Experiments
A/B testing and experimentation with Hanzo Insights — statistical significance, holdout groups, and automatic winner detection.
Experiments
Hanzo Insights provides a full experimentation platform built on top of feature flags. Run A/B tests with automatic statistical significance tracking, Bayesian analysis, and winner detection.
Creating an Experiment
- Go to insights.hanzo.ai → Experiments → New Experiment
- Configure:
| Field | Description |
|---|---|
| Name | Descriptive name (e.g., "Pricing Page CTA Copy") |
| Feature Flag | The flag that controls variants (created automatically or linked) |
| Variants | Control + 1 or more test variants |
| Goal Metric | The event/action you're optimizing for |
| Secondary Metrics | Additional metrics to track |
| Minimum Sample Size | Calculated based on expected effect size |
| Significance Level | Default: 95% (p < 0.05) |
Implementation
Experiments use feature flags under the hood. The SDK returns the variant for the current user:
// Get the user's variant
const variant = insights.getFeatureFlag('pricing-cta-experiment')
switch (variant) {
case 'control':
return <Button>Start Free Trial</Button>
case 'test':
return <Button>Get Started — It's Free</Button>
}
// Track the conversion event (goal metric)
function onSignup() {
insights.capture('signup_completed', {
experiment: 'pricing-cta-experiment',
variant,
})
}React
import { useFeatureFlag } from '@hanzo/insights-react'
function PricingCTA() {
const variant = useFeatureFlag('pricing-cta-experiment')
return variant === 'test'
? <Button onClick={onSignup}>Get Started — It's Free</Button>
: <Button onClick={onSignup}>Start Free Trial</Button>
}Server-Side
import { Insights } from '@hanzo/insights-node'
const insights = new Insights('your-api-key', {
host: 'https://insights.hanzo.ai',
personalApiKey: 'your-personal-api-key',
})
// SSR: get variant and render accordingly
const variant = await insights.getFeatureFlag('pricing-cta-experiment', userId)Statistical Methods
Insights supports two statistical approaches:
Bayesian (Default)
- Calculates probability that each variant is the best
- Provides credible intervals for effect size
- No fixed sample size required
- Results are interpretable as "Variant A has a 95% probability of being better"
Frequentist
- Classic hypothesis testing with p-values
- Requires pre-calculated sample size
- Sequential testing with adjustable significance boundaries
- Results are interpretable as "We can reject the null hypothesis at p < 0.05"
Experiment Lifecycle
Draft → Running → Significant → Complete
↘ Inconclusive ↗| State | Description |
|---|---|
| Draft | Experiment configured but not yet launched |
| Running | Collecting data, variants being served |
| Significant | One variant has reached statistical significance |
| Inconclusive | Minimum sample reached but no significant difference |
| Complete | Experiment ended, winner (or no winner) declared |
Holdout Groups
Reserve a percentage of users who never see any experiment, providing a clean baseline:
Experiment: pricing-cta
Holdout: 10% (never see any variant — always get default experience)
Control: 45% (original CTA)
Test: 45% (new CTA)Guardrail Metrics
Track metrics that should not degrade even if the goal metric improves:
Goal: signup_completed (should increase)
Guardrails:
- page_load_time (should not increase > 10%)
- error_rate (should not increase > 0.5%)
- bounce_rate (should not increase > 5%)If a guardrail is violated, the experiment dashboard shows a warning.
Multi-Armed Bandits
For optimization (not just testing), use the bandit mode that automatically shifts traffic toward the winning variant:
- Start with equal traffic split
- As data accumulates, shift traffic toward better-performing variants
- Minimize regret while still collecting statistical evidence
Enable in experiment settings: Optimization mode → Multi-armed bandit
Best Practices
- Define metrics before launching — Decide what success looks like upfront
- Run until significant — Don't peek at results and stop early
- One change per experiment — Isolate the variable you're testing
- Use holdout groups — Measure cumulative experiment impact
- Monitor guardrails — Ensure you don't degrade core metrics
- Document learnings — Record what you learned regardless of outcome
How is this guide?
Last updated on