A/B testing and experimentation with Hanzo Insights — statistical significance, holdout groups, and automatic winner detection.

Experiments

Hanzo Insights provides a full experimentation platform built on top of feature flags. Run A/B tests with automatic statistical significance tracking, Bayesian analysis, and winner detection.

Creating an Experiment

Go to insights.hanzo.ai → Experiments → New Experiment

Configure:

Field	Description
Name	Descriptive name (e.g., "Pricing Page CTA Copy")
Feature Flag	The flag that controls variants (created automatically or linked)
Variants	Control + 1 or more test variants
Goal Metric	The event/action you're optimizing for
Secondary Metrics	Additional metrics to track
Minimum Sample Size	Calculated based on expected effect size
Significance Level	Default: 95% (p < 0.05)

Implementation

Experiments use feature flags under the hood. The SDK returns the variant for the current user:

// Get the user's variant
const variant = insights.getFeatureFlag('pricing-cta-experiment')

switch (variant) {
  case 'control':
    return <Button>Start Free Trial</Button>
  case 'test':
    return <Button>Get Started — It's Free</Button>
}

// Track the conversion event (goal metric)
function onSignup() {
  insights.capture('signup_completed', {
    experiment: 'pricing-cta-experiment',
    variant,
  })
}

React

import { useFeatureFlag } from '@hanzo/insights-react'

function PricingCTA() {
  const variant = useFeatureFlag('pricing-cta-experiment')

  return variant === 'test'
    ? <Button onClick={onSignup}>Get Started — It's Free</Button>
    : <Button onClick={onSignup}>Start Free Trial</Button>
}

Server-Side

import { Insights } from '@hanzo/insights-node'

const insights = new Insights('your-api-key', {
  host: 'https://insights.hanzo.ai',
  personalApiKey: 'your-personal-api-key',
})

// SSR: get variant and render accordingly
const variant = await insights.getFeatureFlag('pricing-cta-experiment', userId)

Statistical Methods

Insights supports two statistical approaches:

Bayesian (Default)

Calculates probability that each variant is the best

Provides credible intervals for effect size

No fixed sample size required

Results are interpretable as "Variant A has a 95% probability of being better"

Frequentist

Classic hypothesis testing with p-values

Requires pre-calculated sample size

Sequential testing with adjustable significance boundaries

Results are interpretable as "We can reject the null hypothesis at p < 0.05"

Experiment Lifecycle

Draft → Running → Significant → Complete
                ↘ Inconclusive ↗

State	Description
Draft	Experiment configured but not yet launched
Running	Collecting data, variants being served
Significant	One variant has reached statistical significance
Inconclusive	Minimum sample reached but no significant difference
Complete	Experiment ended, winner (or no winner) declared

Holdout Groups

Reserve a percentage of users who never see any experiment, providing a clean baseline:

Experiment: pricing-cta
Holdout: 10% (never see any variant — always get default experience)
Control: 45% (original CTA)
Test: 45% (new CTA)

Guardrail Metrics

Track metrics that should not degrade even if the goal metric improves:

Goal: signup_completed (should increase)
Guardrails:
  - page_load_time (should not increase > 10%)
  - error_rate (should not increase > 0.5%)
  - bounce_rate (should not increase > 5%)

If a guardrail is violated, the experiment dashboard shows a warning.

Multi-Armed Bandits

For optimization (not just testing), use the bandit mode that automatically shifts traffic toward the winning variant:

Start with equal traffic split

As data accumulates, shift traffic toward better-performing variants

Minimize regret while still collecting statistical evidence

Enable in experiment settings: Optimization mode → Multi-armed bandit

Best Practices

Define metrics before launching — Decide what success looks like upfront

Run until significant — Don't peek at results and stop early

One change per experiment — Isolate the variable you're testing

Use holdout groups — Measure cumulative experiment impact

Monitor guardrails — Ensure you don't degrade core metrics

Document learnings — Record what you learned regardless of outcome

Experiments

On this page