Cromulent
← Back to blog

Why your A/B tests are lying to you

Justin T. Huang

Most email marketing teams run an A/B testing program of some shape. Subject line A against subject line B, sender name against sender name, 9am against 2pm, a winner is declared, the playbook gets updated, and the team moves on to the next campaign. The discipline is real. The trouble is that a two-arm test is much too small an instrument to answer the question marketers actually want answered.

The two-variant trap

A classical A/B test takes a high-dimensional object (a message that varies in subject line wording, framing, length, tone, specificity, social proof, send time, and audience segment) and asks it a single binary question: is this version better than that one. Suppose your email has eight dimensions you could meaningfully vary, with four reasonable settings each. That is more than 65,000 distinct messages, and a two-arm test explores two of them. Even if the experiment is clean and well-powered, what you have learned at the end is the relative ranking of two specific points in a 65,000-point space.

You have not learned which dimensions did the work, how they interact, or whether the winning variant would still win for a different segment, a different product, or a different week. The headline is real, but it is local in a way the dashboard does not advertise.

Your average customer doesn't exist

The second thing a two-arm test obscures is heterogeneity. When the dashboard reports that variant B lifted conversion by 3.2%, that number is an average over your entire audience, and averages are famously capable of describing no one in particular. In one segment B might lift conversion by 15%; in another it might depress it by 8%. The reported lift is what you get when you pool those, and the marketer who ships "the winner" never finds out that half the audience was worse off.

The fix is not to stop running tests. The fix is to design tests that respect heterogeneity, which means powering experiments to detect interactions between message features and audience characteristics rather than only headline averages. That is more expensive in sample size. It is also closer to the question you wanted to answer.

What to do instead

Three shifts make the program more useful.

The first is to treat experimentation as a way to reduce uncertainty, not as an audit that issues verdicts. A test is information about a response surface, used to make next week's decisions less wrong than this week's. Multi-armed bandits (Thompson sampling, Bayesian methods) are designed for exactly this setting: they allocate traffic toward arms that are doing well while continuing to explore, and they handle many arms gracefully. "Many arms" is closer to the real problem than "two."

The second is to model dimensions, not variants. Instead of testing Subject Line A against Subject Line B as opaque objects, decompose them into features (loss vs. gain framing, specificity, second-person voice, length, urgency cues) and estimate the effect of each. A model with partial pooling across features lets you make predictions about variants you never ran, which is what you actually wanted from the test in the first place.

The third is to budget for heterogeneity in advance. Before launching, name the customer characteristics you expect to moderate the effect (segment, lifecycle stage, recent purchase behavior) and power the test to detect those interactions. If the budget cannot support that power, the headline average is not actionable, even if it is statistically significant.

A locally true number

Running two-variant tests on a high-dimensional problem is not wrong the way a broken thermometer is wrong. It is wrong the way a thermometer at one specific spot is wrong: technically accurate, useful at that spot, and quietly unreliable as a guide to the building's heating system.

A two-arm test reports a small truth in a confident voice. The work of an experimentation program is to gradually trade some of that confidence for the ability to say larger things.