The top 3 mistakes that make your A/B test results invalid

A Fortune 500 company asked that I review their A/B testing strategy.

The results were good, the hypotheses strong, everything seemed to be in order… until I looked at the log of changes in their experimentation tool.

I noticed several blunders: In some experiments, they had adjusted the traffic allocation for the variations mid-experiment; some variations had been paused for a few days, then resumed; and experiments were stopped as soon as statistical significance was reached.

When it comes to testing, too many companies worry about the “what”, or the design of their variations, and not enough worry about the “how”, the execution of their experiments.

Don’t get me wrong, variation design is important: you need solid hypotheses supported by strong evidence. However, if you believe your work is finished once you have come up with variations for an experiment and pressed the launch button, you’re wrong.

In fact, the way you run your A/B tests is the most difficult and most important piece of the optimization puzzle.

”There are three kinds of lies: lies, damned lies, and statistics.”

– Mark Twain

In this post, I will share the biggest mistakes you can make within each step of the testing process: the design, launch, and analysis of an experiment, and how to avoid them. This post is fairly technical. Here’s how you should read it:
- If you are just getting started with conversion optimization (CRO), or are not directly involved in designing or analyzing tests, feel free to skip the more technical sections and simply skim for insights.
- If you are an expert in CRO or are involved in designing and analyzing tests, you will want to pay attention to the technical details. These sections are highlighted in blue.

Introduction
Your test has too many variations
You change experiment settings in the middle of a test
You’re doing post-test segmentation incorrectly
Conclusion

Mistake #1: Your test has too many variations

The more variations, the more insights you’ll get, right?

Not exactly. Having too many variations slows down your tests but, more importantly, it can impact the integrity of your data in 2 ways.

First, the more variations you test against each other, the more traffic you will need, and the longer you’ll have to run your test to get results that you can trust. This is simple math.

But the issue with running a longer test is that you are more likely to be exposed to cookie deletion. If you run an A/B test for more than 3–4 weeks, the risk of sample pollution increases: in that time, people will have deleted their cookies and may enter a different variation than the one they were originally in.

”Within 2 weeks, you can get a 10% dropout of people deleting cookies and that can really affect your sample quality.”

– Ton Wesseling, Founder, Online Dialogue

The second risk when testing multiple variations is that the significance level goes down as the number of variations increases.

For example, if you use the accepted significance level of 0.05 and decide to test 20 different scenarios, one of those will be significant purely by chance (20 * 0.05). If you test 100 different scenarios, the number goes up to five (100 * 0.05).

In other words, the more variations, the higher the chance of a false positive i.e. the higher your chances of finding a winner that is not significant.

Google’s 41 shades of blue is a good example of this. In 2009, when Google could not decide which shades of blue would generate the most clicks on their search results page, they decided to test 41 shades. At a 95% confidence level, the chance of getting a false positive was 88%. If they had tested 10 shades, the chance of getting a false positive would have been 40%, 9% with 3 shades, and down to 5% with 2 shades.

This is called the Multiple Comparison Problem.

You can calculate the chance of getting a false positive using the following formula: 1-(1-a)^m with m being the total number of variations tested and a being the significance level. With a significance level of 0.05, the equation would look like this:1-(1-0.05)^m or 1-0.95^m.

You can fix the multiple comparison problem using the Bonferroni correction, which calculates the confidence level for an individual test when more than one variation or hypothesis is being tested.

Wikipedia illustrates the Bonferroni correction with the following example: “If an experimenter is testing m hypotheses, [and] the desired significance level for the whole family of tests is a, then the Bonferroni correction would test each individual hypothesis at a significance level of a/m.

For example, if [you are] testing m = 8 hypotheses with a desired a = 0.05, then the Bonferroni correction would test each individual hypothesis at a = 0.05/8=0.00625.”
In other words, you’ll need a 0.625% significance level, which is the same as a 99.375% confidence level (100% – 0.625%) for an individual test.

The Bonferroni correction tends to be a bit too conservative and is based on the assumption that all tests are independent of each other. However, it demonstrates how multiple comparisons can skew your data if you don’t adjust the significance level accordingly.

The following tables summarize the multiple comparison problem.

Probability of a false positive with a 0.05 significance level:

Adjusted significance and confidence levels to maintain a 5% false discovery probability:

In this section, I’m talking about the risks of testing a high number of variations in an experiment. But the same problem also applies when you test multiple goals and segments, which we’ll review a bit later.

”Each additional variation and goal adds a new combination of individual statistics for online experiments comparisons to an experiment. In a scenario where there are four variations and four goals, that’s 16 potential outcomes that need to be controlled for separately.”

– Optimizely Practical Guide to Stats

Some A/B testing tools, such as VWO and Optimizely, adjust for the multiple comparison problem. These tools will make sure that the false positive rate of your experiment matches the false positive rate you think you are getting.

In other words, the false positive rate you set in your significance threshold will reflect the true chance of getting a false positive: you won’t need to correct and adjust the confidence level using the Bonferroni or any other methods.

One final problem with testing multiple variations can occur when you are analyzing the results of your test. You may be tempted to declare the variation with the highest lift the winner, even though there is no statistically significant difference between the winner and the runner up. This means that, even though one variation may be performing better in the current test, the runner up could “win” in the next round.

You should consider both variations as winners.
Mistake #2: You change experiment settings in the middle of a test

When you launch an experiment, you need to commit to it fully. Do not change the experiment settings, the test goals, the design of the variation or of the Control mid-experiment. And don’t change traffic allocations to variations.

Changing the traffic split between variations during an experiment will impact the integrity of your results because of a problem known as Simpson’s Paradox.This statistical paradox appears when we see a trend in different groups of data which disappears when those groups are combined.

Ronny Kohavi from Microsoft shares an example wherein a website gets one million daily visitors, on both Friday and Saturday. On Friday, 1% of the traffic is assigned to the treatment (i.e. the variation), and on Saturday that percentage is raised to 50%.

Even though the treatment has a higher conversion rate than the Control on both Friday (2.30% vs. 2.02%) and Saturday (1.2% vs. 1.00%), when the data is combined over the two days, the treatment seems to underperform (1.20% vs. 1.68%).

This is because we are dealing with weighted averages. The data from Saturday, a day with an overall worse conversion rate, impacted the treatment more than that from Friday.

Source: Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

We will return to Simpson’s Paradox in just a bit.

Changing the traffic allocation mid-test will also skew your results because it alters the sampling of your returning visitors.

Changes made to the traffic allocation only affect new users. Once visitors are bucketed into a variation, they will continue to see that variation for as long as the experiment is running.

So, let’s say you start a test by allocating 80% of your traffic to the Control and 20% to the variation. Then, after a few days you change it to a 50/50 split. All new users will be allocated accordingly from then on.

However, all the users that entered the experiment prior to the change will be bucketed into the same variation they entered previously. In our current example, this means that the returning visitors will still be assigned to the Control and you will now have a large proportion of returning visitors (who are more likely to convert) in the Control.

Note: This problem of changing traffic allocation mid-test only happens if you make a change at the variation level. You can change the traffic allocation at the experiment level mid-experiment. This is useful if you want to have a ramp up period where you target only 50% of your traffic for the first few days of a test before increasing it to 100%. This won’t impact the integrity of your results.

As I mentioned earlier, the “do not change mid-test rule” extends to your test goals and the designs of your variations. If you’re tracking multiple goals during an experiment, you may be tempted to change what the main goal should be mid-experiment. Don’t do it.

All Optimizers have a favorite variation that we secretly hope will win during any given test. This is not a problem until you start giving weight to the metrics that favor this variation. Decide on a goal metric that you can measure in the short term (the duration of a test) and that can predict your success in the long term. Track it and stick to it.

It is useful to track other key metrics to gain insights and/or debug an experiment, if something looks wrong. However, these are not the metrics you should look at to make a decision, even though they may favor your favorite variation.

Mistake #3: You’re doing post-test segmentation incorrectly
Let’s say you have avoided the 2 mistakes I’ve already discussed, and you’re pretty confident about the results you see in your A/B testing tool. It’s time to analyze the results, right?

Not so fast! Did you stop the test as soon as it reached statistical significance?

I hope not…

Statistical significance should not dictate when you stop a test. It only tells you if there is a difference between your Control and your variations. This is why you should not wait for a test to be significant (because it may never happen) or stop a test as soon as it is significant. Instead, you need to wait for the calculated sample size to be reached before stopping a test. Use a test duration calculator to understand better when to stop a test.

A test duration calculator like this one from VWO can help you determine how long to run an A/B test for.

Now, assuming you’ve stopped your test at the correct time, we can move on to segmentation. Segmentation and personalization are hot topics in marketing right now, and more and more tools enable segmentation and personalization.

It follows then, that after you stop a test, you probably start dissecting the results based on segments such as traffic source, new vs. returning users, device type, etc. This technique is called post-test segmentation and is one of the 3 ways to create personalization and segmentation hypotheses to test.

There are 2 main problems with post-test segmentation, however, that will impact the statistical validity of your segments (when done incorrectly).
1. The sample size of your segments is too small. You stopped the test when you reached the calculated sample size, but at a segment level the sample size is likely too small and the lift between segments has no statistical validity.
2. The multiple comparison problem. The more segments you compare, the greater the likelihood that you’ll get a false positive among those tests. With a 95% confidence level, you’re likely to get a false positive every 20 post-test segments you look at.
There are different ways to prevent these two issues, but the easiest and most accurate strategy is to create targeted tests (rather than breaking down results per segment post-test).

I don’t advocate against post-test segmentation―quite the opposite. In fact, looking at too much aggregate data can be misleading. (Simpson’s Paradox strikes back.)

The Wikipedia definition for Simpson’s Paradox provides a real-life example from a medical study comparing the success rates of two treatments for kidney stones.The table below shows the success rates and numbers of treatments for treatments involving both small and large kidney stones.)

Source: Confounding and Simpson’s Paradox

The paradoxical conclusion is that treatment A is more effective when used on small stones, and also when used on large stones, yet treatment B is more effective when considering both sizes at the same time.

In the context of an A/B test, this would look something like this:

Source: Segmenting Data for Web Analytics – The Simpson’s Paradox

Simpson’s Paradox surfaces when sampling is not uniform—that is the sample size of your segments is different. There are a few things you can do to prevent getting lost in and misled by this paradox.

First, you can prevent this problem from happening altogether by using stratified sampling, which is the process of dividing members of the population into homogeneous and mutually exclusive subgroups before sampling. However, most tools don’t offer this option.

If you are already in a situation where you have to decide whether to act on aggregate data or on segment data, Georgi Georgiev recommends you look at the story behind the numbers, rather than at the numbers themselves.

“My recommendation in the specific example [illustrated in the table above] is to refrain from making a decision with the data in the table. Instead, we should consider looking at each traffic source/landing page couple from a qualitative standpoint first. Based on the nature of each traffic source (one-time, seasonal, stable) we might reach a different final decision. For example, we may consider retaining both landing pages, but for different sources.

In order to do that in a data-driven manner, we should treat each source/page couple as a separate test variation and perform some additional testing until we reach the desired statistically significant result for each pair (currently we do not have significant results pair-wise).”

In a nutshell, it can be complicated to get post-test segmentation right, but when you do, it will unveil insights that your aggregate data can’t. Remember, you will have to validate the data for each segment in a separate follow up test.
The execution of an experiment is the most important part of a successful optimization strategy. If your tests are not executed properly, your results will be invalid and you will be relying on misleading data.

It is always tempting to showcase good results. Results are often the most important factor when your boss is evaluating the success of your conversion optimization department or agency.

But results aren’t always trustworthy. Too often, the numbers you see in case studies are lacking valid statistical inferences: either they rely too heavily on an A/B testing tool’s unreliable stats engine and/or they haven’t addressed the common pitfalls outlined in this post.

Use case studies as a source of inspiration, but make sure that you are executing your tests properly by doing the following:
- If your A/B testing tool doesn’t adjust for the multiple comparison problem, make sure to correct your significance level for tests with more than 1 variation
- Don’t change your experiment settings mid-experiment
- Don’t use statistical significance as an indicator of when to stop a test, and make sure to calculate the sample size you need to reach before calling a test complete
- Finally, keep segmenting your data post-test. But make sure you are not falling into the multiple comparison trap and are comparing segments that are significant and have a big enough sample size