How to measure A/B tests for maximum impact and insight

Sean Patterson

One of the core principles of experimentation is that we measure the value of experimentation in impact and insight. We don’t expect to get winning tests all the time, but if we test well, then we should always expect to draw insights from them. The only real ‘failed test’, is a test that doesn’t win and we learn nothing from.

In our eagerness to start testing, it’s common that we come up with an idea (hopefully at least based on data with an accompanying hypothesis!), get it designed and built and set it live. Most of the thought goes into the design and execution of the idea, and often less thought goes into how to measure the test to ensure we get the insight we need.

By the end of this article you should have:

In every experiment it’s important to define a primary goal upfront – the goal that will ultimately judge the test a win/loss. It’s rarely enough to just track this one goal though. The problem is that if the test wins, great, but we may not understand fully why. Similarly if the test loses and we only track the main goal, then the only insight we are left with is that it didn’t win. In this case, we don’t just have a losing test, we also have a test where we lose the ability to learn – the second key measure of how we get value from testing. And remember, most tests lose!

If we don’t track other goals and interactions in the test we will miss the behavioral nuances and the other micro-interactions that can give us valuable insight as to how the test affected user behavior. This is particularly important in tests where a positive result on the main KPI could actually harm another key business metric.

One example from a test we ran recently was for a camera vendor. We introduced add to basket CTAs on a product listing page, so that users who knew which product they wanted wouldn’t have to navigate down to the product page to purchase.

This led to a positive uplift on orders however, it had a negative effect on average order value. The reason for this was that the product page was an important place where users could also discover accessories for their products, including product care packages. As the test was encouraging users to add the main product, they were then less inclined to buy accessories and add-ons. The margins for accessories and add-on products are far higher than cameras, so a lower average order value driven by fewer accessories is definitely a negative outcome.

Insights from well tracked tests should be a key part of how your testing strategy develops as new learnings inform better iterations and open up new areas to testing by revealing user behavior that you were previously unaware of.

In any test, there can be an almost endless number of things you could measure and the solution to not tracking enough shouldn’t be to track everything. Measure too much and you’ll potentially be swamped analyzing data points that don’t have any value and you’ll curry no favour with your developers who have to implement all the tracking! Measure too little and you may miss valuable insights that could turn a losing test into a winning test. The challenge is to measure the right things for each test.

What to measure?

Your North Star Metric

Every test should be aligned to the strategic goal of testing, which goes without saying. That strategic goal should always have a clear measurable goal. For an ecommerce site it will likely be orders, or revenue. Leads for a lead gen site/page. Number of pages or page scroll for a content site – so on and so forth. This KPI will be the key measurement of whether your test succeeds or fails and for that reason, we call it the North Star metric. In essence, regardless of whatever else happens in the test, if we can’t move the needle of this metric, the test doesn’t win. Unsurprisingly, this metric should be tracked in every test you run.

You’ll know if the test wins, but what other effects did it have on your site? What effect did it have on purchase behavior and revenue? Did it lead to a decrease in some other metrics which might be important to the business?

The performance of the North Star metric determines whether or not your hypothesis is proven or disproven. Your hypothesis in turn should be directly related to your primary objective.

The performance of the North Star metric determines whether or not your hypothesis is proven or disproven. Your hypothesis in turn should be directly related to your primary objective.

Guardrail Metrics

You should also be defining ‘guardrail metrics’. These tend to be second tier metrics that relate to key business metrics, which if they perform negatively could call into question the interpretation of how successful the test is. If the test loses but these perform well, it’s also probably a good sign you’re on the right track. They don’t, on their own, define the success or failure like the North Star metric, but they contextualise the North Star metric when reporting on the test.

For an ecommerce site, if we assume the North Star metric is orders, then two obvious guardrail metrics would be revenue and order value. If we run a test that increases orders, but as a result, users buy less items, or lower value items as in the example above, this would decrease AOV and could harm revenue.

Tests can become much more insightful just by adding two more metrics. Not only can we see the test drove more orders, but we can also see that our execution had an effect on the value and quantity of products being bought. This gives us the opportunity to either change the execution of the test to address the negative impact on our guardrail metrics. In this sense, measuring tests effectively is a core part of an iterative test and learn approach.

At a minimum, you should be tracking your North Star metrics and guardrail metrics. These will tell you the impact of the test on the bottom line for the business.

Your guardrail metrics will generally be closely related to your North Star metric.

Your guardrail metrics will generally be closely related to your North Star metric.

Secondary Metrics

Some tests you run may only impact your North Star metric – a test on the payment step of a funnel is a good example where the most likely outcome will either mean more orders or less orders, and not much else. What you’ll learn is whether that change pushed users over the line.

Most other tests, however, will have a number of different effects. Your test may radically change the way users interact with the page and measuring your tests at a deeper level than just the North Star and guardrail metrics will help you understand what effect the change has on user behavior.

We work with an online food delivery company where meal deals are the main way customers browse and shop. Given the amount of the meal deals they have, one issue we found through our initial insights was that users struggle to navigate through them all to find something relevant. We ran a test where we introduced filtering options to the meal deal page, which included how many people the deal feeds, what types of food the deal contains, saving amounts and the price points. Along with they key metrics, we also tracked all the filter options in the test.

This test didn’t drive any additional orders, in fact not many users interacted with the filter suggesting it wasn’t very useful in helping users curate the meal deals. However, what we did notice was that users that did use it by far chose to filter meal deals by price and secondly by how many people they feed. So a ‘flat’ test, but now we know two very important pieces of information that users look for when selecting deals.

This in turn led to a series of tests around how we better highlight price and how many people the meal feeds at different parts of the user journey and on the meal deal offers themselves. These insights have helped shape the direction of our testing strategy by shedding light on user preferences. If we had only tracked the North Star and guardrail metrics, these insights would have been lost.

For each test you run, really think through what the possible user journeys and interactions could be as a result of the test and make sure you track these. It doesn’t mean track everything, but start to see tests as a way of learning about your users not just a way to drive growth.

Secondary metrics help contextualise your North Star and Guardrail metrics, as well as shed light on other behaviors.

Secondary metrics help contextualise your North Star and Guardrail metrics, as well as shed light on other behaviors.

Segmentation

If you’ve managed to track your North Star, guardrail and some secondary metrics in your tests, you’re in a great place. One other thing you’ll want to think about is how to segment your data. Segmenting your test results will be hugely important, especially when you get different user groups that respond differently on your site. Device is an obvious segment that you should be looking with all your test. We’ve seen tests that have had double digit uplifts on desktop, but haven’t moved the needle at all on mobile.

If your test involves introducing a new feature or piece of functionality that users can interact with, it’s helpful to create a segment for users that interact with that feature. This will help shed light over how interaction with this new functionality affects the user behavior.

Key takeaways

Successful tests are measured by impact and insight. The only ‘failed’ test is one that doesn’t win and you don’t learn anything. Insightful tests allow you to better understand why a test performs the way it did and mean that you can learn, iterate and improve more rapidly, leading to better more effective testing.

If you would like to learn more about our approach, get in touch today!

Join 5,000 other people who get our newsletter updates