The Big Debate: What should your primary metric be?

One of the biggest myths in testing is that your primary metric shouldn’t be the purchase or conversion at the end of the user journey.

In fact, one of the biggest names in the game, Optimizely, states:

Your primary metric (and the main metric in your hypothesis) should always be the behavior closest to the change you are making in the variation you are employing.”

Optimizely

We disagree – and want to show how this approach can actually limit the effectiveness of your experimentation program.

Introduction
But first… what is a primary metric?
So what’s the big debate?
Why is the myth wrong?
But is it really a big deal?
But it’s so noisy!
But where do you draw the line?
But what if you are struggling to reach significance?

But first… what is a primary metric?
Your primary metric is the metric you will use to decide whether the experiment is a winner or not.

We also recommend tracking:
- Secondary metrics – to gain more insight into your users’ behavior
- Guardrail metrics – to ensure your test isn’t causing harm to other important business KPIs.
So what’s the big debate?
Some argue that your primary metric should be the next action you want the user to take, not final conversion.

Diagram: Next action vs final action

For example, on a travel website selling holidays, the ‘final conversion’ is a holiday booking – this is the ultimate action you want the user to take. However, if you have a test on a landing page, the next action you want the user to take is to click forward into the booking funnel.

The main motive for using the next action as your primary metric is that it will be quicker to reach statistical significance. Moreover, it is less likely to give an inconclusive result. This is because:
- Inevitably more users will click forward (as opposed to making a final booking) so you’ll have a higher baseline conversion rate, meaning a shorter experiment duration.
- The test has a direct impact on click forward as it is the next action you are persuading the user to take. Meanwhile there may be multiple steps between the landing page and the final conversion. This means many other things could influence the user’s behavior, creating a lot of noise.
- There could even be a time lag. For example, if a customer is looking for a holiday online, they are unlikely to book in their first session. Instead they may have a think about it and have a couple more sessions on the site before taking the final step and converting.

Why is the myth wrong?
Because it can lead you to make the wrong decisions.

Example 1: The Trojan horse

Take this B2B landing page below: LinkedIn promotes their ‘Sales Navigator’ product with an appealing free trial. What’s not to like? You get to try out the product for free so it is bound to get a high click through rate.

But wait…when you click forward you get a nasty shock as the site asks you to enter your payment details. You can expect a high drop-off rate at this point in the funnel.

On this landing page LinkedIn doesn’t tell the user about the credit card form waiting two steps away

LinkedIn requires users to enter their payment details to access the free trial, but this was not made clear on the landing page

A good idea would be to test the impact of giving the user forewarning that payment details will be required. This is what Norton Security have under the “Try Now” CTA on their landing page.

Norton Security lets their users know that a credit card is required, so there are no nasty surprises

In an experiment like this, it is likely that you would see a fall in click through (the ‘next action’ from the landing page). However, you might well see an uplift in final conversion – because the user receives clear, honest, upfront communication.

In this LinkedIn Sales Navigator example:
- If you were to use clicks forward as your primary metric, you would declare the test a loser, despite the fact that it increases conversion.
- If you were to use free trial sign ups as your primary metric, you would declare the test a winner – a correct interpretation of the results.
Example 2: The irresistible big red button

The ‘big red button’ phenomenon in another scenario that will help to bust this troublesome myth:

When you see a big red button, all you want to do is push it – it’s human nature.

The big red button phenomenon

This concept is often taken advantage of by marketers:

Imagine you have a site selling experience gifts (e.g. ‘fine dining experience for two’ or ‘one day acrobatics course’). You decide to test the increasing prominence of the main CTA on the product page. You do this by increasing the CTA size and removing informational content (or moving it below the fold) to remove distractions. Users might be more inclined to click the CTA and arrive in the checkout funnel. However, this could damage conversion. Users may click forward but then find they are lacking information and are not ready to be in the funnel – so actual experience bookings may fall.

Again, in this scenario using click forward as your primary metric will lead you to the wrong conclusions. Using final conversion as your primary metric aligns with your objective and will lead you to the correct conclusions.

There are plenty more examples like these. And this isn’t a made-up situation or a rare case. We frequently see an inverse relationship between clickthrough and conversion in experimentation.

This is why PPC agencies and teams always report on final conversion, not just click through to the site. It is commonly known that a PPC advert has not done its job simply by getting lots of users to the site. If this was the case you would find your website inundated with unqualified traffic that bounces immediately. No – the PPC team is responsible for getting qualified traffic to your site, which they measure by final conversion rate.
But is it really a big deal?

Some people say, ‘Does it really matter? As long as you are measuring both the ‘next action’ and the final conversion then you can interpret the results depending on the context of the test.’

That’s true to some extent, but the problem is that practitioners often interpret results incorrectly. Time and time again we see tests being declared as winners when they’ve made no impact on the final conversion – or may have even damaged it.

Why would people do this? Well, there is a crude underlying motive for some practitioners. It makes them look more successful at their job – with higher win rates and quicker results.

And there are numerous knock on effects from this choice:

1.Wasting resources

When an individual declares a test as a winner incorrectly, the test will need to get coded into the website. This will be added to the development team’s vast pile of work. A huge waste of valuable resources when the change is not truly improving the user experience and may well be harming it.

2. Reducing learnings

Using next action as your primary metric often leads to incorrect interpretation of results. In turn, this leads to missing out vital information about the test’s true impact in communications. Miscommunication of results means businesses miss out on valuable insights about their users.

Always question your results to increase your understanding of your users. If you are seeing an uplift in the next action, ask yourself, ‘Does this really indicate an improvement for users? What else could it indicate?’ If you are not asking these questions, then you are testing for the sake of it rather than testing to improve and learn.

3. Sacrificing ROI

With misinterpreted results, you may sacrifice the opportunity to iterate and find a better solution that will work. Instead of implementing a fake winner, iterate, find a true winner and implement that!

Moreover, you may cut an experiment short, having seen a significant fall in next step conversion. Whereas if you had let the experiment run for longer, it could have given a significant uplift in final conversion. Declaring a test a loser when it is in fact a winner will of course sacrifice your ROI.

4. Harming stakeholder buy-in

On the surface, using click-through as your primary metric may look great when reporting on your program metrics. It will give your testing velocity and win rate a nice boost. But it doesn’t take long, once someone looks beneath the surface, to see that all your “winners” are not actually impacting the bottom line. This can damage stakeholder buy-in, as your work is all assumptive rather than factual and data-driven.
But it’s so noisy!

A common complaint we hear from believers of the myth is that there is too much noise we can’t account for. For example, there might be 4 steps in the funnel between the test page and the final conversion. Therefore, there are so many other things that may have influenced the user in the time between step 1 and step 4 that could lead them to drop off.

That’s true. But the world is a noisy place. Does that mean we shouldn’t test at all? Of course not.

For instance, I might search “blue jacket” and Google links me through to an ASOS product page for their latest denim item. Between this page and the final conversion we have 3 steps: basket, sign in, checkout.

Look at all the noise that could sway my decision to purchase along each step of the journey:

As you can see there is a lot of unavoidable noise on the website and a lot of unavoidable noise external to the site. Imagine ASOS were to run a test on the product page and were only measuring the next action (“add to basket” clicks). Their users are still exposed to a lot of website noise and external noise during this first step.

However, one thing is for sure: all users will face this noise, regardless of whether they are in the control or the variant. As the test runs, the sample size will get larger and larger, and the likelihood of seeing a false uplift due to this noise gets smaller and smaller. This is exactly why we ensure we don’t make conclusions before the test has gathered enough data.

The same goes when we use final conversion as our primary metric rather than ‘next action’. Sure, there is more noise, which is one of the reasons why it takes longer to reach statistical significance. But once you reach statistical significance, your results are just as valid, and are more aligned with your ultimate objective.
But where do you draw the line?
Back to our LinkedIn Sales navigator example: as discussed above, the primary metric should be free trial sign ups. But this isn’t actually the ultimate final conversion you want the user to take. The ultimate conversion you want the user to take is to become a full-time subscriber to your product, beyond the free trial.

You should think of it like a relay race.

The objective of the landing page is to generate free trials. → The objective of the free trial is to generate full time subscriptions. → The objective of the full time subscription is to maintain the customer (or even upsell other product options):

Each part of the relay race is responsible for getting the customer to the next touch point. The landing page has a lot of power to influence how many users end up starting the free trial. It has less power to influence how successful the free trial is and whether the user will continue beyond the trial.

Nonetheless, we’ve seen experiments whereby the change does have a positive impact beyond the first leg of the relay race, as it were. In one experiment we explained the product more clearly on the landing page. This increased the user’s understanding of it, making them more likely to actually use their free trial (and be successful in doing so). This lead to an uplift in full subscription purchases 30 days later.

For this kind of experiment that could have an ongoing influence, you may wish to keep the experiment running for longer to get a read on this. It is sensible to define a decision policy up-front in this instance. In this example, where the impact on full purchases is likely to be flat or positive, your decision policy might be:
- If we see a flat result or a fall in free trial sign ups (primary KPI) we will do the following:
  - Stop the test and iterate with a new execution based on our learnings from the test.
- If we see a significant uplift in free trial sign ups (primary KPI), we will do the following:
  - Serve the test to 95% and keep a 5% hold back to continue measuring the impact on full subscription purchases (secondary KPI).
This way, you will be able to make the right decisions and move on to your next experiments while still learning the full value of your experiment.

For a test where there is a higher risk of a negative impact on full subscription purchases, you may do the following things:
1. Define the full subscription metric as your guardrail metric.
2. Design a stricter decision policy whereby you gather enough data to confirm there is no negative impact on full subscription purchases.
But what if you are struggling to reach significance?
For many, using the next action as the primary metric allows them to experiment faster. So does low traffic justify testing to the next action instead of sale? Sometimes, but only if you’ve considered these options first:

1.Don’t run experiments

That’s not to say you shouldn’t be improving your website too. Experiments are the truest form of evidence to understand your audience. But if you don’t have enough traffic, the next best thing to inform & validate your optimization is using other forms of evidence instead. You can use methods such as usability testing. Gathering insights via analytics data & user research is extremely powerful. This is something we continually do alongside experimentation, for all our clients.

2. Be more patient

For a particularly risky change, you might be willing to be patient and choose to run an experiment that will take longer to reach significance. Before you do this, ensure you plug in the numbers to a test duration calculator so that you have a good idea of exactly how patient you are going to need to be. Here’s a couple of good ones that are independent of any particular testing tool:
- Analytics-toolkit.com
- Evan Miller
3. Run tests on higher traffic areas & audiences

If you are trying to run tests to a very specific audience or a low traffic page, you aren’t going to have much luck in reaching statistical significance. Make sure you look at your site analytics data and prioritize your audiences and areas by their relative size.

With all being said, you do have a 4th option..

If you are really struggling to reach statistical significance then you might want to use the next action as your primary metric. This isn’t always a disaster – so long as you interpret your results correctly. The problem is that so often people don’t.

For a site with small traffic it may make sense to take this approach if you are experienced in interpreting experiment results.

However, for sites with lots of traffic, there’s really no excuse. So start making the switch today. Your win rates might fall slightly, but when you get a win, you can feel confident that you are making a true difference to the bottom line.

To find out more about our approach to experimentation, get in touch today!