Confidence AI: the next generation of A/B test prioritization

Most experimentation teams have far more test ideas than they could ever conceivably launch.

This creates a problem:

How do we decide, in as objective a manner as possible, which tests to prioritize in order to maximize a program’s overall impact?

Historically, there have been a number of prioritization frameworks developed to solve this problem. Unfortunately, all of them have fallen short in a number of fairly big ways.

Now, with the help of artificial intelligence, we think we’ve finally found a close-to-optimal solution.

In this article, we’re going to briefly discuss the shortcomings of past prioritization approaches, before sharing an overview of our new prioritization tool – Confidence AI – and how it overcomes them.

And for those of you that are skeptical, here’s a fact to keep you reading:

Confidence AI is able to predict the results of winning a/b tests with 63% accuracy.* Based on standard industry win rates, this suggests it is several times more accurate than the average experimentation practitioner.

*Concepts with a confidence score of 66% and above are classed as predictions of a winner. For more on how confidence scores work, see the What is Confidence AI and how does it work? section of this page.

Introduction
Prioritization tools of the past
What is Confidence AI and how does it work?
How accurate is Confidence AI?
How we actually use Confidence AI: beyond win rate
Final thoughts: silver bullet or something else?

Prioritization tools of the past
Prior to now, all leading prioritization tools have fallen short in at least one of two ways:
1. Subjectivity – at its core, our goal as experimenters is to use objective data to make better decisions. Most prioritization tools rely heavily on human interpretation, which introduces an undesired element of subjectivity into the experimentation process.
2. One size fits all – every page, every website, every company, and every industry is different. To prioritize our tests appropriately, we need to evaluate test ideas using flexible, data-backed criteria that adapt to the unique context within which each test is being conducted (more on this shortly). This is a big ask, and unsurprisingly, every prioritization tool up until now falls short in this respect too.
To illustrate these points, consider two of our industry’s favorite prioritization frameworks: ICE and PXL.

(If you’re not interested in a comparison of different prioritization methods, feel free to skip ahead!)

Growthhacker’s ICE framework prioritizes a/b test concepts based on three factors:
- Impact – if this works, how big will its impact be?
- Confidence – how confident are we that this will work?
- Ease – how easy will this be to implement?
In essence, you score each concept out of ten for each of these three factors, and you then add up the scores and divide them by three to give an average out of ten. Concepts with the highest scores are prioritized.

As will be evident from this brief explanation, this method – while undeniably useful – is extremely flawed. After all, there’s no reliable way of generating an estimate of a test’s ‘impact’ or ‘confidence’ short of actually running the test itself. Any estimate we do make is therefore bound to be a very crude guess.

Experimentation is supposed to be about eliminating gut-feel and intuition where at all possible, and yet this prioritization framework is literally built on these things.

CXL’s PXL framework attempts to minimize subjectivity by creating a set of criteria that is believed to predict the impact of any given experiment. These criteria are weighted based on expected importance, and the scores for each individual criterion are then summed to give an overall prioritization score. As with ICE, experiments with higher prioritization scores are prioritized.

ConversionXL’s PXL framework

This approach has a number of advantages over ICE:
- Many of the questions involve factual yes or no answers, e.g. whether or not the test is above the fold, whether or not it involves removing or adding an element, etc. The less ambiguity involved in the scoring, the less room there is for subjectivity to creep in.
- Many of the questions can be answered using empirical data (rather than intuition), e.g. whether or not the change is noticeable within 5 seconds can be answered using a simple usability test. Again, this minimizes subjectivity.
But while PXL does a laudable job of side-stepping the subjectivity objection, it falls quite badly afoul of the one-size-fits-all objection.

To see why, consider the first two criteria in the PXL framework:
1. ‘Above the fold?’
2. ‘Noticeable within 5 sec?’
On a homepage experiment, there’s every possibility that each of these criteria will be strongly correlated with the impact of the test. With a product page experiment, on the other hand, users are often prepared to delve deep beneath the fold, so the importance of these criteria is likely to be much less pronounced.

In actual fact, based on our own internal analysis, we’ve found that experiments beneath the fold have a similar – sometimes even higher – win-rate than those above it (see graph below).

On mobile, above the fold tests have a win rate of 31% vs. 41% for below the fold. On desktop, it is 37% vs. 36%. Sample size: 505 a/b tests.

All this to say, this one-size-fits-all approach means that certain kinds of tests will be prioritized for no other reason than that the set of criteria being used is biased in their favor.

Like ICE, this framework is undeniably useful, but also like ICE, it falls far short of the ideal.

So, this begs a question:

What is the ideal?

Well, in our view, an ideal prioritization tool would:
- Be as close to completely objective as possible;
- Produce flexible criteria for each test concept;
- Use past results from each individual program to inform prioritization criteria and scores;
- Require no additional work on top of what you already do;
- Be completely dynamic, so that concepts are continuously reprioritized when new results and insights are uncovered. (This is another big issue with current prioritization approaches: prioritization only happens once at the beginning of a quarter or strategy. That means the backlog is a freeze frame of what the priority was at the time of creation, rather than being continuously updated based on the latest learnings from the experiments that are being run.)
This is, we realize, a huge ask.

Up until recently, it has been far beyond the capabilities of any prioritization tool in existence.

But with Confidence AI, that’s all changed…
What is Confidence AI and how does it work?

Confidence AI is a machine learning model that we’ve developed to predict the results of a/b tests. By embedding Confidence AI into our prioritization approach, we’re able to almost completely remove both of the shortcomings of traditional prioritization methods identified above (subjectivity and one-size-fits-all-ness).

Here’s an outline of how it works:

As some of you may know, here at Conversion we store extensive data on every experiment we run. This includes data about the client, the client’s industry, the page, the levers, the change type, the psychological principle, and more.

We’ve been in business for over 15 years, and we’ve worked with over 200 clients in over 30 industries. This means that we now have a huge experiment database, consisting of over 20,000 experiments and hundreds of thousands of data points.

Having trained Confidence AI on this vast dataset, we can now input the parameters of tests we plan to run into the model and it will compute a confidence score based on how likely it predicts each experiment is to win.

Confidence AI is integrated within each client’s Experiment OS, where it prioritizes their experiment backlog based on the confidence score of each test concept. Concepts with a high confidence score are pushed to the top of the priority list and those with a low confidence score are pushed to the bottom.

Screenshot taken from Experiment OS

What’s more, as we run more tests and gather more data, Confidence AI dynamically updates the confidence scores of experiments in the backlog to reflect new learnings from each client’s experiments as they come in.

How accurate is Confidence AI?
This may all sound very impressive in theory, but in practice, Confidence AI is really only as good as the predictions it produces – so, the question on everyone’s mind:

Does Confidence AI actually work?

Here’s the data:

10 months ago we rolled out Confidence AI across our entire consulting team. What this means, in practice, is that each time our consultants develop a new concept for one of our clients, the concept is fed into Confidence AI.

Confidence AI then takes this concept and computes a confidence score out of 100 based on how likely it believes an experiment is to win.

We grouped these confidence scores into 3 categories – low confidence (0-33), medium confidence (33-66), and high confidence (66-100) – and then looked at the actual average win rates for each of these categories.

Here are the results:
- Low confidence (0-33) – 223 completed experiments – win rate of 27%.
- Medium confidence (33-66) – 120 completed experiments – win rate of 43%.
- High confidence (66-100) – 49 completed experiments – win rate of 63%.
Given that elite experimentation organizations like Microsoft, Airbnb, Google, and Booking.com report win rates in the range of 8-30%* – and if we assume that every experiment is conducted under the assumption that it is going to win – then it seems that Confidence AI is massively outperforming the average practitioner in terms of its ability to predict winners.

*There may be a little bit of noise here. There are a number of reasons that the win-rates of these tech giants are often so low. To name two: 1) no-brainers are often implemented without testing; 2) website assets are often already very well-optimized, making wins harder to come by. Nonetheless, Confidence AI’s accuracy is significantly higher than any industry win-rate we’ve ever come across – including our own! – which gives us good reason to be confident in its utility.
How we actually use Confidence AI: beyond win rate

The eagle-eyed amongst you may have spotted what seems to be a bit of a discrepancy in the last section:

If Confidence AI is supposedly prioritizing experiments with the highest confidence scores, why did we run so few high confidence tests during the trial period (49 high confidence, 120 medium confidence, and 230 low confidence)?

This raises an important point: when deciding which experiments to run, there are many factors that come into play in addition to likelihood of winning. On a basic level, we also need to consider things like build size and dependencies, which will determine how resource expensive a test is likely to be.

For example, if we have a high confidence test concept that is going to take a month to build and that also requires signoff from our client’s entire board of directors, we might choose to prioritize a lower confidence test with fewer dependencies and a quicker build.

This approach allows us to move at speed and gather learnings as we go – learnings that feed back into Confidence AI and that generate more accurate predictions.

But going beyond considerations about resource constraints and velocity, it’s also worth making a more general point: experimentation is about more than exploiting low hanging fruit (as identified by Confidence AI); it’s about exploring the full landscape of potential solutions to help our clients find the overall optimal solution to their problem – the global maxima.

Our goal is to help our clients find their global maxima

Confidence AI is a tool – an extremely powerful one – in our consultants’ toolkit. It provides them with a high-fidelity picture of the risk-reward landscape that they are operating within, which means they can – if they choose to – aggressively exploit high confidence tests that are likely to deliver strong short-term ROI for our clients.

But our goal as an agency is to maximize long-term – not just short-term – ROI for our clients.

In our experience, safe, high-confidence tests are great for generating incremental uplifts, but the most value tends to come when we use experimentation to help our clients take bold risks with a safety net.

This is a big part of the reason that we’re so often able to help our clients move beyond the plateaus on which they were previously stuck and continue their ascent towards their respective global maximas.

Saying that, we of course allow each of our clients to set the agenda for their own programs. If our client asks us to use Confidence AI to drive as much short-term value as possible, this is absolutely what we will do. But more often than not, our clients understand the long-term value that experimentation can bring to their entire business, so the emphasis of most programs is on measured exploration as much as it is on exploitation.

Circling back to the question asked at the start of this section, then: the primary reason that we often run fewer high-confidence tests than might be expected is that we often choose to explore uncharted territory and help our clients discover the high-risk, huge-reward solutions that have the power to revolutionize their businesses.
Final thoughts: silver bullet or something else?

Confidence AI is an incredibly exciting development that has significantly improved our ability to prioritize promising avenues of experimentation.

What’s more, the model is still in its infancy. This is the first iteration of Confidence AI, and we have every reason to believe that as we gather more data and roll out further iterations, the model will get more accurate over time.

But it’s important to emphasize that Confidence AI is not a silver bullet.

Ultimately, it’s only as effective as the experiment concepts themselves. If supporting research is poor or if experiment executions are weak, then Confidence AI’s predictive power goes way down.

Equally important, while Confidence AI may be able to give us a strong indication about how to maximize the short-term impact of a program, ascending to a program’s global maxima requires trialling bold, innovative ideas – the kind of ideas that a tool like Confidence AI is not particularly well equipped to evaluate.

All this to say, maybe one day Confidence AI will be able to predict every kind of experiment result with unerring accuracy. Until then, it is simply a powerful tool that is as effective – or as ineffective – as the practitioner wielding it.