SCORE: A dynamic prioritization framework for AB tests from Conversion.com

Why prioritize?

With experimentation and conversion optimization, there is never a shortage of ideas to test.

In other industries, specialist knowledge is often a prerequisite. It’s hard to have an opinion on electrical engineering or pharmaceutical research without prior knowledge.

But with experimentation everyone can have an opinion: marketing, product, engineering, customer service – even our customers themselves. They can all suggest ideas to improve the website’s performance.

The challenge is how you prioritize the right experiments.

There’s a finite number of experiments that we can run – we’re limited both by the resource to create and analyze experiments, and also the traffic to run experiments on.

Prioritisation is the method to maximise impact with an efficient use of resources.

Prioritisation is the method to maximise impact with an efficient use of resources.

Why prioritize?
Where most prioritization frameworks fall down
Introducing SCORE – Conversion.com’s prioritization process
Strategy: Prioritising your experimentation framework
Concepts: Getting the right ideas
Order: Prioritising the concepts
Roadmap: Creating an effective roadmap
Experimentation: Running and analyzing the experiments
What’s good (and bad) about this approach
So, what now?

Where most prioritization frameworks fall down
There are multiple prioritization frameworks – PIE (from WiderFunnel), PXL (from ConversionXL), and more recently the native functionality within Optimizely’s Program Management.

Each framework has a broadly consistent approach: prioritization is based on a combination of (a) the value of the experiment, and (b) the ease of execution.

For example, ConversionXL’s PXL framework asks a series of yes/no questions to objectively assess each experiment’s value and ease.

Experiments that are above the fold and based on quantitative and qualitative research will rightly score higher than a subtle experiment based on gut instinct alone.

This approach works well: it rewards the right behavior (and can even help drive the right behavior in the future, as users submit concepts that are more likely to score well).

But while it improves the objectivity in scoring, it lacks two fundamental elements:
1. It accounts for page traffic, but not page value. So an above-the-fold research-backed experiment on a zero-value page could be prioritized above experiments that could have a much higher impact. (We used to work with a university in the US whose highest-traffic page was a blog post on ramen noodle recipes. It generated zero leads – but the PXL framework wouldn’t account for that automatically.)
2. While it values qualitative and quantitative research, it doesn’t appear to include data from the previous experiments in its prioritization. We know that qualitative research can sometimes be misleading (customers may say one thing and do something completely different). That’s why we validate our research with experimentation. But in this model, its focus is purely on research – whereas a conclusive experiment is the best indicator of a future iteration’s success.
Moreover, most frameworks struggle to adapt as an experimentation program develops. They tend to work in isolation at the start – prioritizing a long backlog of concepts – but over time, real life gets in the way.

Competing business goals, fire-fighting and resource challenges mean that the prioritization becomes out-of-date – and you’re left with a backlog of experiments that is more static than a dynamic experimentation program demands.
Introducing SCORE – Conversion.com’s prioritization process
Our approach to prioritization is based on more than 10 years’ experience running experimentation programs for clients big and small.

We wanted to create an approach that:
- Prioritises the right experiments: So you can deliver impact (and insight) rapidly.
- Adapts based on insight + results: The more experiments you run, the stronger your prioritization becomes.
- Removes subjectivity: As far as possible, data should be driving prioritization – not opinion.
- Allows for the practicalities of running an experimentation program: It adapts to the reality of working in a business where the wider priorities, goals and resources change.
But the downside is that it’s not a simple checklist model. In our experience, there’s no easy answer to prioritization – it takes work. But it’s better to spend a little more time on prioritization than waste a lot more effort building the wrong experiments.

It’s better to spend a little more time on prioritization than waste a lot more effort building the wrong experiments.

With that in mind, we’re presenting SCORE – Conversion.com’s prioritization process:
- Strategy
- Concepts
- Order
- Roadmap
- Experimentation
As you’ll see, the prioritization of one concept against each other happens in the middle of the process (“Order”) and is contingent on the program’s strategy.

Strategy: Prioritising your experimentation framework
At Conversion.com, our experimentation framework is fundamental to our approach. Before we start on concepts, we first define the goal, KPIs, audiences, areas and levers (the factors that we believe affect user behavior).

You can read more about our framework here and you can create your own with the templates here.

When your framework is complete (or, at least, started – it’s never really complete), we can prioritize at the macro level – before we even think about experiments.

Assuming we’ve defined and narrowed down the goal and KPIs, we then need to prioritize the audiences, areas and levers:

Audiences

Prioritise your audiences on volume, value and potential:
- Volume – the monthly unique visitors of this audience. (That’s why it’s helpful to define identifiable audiences like “prospects”, “users on a free trial”, “new customers”, and so on.)
- Value – the revenue or profit per user. (Continuing the above example, new customers are of course worth more than prospects – but at a far lower volume.)
- Potential – the likelihood that you’ll be able to modify their behavior. On a retail website, for example, there may be less potential to impact returning customers than potential customers – it may be harder to increase their motivation and ability to convert relative to a user who is new to the website.
You can, of course, change the criteria here to adapt the framework to better suit your requirements. But as a starting point, we suggest combining the profit per user and the potential improvement.

Don’t forget, we want to prioritize the biggest value audiences first – so that typically means targeting as many users as possible, rather than segmenting or personalising too soon.

Areas

In much the same way as audiences, we can prioritize the areas – the key content that the user interacts with.

For example, identify the key pages on the website (homepage, listings page, product page, etc) and score them on:
- Volume – the monthly unique visitors for the area.
- Value – the revenue or profit from the area.
- Potential – the likelihood that you’ll be able to improve the area’s performance. (Now’s a good time to use your quantitative and qualitative research to inform this scoring.)
(It might sound like we’re falling into the trap of other prioritization models: asking you to estimate potential, which can be subjective. But, in our experience, people are more likely to score an area objectively, rather than an experiment that they created and are passionate about.)

Also, this approach doesn’t need to be limited to your website. You can apply it to any other touchpoint in the user journey too – including offline. Your cart abandonment email, customer calls and Facebook ads can (and should) be used in this framework.

If your KPI is profit, you may want to include offline content like returns labels in prioritization model.

Levers

As above, levers are defined as the key factors or themes that you think affect an audience’s motivation or ability to convert on a specific area.

These might be themes like pricing, trust, delivery, returns, form usability, and so on. (Take another look at the experimentation framework to see why it’s important to separate the lever from the execution.)

When you’re starting to experiment, it’s hard to prioritize your levers – you won’t know what will work and what won’t.

That’s why you can prioritize them on either:
- Confidence – a simple score to reflect the quantitative and qualitative research that supports the lever. If every research method shows trust as a major concern for your users, it should score higher than another lever that only appears occasionally.
- Win rate – If you have run experiments on this lever in the past, what was their win rate? It’s normally a good indicator of future success.
Of course, if you’re starting experimentation, you won’t have a win rate to rely on (so estimating the confidence is a fantastic start).

But if you’ve got a good history of experimentation – and you’ve run the experiments correctly, and focused them on a single lever – then you should use this data to inform your prioritization here.

Again, the more we experiment, the more accurate this gets – so don’t obsess over every detail. (After all, it’s possible that a valid lever may have a low win rate simply because of a couple of experiments with poor creative.)

Putting this all together, you can now start to prioritize the audiences, areas and levers that should be focused on:

As you can see, we haven’t even started to think about concepts and execution – but we have a strong foundation for our prioritization.
Concepts: Getting the right ideas
After defining the strategy, you can now run structured ideation around the KPIs, audiences, areas and levers that you’ve defined.

This creates the ideal structure for ideation.

Rather than starting with, “What do we want to test?” or “How can we improve product pages?”, we’re instead focusing on the core hypotheses that we want to validate:
- How can we improve the perception of pricing on product pages for new customers?
- How can we overcome concerns around delivery in the basket for all users?
- And so on.
This structured ideation around a single hypothesis generates far better ideas – and means you’re less susceptible to the tendency to throw everything into a single experiment (and not knowing which part caused the positive/negative result afterwards).
Order: Prioritising the concepts

When prioritizing the concepts – especially when a lever hasn’t been validated by prior experiments – you should look to start with the minimum viable experiment (MVE).

Just like a minimum viable product, we want to define the simplest experiment that allows us to validate the hypothesis. (Can we test a hypothesis with 5 hours of development time rather than 50?)

Just like a minimum viable product, we want to define the simplest experiment that allows us to validate the hypothesis.

This is a hugely important concept – and one that’s easily overlooked. It’s natural that we want to create the “best” iteration for the content we’re working on – but that can limit the success of our experimentation program. It’s far better to run ten MVEs across multiple levers that take 5 hours each to build, rather than one monster experiment that takes 50 hours to build. We’ll learn 10x as much, and drive significantly higher value.

In one AB test for a real estate client, we created a fully functional “map view”. It was based on a significant volume of user research – but the minimum viable experiment would have been simply to test adding a “Map view” button without the underlying functionality.

So at the end of this phase, we should have defined the MVE for each of the high priority levers that we’re going to start with.
Roadmap: Creating an effective roadmap
There are many factors that can affect your experimentation roadmap – factors that stop you from starting at the top of your prioritized list and working your way down:
- You may have limited resource, meaning that the bigger experiments have to wait till later.
- There may be upcoming page changes or product promotions that will affect the experiment.
- Other teams may be running experiments too, which you’ll need to plan around.
And there are dozens more: resource, product changes, marketing, seasonality can all block experiments – but shouldn’t block experimentation altogether.

That’s why planning your roadmap is as important as prioritizing the experiments. Planning delivers the largest impact (and insight) in spite of external factors.

Planning your roadmap is as important as prioritizing the experiments. Planning delivers the largest impact (and insight) in spite of internal factors.

To plan effectively:
- Identify your swimlanes: These are the audiences and areas from your framework that you’ll be experimenting on. (Again, make sure you focus on the high priority audiences and areas – don’t be tempted to segment or personalize too early.)
- Estimate experiment duration: Use an appropriate minimum detectable effect for the audience and area to calculate the duration, then block out this time in the roadmap.
- Experiment across multiple levers: Gather more insight (and spread your risk) by experimenting across multiple levers. If you focus heavily on a lever like “trust” with your first six experiments, you might have to start again if the first two or three experiments aren’t successful.
Experimentation: Running and analyzing the experiments
With each experiment, you’ll learn more about your users: what changes their behavior and what doesn’t.

You can scale successful concepts and challenge unsuccessful concepts.

For successful experiments, you can iterate by:
- Moving incrementally from minimum viable experiments to more impactful creative. (With one Conversion.com client, we started with a simple experiment that promoted the speed of delivery. After multiple successful experiments around delivery, we eventually worked with the client to test the commercial viability of same-day delivery.)
- Applying the same lever to other areas and potentially audiences. If amplifying trust messaging on the basket page works well, it’ll probably work well on listing and product pages too.
Meanwhile, an experiment may be unsuccessful because:
- The lever was invalidated – Qualitative research may have said customers care about the lever, but in practice makes no difference.
- The execution was poor – It happens sometimes. Every audience/area/lever combination can have thousands of possible executions – you won’t get it right first time, every time, and you risk rejecting a valid lever because of a lousy experiment.
- There an external factor – It’s also possible that other factors affected the test: there was a bug, the underlying page code changed, a promotion or stock availability affected performance. It doesn’t happen often, but it needs to be checked.
In experiment post-mortems, it’s crucial to investigate which of these is most likely, so we don’t reject a lever because of poor execution or external factors.

Conduct experiment post-mortems so you don’t reject a lever because of poor execution or external factors.
What’s good (and bad) about this approach
This approach works for Conversion.com – we’ve validated it on clients big and small for more than ten years, and have improved it significantly along the way.

It’s good because:
- It’s a structured and effective prioritization strategy.
- It doesn’t just reward data and insight – it actively adapts and improves over time.
- It works in the real-world, allowing for the practicalities of running an experimentation program.
On the flip side, its weaknesses are that:
- It takes time to do properly. (You should create and prioritize your framework first.)
- You can’t feed in 100 concepts and expect it to spit out a nicely ordered list. (But in our experience, you probably don’t want to.)
So, what now?
1. If you haven’t already, print out or copy this Google slide for Conversion.com’s experimentation framework.
2. Email marketing@conversion.com to join our mailing list. We like sharing how we approach experimentation.
3. Share your feedback below. What do like? What do you do differently?