Exploration vs. Exploitation: how to balance short-term results with long-term impact

Today in 2024, pretty much every functional experimentation team understands the importance of iteration:
1. You run a test;
2. You analyze the results to work out why the test did or didn’t work;
3. You build another test to exploit this learning;
4. You analyze this second test to work out why it did or didn’t work;
5. You build another test to exploit this learning;
6. etc.
This approach can yield some very strong results in the short-to-medium term, but in the longer-term, you’re likely to find that the wins from your chosen line of testing begin to dry up.

From here, it’s only a matter of time until you encounter the most feared phenomenon in all of optimization:

The plateau.

We’ve had a lot of experience helping in-house teams push through this kind of performance plateau, and in our experience, it’s almost always caused by the same thing:

A less than optimal approach to the explore-exploit tradeoff.

In this article, we’re going to explain what the explore-exploit tradeoff is, how we’re using our Levers™ Framework to optimally solve it for our clients – and how you too can do the same for your program.

Introduction
What is the explore-exploit tradeoff?
What is the Levers™ Framework?
Using the Levers Framework to balance exploration and exploitation
Final thoughts

What is the explore-exploit tradeoff?
In essence, the explore-exploit tradeoff is the tradeoff between gathering new information (exploration) and using that information to improve performance (exploitation).

When you’re exploring new information, you’re not exploiting the information you already have to drive impact now.

When you’re exploiting preexising information, you’re not gathering new information that might drive an even bigger impact in the future.

As it turns out, the explore-exploit tradeoff shows up in a ridiculously broad range of contexts. For example,
- In reinforcement learning, where the goal is to teach an AI agent to make decisions based on feedback from its environment: should the agent select the best-known option based on past experience (exploit) or should it explore new options that may lead to a more optimal solution in the future (explore)?
- In CRO: should we iterate on what’s already proven to be effective (exploit) or should we trial something completely new (explore)?
- And also in much more everyday contexts: should I pick my favorite burger that I know I’ll enjoy (exploit) or should I venture further afield and try the escargots instead (explore)?
We’re not saying that we’ve found a global solution to the explore-exploit dilemma – one that will apply to all of the various domains mentioned above.

What we are saying, though, is that we believe we’ve found a strong, close-to-optimal solution to this problem within the context of CRO/experimentation – one that will ultimately allow you to move away from the current local maximum (plateau) you are stuck on, and towards your global maximum.

Central to this approach is our Levers™ Framework.
What is the Levers™ Framework?
We’re not going to delve too deeply into our Levers™ Framework here, since we’ve got other pieces of content that fulfill this purpose already – see our recent white paper, webinar, and blog.

Saying that, our entire solution to the explore-exploit tradeoff is built around the Levers Framework, so it’s worth offering up a quick high-level overview before we go any further. If you’re already acquainted with this stuff, feel free to skip ahead.

So, to begin: in order to explain what the Levers™ Framework is, we first need to define what we mean by the word ‘lever’.

For us, a lever is any feature of the user experience that influences user behavior.

For instance, sales countdown timers exploit a sense of urgency. Within the Levers Framework, an experiment that deploys a sales countdown timer would therefore be categorized under the urgency lever, since this is the means by which it influences user behavior.

In essence, the Levers Framework is a comprehensive taxonomy of the user experience features that influence user behavior (see below). The framework is a treelike structure that aims to categorize these features of user experience at three levels of generality: Master Levers (most general); Levers (middle layer); and Sub-levers (most specific).

High-level overview of our Levers Framework.

In principle, this means that every experiment we run – every lever we pull – can be categorized at three different levels of generality.

For example, let’s say we’ve added a Trustpilot Logo to a landing page of one of our clients. By adding this logo we are:
- Trying to elicit trust from the user, so this test is classed under the Trust Master Lever.
- Trying to enhance our client’s credibility in the eyes of the user, so this test is classed under the Credibility Lever.
- Appealing to a broader source of endorsement among people who will be considered representative for the typical user, so this test is classed under the Social Proof Sub Lever.
The Levers Framework is the product of more than 16 years worth of iterations, and it has been validated both by its efficacy in our day-to-day client work and by its profound predictive power.

The framework has a huge range of applications that we won’t touch on here (check out the white paper for more), but one thing worth flagging is that it serves as a fine-grained, comprehensive map of the various user experience solutions that influence conversion…

…and once you have a trustworthy map, exploring the territory becomes a whole lot easier.

Using the Levers Framework to balance exploration and exploitation
On the surface, our approach to the explore-exploit tradeoff is quite simple:

When we first start working on a website, we make a conscious effort to run exploratory experiments on all 5 of our Master Levers, i.e. Cost, Trust, Motivation, Usability, and Comprehension.

While this approach means that our initial win-rate may be a little bit lower than it would have been had we focussed solely on low hanging fruit, it allows us to gather valuable information about the kinds of interventions that are likely to be most effective on any given website (and those that aren’t).

Programs that pursue quick-wins tend to show lower win-rates in the the long-term than programs focussed on balancing exploration and exploitation in a structured way.

By intentionally collecting information about a broad selection of levers, we are then able to explore the full range of possible solutions in a structured way.

Once we’re confident in our results, we can finally shift into exploit mode and start ruthlessly folding poorly performing levers while doubling down on successful ones to drive maximum impact for our clients.

Putting this in terms of the Local/Global Maximum analogy above:

At the start of a program, we will essentially fly over the entire optimization landscape in search of the region within which the Global Maximum exists.

Once we think we’ve found it, we’ll then drop down into this general region and begin performing a ‘hill-climbing’ operation, which basically involves iteratively improving the website to gradually ascend to the global maximum.

Of course, the reality of the situation is a good deal more complicated than this theoretical sketch may suggest, but we’ve found that the principle behind this approach is sound – and that it often provides an optimal path through the explore-exploit quandary, allowing us to maximize long-term value for our clients.

To support you in actually applying this approach to your own work, we’re going to run through the key steps in our process, with the goal of adding some additional detail and actionability to the picture painted thus far.

1. Research

As mentioned above, our approach involves distributing our experiments across all 5 Master Levers.

However, before we do this, we first need to identify the most impactful levers within those 5 Master Levers, as well as the types of experiments that are most likely to succeed for each.

We therefore typically begin by running a UX research project, which will include methodologies like analytics reviews, user testing, surveys, scroll/heatmaps, competitor analysis, and more.

This allows us to collect a huge range of observations about the barriers and motivations that are active on any given website.

We will then combine sets of these individual observations into what we term ‘insights’, which are the unifying themes under which observations can be grouped.

So, to give a more concrete example:
- 50% of survey respondents said low confidence in the service was a barrier to conversion.
- User testing found that 3 out of 5 participants shared feedback to the effect of ‘I’m not familiar with this service. I’m not confident it will work.’
- The insight for these two observations would then be something like ‘users lack trust in the efficacy of the service.’
Once we’ve combined all of our observations into insights, we start assigning each insight to a Master Lever, Lever, and Sub Lever, working our way down our framework to establish an increasingly specific understanding of the problem we’re trying to address.

Different research methods generate observations, which we cluster together under different themes known as insights. We then aim to map each of these insights to the lever that relates most closely to it.

So, returning to the example from above: if our insight is ‘users lack trust in the efficacy of the service,’ this is clearly a trust issue, so we will assign this insight to the Trust Master Lever.

The Trust Master Lever, along with its constituent Levers and Sub-Levers

Moving one layer further down the framework, we must then ask: ‘is this a legitimacy question, a credibility question, or a security question?’

In our framework, Credibility is about whether a company is able to live up to the claims that it makes on its website, so clearly this is a Credibility question rather than one pertaining to Security or Legitimacy.

Interpreting the Sub Lever may be slightly more difficult, but for now, we may tentatively identify this as an Authority issue, since an increase in Authority would likely assuage the trust-related concerns associated with this insight.

Using this approach, we will attempt to tag every insight we’ve collected from our research to a specific Master Lever, Lever, and Sub Lever.

This will then typically leave us with a huge range of insights, distributed across all 5 of our master levers (see graph below).

Insights from one of our programs, distributed across our 5 Master Levers.

2. Ideation

Now that you have each of your insights assigned to a lever, you’ll need to develop the execution for each of these specific levers.

To clarify, let’s once again return to the example from the previous section, where we found that the Master Lever was Trust, the Lever was Credibility, and the Sub Lever was Authority.

So far, we have quite a specific idea about what we might want to test, i.e. anything that is going to enhance the authority of our brand. This solution, however, still leaves us with some scope with respect to the actual experiment we want to run.

For example,
- On the homepage we could add a logo of an institution that we’ve been endorsed by;
- We could include a testimonial from an authority figure in our industry endorsing our brand;
- We could roll either of these changes out at different stages of the funnel or on different parts of the page;
- etc.
We have an entire process for developing high-impact experiment concepts from our initial research – we’ll be sharing more info about this in the future – but for now, here are a few considerations to keep in mind during this initial ideation session:
- We want to gather data about our Levers at minimum cost and effort. We therefore recommend using a Minimum Viable Experiment (MVE) approach, which essentially means creating the smallest (in terms of build time, sign-off, etc.) experiment possible that will allow you to validate your selected levers. This means you can gather valuable information at minimal upfront expense. With this data in hand, you can then be more confident running resource-expensive experiments on this lever further down the line.
- Make sure that the test is conducted on an area of the website with sufficient traffic to ensure a conclusive result, e.g. if your About Us page only gets 100 visitors a month but your PDPs receive millions, PDPs are the only viable option out of the two.
- If you have a high minimum detectable effect, ensure that your concept is bold enough to potentially achieve this threshold.
- Ensure that the execution you’ve developed actually aligns with the various levers of interest you have identified. Ultimately, each execution should derive from insights for one lever.
3. Roadmap strategy

Once you’ve developed executions for each of your initial concepts, you’ll then need to prioritize these executions.

We’ve developed a machine learning assisted prioritization tool for this purpose, but feel free to apply whatever prioritization framework you’re currently using.

The goal is to end up with a relatively long list of experiment executions, prioritized based on things like:
- Ease of implementation
- Potential impact
- No. of observations/insights supporting that concept
- etc.
Once you’ve got this prioritized list together, you then need to build your roadmap.

This is the stage where you get to exert more intentional control over your balance of exploration vs. exploitation.

We would always recommend testing across all 5 Master Levers, but perhaps you want to give a slightly stronger weight to exploitation rather than exploration. In this case, for the first 20 experiments in your roadmap, you may choose to run 8 on the Master Lever with the most insights attached to it and only 1 or 2 to the Master Lever with the least.

Conversely, if you want to ensure that your initial exploration is as thorough as possible, you will need to make sure that your tests are fairly evenly diversified across all 5 Master Levers so as to gather as much information about their respective efficacies as possible.

Saying that, we do not recommend running tests without any supporting research. If one of your Master Levers has few or no insights attached to it, we would recommend shifting your attention to the other Master Levers.

CRO is often a balance between earning and learning; by intentionally weighing your balance of exploration to exploitation at this step, you can ensure that your roadmap aligns as closely with your goals as possible.

4. Experiment

Once you’ve decided which tests you want to run, the next thing to do is actually run them!

5. Iterate

You may think that once you’ve run your initial tests, you’re ready to start folding losers and doubling down on winners to drive value now.

For winning tests, this is more or less how it works. When you find an effective lever, our recommendation is that you exploit that lever relentlessly, for as long as it continues to deliver value.

With one of our clients, we’ve run 46 iterations on a single lever – and it still delivers results to this day!

For losing experiments, on the other hand, there are some additional considerations worth factoring in. As a pretty reliable rule of thumb, a test will lose for one of two reasons:
1. Execution – the lever you selected may actually be effective, but the execution you chose may be poor.
2. Lever – the specific levers you’ve tested are themselves ineffective. This reason breaks down into two further types:
  - The specific Master Levers and Levers you’ve tested on are ineffective.
  - The specific Sub Lever you’ve chosen to test is ineffective, but the general formulation of your problem re: Master Levers and Levers is correct.
We’ve previously written in detail about our process for diagnosing the cause of a test’s loss, as well as how to iterate on this type of result. This blog post here is already pretty long, so we won’t go into this again now, but if you’d like to read about this subject in detail, click here.

One important thing to keep in mind is that losing experiments are not ‘the end of the line’ for a lever. In fact, they often tell you a lot more than inconclusive experiments – and therefore provide direction for future testing.

Ultimately, while no single non-winning experiment is sufficient to rule out a lever’s importance, losing experiments have the advantage of telling you something more: that what you are doing at least matters to users.

This suggests that a better message or change relevant to that lever might well intervene in a way that makes a positive rather than negative difference. Equally, simply doing less of – or the opposite of – what had the negative effect might be most effective.
Final thoughts

In our experience, a sub-optimal balance between exploration and exploitation is the cause of 9 out of 10 performance plateaus. In this blog, we shared the approach we’ve been using to help our clients successfully navigate the explore-exploit tradeoff and begin once again driving revenue with CRO/experimentation.

As will be clear by now, much of this approach is driven by our Levers Framework, so if you’re keen to put the method laid out here into practice, we’d recommend that you download our recent white paper. This should hopefully give you everything you need to get started.

And in the meantime, if you’ve got any further questions about how any of this works, feel free to drop us a line – we’re passionate about experimentation and are always keen to share our expertise where we can!