• For the past 15+ years, we’ve been storing extensive data on every experiment we run.

    As the world’s largest experimentation agency, this means we now have a huge repository of past a/b test results, spanning countless industries, verticals, and company sizes.

    We’ve always known that this repository had the potential to offer an incredible competitive edge to our clients, but up until recently, we’ve had trouble unlocking its full potential.

    Recently, though, that’s all changed.

    In fact, thanks in large part to our experiment repository and the meta-analytic techniques it’s opened up for us, we’ve been able to achieve some of our most impressive agency-wide results ever.

    So what have we done and how did we do it?

    In this blog, we’re going to share all.

    Whether you run a mature experimentation program or you’re just getting started, this blog will give you everything you need to build an experiment repository and begin unlocking the power of meta-analysis in your own work.

  • Contents

  • What is meta-analysis?

    A/b tests – or randomized controlled trials (RCTs) – offer one of the strongest forms of evidence available for understanding causal relationships.

    But there’s one form of evidence that’s even stronger than RCTs:

    Meta-analyses of RCTs.

    Hierarchy of evidence

    The hierarchy of evidence

    A meta-analysis is essentially an analysis of multiple RCTs with the goal of combining their data to increase sample sizes and unearth macro-level trends that arise across trials.

    Think of it like this: even with a strong a/b test, there’s still some chance of error creeping into your results. Maybe a confounding variable has distorted your data. Maybe your result was an unlikely statistical fluke. Maybe it was something else.

    By merging multiple a/b test results, you stand a much better chance of separating out the signal from the noise.

    Combining the results of multiple a/b tests

    Combining the results of multiple a/b tests

    Meta-analysis is a very well-utilized tool within academic contexts, but in the world of business experimentation, very few people are doing it – let alone doing it well.

    In our view, based on the results we’ve been seeing, this is a huge missed opportunity.

    Think what you could do if you could combine every experiment you’ve ever run and extract macro-level insights to help inform things like:

    • UX – identify levers with the highest and lowest success rates to understand what is truly most important to your users.
    • Product & pricing – find out what your customers really want and then build and optimize products and pricing strategies that respond to these desires and preferences.
    • Business strategy – aggregate your experiment results to inform the direction of your business.
    • Experimentation strategy – leverage macro insights to answer questions like: do tests with big builds justify the extra resource expenditure? Do high-risk tests justify the added risk? Is our prioritization methodology prioritizing the right tests? etc.
    • And more.
  • Why bother building an experiment repository? 5 good reasons

    Building an experiment repository is the first and most important step on the road to running successful meta-analyses.

    Put simply, if your data is effectively stored and organized in a centralized repository, then running high-level meta-analyses will be relatively straightforward.

    Unfortunately, building an effective experiment repository can be quite labor intensive, so here are five good reasons, taken from our own experience, of why we think you should invest the time.

    1. Improved performance

    This one’s the most generic benefit we’re going to talk about, but we thought we best mention it up top because, well – numbers speak. All of the other benefits below ladder up to this one.

    During the last few years, we’ve experienced some incredible growth as an agency – but as any growing team will know, the bigger you become, the harder it can be to maintain standards.

    We work tirelessly to mitigate this risk, and thanks to our emphasis on good processes and a systematized methodology, we’re generally quite successful at doing so.

    But in 2022, when we first began to really operationalize our experiment repository, not only were we able to keep key agency-wide metrics steady – we were able to raise them.

    In fact, we hit our highest agency-wide win-rate ever in 2022* – 26% higher than the average from the previous four years – and we’ve remained on the same trajectory ever since.

    None of this would have been possible without our experiment repository and the innovations that surround it.

    Graph of our win-rate over time

    Our agency-wide win-rate from 2018-2022

    *In addition to win-rate, which can be a slightly problematic metric at times, we also track things like testing velocity, volume of experiments, etc., and use these as important program- and agency-wide metrics to monitor and optimize our performance over time. We wouldn’t expect our experiment repository to impact these metrics in any real way, which is why the emphasis here is on win-rate alone.

    2. Unearth macro-level insights

    An effective experiment repository will allow you to unearth macro-level insights that were completely inaccessible to you before.

    The value of this capability can’t be overstated.

    It allows you to start interrogating your data in completely new ways, meaning you can combine all of your experiment results to ask questions like:

    • Which areas of our website do we have the most – and the least – success on?
    • Which levers do our users respond most – and least – positively to?
    • Which levers are most effective on which page types?
    • How do our levers and areas data vary across different user segments?
    • Etc.

    This may sound quite cool in theory, but in practice, it’s a way to unearth novel insights, open up completely new avenues of testing – and ultimately take your experimentation program to the next level.

    We’ve seen some of the best program-wide results in our agency’s history by using these kinds of meta-analytic techniques. Here’s a particularly striking example.

    3. Improved organizational memory & collaboration

    Since really building out our experiment repository, we now have an easily accessible, rigorously tagged, endlessly filterable storehouse of data.

    This functionality means it’s easier than ever to harness insights from past experiment programs to drive results for our clients today. This works in two main ways:

    • Improved organizational memory – without an experiment repository, experiment results get lost. Either the person who ran the test leaves the organization, taking their insights with them, or the insights just get forgotten. A well structured experiment repository ensures that insights from a/b tests are 1) preserved and 2) accessible to everyone.
    • Improved cross-team information sharing – without a centralized repository, experiment data tends to become siloed within specific teams. Our experiment repository means any team can go into our database and filter for any type of experiment they need data on. This allows us to squeeze significantly more value out of each a/b test than we would have been able to without our repository. (This benefit is likely to be even more pronounced for product and optimization teams working on the same website)

    To give one example (of many) of how we’re benefiting from this aspect of our repository:

    For us, database research has become one of our core research methods as an agency. In essence, at the start of most programs, we will filter our database along certain dimensions to uncover insights and inspiration from past programs that we can use to inform present roadmaps and strategy.

    Of course, we don’t blindly follow what the database tells us. What worked on one website won’t necessarily work on another – which is why experimentation is so important. But this kind of database research gives us additional insight and data that we can use to gain a headstart at the beginning of a new program – and that we can use to enrich our experiment data further down the line.

    Ultimately, this means we can take more informed risks and drive more client value than would be possible without the repository and the insights it holds.

    4. Enhanced Prioritization – ConfidenceAI

    Based on thousands of a/b tests, Harvard Business School found that about 1 in 10 business experiments has a positive impact on its primary metric, i.e. 1 in 10 were winners.

    Assuming that more or less every team behind each of these experiments expected their experiment to win – surely a fairly safe assumption – then what this number tells us is that human beings are pretty poor at predicting experiment results…

    …but AI isn’t.

    Over the last twelve months, we’ve been training a machine learning model to take the data in our experiment repository – made up of more than 25,000 data points – and use it to predict the results of future a/b tests.

    Confidence AI

    Confidence AI

    Though this is only an early iteration, so far this model – dubbed Confidence AI – has been able to predict the results of winning a/b tests with 63% accuracy.*

    This is obviously miles above the 10% figure cited above – and it’s causing quite a stir within Conversion.

    Though there are myriad different possible uses for this kind of technology, the main value we’re seeing from it right now relates to prioritization:

    Once our consultants have done their research and come up with different test concepts for a program, they then need to decide which concepts to prioritize. Confidence AI analyzes each of these concepts based on a range of factors and then computes a confidence score that incorporates all of this information.

    By pairing this confidence score with information about the expected build size and possible dependencies associated with each experiment, we can then prioritize our backlog as effectively as possible, based on mountains of data and a cutting edge machine learning model.

    This allows us to zero in on winners much more reliably – and to also deprioritize tests that are, based on the data, much less likely to win.

    Before you can even think about running this kind of machine learning assisted prioritization, you need to have a robust experiment database – with powerful taxonomies – in place.

    *When confidence AI computes a confidence score of greater than 66/100, we call this as a prediction of a winner.

    5. Sharpen executions

    Once we’ve come up with our hypotheses, we next need to decide on our execution, i.e. the specific experiment that will allow us to test this hypothesis.

    Our experiment repository is proving to be extremely useful when it comes to fine-tuning our executions. For example, take this experiment:

    One of our clients had a single-page free-trial funnel.

    For a range of data-backed reasons, we hypothesized that we could increase their free trial sign-up rate by splitting this journey out into multiple steps.

    As part of this test, we knew we were going to need to design a new progress bar.

    By filtering our repository by industry, website area, and component, we were able to find past experiments on similar websites that had involved a progress bar redesign.

    In this instance, we found that more-detailed progress bars tended to be less effective than less-detailed progress bars.

    Our meta-components study revealed that low-detail progress bars tend to perform best

    We therefore chose to design a less-detailed progress bar, which ultimately contributed to a 7.5% uplift in free trials in this experiment.

    Progress bar experiment

    Simplified progress bar design

  • How to build a high-impact experiment repository of your own: 4 steps

    Now that we’ve (hopefully!) convinced you that you need to start building an experiment repository of your own, we’re going to share the main steps we’ve gone through to get our repository to where it is today.

    Where possible, we’ll share the resources and specific solutions that we ourselves have used to overcome the challenges associated with building an effective experiment repository.

    To start, then, an effective repository needs to meet three criteria:

    • Accessible – data has to be stored in a centralized database where anyone in your team can access it.
    • Filterable – people need to be able to filter the database to find the results and insights they need.
    • Self-updating – as experiment results come in, they need to be continually added to the meta-analysis.

    Each of the steps described below are geared towards meeting these three criteria.

    1. Choose a tool

    Really, the first thing you need to decide on when it comes to building an experiment repository is where you want to house it.

    There are a number of tools specifically built for this purpose, e.g. our very own Liftmap tool.

    Screenshot of liftmap

    Screenshot taken from Liftmap

    Other tools, such as Airtable or Notion, weren’t built specifically with meta-analysis in mind, but their power and customizability mean they offer a good option to anyone with enough time and skill to use them effectively.

    Some things to consider when selecting a tool:

    • Integrations – how well does the tool integrate with your current tech stack? How much can you automate? The clunkier your processes, the more laborious it will be to create and maintain your repository. More on this in the buy-in and workflows section below.
    • Time – tools like Liftmap and Effective Experiments will be almost good-to-go out of the box. With tools like Airtable or Notion, you will essentially have to build your repository from the ground up.
    • Customizability – some of the tools mentioned above (a.g. Airtable) are extremely customizable. That means you can basically turn them into whatever you need them to be. Others are more rigid.
    • Portability – is your data locked into the tool you’re using, or can you easily take it out of the tool and use it elsewhere.

    When it comes to tooling for meta-analysis, each experimentation team is likely to have different requirements, so a one size fits all approach probably won’t work.

    2. Settle on your taxonomies

    Once you’ve got all of your experiments in one place, you need a systematic way of categorizing them along certain dimensions.

    That’s where taxonomies come in.

    Taxonomies are systems of categorization. To a large extent, the effectiveness of a repository is dependent upon the effectiveness of the taxonomies it uses.

    Without good, shared, systematic taxonomies you can’t:

    1. Categorize experiments consistently – instead, everyone will use their own intuitions about how best to categorize their experiments, resulting in all sorts of bias and imprecision being baked into your data.
    2. Filter experiments – if everyone is categorizing experiments by their own rules, how can you filter your database to find the experiments you’re looking for?
    3. Extract macro insights – taxonomies are the lens through which macro-level patterns in your data begin to emerge. Without powerful taxonomies, it’s all just a morass of undifferentiated data points.
    4. Run machine learning – taxonomies give your machine learning model salient dimensions that it can use to observe and unearth patterns.

    Really, it wasn’t until we developed watertight taxonomies – with the Levers Framework at the forefront – that we were able to extract the full value from our experiment repository.

    Levers Framework overview

    High-level view of our Levers Framework

    Here are some of the most important taxonomies we use:

    • Our Levers Framework – this is undoubtedly THE most important taxonomy we use. A lever, as we define it, is any feature of the user experience that influences user behavior. We’ve spent years building and refining our levers framework, and we know that it captures something very important because it is the taxonomy that does the lion’s share of the predictive work for our prediction tool, Confidence AI. Here’s our whitepaper and webinar about the Levers Framework – feel free to take it and apply it to your own repository.
    • Psychological principles – we took the skeleton of our psychological principles taxonomy from the book ‘Smart Persuasion’. In essence, we define each experiment based on the psychological principle that is in play in that experiment.
    • Risk level – we define every experiment we run based on the amount of risk involved. Here’s an overview of our approach to risk.
    • Area – which area of the website was the experiment run on?
    • Industry – which industry does this client operate in?
    • KPI – what was the primary KPI for this test?

    3. Tagging

    Once you’ve decided on your taxonomies, you then need to tag up all of the past experiments that you’re planning to add to your database.

    We spent months working back through our experiment database tagging more or less every experiment we’ve ever run.

    This was quite labor intensive, but it’s a one time job – and once it’s done, it brings all of your past experiment results that have been collecting dust back into play.

    With this complete, you can now filter your database in all kinds of ways, and begin drawing upon past experiments to inform future directions.

    4. Buy-in and workflows

    At this point, your repository should be more or less good-to-go – but there’s still one more thing you’ll need to get set up before you can really get the ball rolling: workflows.

    Your repository shouldn’t be a fixed, static thing; it should be a living, growing database that is continually updated as new experiment data comes in.

    The more data you add to the repository, the more valuable it becomes.

    Unfortunately, data capture is one of the biggest challenges to overcome when building your database – especially if you’re democratizing experimentation across your entire company.

    The first step here is about gaining buy-in. Ultimately, people on the ground will be the ones capturing the data, so they need to know why they’re being asked to do this extra work. Specifically, they need to know:

    Making sure that the relevant teams have access to this information will make everything run much more smoothly.

    Once you’ve got buy-in and everyone is on the same page, the next step is about building workflows that support your meta-analysis.

    This will vary a lot from company to company, but to give you a bit of inspiration, here are a few things we’ve done to encourage our team to capture their data:

    • Almost all of the data points used in our meta-analysis are built into our Hypothesis Framework. That means when our consultants are creating their hypotheses, they’re naturally tagging a lot of the data points already.
    • Our R&D team has created automated experiment plan decks, but in order for consultants to benefit from this automation, they need to fill out certain fields in our database.
    • Additionally, we’ve also created a data completion metric, which tracks the % of inputs a person has failed to record. At the end of each week, we send out a weekly reminder, asking the owner to fill in any missing data points.

    As you can imagine, these workflows have taken time and effort to build, but we’re now at a point where our experiment database is growing more or less organically, without the need for constant intervention or micro-managing.

    This is the final goal – the point at which your repository has become a true second brain that effortlessly records data and makes it accessible to your entire organization.

  • Final thoughts: is it worth it?

    Some people in the experimentation space have been known to question the value of meta-analysis.

    Can you really take insights from past programs and apply them to current problems? Can insights from one business, or team, be applied to another? Does it actually work?

    These concerns are completely legitimate – but here’s the thing:

    This isn’t a subject we need to sit around debating.

    It’s an empirical question, and like any good empirical question, it can be answered by an experiment.

    We’ve run the experiment.

    The results are in.

    By incorporating meta-analytic techniques into our approach to experimentation, we’ve massively increased our primary metric – win rate – while holding our guardrail metrics – velocity, volume, and client satisfaction – steady.

    The test is a winner.

    Of course, this isn’t a controlled experiment; there are all kinds of confounding variables involved.

    (Heck, we might even need to run a meta-analysis to see whether meta-analysis is effective!)

    But so far, at least, the data’s all pointing in the right direction.