Stephen Pavlovich, Author at Conversion.com

It shouldn’t take a year to launch the wrong product: How to make better products with experimentation

Our GIF

We are our choices.

So says JP Sartre (and Dumbledore).

The same is true of product.

Everything we produce is the result of our choices. Which products and features do we roll out? Which do we roll back? And which ideas never even make it on the backlog?

The problem is – most of us suck at making choices.

Decisions are made by consensus, based on opinion not evidence. We’re riddled with subjectivity and bias, often masquerading as “experience”, “best practice” or “gut instinct”.

But there’s a better way – using experimentation as a way to define your product roadmap.

Experimentation as a product development framework

For many product organisations, experimentation serves two functions:

1.Safety check: Product and engineering run A/B tests (or feature flags) to measure the impact of new features.

2.Conversion optimisation: Marketing and growth run A/B tests often, for example, on the sign-up flow to optimise acquisition.

But this neglects experimentation’s most important function:

3.Product strategy: Product teams use experimentation to find out which features and ideas their customers will actually use and enjoy. 

In doing so, you can use experimentation to inform product – not just validate it. You can test bolder ideas safely, creating better products for your customers. By putting experimentation at the heart of their business, organisations like Facebook, Amazon, Uber and Spotify have created and developed products used by billions worldwide.

But they’re in the minority. They represent the 1% of brands that have adopted experimentation as not just a safety check, but as a driving force for their product.

So how do the 99% of us better adopt experimentation?

Five principles of product experimentation

#1 Experiment to solve your biggest problems.

First, and most importantly, you should experiment on your biggest problems – not your smallest.

If experimentation is only used to “finesse the detail” by A/B testing minor changes, you’re wasting the opportunity.

To start, map out the products or features you’re planning. What are the assumptions you’re making, and what are the risks you’re taking? How can you validate these assumptions with experimentation? 

Also, what are the risks you’re not taking – but would love to at least try with an A/B test?


At Domino’s, we’re evangelising the role of experimentation for both
customer experience optimisation and product development.



#2 Be bold.

Experimentation lets you experiment with the confidence of a safety net. 

Because experiments are – by their nature – measurable and reversible, it gives us a huge opportunity to test ideas that are bolder than we’d ever dare.

In his 1997 letter to investors, Jeff Bezos talked about type 1 and type 2 decisions.

Type 1 decisions are irreversible – “one-way doors”:

“These decisions must be made methodically, carefully, slowly, with great deliberation and consultation. If you walk through and don’t like what you see on the other side, you can’t get back to where you were before.”

Type 2 decisions are reversible – “two-way doors”:

“But most decisions aren’t like [Type 1 decisions] – they are changeable, reversible – they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long. You can reopen the door and go back through.”

Fast forward 22 years and Jeff Bezos’s latest letter to investors doubles down on this approach:

As a company grows, everything needs to scale, including the size of your failed experiments. If the size of your failures isn’t growing, you’re not going to be inventing at a size that can actually move the needle. Amazon will be experimenting at the right scale for a company of our size if we occasionally have multibillion-dollar failures.

If we aren’t prepared to risk failure, then we don’t innovate. Instead, we stagnate and become Blockbuster in the Netflix era.

Instead, experimentation gives us a safety net to take risks. We can test our boldest concepts and ideas, which would otherwise be blocked or watered down by committee. After all, it’s only a test… 

#3 Test early / test often.

Experimentation works best when you test early and often.

But for most product teams, they’re testing once, at the end. They do this to measure the impact of a new feature before or just after it launches. (This is the “safety check” concept, mentioned above.) 

Their process normally looks like this: 

Whether the experiment wins or loses – whether the impact is positive or negative – the feature is typically rolled out anyway.

Why? Because of the emotional and financial investment in it. If you’ve spent 6 or 12 months building something and then find out it doesn’t work, what do you do? 

You could revert back and write off the last 6 months’ investment. Or you could persevere and try to fix it as you go. 

Most companies choose the second option – they invest time and money in making their product worse. 

As Carson Forter, ex-Twitch now Future Research, says of bigger feature releases:

“By the time something this big has been built, the launch is very, very unlikely to be permanently rolled back no matter what the metrics say.” 

That’s why we should validate early concepts as well as ready-to-launch products. We start testing as early as possible – before we commit to the full investment – to get data on what works and what doesn’t.

After all, it’s easier to turn off a failed experiment than it is to write off a failed product launch. What’s more, gathering data from experiments will help us guide the direction of the product.

#4 Start small and scale.

To do that – to test early and often – it means you’ll frequently have to start with the “minimum viable experiment” (MVE). 

Just like a minimum viable product, we’re looking to test a concept that as simple and as impactful as possible.

Henrik Kniberg’s drawing illustrates this well:

So what does this look like in practice? Often “painted door tests” work well here. You don’t build the full product or feature and test that. After all, by that point, you’ve already committed to the majority of the investment. Instead, you create the illusion of the product or feature.

season 7 GIF


Suppose a retailer wanted to test a subscription product. They could build the full functionality and promotional material and then find out if it works. Or they could add a subscription option to their product details pages, and see if people select it.


A retailer could add a “Subscribe and Save” option similar to Amazon’s product page –
even without building the underlying functionality. Then, the track the percentage of customers who try to select this option.


Ideally before they run the experiment, they’d plan what they’d do next based on the uptake. So if fewer than 5% of customers click that option, they may deprioritise it. If 10% choose it, they might add it to the backlog. And if 20% or more go for it, then it may become their #1 priority till it was shipped.

We’ve helped our clients apply this to every aspect of their business. Should a food delivery company have Uber-style surge pricing? Should they allow tipping? What product should they launch next?

#5 Measure what matters.

The measurement of the experiment is obviously crucial. If you can’t measure the behaviour that you’re looking to drive, there’s probably little point in running the experiment. 

So it’s essential to define both:

  • the primary metric or “overall evaluation criterion” – essentially, the metric that shows whether the experiment wins or loses, and 
  • any second or “guardrail metrics” – metrics you’re not necessarily trying to affect, but don’t want to perform any worse. 

You’d set these with any experiment – whether you’re optimising a user journey or creating a new product. 

As far as possible – and as far as sample size/statistical significance allows – focus these metrics on commercial measures that affect business performance. So “engagement” may be acceptable when testing a MVE (like the fake subscription radio button above), but in future iterations you should build out the next step in the flow to ensure that the positive response is maintained throughout the funnel.

Why is this approach better?

1.You build products with the strongest form of evidence – not opinion.
Casey Winters talks about the dichotomy between product visionaries and product leaders. A visionary relies more on opinion and self-belief, while a leader helps everyone to understand the vision, then builds the process and uses data to validate and iterate.

And the validation we get from experiments is stronger than any other form of evidence. Unlike traditional forms of product research – focus groups, customer interviews, etc – experimentation is both faster and more aligned with future customer behaviour.

The pyramid below shows the “hierarchy of evidence” – with the strongest forms of evidence at the top, and the weakest at the bottom.

You can see that randomised controlled trials (experiments or A/B tests) are second only to meta analyses of multiple experiments in terms of quality of evidence and minimal risk of bias:

2.Low investment – financially and emotionally.
When we constantly test and iterate, we limit the financial and emotional fallout. Because we test early, we’ll quickly see if our product or feature resonates with users. If it does, we iterate and expand. If it doesn’t, we can modify the experiment or change direction. Either way, we’re limiting our exposure.

This applies emotionally as well as financially. There’s less attachment to a minimum viable experiment than there is a fully-built product. It’s easier to kill it and move on.

And because we’re reducing the financial investment, it means that…

3.You can test more ideas.
In a standard product development process, you have to choose the products or features to launch, without strong data to rely on. (Instead, you may have market research and focus groups, which are beneficial but don’t always translate to sales). 

In doing so, you narrow down your product roadmap unnecessarily – and you gamble everything on the product you launch.

But with experimentation, you can test all those initial ideas (and others that were maybe too risky to be included). Then you can iterate and develop the concept to a point where you’re launching with confidence.

It’s like cheating at product development – we can see what happens before we have to make our choice.

En esta escena de “Annie Hall” (1977) iban a tomar cocaína y el personaje de Woody Allen estornuda… es muy gracioso y lo mejor es que no estaba en el guión, paso durante los ensayos y quedo en la película.
celluloidshadows:
“ Click the pic to watch...

Right now it’s only a notion, but I think I can get money to make it into a concept,
and later turn it into an idea.

4.Test high risk ideas in a low risk way.
Because of the safety net that experimentation gives us (we can just turn off the test), it means we can make our concepts 10x bolder.

We don’t have to water down our products to reach a consensus with every stakeholder. Instead, we can test radical ideas – and just see what happens.

Like Bill Murray in Groundhog Day, we get to try again and again to see what works and what doesn’t. So we don’t have to play it safe with our ideas – we can test whatever we want.

bill murray film GIF

Why, run A/B tests of course.

Don’t forget, if we challenge the status quo – if we test the concepts that others won’t – then we get a competitive advantage. Not by copying our competitors, but by innovating with our products.

And this approach is, of course, hugely empowering for teams…

5. Experiment with autonomy.
Once you’ve set the KPIs for experimentation – ideally the North Star Metric that directs the product – then your team can experiment with autonomy.

There’s less need for continual approval, because the opinion you need is not from your colleagues and seniors within the business, but from your customers.

And this is a hugely liberating concept. Teams are free to experiment to create the best experience for their customers, rather than approval from their line manager.

6. Faster.
Experimentation doesn’t just give you data you can’t get anywhere else, it’s almost always faster too.

Suppose Domino’s Pizza want to launch a new pizza. A typical approach to R&D might mean they commission a study in consumer trends and behaviour, then use this to shortlist potential products, then run focus groups and taste tests, then build the supply chain and roll out the new product to their franchisees, and then…

Well, then – 12+ months after starting this process – they see whether customers choose to buy the new pizza. And if they don’t…

But with experimentation, that can all change. Instead of the 12+ month process above, Domino’s can run a “painted door” experiment on the menu. Instead of completing the full product development, then can add potential pizzas to the menu that look just like any other product on the menu. Then, they measure the add-to-basket rate for each.

This experiment-led approach might take just a couple of weeks (and a fraction of the cost) of traditional product development. What’s more, the data gathered is, as above, likely to correlate more closely to future sales.

7. Better for customers.
When people first hear about the painted door testing like this example with Domino’s, they worry about the impact on the customer.

“Isn’t that a bad customer experience – showing them a product they can’t order?”

And that’s fair – it’s obviously not a good experience for the customer. But the potential alternative is that you invest 12 months’ work in building a product nobody wants.

It’s far better to mildly frustrate a small sample of users in an experiment, than it is to launch products that people don’t love.

To find out more about our approach to product experimentation, please get in touch with Conversion.

SCORE: A dynamic prioritisation framework for AB tests from Conversion.com

Why prioritise?

With experimentation and conversion optimisation, there is never a shortage of ideas to test.

In other industries, specialist knowledge is often a prerequisite. It’s hard to have an opinion on electrical engineering or pharmaceutical research without prior knowledge.

But with experimentation everyone can have an opinion: marketing, product, engineering, customer service – even our customers themselves. They can all suggest ideas to improve the website’s performance.

The challenge is how you prioritise the right experiments.

There’s a finite number of experiments that we can run – we’re limited both by the resource to create and analyse experiments, and also the traffic to run experiments on.

Prioritisation is the method to maximise impact with an efficient use of resources.

Prioritisation is the method to maximise impact with an efficient use of resources.

Where most prioritisation frameworks fall down

There are multiple prioritisation frameworks – PIE (from WiderFunnel), PXL (from ConversionXL), and more recently the native functionality within Optimizely’s Program Management.

Each framework has a broadly consistent approach: prioritisation is based on a combination of (a) the value of the experiment, and (b) the ease of execution.

WiderFunnel’s PIE framework uses three factors, scored out of 10:

  • potential (how much improvement can be made on the pages?)
  • importance (how valuable is the traffic to the page?) and
  • ease (how complicated will the test be to implement?)

This is effective: it ensures that you consider both the potential uplift from the experiment alongside the importance of the page. (A high impact experiment on a low value page should rightfully be deprioritised.)

But it can be challenging to score these factors objectively – especially when considering an experiment’s potential.

Conversion XL’s PXL framework looks to address this. Rather than asking you to rate an experiment out of 10, it asks a series of yes/no questions to objectively assess its value and ease.

Experiments that are above the fold and based on quantitative and qualitative research will rightly score higher than a subtle experiment based on gut instinct alone.

This approach works well: it rewards the right behaviour (and can even help drive the right behaviour in the future, as users submit concepts that are more likely to score well).

But while it improves the objectivity in scoring, it lacks two fundamental elements:

  1. It accounts for page traffic, but not page value. So an above-the-fold research-backed experiment on a zero-value page could be prioritised above experiments that could have a much higher impact. (We used to work with a university in the US whose highest-traffic page was a blog post on ramen noodle recipes. It generated zero leads – but the PXL framework wouldn’t account for that automatically.)
  2. While it values qualitative and quantitative research, it doesn’t appear to include data from the previous experiments in its prioritisation. We know that qualitative research can sometimes be misleading (customers may say one thing and do something completely different). That’s why we validate our research with experimentation. But in this model, its focus is purely on research – whereas a conclusive experiment is the best indicator of a future iteration’s success.

Moreover, most frameworks struggle to adapt as an experimentation programme develops. They tend to work in isolation at the start – prioritising a long backlog of concepts – but over time, real life gets in the way.

Competing business goals, fire-fighting and resource challenges mean that the prioritisation becomes out-of-date – and you’re left with a backlog of experiments that is more static than a dynamic experimentation programme demands.

Introducing SCORE – Conversion.com’s prioritisation process

Our approach to prioritisation is based on more than 10 years’ experience running experimentation programmes for clients big and small.

We wanted to create an approach that:

  • Prioritises the right experiments: So you can deliver impact (and insight) rapidly.
  • Adapts based on insight + results: The more experiments you run, the stronger your prioritisation becomes.
  • Removes subjectivity: As far as possible, data should be driving prioritisation – not opinion.
  • Allows for the practicalities of running an experimentation programme: It adapts to the reality of working in a business where the wider priorities, goals and resources change.

But the downside is that it’s not a simple checklist model. In our experience, there’s no easy answer to prioritisation – it takes work. But it’s better to spend a little more time on prioritisation than waste a lot more effort building the wrong experiments.

It’s better to spend a little more time on prioritisation than waste a lot more effort building the wrong experiments.

With that in mind, we’re presenting SCORE – Conversion.com’s prioritisation process:

  • Strategy
  • Concepts
  • Order
  • Roadmap
  • Experimentation

As you’ll see, the prioritisation of one concept against each other happens in the middle of the process (“Order”) and is contingent on the programme’s strategy.

Strategy: Prioritising your experimentation framework

At Conversion.com, our experimentation framework is fundamental to our approach. Before we start on concepts, we first define the goal, KPIs, audiences, areas and levers (the factors that we believe affect user behaviour).

You can read more about our framework here and you can create your own with the templates here.

When your framework is complete (or, at least, started – it’s never really complete), we can prioritise at the macro level – before we even think about experiments.

Assuming we’ve defined and narrowed down the goal and KPIs, we then need to prioritise the audiences, areas and levers:

Audiences

Prioritise your audiences on volume, value and potential:

  • Volume – the monthly unique visitors of this audience. (That’s why it’s helpful to define identifiable audiences like “prospects”, “users on a free trial”, “new customers”, and so on.)
  • Value – the revenue or profit per user. (Continuing the above example, new customers are of course worth more than prospects – but at a far lower volume.)
  • Potential – the likelihood that you’ll be able to modify their behaviour. On a retail website, for example, there may be less potential to impact returning customers than potential customers – it may be harder to increase their motivation and ability to convert relative to a user who is new to the website.

You can, of course, change the criteria here to adapt the framework to better suit your requirements. But as a starting point, we suggest combining the profit per user and the potential improvement.

Don’t forget, we want to prioritise the biggest value audiences first – so that typically means targeting as many users as possible, rather than segmenting or personalising too soon.

Areas

In much the same way as audiences, we can prioritise the areas – the key content that the user interacts with.

For example, identify the key pages on the website (homepage, listings page, product page, etc) and score them on:

  • Volume – the monthly unique visitors for the area.
  • Value – the revenue or profit from the area.
  • Potential – the likelihood that you’ll be able to improve the area’s performance. (Now’s a good time to use your quantitative and qualitative research to inform this scoring.)

(It might sound like we’re falling into the trap of other prioritisation models: asking you to estimate potential, which can be subjective. But, in our experience, people are more likely to score an area objectively, rather than an experiment that they created and are passionate about.)

Also, this approach doesn’t need to be limited to your website. You can apply it to any other touchpoint in the user journey too – including offline. Your cart abandonment email, customer calls and Facebook ads can (and should) be used in this framework.

If your KPI is profit, you may want to include offline content like returns labels in prioritisation model.
If your KPI is profit, you may want to include offline content like returns labels in prioritisation model.

Levers

As above, levers are defined as the key factors or themes that you think affect an audience’s motivation or ability to convert on a specific area.

These might be themes like pricing, trust, delivery, returns, form usability, and so on. (Take another look at the experimentation framework to see why it’s important to separate the lever from the execution.)

When you’re starting to experiment, it’s hard to prioritise your levers – you won’t know what will work and what won’t.

That’s why you can prioritise them on either:

  • Confidence – a simple score to reflect the quantitative and qualitative research that supports the lever. If every research method shows trust as a major concern for your users, it should score higher than another lever that only appears occasionally.
  • Win rate – If you have run experiments on this lever in the past, what was their win rate? It’s normally a good indicator of future success.

Of course, if you’re starting experimentation, you won’t have a win rate to rely on (so estimating the confidence is a fantastic start).

But if you’ve got a good history of experimentation – and you’ve run the experiments correctly, and focused them on a single lever – then you should use this data to inform your prioritisation here.

Again, the more we experiment, the more accurate this gets – so don’t obsess over every detail. (After all, it’s possible that a valid lever may have a low win rate simply because of a couple of experiments with poor creative.)  

Putting this all together, you can now start to prioritise the audiences, areas and levers that should be focused on:

As you can see, we haven’t even started to think about concepts and execution – but we have a strong foundation for our prioritisation.

Concepts: Getting the right ideas

After defining the strategy, you can now run structured ideation around the KPIs, audiences, areas and levers that you’ve defined.

This creates the ideal structure for ideation.

Rather than starting with, “What do we want to test?” or “How can we improve product pages?”, we’re instead focusing on the core hypotheses that we want to validate:

  • How can we improve the perception of pricing on product pages for new customers?
  • How can we overcome concerns around delivery in the basket for all users?
  • And so on.

This structured ideation around a single hypothesis generates far better ideas – and means you’re less susceptible to the tendency to throw everything into a single experiment (and not knowing which part caused the positive/negative result afterwards).

Order: Prioritising the concepts

When prioritising the concepts – especially when a lever hasn’t been validated by prior experiments – you should look to start with the minimum viable experiment (MVE).

Just like a minimum viable product, we want to define the simplest experiment that allows us to validate the hypothesis. (Can we test a hypothesis with 5 hours of development time rather than 50?)

Just like a minimum viable product, we want to define the simplest experiment that allows us to validate the hypothesis.

This is a hugely important concept – and one that’s easily overlooked. It’s natural that we want to create the “best” iteration for the content we’re working on – but that can limit the success of our experimentation programme. It’s far better to run ten MVEs across multiple levers that take 5 hours each to build, rather than one monster experiment that takes 50 hours to build. We’ll learn 10x as much, and drive significantly higher value.

In one AB test for a real estate client, we created a fully functional “map view”. It was based on a significant volume of user research – but the minimum viable experiment would have been simply to test adding a “Map view” button without the underlying functionality.
In one AB test for a real estate client, we created a fully functional “map view”. It was based on a significant volume of user research – but the minimum viable experiment would have been simply to test adding a “Map view” button without the underlying functionality.

So at the end of this phase, we should have defined the MVE for each of the high priority levers that we’re going to start with.

Roadmap: Creating an effective roadmap

There are many factors that can affect your experimentation roadmap – factors that stop you from starting at the top of your prioritised list and working your way down:

  • You may have limited resource, meaning that the bigger experiments have to wait till later.
  • There may be upcoming page changes or product promotions that will affect the experiment.
  • Other teams may be running experiments too, which you’ll need to plan around.

And there are dozens more: resource, product changes, marketing, seasonality can all block experiments – but shouldn’t block experimentation altogether.

That’s why planning your roadmap is as important as prioritising the experiments. Planning delivers the largest impact (and insight) in spite of external factors.

Planning your roadmap is as important as prioritising the experiments. Planning delivers the largest impact (and insight) in spite of internal factors.

To plan effectively:

  • Identify your swimlanes: These are the audiences and areas from your framework that you’ll be experimenting on. (Again, make sure you focus on the high priority audiences and areas – don’t be tempted to segment or personalise too early.)
  • Estimate experiment duration: Use an appropriate minimum detectable effect for the audience and area to calculate the duration, then block out this time in the roadmap.
  • Experiment across multiple levers: Gather more insight (and spread your risk) by experimenting across multiple levers. If you focus heavily on a lever like “trust” with your first six experiments, you might have to start again if the first two or three experiments aren’t successful.

Experimentation: Running and analysing the experiments

With each experiment, you’ll learn more about your users: what changes their behaviour and what doesn’t.

You can scale successful concepts and challenge unsuccessful concepts.

For successful experiments, you can iterate by:

  • Moving incrementally from minimum viable experiments to more impactful creative. (With one Conversion.com client, we started with a simple experiment that promoted the speed of delivery. After multiple successful experiments around delivery, we eventually worked with the client to test the commercial viability of same-day delivery.)
  • Applying the same lever to other areas and potentially audiences. If amplifying trust messaging on the basket page works well, it’ll probably work well on listing and product pages too.

Meanwhile, an experiment may be unsuccessful because:

  • The lever was invalidated – Qualitative research may have said customers care about the lever, but in practice makes no difference.
  • The execution was poor – It happens sometimes. Every audience/area/lever combination can have thousands of possible executions – you won’t get it right first time, every time, and you risk rejecting a valid lever because of a lousy experiment.
  • There an external factor – It’s also possible that other factors affected the test: there was a bug, the underlying page code changed, a promotion or stock availability affected performance. It doesn’t happen often, but it needs to be checked.

In experiment post-mortems, it’s crucial to investigate which of these is most likely, so we don’t reject a lever because of poor execution or external factors.

Conduct experiment post-mortems so you don’t reject a lever because of poor execution or external factors.

What’s good (and bad) about this approach

This approach works for Conversion.com – we’ve validated it on clients big and small for more than ten years, and have improved it significantly along the way.

It’s good because:

  • It’s a structured and effective prioritisation strategy.
  • It doesn’t just reward data and insight – it actively adapts and improves over time.
  • It works in the real-world, allowing for the practicalities of running an experimentation programme.

On the flip side, its weaknesses are that:

  • It takes time to do properly. (You should create and prioritise your framework first.)
  • You can’t feed in 100 concepts and expect it to spit out a nicely ordered list. (But in our experience, you probably don’t want to.)

So, what now?

  1. If you haven’t already, print out or copy this Google slide for Conversion.com’s experimentation framework.
  2. Email marketing@conversion.com to join our mailing list. We like sharing how we approach experimentation.
  3. Share your feedback below. What do like? What do you do differently?

Introducing our hypothesis framework

Download printable versions of our hypothesis framework here.

Experiments are the building blocks of optimisation programmes. Each experiment will at minimum teach us more about the audience – what makes them more or less likely to convert – and will often drive a significant uplift on key metrics.

At the heart of each experiment is the hypothesis – the statement that the experiment is built around.

But hypotheses can range in quality. In fact, many wouldn’t even qualify as a hypothesis: eg “What if we removed the registration step from checkout”. That might be fine to get an idea across, but it’s going to underperform as a test hypothesis.

For us, an effective hypothesis is made up of eight key components. If it’s reduced to just one component showing what you’ll change (the “test concept”), you’ll not just weaken the potential impact of the test – you’ll undermine the entire testing programme.

That’s why we created our hypothesis framework. Based on almost 10 years’ experience in optimisation and testing, we’ve created a simple framework that’s applicable to any industry.

Conversion.com’s hypothesis framework

Conversion.com Hypothesis Framework

What makes this framework effective?

It’s a simple framework – but there are three factors that make it so effective.

  1. Putting data first. Quantitative and qualitative data is literally the first element in the framework. It focuses the optimiser on understanding why visitors aren’t converting, rather than brainstorming solutions and hoping there’ll be a problem to match.
  2. Separating lever and concept. This distinction is relatively rare – but for us, it’s crucial. A lever is the core theme for a test (eg “emphasising urgency”), whereas the concept is the application of that lever to a specific area (eg “showing the number of available rooms on the hotel page”). It’s important to make the distinction as it affects what happens after a test completes. If a test wins, you can apply the same lever to other areas, as well as testing bolder creative on the original area. If it loses, then it’s important to question whether the lever or the concept was at fault – ie did you run a lousy test, or were users just not affected by the lever after all?
  3. Validating success criteria upfront: The KPI and duration elements are crucial factors in any test, and are often the most overlooked. Many experiments fail by optimising for a KPI that’s not a priority – eg increasing add-to-baskets without increasing sales. Likewise the duration should not be an afterthought, but instead the result of statistical analysis on the current conversion rate, volume of traffic, and the minimum detectable uplift. All too often, a team will define, build and start an experiment, before realising that its likely duration will be several months.

Terminology

Quant and qual data

What’s the data and insight that supports the test? This can come from a huge number of sources, like web analytics, sales data, form analysis, session replay, heatmapping, onsite surveys, offsite surveys, focus groups and usability tests. Eg “We know that 96% of visitors to the property results page don’t contact an agent. In usability tests, all users wanted to see the results on a map, rather than just as a list.”

Lever

What’s the core theme of the test, if distilled down to a simple phrase? Each lever can have multiple implementations or test concepts, so it’s important to distinguish between the lever and the concept. Eg a lever might be “emphasising urgency” or “simplifying the form”.

Audience

What’s the audience or segment that will be included in the test? Like with the area, make sure the audience has sufficient potential and traffic to merit being tested. Eg an audience may be “all visitors” or “returning visitors” or “desktop visitors”.

Goal

What’s the goal for the test? It’s important to prioritise the goals, as this will affect the KPIs. Eg the goal may be “increase orders” or “increase profit” or “increase new accounts”.

Test concept

What’s the implementation of the lever? This shows how you’re applying the lever in this test. Eg “adding a map of the local area that integrates with the search filters”.

Area

What’s the flow, page or element that the test is focused on? You’ll need to make sure there’s sufficient potential in the area (ie that an increase will have a meaningful impact) as well as sufficient traffic too (ie that the test can be completed within a reasonable duration – see below). Eg the area may be “the header”, “the application form” or “the search results page”.

KPI

The KPI defines how we’ll measure the goal. Eg the KPI could be “the number of successful applications” or “the average profit per order”.

Duration

Finally, the duration is how long you expect the test to run. It’s important to calculate this in advance – then stick to it. Eg the duration may be “2 weeks”.

Taking this further

This hypothesis framework isn’t limited to A/B tests on your website – it can apply anywhere: to your advertising creative and channels, even to your SEO, product and pricing strategy.
Any change and any experience can be optimised – and to do that effectively requires a data-driven and controlled framework like this.

Don’t forget – you can download printable versions of the hypothesis framework here.

Managed Service Sucks

Software and Services Don’t Mix

Why you shouldn’t buy services from your testing platform.

Split-testing software vendors have traditionally relied on their managed service to win and retain clients.

From Maxymiser to Adobe, Monetate to Qubit, the managed service has been essential to their growth. Even today, most companies cite a lack of resource as the biggest barrier in their optimisation program – and a managed service can help overcome that.

Except most managed services suck.

For software vendors, a managed service can throttle their growth and limit their potential. And for their customers, a managed service can lead to substandard results in their conversion optimisation programme.

And as the optimisation and testing industry continues to expand exponentially, this is only going to get worse.

The core of the problem is simple:

Software and service don’t scale at the same rate.

Scale is crucial to the success of software vendors. After all, most testing platforms have taken significant investment: Qubit has taken $75M, Monetate $46M, and Maxymiser was acquired by Oracle in August 2015.

But it’s challenging when these companies offer essentially two products – software and service – that scale at very different rates.

With limited cost of sales, a fast-growth software vendor may expect to increase its sales 3–5x in a year.

Look at the rise of Optimizely. Their product’s ease-of-use and their partner program allowed them to focus on the software, not a managed service. And that meant they could grow their market share rapidly:

 

testing-market

Between 2012 and 2015, they’ve grown 8x.

Now compare that growth to a marketing services agency. Even a fast-growth mid-size agency may only grow 50% a year – or to put it another way, 1.5x.

If you combine software and service in one company, you’re creating a business that is growing at two very different rates. And this creates a challenge for testing platforms who offer a managed service.

They have three options:

  1. Move away from managed service to self-serve and partner-led growth.
  2. Attempt to scale managed service to keep up with software growth.
  3. Some combination of 1 and 2.

Most will choose option 2 or 3, rather than going all-out on 1. And this choice threatens the quality of their managed service and their ability to scale through partners.

The cost of scaling services

To enable scaling – and to minimise costs – software vendors have to exploit efficiencies at the expense of quality:

  1. They strip back the service to the absolute minimum. They typically cut out the quantitative and qualitative analysis that supports good testing.
  2. They rely on cookie-cutter testing. Instead of creating a bespoke testing strategy for each client, they replicate the same test across multiple websites, regardless of whether it’s the right test to run.
  3. They load account managers with 10–20 clients – meaning the service is focused on doing the minimum necessary to limit churn.

In short, to keep up with the growth of the platform, they inevitably have to sacrifice the quality of the managed service in the interest of making it scale.

Let’s look at each of these three points in turn.

#1 Stripped-back service

At its core, conversion optimisation is simple:

Find out why people aren’t converting, then fix it.

The problem is that the first part – finding out why they aren’t converting – is actually pretty hard.

Earlier this year, I shared our take on Maslow’s hierarchy of needs – our “hierarchy of testing”:

Conversion.com hierarchy of testing

The principle is the same as Maslow’s – the layers at the bottom of the pyramid are fundamental.

Starting at the top, there’s no point testing without a strategy. You can’t have a strategy without insight and data to support it. And you can’t get that without defining the goals and KPIs for the project.

In other words, you start at the bottom and work your way up. You don’t jump straight in with testing and hope to get good results.

In particular, the layers in the middle – data and insight – are essential for success. They link the testing program’s goals to the tests. Without them, you’re just guessing.

But all of this comes at a cost – and it’s typically the first cost that managed services cut. Instead of using a similar model to the pyramid above, they jump straight to the top and start testing, without the data and insight to show where and what they should be testing.

Ask them where they get their ideas from, and they’ll probably say heuristics – a nicer way of saying “best practice”.

#2 Cookie-cutter testing

Creating tests that aren’t based on data and insight is just the start.

To maximise efficiency (again, at the expense of quality), managed services will typically use similar tests across multiple clients. After all, why build a unique test for one client when you can roll it out across 10 websites with only minimal changes?

Break down the fees that managed services charge, and it’s easy to see why they have to do this.

Let’s assume Vendor X is charging £3k to deliver 2 tests per month. If we allow £1k/day as a standard managed service rate, then that gives 24 hours – or 12 hours per test.

At Conversion.com, we know that even just to build an effective test can take longer than 12 hours – and that’s before you add in time for strategy, design, QA and project management.

The cookie-cutter approach is problematic for two core reasons:

  1. They start with the solution, and then find a problem for it to fix. It’s clear that this is going to deliver average results at best. (Imagine if a doctor or mechanic took a similar approach.)
  2. It limits the type of tests to those that can be easily applied across multiple websites. In other words, the concepts aren’t integrated into the website experience, but are just pasted on the UI. That’s why these tests typically add popups, modify the calls-to-action and tweak page elements.

#3 Account manager loading

This focus on efficiencies means that account managers are able to work across at least 10–20 clients. Even assuming that account managers are working at 80% utilisation, that means that clients are getting between 1.5 and 3 hours of their time each week.

Is that a problem?

At Conversion.com, our consultants manage 3–5 clients in total. We feel that limit is essential to deliver an effective strategy for optimisation.

Ultimately, it reflects our belief that conversion optimisation can and should be integral to how a company operates and markets itself – and that takes time.

Conversion optimisation should let you answer questions about your commercial, product and marketing strategy:

  • How should we price our product to maximise lifetime value?
  • How do we identify different user segments that let us personalise the experience?
  • Which marketing messages are most impactful – both on our website and in our online and offline advertising?

Not “Which colour button might work best?”

Conversion optimisation isn’t a series of tactical cookie-cutter tests that can be churned out for your website, while 19 other clients compete for your AM’s attention.

The impact on test results

It’s not surprising that a managed service with a “one-size-fits-most” approach for its clients doesn’t perform as well as testing strategy from a dedicated optimisation agency.

The difference in approach is reflected in results (and, of course, the cost of the service).

But some managed services are misleading their clients over the success of their testing program.

There are three warning signs that the value of a managed service is being overreported:

  1. Weak KPIs: A KPI should be as closely linked as possible to revenue. For example, you may want to see whether a new product page design increases sales. But many managed services will track – and claim credit for – other KPIs, like increasing “add to cart”. While it may be interesting to track, it doesn’t indicate the success of a test. No business made more money just by getting more visitors to add to cart.
  2. Too many KPIs: There’s a reason why managed services often track these weak KPIs alongside effective KPIs, like visit to purchase or qualified lead. That’s because the more KPIs you track – bounce rate, add to cart, step 1 of checkout – the more likely you are to see something significant in the results. At 95% significance, there’s a 1 in 20 chance of getting a false positive. So if you’re testing 4 variations against the control, and measuring 5 KPIs for each – the chances are you’re going to get a positive result in one KPI, even when there isn’t one.KPI table
  3. Statistical significance: The industry’s approach to statistical significance has matured. People are less focused on just hitting a p value of 0.05 or less (ie 95% significance). Instead, strategists and platforms are also factoring in the volume of visitors, the number of conversions, and the overall test duration. And yet somehow we still hear about companies using a managed service for their testing, where the only result in the last 12 months is a modest uplift at 75% significance.

The role of managed service

Managed service has a place. It can be essential to expand a new market – especially where the product’s learning curve is steep and may limit its appeal to a self-serve audience.

But the focus should always be on the quality of the service. Vendors can subsidise the cost of their service if needed – whether through funding or the higher profit margin in software – to deliver an effective optimisation program.

Then, their growth should come through self-service and partners. As above, service and software scale at different rates – and the faster a software vendor champions self-service and a partner program, the faster they’ll grow.

 

Disclaimer: I’m the CEO of Conversion.com, an agency that specialises in conversion optimisation. We partner with many of the software vendors above. While we have a vested interest in companies choosing us over managed service, we have an even greater interest in making sure they’re testing effectively.

2016: Five predictions for the conversion optimization industry

Every January for the last five years, I’ve thought to myself: “This year is the one when conversion optimization will become mainstream.”

Not just another process that’s occasionally bolted on to marketing or web design – but a mindset that’s core to how every company operates and grows.

But it’s never quite worked out like that.

Conversion optimization has come a long way: data-driven companies like Facebook are leading the way, and more companies than ever are embracing testing. Conversion optimization has become huge business: not just for the brands who embrace a “test and learn” philosophy, but also the software and service companies that support them.

But there’s still a long way to go. Here are my predictions for the year ahead…

 

#1 There’ll be a high-profile web redesign disaster

“Most brands are still redesigning their websites on a 3–5 year cycle”

We’re not a fan of redesigns. They’re typically unfocused, unmeasured, “best practice”-riddled disasters – and the antithesis to the concept of continual improvement that we promote. (That’s why our creative team don’t do redesigns – they focus on tests instead.)

It’s been two years since Marks and Spencers famously botched their redesign. It cost them £150 million and lowered their sales 8%.

But they won’t be the last – most brands are still redesigning their websites on a 3–5 year cycle. Meanwhile a company like Amazon continually optimizes, tests, iterates – without ever really appearing to ”redesign”.

 

#2 Personalization will be the new battleground for websites and testing platforms

“Personalization can become a brand’s competitive advantage.”

I have a love/hate relationship with personalization.

It offers a huge opportunity to companies with a strong foundation of split-testing.

But if you don’t have this foundation – if you don’t know which user segments to target, what motivates them and what stops them – then you risk forking your website and creating multiple suboptimal experiences.

But personalization can and should be huge – both for brands and the software vendors that support them.

For brands with a mature testing program, personalization offers a way to drive even more value from every visitor. They already know from their A/B tests that their visitors behave differently: some will respond positively to a test, while others may be neutral or negative. Personalization offers a way to fix this – and opens up huge new opportunities for growth.

Most importantly, it can become a brand’s competitive advantage: while a competitor may be able to learn from or copy your website’s testing, they can never fully discover your personalization strategy.

Likewise for the testing platforms themselves, they will live or die by the success of their personalization offering. A/B testing platforms are all based on the same premise – and it’s easy to switch out one with another.

But personalization offers a huge opportunity for software vendors. They can differentiate with the sophistication of their platform, its ease of use, and potentially the AI that supports it.

And most importantly for software vendors, it’s difficult to swap one personalization setup for another: the more data they collect, the more complex the setup, the more value they add, the harder it’ll be to move away from – and the more they can charge.

 

#3 Google’s new optimization platform will disrupt the market

“Google’s biggest opportunity is to create a full-funnel testing platform.”

Some time in 2016, Google will publicly launch a testing platform that will disrupt the market.

It was ten years ago that they launched Google Website Optimizer – allowing anyone to A/B test, without a five-figure monthly pricetag. GWO has since been retired – replaced by a much weaker product, Content Experiments – but all that is set to change.

Google’s new testing platform is rumored to be in beta, and its release is set to disrupt the market. At minimum, it’ll bring a huge amount of interest and attention as more people discover the opportunities for optimization.

But we don’t know yet whether it’ll be a “me too” product – possibly with some additional features (like Content Experiments’ multi-armed bandit model) – or whether it’ll be a game changer like GWO was ten years ago.

Google’s biggest opportunity is to create a full-funnel testing platform: spanning acquisition, conversion and analytics. They have a huge competitive advantage – Google has the market share on analytics and online advertising. With two-way integrations for both, they can not only bring more people to testing, they can also bring the rigor and theory of optimization and testing to online advertising.

Right now, conversion optimization and testing is focused on websites – but brands are spending 99x as much bringing traffic to their website as they are on optimising the website itself. Google is perfectly positioned to take advantage of this.

 

#4 Testing will spread outside of website optimization

“There are opportunities throughout the funnel: from advertising through to CRM.

The concept of testing and continual improvement isn’t unique to conversion optimization. What’s interesting is how conversion is starting to have an impact on complementary disciplines – like SEO and PPC.

This has already started. In January 2015, Pinterest posted about their success with split-testing for SEO. Then in December, Distilled announced a server-side solution for companies looking to split-test their SEO strategy. Meanwhile in PPC, Brainlabs have developed A/B Labs, allowing you to split-test campaign structures, bidding software and even different PPC agencies.

In 2016, this will gather pace. There are opportunities throughout the funnel:

  • At the top of the funnel, advertisers can leverage the insight and process of website optimization. By analysing the same qualitative data that informs testing on the website – as well as the content that is proven to motivate users – advertisers can create more appealing, persuasive creative. Then, they can increase the sophistication of their ad testing: not just in PPC, but even TV and outdoor – traditionally untested campaigns that would fall into the category of “Half the money I spend on advertising is wasted; the trouble is I don’t know which half.”
  • At the bottom of the funnel, brands can increase customer satisfaction, loyalty and lifetime value by again leveraging the insight and process of website optimization. By understanding the principles that motivate users to become customers – and by applying principles of testing to CRM – brands can test and optimise every customer touchpoint.  

 

#5 A testing/personalization vendor will IPO

“Optimizely have the market share in optimization, and the investment to push heavily on personalization.”

If the predicted “softening” of the tech market allows, we’ll see one of the pureplay testing and personalization vendors preparing to IPO.

The prime contender is Optimizely. They have the market share in optimization, and the investment to push heavily on personalization.

There’s a chance they could be an acquisition target instead. Interestingly, as of October 2015, Salesforce Ventures is hedging its bets with investments in both Optimizely and Qubit. (It’s only been a few months since Maxymiser was acquired by Oracle in August 2015.)

Any IPO or acquisition will depend on the success of either company’s personalization offering – as above, their monthly recurring revenue can increase exponentially if they get it right.

 

If you want to make these ideas a reality – and help brands exploit the potential of optimization and personalization – please look at our careers page!