Ep: 4 // Ronny Kohavi, Airbnb // Creating Trust in Experimentation

Ronny Kohavi is a driving force and leader in experimentation culture for enterprise legends like Amazon, Microsoft, and, most recently, as Airbnb’s VP and Technical Fellow. Among his noteworthy contributions, he’s implemented fundamental changes and achieved wild success for Microsoft by implementing experiments at scale. Ronny has recently published Trustworthy Online Controlled Experiments: A Practical Guide to A/B. And with all that said, would you expect any less from a Stanford graduate who joined their first startup at age 15?

“If you define an intrapreneur as somebody who is within the company that tries to promote some innovation or some change, then certainly, I fit that definition.”

Ronny Kohavi

Running hundreds of experiments a day is not something lots of people have done in their careers. Despite his massive success in the industry, Ronny is also no stranger to experiencing less than stellar results despite having a world-class personalization team. Ronny explains that even the experts end up being humbled by the reality that the users don’t like what we do most of the time.
Of all of his accomplishments, Ronny is most proud of his ability to change experimentation culture. Ronny mentions to Chris, “The main thing that I’m proud of is a cultural change.” Overcoming obstacles like getting executive buy-in, working through organization shifts, changing design and development patterns and building a platform you can trust are topics Chris and Ronny candidly discuss in this episode

Top takeaways include:

How to pick a path of least resistance to demonstrate value to management
Why selecting stories will help evangelize experimentation across your organization
The importance of validating before celebrating

Read transcript

Episode Transcript

Ronny Kohavi:

It’s not about steady progress that you know when you will deliver. A lot of these are hypotheses. Many of them will fail, but those that succeed are going to provide you with the learning.

Chris Goward:

This is Insights for Growth, the show where we hear insights from intrapreneurs, who drive change within large organizations. I’m Chris Goward, Founder of Widerfunnel. Widerfunnel helps great companies design digital experiences that work proven through rigorous experimentation systems. Today on the show, we have a great conversation lined up.

All right. So, why don’t we start? Why don’t you tell us your name and title?

Ronny Kohavi:

Hello, my name is Ronny Kohavi. I’m the VP of Technical Fellow for Relevance and Personalization at Airbnb.

Chris Goward:

Great. Let’s go back to beginning, you started your career in programming. Since then, you’ve led teams at companies like Amazon, Microsoft, and now, Airbnb. Now, of course, you’ve been publishing journal articles. You’ve just recently released your new book, which I have sitting over here on my shelf. It’s very good, I recommend it. So, yeah. Your book is called Trustworthy Online Controlled Experiments. You’ve led internal development of software products that have had widespread adoption at companies. Would you consider yourself a change agent or an intrapreneur? How do you think about that?

Ronny Kohavi:

Yeah. So, if you define an intrapreneur as somebody who is within the company that tries to promote some innovation or some change, then certainly, I fit that definition. In the companies that I worked, I tried to introduce new innovation. Especially at Amazon and Microsoft, I have achieved what I consider to be fundamental changes in the way we have done things. So, as an example, at Microsoft, when I arrived, there was practically no controlled experiments going on, no A/B testing. There were a few attempts, one-offs analyzed, and then we built the platforms there to run controlled experiments at scale. When I left, which was about a year ago, we were starting over 100 controlled experiments every day. These were running across approximately about 15 of the major product groups. So, anywhere from Bing and MSN, where we started, but also to Office and Microsoft Teams and Xbox and all the major product groups, were starting to use experimentation and scaling them very nicely.

Chris Goward:

Right. So, that’s obviously must be a good feeling, having created that kind of change in a company of the scale of Microsoft. I mean, running hundreds of experiments a day is not something that a lot of people have done in their career. Looking back, I mean, you have a distinguished career. You’ve made a lot of change. What would you say stands out as the big accomplishment of what you’ve done so far?

Ronny Kohavi:

The main thing that I’m proud of is this cultural change of going from a model where people planned the release of a product, executed it against it, and then it was out there for a while. I think of Office as the extreme case, where nice model, 27 steps. You start with some research. You come up with key pillars. You refine them. And then three years later, there is a shipping product. That model was good early on in the ’90s. I think now that we have the internet and the ability to instrument and collect information from users, we have a superior model where we can try things out there, see the impact on users, and then iterate more quickly. That cultural change going from a fairly waterfall model into something where today, if you take Office today, Office ships every month. In that shipping release are hundreds of experiments that you will be in some of them in the control, some of them in treatment one, some of them are treatment two. And then data is collected, decisions are made. And then the process is iterated. So, you’re always able to provide users with features that turned out to be provably useful to some users and to the organization. I think that cultural change was the thing that I’m most proud of in my Microsoft career. I encourage companies that can do this to of course do this. I mean, we are at the extreme with software in the sense that the magic of software allows you to do this very cheaply, very easily. As you get into hardware and more physical instances, it’s harder to run experiments, it’s more expensive. Some people have been doing this. You don’t get the agility and the scale that you get in software, but there’s still value to it. Again, if you’re in the software business and you’re not running controlled experiments, you’re losing an opportunity to iterate much more quickly and gain all this valuable insight that ultimately leads to a better product.

Chris Goward:

So really, shifting the culture, especially a software driven company, from that waterfall process of pre-planned months of planning to a more agile experiment driven approach has been what you’ve been pushing. Has that been the theme of what you’ve been doing throughout your career is pushing that kind of experiment-driven agile approach?

Ronny Kohavi:

Yeah. So, first of all, I’m not going to say throughout my whole career. I started out fairly young. I joined the startup when I was 15. And then I spent some time in Israeli army and a computer center. And then I did my PhD. And then I went to Silicon Graphics. In Silicon Graphics, we built a classical product. It was a data mining and visualization product, but it was built in the classical release cycles. Then I was at Blue Martini, which, again, was very much a regular cycle of delivery. Amazon, it was the first time where I started getting exposed to experiments, because it was a service-based model. Because it was a website, it was very easy to do. Amazon, especially with Jeff who liked to be more data driven, encouraged us to try things out and learn.

I remember this epiphany of realizing that most ideas fail.

I remember, I own their several teams there when I was Director of Data Mining and Personalization. The personalization team is world leading. We have great ideas, and the team is superb with the best people. Yet, when we looked at the number of experiments that were failing, we were failing more than half the time.

I had this user observation that, “Wow, even the best experts, we always think our ideas are great. We end up being humbled by the reality that the users don’t like what we do most of the time.” The features are not implemented as well as we hoped. Bugs happen more often. The tool of controlled experiments, these A/B tests allow you to discover things really, really quickly. So, there’s the long scale of controlled experiments, which may be several weeks, but there’s also even the short scale, if you’re able to do near real-time generation of scorecards, you’re able to detect early failures pretty quickly. That was a theme that evolved later on in my career at Bing, where the first iteration of the platform gave you this first scorecard a day later. Later iterations were giving you results in near real time. So, you launch something. Twenty minutes later, you could start looking at what happened. If the change had a bug in it… Again, this was one of the surprises that approximately 10% of the time, there was a serious enough bug that we would abort. It became so common that we said, “We will automate this process.” So as we generated this near real time scorecard, as we were able to tell you what happened to the key metrics, we would say, “If something is so bad, we are losing, the user sessions are getting smaller. There’s fewer sessions. We’re losing a lot of money.” Any one of those keys, you can think of it as small number of metrics, if they’re down so much, we should abort the experiment automatically. So, we’re doing that at 10:00 PM at night, at 2:00 AM at night. That’s the beauty of being able to automate the experimentation platform that can give you that. So, I think that’s the phases of going from waterfall to experiments over slow daily or weekly release cycles and updates of scorecard to even near real time. What’s nice about this, because there’s a notion of statistical power, which is the ability to detect the change if it exists. So, if there is a change, let’s say you’re dropping revenue by 20%, the power tells you what is the probability that you will detect a change and you normally want something like 80 or 90%. That usually depends on the number of users you’re able to expose and the metric itself, which has its inherent properties, its own variance. That the ability to detect the change in your sensitivity is proportional to the square root of the number of users. So, when you are looking for small changes, you need a lot more users. When you’re looking for an egregious change, like we dropped a key metric by 10%, you don’t need a lot of users. That’s why all the real-time concept works so well, because if something drops severely, you’ll be able to detect this in a very short amount of time.

Chris Goward:

Now we’re going to get into a little bit more about statistical power and making better decisions. So, I don’t want to lose listeners when you start talking about square roots or anything like that, but I do want to also get back to the idea of the culture change. So, it sounds like you were bitten by the bug of experimentation and seeing the power of it at Amazon. Is that fair to say?

Ronny Kohavi:

Yes, absolutely. I mean, there was this eye-opening observation that we are mostly wrong. Most of our hypotheses, most of the things that we believe are going to be useful and any feature that a team builds, unless it’s a regulatory feature that you have to build, is normally designed to improve the product. So, it was very humbling to see most ideas fail.

Chris Goward:

Yeah, I’m sure. So, that ties back to what you were first mentioning is this idea of creating culture change. That is what you’re most proud of. It seems like that is really closely tied to culture, it has to be. Experimentation requires a certain humility and a certain recognition that we’re wrong a lot of the time. No matter how smart we think we are, how smart we might be, we have blind spots. We are not our full sample size of our audience, right?

Ronny Kohavi:

I’ll tell you a funny story, which is when I came to Microsoft, that was the initial pitch I started to give, which is look, at Amazon, we were failing more than 50% of the time. But the culture wasn’t ready for that. The typical response was, “We have better program managers here.”

Chris Goward:

Oh, wow. Okay. Yeah, right. Right, right, right. So, that’s interesting, right? You’re trying to create this cultural change, this acceptance of data-driven, insight-driven, experiment-led, iterative product design or customer experience design, however, you want to categorize it. You faced some resistance, right? In any large organization or any organization of more than one person, there’s going to be resistance to change, especially so in large companies. So, how did you face that and come across that resistance? What would you do to help to overcome that inertia?

Ronny Kohavi:

So, first of all, I’ll say it wasn’t easy. The first few years were really hard in the sense that I was trying to build something that few people appreciated. The things that have helped are to pick an area where you know that this is going to help. So, we started off when we built the first platform, we went after MSN, because it’s a website. It’s easy to change. It’s not as hard as Office, which has to change their whole three-year release cycle. So, we went after the Agile groups. It was MSN and then Bing also, where you can see the value quickly. The main thing is to establish a beachhead with a few areas where you can have success. So, the ability to show that when people are open to running experiments, some of them are going to be surprisingly good, some of them are going to be surprisingly bad. There’s this notion of it. What is a surprising experiment really? It’s one where the result deviates a lot from your expectation, but in absolute value. So, if you think something was going to be great and it was great, then well, it’s not surprising. But if you thought something was going to be great and it failed, well, you learn something. But the opposite is even more interesting, which is you thought this project was going to be mediocre and it was this home run, then you learned a lot, right? I mean, the book opens up with an example of a really small change to the way we displayed ads at Bing, an idea that was on the backlog for months and months and months. It was just not appreciated. It was even such a small idea that the person that ultimately implemented, it took days to implement. This was this breakthrough idea that increased our revenue by over 10%. Nobody believed it initially. It’s like, “Wow, this minor change created this thing.” It was so extreme that I remember when they launched the experiment even, which was a percentage of the users, alarms started firing that there is something wrong with our logging, because we’re making too much money. By the way, that used to happen once in a while. Somebody would log revenue twice and would seem like we’re making a lot of money, but it wasn’t real. In this case, the small change that took a couple of days to implement was so big in terms of revenue that all these alarms fired. We didn’t believe the results. We stopped and reiterated and tried it. We re-did it multiple times, but it was real. Those things that allow you to show the surprises, meaning something that was deemed as mediocre that is a huge win or vice versa, those are the ones that become stories that you have to evangelize across the org and get the executives to see them, believe them, repeat them, and say that “This is how we want to move forward.”

Chris Goward:

Yeah, well, you touched on a couple points that are interesting there that really resonate with my experience, leading experimentation programs for 13, almost 14 years now at Widerfunnel, is that, first of all, to create that culture, sometimes you have to start with a little bit of protective shelter to get momentum and prove out some experiments that show good wins before it’s ready to be exposed to the light of day in the larger organization where the inertia could kill it initially, right? Sometimes, if people try to go too big, too fast, when there isn’t the culture of support and acceptance, especially at the executive level, you can lose momentum quickly when things don’t happen as quickly as possible.

I also think that what you’re touching on, the nature of experimentation sometimes is counter intuitive, where people are used to seeing regular progression over time. If you’re chipping away at a stone, you expect that every time your hammer hits the stone, you’re going to chip something off of it. So, you see regular progression. But that’s not the way experimentation works, is it? You learn something. Hopefully, you design your experiments properly every time.

But the big wins come in large bursts usually, not progressively evenly over time.

Ronny Kohavi:

Absolutely. In fact, this is one of the differences that the culture has to adopt to. So, when you think about the Office model, this waterfall model, they were able to say, “We are going to release on this date three years out,” and meet that, because it was like you said chiseling at the stone. There was a process. You would have to code things. Things would be late. You work hard overnights to meet the schedule, but it was activity that you knew would generate, if there was a feature and there was a good spec and you were able to write it and then QA was able to test it. There was a cycle of stabilization and reducing the bugs. You can be more predictable about when you would deliver something. But in our world, when you try to solve problems like relevance for search, now that I’m doing this at Airbnb and also at Bing and even when you’re introducing new features, the mindset has to change from, “We’re going to build it,” to “Let’s evaluate if this is useful.” If you’re actually going to be using it, if this is going to improve something about the user experience, are you able to give the user a capability that they didn’t have before or that would make something that was before faster, easily, more discoverable, all these things? That’s what experimentation gives you. As you said beautifully,

…it’s not about steady progress that you know when you will deliver. A lot of these are hypothesis. Many of them will fail, but those that succeed are going to provide you with a learning.

I would say that there are certain domains where this is more important than others. I think of search, one of the key features in any retail site or search engine is one of those things that tends to have models behind it, usually machine learning models. It’s very hard to spec, to say in advance, “Build X and it will be useful.” Many of the ideas that we try fail. We tend to focus on the anecdotal examples that say our new feature will work. We tend to not always see that it’s a recall problem. Where are the places that this will hurt the user experience? That’s what experimentation gives you, is you actually expose this new idea to users. You measure how the user experiences changes and what are the key metrics and how they move. And then you’re able to objectively determine if the feature is useful or not, relevant to the metrics that you determine.

Chris Goward:

Right. So, when making this whether it’s a business case or a promotion of the culture of experimentation, there’s two sides to it then. You’re looking at the big gains and the wins, but then also the risk mitigation, right? The potential for loss if you release a feature that actually harms the user experience that you’d completely predict the inverse.

Ronny Kohavi:

Yes, I mean, there’s two ways to improve the company model, which is to raise the top line or reduce the cost or the losses. Experimentation gives you both, right? We know of things where if you introduce a positive feature that helps the users, they’re going to sign up more. They reduce churn. They may generate more revenue if you have an immediate model, like an ask-based model. But there’s also the stuff that hurts you. Knowing that something hurts key metrics and stopping that release is also very, very important.

Chris Goward:

Yeah, talking about the cultural aspect of it, would you say that that’s been the biggest challenge overall, or were there other equal challenges you faced in trying to implement this kind of program?

Ronny Kohavi:

So, I think the cultural challenges are the hardest in terms of it’s hard to predict when… I think that building a good experimentation platform is a challenge. Building something that you can trust is a challenge. I often point at the example of the early versions of Optimizely. They built their A/B testing platform, which initially had not just software bugs, but statistical incorrect outcomes that they gave you in the model. So, they were not aware of something called multiple hypothesis testing. They allowed you to peek at the results very often, which increases the probability that you will make something look statistically significant when it isn’t. We don’t have to go into the details, but let’s say that they declared success too many times. Now, getting the statistics right is also important. It’s not easy. Optimizely did that later on with what they call the New Stats Engine, but the fact is that they didn’t get it right. Also, when I was involved in building the platforms, we got things wrong. I think it is a key thing to always test yourself, always triangulate. There’s this idea called an A/A test, which we may get to later on, but there are tests that you can do to evaluate that the technology you build is reliable. While I think cultural challenges are harder to predict, will the executive buy in? How well will the organization shift? Will we change some of the design patterns, some of the development patterns? I also think that there are challenges around building a platform that you can trust. I’ve seen examples where the first experiment might be this huge win and then you realize you made a mistake. It’s all credibility.

Chris Goward:

You’re right.

Ronny Kohavi:

Be careful with that.

Chris Goward:

Yeah. So, that’s something that I know you’ve been advocating for quite a lot. Of course, your book is Trustworthy Online Controlled Experiments. So, it’s not just online controlled experiments, but how do we make them trustworthy, right? So, it seems like your theme these days. Perhaps why are you such an advocate of this?

Ronny Kohavi:

Yeah. So, that theme, by the way, was one of the first things that we did when we built the original experimentation platform is we have our mission statement to accelerate innovation using trustworthy experimentation. From the early days, this was around 2006, when we formed up the initial platform. The reason we chose the word ‘trustworthy’ is that we learned that many of the initial results that we had… Again, my exposure was at Amazon. Some of the most extreme results when you dug into them, they were incorrect. Building that trust, so then when you do get a result, you should be able to say, “Yes, this is a number that I believe in.” That’s hard. There’s a saying that I put in several of the papers, which is, “Getting numbers is easy. Getting numbers you can trust is hard.” So, the initial versions of platforms that people build might generate numbers, but the question is, “Can I trust them?”

Chris Goward:

Right. If you have numbers that are untrustworthy, then what’s the point, right? The only thing worse than an experiment result that’s losing is an experiment result that wins that is untrustworthy, because now you’re actually making a poor decision that you’re very confident in.

Ronny Kohavi:

But it’s the unknown unknown, right? Because people will give you the P value and it will have five digits of precision. And then they’ll give you a confidence interval. What you don’t know is that these are incorrect. So, desire to validate that what you’re telling your users when you show a scorecard is trustworthy, not everybody has that. I think that’s actually that skepticism, that desire to revalidate, to triangulate is a property of the best data scientists out there. They look for things to disprove that this is such a great win, because it’s very unlikely that you have a 10% experiment win, right? So, when you have that, don’t declare success before you validate it in many ways and triangulate and see that this is a real effect.

Chris Goward:

Well, it seems like we could probably generalize that to more broadly. The best decision makers are probably skeptics in overall, right? They have some curiosity to really understand what’s behind it.

Ronny Kohavi:

Right. They triangulate, they measure the same thing from multiple angles. You look for ways to disprove this. I mean, this is hard, right? You want to celebrate and win. When you have to have this instinct to say, “Before I celebrate, let me validate it to make sure that this is real. Can I explain why, and can I segment and see that the effect is good in multiple segments, because it can’t be that something is good in only one segment, yet the overall average is so high, right? Because, ultimately, something that’s a huge win has to impact a lot of large segments.”

Chris Goward:

Okay, so let’s get into a little bit of that. What you talked about in the book is how to create more trust in your experimentation. If you’re going to experiment, you might as well make sure that you’re experimenting properly. Otherwise, yeah, you’re probably wasting your time. So, you speak a lot about Twyman’s law and how that plays out. Why don’t you tell us about that?

Ronny Kohavi:

Yeah. So, Twyman’s law is actually very interesting in the sense that I have seen this, worded in several different ways, but there is no real book that says Twyman’s law, right? People ascribed to Twyman something that was said, but the common version of this is basically around the idea that any figure that looks interesting or different is usually wrong. This is the earliest wording of this that I found. The one that I actually was exposed to this is any statistics that appear interesting is almost certainly a mistake.

Basically, the idea of Twyman’s law is that if you’re running a lot of experiments and most of them come out to be small impact, I managed to get 0.5% or 1% or you lose your 0.5%, lose your 2%. Suddenly, you have a 5% or 10% win, it’s interesting, it’s surprising. But your prior should be, “Hold on, it’s probably not real, because the probability of that happening is very, very low, very low.” So, we made it a habit when something extreme happens is to say, “Hold on, let’s make sure this doesn’t violate Twyman’s law. Let’s validate the results, spend more time to see that this is real.” There are wins. I mean, again, I opened the book with an example that was over 10% winning revenue, but it was rare. We had other examples that we thought that looked initially like a great win, but we show that they were not trustworthy, that there were some bugs. There were some reason why we were skewed. For example, a classical thing that happens is we do a redirect. So, the treatment does a redirect. The slow users fail that redirect more often. So, only the good users make it to your treatment. So, they’re going to spend more money, they’re going to have faster internet connections, they’ll show better metrics, but you lost users on the way.

That’s why we built this test called a sample ratio mismatch to detect those effects. So, very important law. There’s a book that I recently read that has the same theme. They never used the word Twyman’s law, but the book’s name is Calling BS. They give examples that’s very much aligned with Twyman’s law.

Chris Goward:

Yeah, that sounds like my kind of book. I’d like to take a look at it. So, okay. So, let’s talk about then some more of those things that you’ve learned over the many years of experimentation that you’ve put into place to create more trust in the experimental results. You talk about the sample ratio mismatch metric. Of course, you mentioned the A/B, A/A test as well, we’ve been doing that for many years at Widerfunnel as well. Often revalidating, something that looks surprising with an A/A/B/B just to triple check that everything in the instrumentation is working. So, you have some other instrumentation tests that you’ve put into place.

Ronny Kohavi:

Yeah. So, I think the main insight was that we should run A/A test. So, first, let me back up and say, “What is an A/A test?” You normally think of an A/B test, where you have a version A and version B or the control and the treatment. You’re looking for differences between what exists today, the control or the A, and the new feature that you’ve launched, which is the B or the treatment. In an A/A test, there is no change. You split the users into two, A and A prime, you can think about it, but we know that there is no difference. So, what you want is that the system should tell you 95% of the time, if you use a common P value of 0.05, that there is no difference. I remember we first talked about running A/A test, there was this discussion of, “Why do it?” I mean, are we fighting the math? No, we’re actually fighting the implementation of the math. The math tells us that you should get a non-statistically significant result 95% of the time, but maybe our implementation is wrong. It was actually surprising to see how many amazing insights came from these A/A tests. So, I remember the first one was that we weren’t correctly computing the variance, which then impacted the confidence rules. We weren’t computing the variance correctly of ratio metrics. So, if you have a ratio metric, like click through rate, number of clicks divided by number of pages, it turns out that the standard computation that you learned in Stats 101 is going to fail. So, you have to realize that. What the A/A test showed us is that yes, we did not correctly estimate the variance or the standard deviation. That caused us to declare success a lot more often or failure, it changed a lot more often. So, that’s an example of something that we discovered. Then we have to go and understand, “Why are these metrics showing statistically significant more often than they should?” We learned that and adopted. There’s ways to work around these. Sample ratio mismatch was another example where if you’re running an experiment, and most experiments, we recommend run with an equal assignment to the control on the treatment. So, you want to assign 10%/10% or 50%/50%. If you then get 51% versus 49% or if your experiment is large and you get 49.9% versus 50.1%, something is wrong. Because the law of large numbers tells you that when you have a million users, 0.1% deviation should be extremely rare. So, you should ask yourself, “What is the probability that I will get such a difference?” You can compute that probability. If you get a value like this should happen once in a million and you’ve only run 50 experiments, then whoo, did I just observe a really rare event? I just got really, really unlucky that I win the lottery by chance. No, it’s most likely you have some bug. Debugging these sample ratio mismatches is very, very hard.

Chris Goward:

Thinking about a lot of our listeners here will be business decision makers, they’re innovators, champions of innovation. They understand that being data driven and making these kinds of decisions is very important. They want to have that in place. They may not be statisticians. They may not be programmers. They may not be really able to even dig into that or understand, but they need to know enough to be able to ask the questions of the people informing them and reporting back on these kinds of things. Do you have some advice on the kinds of questions they should be asking of their statisticians, data scientists or programmers, developers or whoever is implementing their experiments?

Ronny Kohavi:

That’s a great question. The easiest thing for me to say is if you run controlled experiments, did you do some validations that the result does not have anomalies that would invalidate it? You can point people at articles or book. Those statisticians should be able to run those tests. But the key is to allocate time for that to say, “Is every result that I’m getting… Has it passed the three statistical tests that are recommended in chapter three of the book?” That builds that level of trust, but I’ll go even more basic. We talked about this hierarchy of evidence early in the book. A lot of the executives may not even understand that different analyses have different inherent reliability to them. So, there’s something called hierarchy of evidence. If you heard something from a user that complained, well, maybe. If you ran a study, and show that X percent of users do something and Y percent of users do something, you have a good hypothesis that may be pushing one to the other may help, but it is an observational study. It does not necessarily imply this causality, which is the key that we want to prove, that if you push people to do something, if you introduce a feature, you will actually cause metrics to move. So, that’s the second rung, which is to do an observational study. And then at the top of the pyramid are these controlled experiments, the ones that we recommend, the A/B test, where there is a randomized control going on and you split the users. Nothing should be different between these two groups, except the fact that there was a random number generation process. So, something went external in the stock market, if there’s a spike in corona, whatever the other factors that are external to the system happen, they should impact the control and treatment in the same way. So, that if you see a difference between them, it is likely due to my change. That’s why you have to still run a statistical test to say, “Is this delta that I’m seeing, is that statistically significant?” So, going back, the executives first have to understand this rung of this hierarchy. That when they get a report that says this is happening, if it’s not based on a controlled experiment, they should not assign to this the same level of trust as a controlled experiment. I think that was one of the early things that I learned that many people were not… They necessarily don’t have a science background, or they don’t think about the fact that this is an interesting study, but it’s all observational. A lot of decisions happen based on observations. And then people are surprised that, “Well, we did this, and nothing happened.” Well, it was not causal, right? There’s multiple reasons why the ground is wet. It could be that it was raining, but it could also be that the sprinklers were on. So, you accounted for this other factor. So, that’s why controlled experiments, in terms of the scientific rigor of them, they’re considered to be the gold standard. In fact, there’s one more which is multiple controlled experiments, right? If you’re able to run a controlled experiment and replicate it, that gives even more confidence that the feature that you build is actually useful to the metrics.

Chris Goward:

You’re right, which is the basis of the scientific method, having peer reviewed journals of replicable experiments.

Ronny Kohavi:

Right. But that’s why we also have this problem right now in the community of many things not being replicated, right? Because we do a lot of these studies and because there’s a bias to publish only things that are statistically significant, you run enough ideas, some of them will be statistically significant by chance and they will not replicate. That’s why it’s so important to both yourself replicate, don’t cheat yourself. We run so many ideas, most of them fail. Therefore, if the P value is not very low, then replicating it is a good idea, right? The standard .05 that we’re used to in the broad literature may not be good enough if you’re starting 100 new experiments every day. Because then think about if 5 of them, if there is no difference, you will get statistically significant and a P value of .05 for 5 out of 100.

Chris Goward:

Yeah. So, okay. So, you’re talking about the hierarchy of believability or trustworthiness in your methodology. It’s similar to something that I’ve been talking about for the last year or so is this a mix of methodologies that have different purposes for different outcomes. So, in some cases where we want high trustworthiness, then an A/B test or an A/A/B/B test is probably your highest confidence level in a controlled experiment, but in some cases, the controlled experiment is not actually going to be the right tool for the job if what you want is a rich understanding of the why behind it. Sometimes you need qualitative input to enrich your understanding of the customer experience. So, we’ve been thinking about this idea of a quadrant of methodologies where some methods are behavioral, where you’re observing how people actually behave. Others are attitudinal, which is how people say they behave or what they believe. Some are qualitative, and some are quantitative. So, it’s this quadrant of different methods. Sometimes user testing is the right thing to find new ideas about what may be a problem, which you wouldn’t have even known to test if you didn’t do that.

Ronny Kohavi:

It’s called hypothesis generation. So, absolutely. Hypothesis generation comes from a multitude of sources. You could do some data mining on the data. You can build some models. As you said, you can do some user studies. You can just look online. We have the amazing opportunity to looking at how users flow through our website, as an example, right? So, these studies are all true ways of generating ideas. Of course, there’s anecdotal evidence, right? The CEO sends you an email saying, “God, why are we showing this result?” or “Why are we not doing this correctly?” So, all these are sources of ideas, of hypotheses, then comes the question of, “Which of them do we have an idea on how to improve something? Does it have enough coverage?” That’s one of the challenges that I tell people. It’s very easy to solve this example. Are you able to generalize it, so that it will cover 10% of users, 20% of cases, 30% of cases, right? The higher the coverage, the more opportunity to have to make a big impact. One of the things that is important to realize is when you run a controlled experiment, if it only triggers, meaning, it only impacts 1% of users, then you’re very limited in what you can do, right? You can make that 1% of users 50% more effective, but ultimately, you’ve moved the key metric by .5%. No experiment gives you 50% more effectiveness. If we get a few percentage points, it’s a huge win. If you got a few percentage points and you only apply them to 1% multiply the two out, you’ve made a very small impact. So, that challenge between the ability to have large coverage or a large triggering rate and have a large impact on the population, there’s always this trade off, which is a very interesting one.

Chris Goward:

Right. So, it gets into a whole other area of prioritizing opportunities and finding the things that will have impact on the actual business rather than just being an interesting experiment that you have an idea about.

Ronny Kohavi:

Absolutely.

Chris Goward:

So, you’ve been promoting some other ideas that we’ve been big fans of at Widerfunnel as well. A few years ago, we started thinking more about the business impacting metrics. When you’re running an experiment, you need to think about making sure that you’re not only trustworthy in the results, but that the results are actually important to the business. So, there was this idea of the True North Metric that that came out about finding the one metric that is indicative of improvement in the business.

Ronny Kohavi:

What matters, yes.

Chris Goward:

Yeah, you want to step further and started talking about this Overall Evaluation Criteria, OEC, which is a blend of metrics that indicate an importance of the business. We love that concept of that looking at that blend. We do that quite a bit here with all of our clients. What are all of the metrics that indicate success in a variety of ways for the user? Their goals are being met, as well as the business’s goals being met. And then you also talk about guardrail metrics being really important. So, what’s that? Tell us about that?

Ronny Kohavi:

Okay, so let me start with the OEC, which I think is a very, very important concept. When you start off running A/B tests, the first question you should ask is, “What am I trying to improve?” I think just getting that organizational alignment on, “What is a small set of metrics that matter for the business?” is a huge win. So, if you’re able to get that agreement… I’ll tell you a funny story, which is when I first worked with Microsoft Support, they came to us. They said, “We want to onboard to the experimentation platform.” They gave us all the statistics about how many people visit their site and what it could do if we were able to be more effective in supporting them. And then I said, “What is the metric that you’re optimizing for?” It was really interesting, because they were like, “Ronny, we read your papers. We even know the term OEC, the Overall Evaluation Criteria. It is time.” I thought for a minute. I was like, “Wait a minute. Is more time on a support site a good thing or a bad thing?” It was funny, because half the people in the room from that group thought that people spending more time is a good thing and half the people thought it was a bad thing. So, when you can’t get the direction right, obviously, you haven’t thought about the problem well enough. People have their reasons, right? If you’re able to solve the problem quickly and get out, then it’s a good thing. However, it’s also possible that you’ve abandoned our site and you went to a search engine, because you couldn’t find the answers. If we kept you longer, that’s a good indication. Ultimately, I think that reducing the time condition on the fact that you solve your problem is the metric. But this is where we get to the OEC. So, a lot of the times people will say, “Oh, my OEC is simple, I want to make more money,” for example. And then the challenge is well, remember, you’re running an experiment for a relatively short duration. You’re probably thinking of, “I want to make money in the long term.” Some of those decisions conflict. For example, you’re able to make more money in the short term if you raise prices, if you put more ads, but these are going to cause more abandonment. Over time, you will lose the users. So, the metric has to be more complicated than just make more money. In this example about Bing the search engine or Google for that matter, it’s the same thing, there is a decision of how many ads to put on the page. The OEC that we ended up with is we want to make more money, but we also restrict the ads group to a certain proportion of the page on average. So, now, it becomes a constrained optimization problem. You can choose whether to show zero or one or two or three ads, but on average, this is the number of pixels that we give you. So, now, it’s a much better-defined problem, because now, you know they have to choose one. Do we show an ad for this query? If we show, how many to show, knowing that… We showed this. … that if we show too many ads, people start to abandon. The OEC has this unique property that one, it has to be measurable in a relatively short term. A couple of weeks is the typical experiment, but you must believe that it will cause users to move your key business metrics in the long term. So, there’s multiple examples of this. YouTube started and they declared minutes as their top-level metrics, but then they refined it. They wanted to say, “Well, it has to be minutes on good content, that people don’t abandon quickly, that is not abusive.”

So over time, you begin to refine that OEC and come up with other metrics. We looked at clicks for algorithmic results on a search engine, but then you wanted to say, “Well, it’s got to be a click where they don’t come back too quickly. So, it wasn’t a clickbait.” So, you refine the OEC to come up with these metrics that help you understand what is a good behavior for the long term. So, the OEC, very important concept, getting alignment on it, very important. It’s also very hard. In every org that we on boarded, there was a long discussion with the business for us to understand it and to come up with those initial metrics and realize that they will evolve over time as you start to understand when the initial metric is off. It could be the vector is correct. I want the people to spend more time on my site if it’s a browsing site, but I also want the content to be not clickbait. So, I have to establish some quality criteria.

Chris Goward:

Yeah, it reminds me of an experiment. It wasn’t too long ago that we ran for HP Instant Ink where it was within the printer setup portal and trying to get people to sign up for the Instant Ink service. We had designed a pretty dramatic redesign test, like using some choice architecture type of behavioral design. We ran an experiment and found that it more than doubled the signup rate for the Instant Ink service. So, that might be something that would fall under Twyman’s law at first. You’re thinking, “Well, hang on, how do we double the signup rate from this new design?” So, we looked into the instrumentation. Everything looked right. We looked into the revenue. Yeah, money was flowing in. It was actually, pretty remarkable. But then we started to look back at, “Let’s just double check with customer support and see if any of this new cohort was having any different types of feedback.” It turned out that yeah, given a few more days or weeks when they start just getting charged different things. What happened was the new design, although, it was more effective at getting people to sign up. It also was causing some confusion in what they were signing up for. So, people didn’t quite have as much clarity.

Ronny Kohavi:

Cancellation rates may go up, and you have to make the trade-off.

Chris Goward:

Dissatisfaction, more people calling service saying, “Hang on, I didn’t realize what I’d signed up for. Now I’m getting these charges and stuff like that.” So, that’s I guess what we’re referring to as a guardrail metric of saying like, “Okay, let’s make sure that the success that we’re having in a test doesn’t have unintended negative effects.”

Ronny Kohavi:

Yeah, you’re right. I want to continue to mention the OEC and then you mentioned guardrails. I think coming up with organizational guardrail metrics that say, “Look, if you violate this guardrail metric, hold on, because the OEC may not reflect some of the things that you may have caused.” So, it may be things like tickets. It may be just crash rates. You may have done something that causes users to crash more often you’re not realizing it. That may have some unintended consequences like they now have more sessions. Sessions is a great metric. It indicates that people are coming more often, but you know what? If you’re causing more crashes, people have more sessions. So, these are the things that protect you.

Now, as part of the guardrails, we want to make sure that you check this SRM test, the sample ratio mismatch that we discussed. Performance is another amazing example of something that we always wanted to make sure it’s a guardrail. Evidence is just very conclusive that performance is critical. When we ran… There’s a chapter in the book about this. … a slowdown experiment at Bing, the effect was so much bigger than people expected. That we had a whole team funded to try and improve the performance. We came up with this statement that if you’re able to improve the performance by 10 milliseconds, you can fund an engineer for a year. Ten milliseconds is faster than your eye blinks. 150 to 400 milliseconds is the typical ability that people can perceive something. This was so smaller, but the effect was so large that coming up with this simple statement resonated really, really well. Somebody came to us and said, “I got this idea and I’m moving some metric, but I’m slowing the page by 30 milliseconds.” I was like, “Okay, you’re willing to give up three people on your team at this point? How often do you want to shift that?” It was interesting that over time, as we became faster, people said, “Well, is that study still relevant?” Now, we got Bing to a point where the server was returning the results in under a second. People said, “The result can’t hold at this point.” We reran the experiment. We were surprised. It was now four milliseconds, because Bing [inaudible] so much at the time, that four milliseconds now funded an engineer for a year. It still mattered, still getting that down. It’s because a lot of the things that happen people don’t realize. Some people are in slower bandwidth. They may be on another continent. Even a small slowdown may cause some failures and retries. So, performance is much more impactful than people think. Where the performance is, is also very, very important. So, the ability to give an early flash of the results, show that something is coming is much more important than if you refresh something on the page, lazy load some component three seconds later.

Chris Goward:

Right. Okay. So, yeah, there’s a lot to think about when setting up the experimentation program, like not only how you design experiments or how you come up with ideas, but how you implement, how you measure, that the instrumentation is done properly, that you’re not slowing down. I guess the other part of instrumentation is also that the instrumentation itself can slow down the experience. So, if you’re running tests and it creates that flicker or slowdown in itself, you can actually have a cost for running the experiment that you have to consider. So, yes, the executive thinking about leaving these things, I guess it sounds like they should just make sure that whoever is leading their experimentation program should have read your book to make sure that they’ve checked off all of these boxes.

Ronny Kohavi:

Like you said it, yup.

Chris Goward:

Yeah, absolutely. We’re all big fans of the book here, for sure. A lot of good principles in there but making sure that you’re trustworthy in your experimentation. So, taking a step back then and thinking about where the ideas come from, in your time of running all of these thousands of experiments, I’m curious where you found the best areas for… I think you mentioned a little bit earlier about where hypotheses come from. Do you find trends in where the best ideas tend to come from or the kinds of product managers or owners that tend to have the best ideas and what their disciplines are for finding those?

Ronny Kohavi:

Yeah, so I’ll say something, which I hope is not too controversial, but usually the best ideas come from seeing your competitors launch something if they’re data driven. So, Bing learned a lot from Google. Google learned a lot from Bing. We were both running a lot of experiments. When something launched and I said, “Wow, this is a useful feature,” a lot of the times when we evaluated it, we saw that it was useful to our users too. Not always, but certainly, I would say, that’s one source. Being aware of what’s happening in the industry, being aware of some trends, that’s certainly a good overall scheme to generate ideas. I’ll give you a very specific example. Back in 2009, there was a person at MSN that suggested that we open the Hotmail link in a new window or tab. It looked like this. Why would we do this? So, he ran the experiment. It was wildly successful. We could not imagine the value of that simple idea, which is a one line change, right? There was a huge debate that this is not industry standard and some users are complaining. But ultimately, it took us multiple years to actually learn what are good scenarios for opening in a new tab or window. Normally, if you take somebody to another place, the good example is MSN. It’s a portal. If you open your mail and you open it in a new window or tab, when you are done and you close that tab or window, you go back to the MSN homepage. We may actually show you some interesting news. So, we learned something, and that became a very interesting industry standard. Lots of companies, Facebook and Twitter and others did the same thing over time. So, these are trends in the industry that just happen. These are ideas that are generally when you see a lot of people do that, it’s great. Autocomplete, another example, in search. Autocomplete is known to be such a useful feature. As user start typing, see if you can guess and save them keystrokes. You can show common queries that other people have typed. Very, very useful. So, that’s one area. I think, as you asked about the typical people, the program managers that tend to germinate the best ideas live in the data. They’re good enough to be able to either on themselves or pair up with some data scientists, but they understand the user base and what they’re doing. How long does it take people to do something efficient, effective on our site? What is the time to book? What is the time to buy? What is the time to have a successful query where you don’t come back from, because you have answered your query? Playing with that leads to very, very interesting…I’ll give you one funny example. We started at Bing doing what we call instant answers, which is when you type some query, the answer is there. You don’t need to click, right? So, you ask, “What time is it now in the UK?” We give you the answer. One of the classical instant answers was a calculator. You type 178 times 256, boom, you get the answer right there. I was looking at, “Which of our instant answers are failing?” and instant answers that’s failing is one where you actually have users clicking on links below the instant answer. So, we didn’t do a good job, because the answer wasn’t there as it’s supposed to be. I was shocked to see that the calculator answer is terrible when people type 5 over 3. I was like, “Why is that?” I remember, looking at that and then realizing that Five Thirds is actually the name of a bank. You don’t want to hide the banks name in there. People are looking for the bank. What’s the chance that people are looking for 5 over 3? No, they’re probably looking for the bank. So, it’s a funny anecdote, but those are the things that I think lead to better hypotheses, is to look at failure cases. Okay, where users are trying to do something and we’re not giving them the answer and it takes them too long or they have to look through a lot of pages to get to what they want to do. So, people are able to generate hypotheses based on this real data, based on actual user behavior typically generate good ideas.

Chris Goward:

Right. Okay, so being connected with the customer through the data of how you’re seeing customer behavior and then being curious about evaluating especially failures or things that are going wrong, I absolutely agree with that one. A lot of people in the early stages of their maturity in experimentation, they’re aiming for the wins. That’s all they care about. When a loss happens, they almost want to ignore it and just move on to the next potential winner, the next test idea, when there’s so much value to be gained in really analyzing the results. Because once you’ve got an experiment results, no matter what the outcome is, if you’ve designed the experiment properly, there should be something interesting in there to learn.

Ronny Kohavi:

Right. So, learning from failures is very important. When Thomas Edison tested filament for light bulbs, he said, “I’ve tried 1,600 filaments, and I know what doesn’t work.” You run families of things. The same thing with experiments, if you tried an idea in a given area and it fails, it’s always, “Do we iterate more? Do we learn from this, or do we move on to another area?” That’s the million-dollar decision. That’s usually a very hard one.

Chris Goward:

Okay. So, this leads us right into then thinking about overall in your career and learning from success isn’t the only way. You learn from successes of what works. What about on the other side, have you had failures in your career, or has it just been a long string of successes?

Ronny Kohavi:

No, no, no. Everybody has failures in their careers, and I’ve had my share of failures. I think one of the things I’ve learned is that many times, if you’re pitching an idea, it’s best to learn what the other party is looking for and couch it in things that matter to them. I’ll give you one example of that is we built experimentation at Microsoft. The big break was initially MSN. The big breakthrough was at Bing when we scaled the platform a lot. When we went into Office, the way that seemed to resonate better is not so much as this will allow you to innovate, because they were sure they’re like the best innovation machine on earth. They’ve been doing this and they’re the cash cow. It was more of, “We will give you a mechanism. So, that when you deploy, we can assure that you didn’t hurt a lot.” They were really afraid that changes that they’re making to the product are hurting, because they have a large user base. For them, detecting bugs early was critical, so as I talked earlier about this ability to do near real time detection. So, just the ability to do what are called safe-deploys, here’s a deploy, we’re going to deploy it. We’ll be able to tell you relatively quickly in a matter of half an hour, an hour, two hours, if there’s something egregiously bad about it. And then if there is, abort. They had their own terminology. They were building such mechanisms. Our ability was to say, “Look, we’re not just going to watch the metrics over time that you do.” The classical thing is organizations build these time series graphs, and they look at performance or other things over time. And then when there’s a blip, somebody goes, “Hey, something is wrong with performance.” There’s a better way, which is to use the scientific method of controlled experiments. Then the changes are exaggerated, then you’re able to use their statistics to say, “Look, this is not just a blip, because some events happened. Everybody’s listening to the news.” Both control and treatment are running, and they were impacted differently. So, that ability to alert was the way to sell Office on controlled experiments. So, that’s how we got our beachhead into them. They started to build the infrastructure for deployments using controlled experiments. Over time, they saw more and more of the value of this as a way to test hypotheses and features and other things.

Chris Goward:

Okay, so starting small with what you are able to get access to and influence.

Ronny Kohavi:

And impact what they care about, right? They didn’t think we’re going to help them with ideas because it’s so great, but they cared about reducing the risk, reducing their outages, and reducing bad deploys. That’s when we were able to couch the whole idea of a controlled experiment in their terms in a way that they immediately see value that then would be a win and then would lead over time to this recognition of the infrastructure that’ll allow them to use this in a broader scale. So, that beachhead success is very, very important.

Chris Goward:

Okay, yeah, great. So, then let’s think about the listener who is working in a large enterprise, I think that’s a really good tip. That’s good advice for how to maybe make change or overcome some objections. So, someone who’s maybe a manager, director, in a large enterprise, they want to become this intrapreneur change maker. They have a vision for a change they can make. So, they want to become Ronny, what would you advise them or tell them that you wish you had known early on? What can we learn from your experience?

Ronny Kohavi:

I already mentioned that the couching of whatever you’re pushing in the near term. I think another one is you need some level of executive support. In some sense, I got lucky with Bing having two executives out of the several there that supported and understood experimentation. So, one was Harry Shum, the head of the development organization, that was enough of a scientist to know the value. He was a great supporter. The other one was Chi Lu. He came from Yahoo. He was doing experiments at Yahoo. He saw the value, he supported this. That allowed us to scale the platform to come up with… Coming up with this OEC, asking people, “What are you optimizing for”, that’s not an easy question. If the executives aren’t on board, thinking about it and optimizing for it, then you get this disconnect. So, that’s what I would advise. Try to find those supporters in the org and they have to be high enough. So, that they talk about this, they talk about the successes. You educate them in a way that they can then share some of the successes and failures. Not less important is to also share that, “We did something. We thought it was going to work.” It’s okay to fail. I mean, as I said, this is one of the most humbling things about working with controlled experiments, but it’s also a way of thinking about product development, which is not all ideas are going to be good. Change the culture, so that we don’t have a timeline where we deliver something that is clearly going to work for users, because we don’t know.

Start to evaluate projects based on not shipping but based on their value to users.

This is another example of a fundamental change that happened in Microsoft. When I joined Microsoft, Microsoft gave out these ship-it awards, right? I’ve got one in my room here. It looks like this. You can see it, but I’ll read what it says because I think it’s important. It says, “Every time a product ships, it takes us one step closer to the vision. Empower people through great software anytime, anyplace, any device.” But that’s not true, right? We now know that anytime the product ships, maybe you have a 30% chance of taking you one step closer to the goal.

Chris Goward:

Right, or maybe it’s taking us backwards if you’re-

Ronny Kohavi:

Yes, exactly. So, the whole idea of the activity of shipping has to change from, “We ship, let’s celebrate,” to shipping is the step that allows you to evaluate what this does to key metrics. You celebrate when the metrics move, right? It’s like you don’t celebrate when you buy or sell the stock on the transaction. You have to see you know what happened to it over time. So, it’s I think that change, and Satya made that change. I think it was like six months after he became CEO. He stopped these ship-it awards. Very interesting message to the org that, “We don’t care about shipping as much as actually moving the metrics.” That’s a nice cultural change.

Chris Goward:

Yeah. Okay, great. I think that’s a good thought to end on is thinking about the impact you’re having in the outcomes, not just the activity. Great. Well, thank you for your time, Ronny. That’s been great. For the listeners, definitely, again, recommend picking up the book, Trustworthy Online Controlled Experiments. Lovely to have this conversation and looking forward to the next time. Thanks, Ronny.

Ronny Kohavi:

Chris, thank you very much.

This show was made possible by Widerfunnel, the company that designs digital experiences that work for enterprise brands proven through experimentation. For more information, visit widerfunnel.com/tellmemore. That’s W-I-D-E-R-F-U-N-N-E-L.com/tellmemore.