Research: When A/B Testing Doesn’t Tell You the Whole Story

Every year, marketers spend billions of dollars on campaigns meant to attract, retain, and upsell customers. Yet despite this massive investment, it can be extremely challenging to determine how effective these initiatives actually are, and how they can be improved. One common method of measuring a campaign’s Return on Investment (ROI) is to run an A/B test: Marketers will target customers with two different interventions, and then compare results between the two groups. With the right approach to analysis, these A/B tests can provide useful insights — but they also have the potential to be highly misleading.

To understand the shortcomings of how A/B tests are often employed, it’s helpful to consider a hypothetical example. Imagine you work for a large arts organization that is concerned about declining retention rates among its members. You’re thinking about sending a small gift along with the renewal notification to the members you’ve determined are at a higher risk of canceling their memberships, but since that comes at a cost, you want to make sure the intervention is effective before rolling it out more broadly. So you decide to run a small pilot campaign, randomly choosing one group of “at risk” members to receive a gift and one not to, in order to see if those who receive the gift are more likely to renew.

Now, say you don’t find any difference in retention rates between members who receive the gift and those in the control group. If you ended your analysis there, it would likely lead you to cancel the gift program, since the data seems to suggest that sending gifts has no impact on retention. But upon closer examination of the data, you might find that for a certain subgroup of customers — such as those who had visited the venue in the last year — the gift did in fact significantly increase their chances of renewing, while for customers who had not visited the venue in a long time, the gift actually made them less likely to renew, perhaps because it served as a more salient reminder of how infrequently they had been using their membership. Using an A/B test to evaluate the average effect of an intervention can cover up important insights around which customers are likely to be more or less receptive to that campaign (whether the analysis suggests the intervention has a positive, negative, or, as in this example, an insignificant effect), leading marketers to make the wrong decisions around which campaigns to run with which customers.

Optimizing Churn Prevention Campaigns

This isn’t just hypothetical — in fact, this example is based on a real organization I worked with as part of my research. When it comes to increasing retention, companies typically identify “high risk” customers — that is, customers whose recent behavior or other characteristics suggest they are particularly likely to cancel their subscriptions or stop purchasing a company’s product — and then run A/B tests to determine if their retention campaigns will be effective with this group. While this is an understandable strategy (certainly you don’t want to waste marketing resources on customers who weren’t going to churn anyway), my research suggests that it can seriously backfire, as it can lead marketers to make flawed decisions that actually reduce overall retention rates and ROI on marketing spend.

Specifically, I conducted field experiments with two large companies that were implementing retention campaigns. In the first part of my study, the companies both developed churn reduction interventions and then ran A/B tests tracking churn rates for a total of over 14,000 customers, where one randomly assigned group of customers received the interventions, and the other did not. Next, I collected a rich dataset of customer information, including recent activity and engagement with the company, tenure as a customer of the company, location, and other metrics that were used to predict churn risk, and examined which of these characteristics correlated with a positive response to the retention campaigns.

Across both companies, I found that the customers who had been identified as having the highest risk of churning were not necessarily the best targets for the retention programs — in fact, there was little correlation between customers’ churn risk level and their sensitivity to the interventions. The data showed that there was a distinct group of customers who responded strongly to each intervention (customers with particular behavioral or demographic characteristics that consistently correlated with being much less likely to churn after receiving the interventions), but that “high-sensitivity” group had almost no overlap with the people identified as “high churn risk.” And this had serious implications for ROI: My analysis found that if the two companies were to spend the same amount of marketing budget targeting the high-sensitivity group rather than the high-churn-risk group, it would reduce their churn rates by an additional 5% and 8% respectively.

Of course, the specific factors that make a customer more likely to be receptive to a retention campaign will vary organization to organization and even campaign to campaign, but running pilots like the ones described above can help you identify the characteristics that will be the best predictors of your customers’ sensitivity to a specific intervention. For example, one of the organizations in my study was a telecommunications company with access to detailed data on behavioral metrics such as the number of calls customers had made in the last month, the number of texts they sent, gigabytes of data downloaded, and more. For this company, the data showed that how recently a customer had last engaged with the company predicted their level of churn risk, but had no impact on their sensitivity to the churn intervention. What did predict sensitivity was their data usage — suggesting that to maximize ROI, the company should consider targeting their retention campaign not at the customers who hadn’t engaged in a long time, but at the customers who used the most data.

Moving from Prediction to Prescription

So what does this mean for marketers? The key insight is that marketing interventions should be targeted based on each customer’s expected response to that intervention, not on what customers are expected to do in the absence of that intervention. In a sense, marketers are like doctors: Doctors don’t just give random treatments to the patients who are most likely to die — they prescribe specific treatments to the patients who are most likely to respond positively to those treatments.

Rather than trying to predict what customers will do (i.e., trying to determine their risk of churning), marketers should focus on how different types of customers will respond to particular campaigns, and then design campaigns that are most likely to be effective at reducing churn among a given group of customers. Companies should leverage A/B test data not simply to attempt to measure the overall effectiveness of a campaign among all customers, but to explore which types of customers will be most sensitive to certain interventions. That means combining customers’ historical transaction and demographic data with the data collected through A/B tests to identify the behaviors and traits that make a customer most likely to respond to a particular intervention. Luckily, many companies already collect all this data — it’s merely a matter of leveraging it in a new way.

***

The concept of targeted marketing campaigns is nothing new — but it’s critical to think carefully about how you’re making those targeting decisions. Rather than just guessing about what factors might indicate that someone is a strong target, or focusing on a group that’s been deemed high priority (such as high churn risk customers), firms should target the customers who will be the most sensitive to the specific intervention they’re implementing. To maximize ROI, marketers need to stop asking, “Is this intervention effective?” and start asking, “For whom is this intervention most effective?” — and then target their campaigns accordingly.

Source: HBR