A new way to analyze data in the cloud

August 4, 2013 at 3:56pm
David Crawford

Home

## SQL Case Study: Removing bias from customer analysis

A/B tests are a powerful way of deciding whether to make a change to your product, marketing, or sales, and there’s a wealth of resources available to help you run and analyze them. A/B tests focus an experiment on a single change, and make sure that other factors (what your users ate for breakfast, whether it was raining outside, etc.) don’t impact the experiment’s result beyond a measurable confidence level.

As analysts, our lives would be simple (but boring) if every analysis was a randomized experiment. But we can’t run an experiment for every decision. Running experiments can be very expensive. It’s easy when you’re testing changing the background color of your landing page, but what if you want to know whether prioritizing one feature will have an impact on 1 month retention? You’ll need to wait at least a month before you get results. Experiments on conversions that are high value but low frequency take a long time to reach statistical significance. Not only might you have to wait a while, but you might have to put extensive resources into some tests. If you want to know the effect of a new feature you have to build the new feature.

In reality, we’re often trying to interpret trends in historical data collected during the course of business. Trends are useful for alerting us to changes (everybody loves a dashboard), but they have nothing to say about causality. If we’re trying to use trends to make decisions, we need to get around the fact that our historical data is riddled with bias.

## Bias

When you’re looking for causation, you want to know if a particular “treatment” has a particular “effect.” The effect is measured by comparing to a “control.” Bias is when another variable is correlated with who received the treatment. If that variable has an impact on the effect, than we could observe its effect in the effect we’re measuring from the treatment.

Here’s an example: between sending your first marketing email and your second, your subscriber base changes. Was the second email more effective because you had a better subject line? Or was it because you had more receptive subscribers (a bunch of interested new readers signed up)?

Email 1 Email 2
13%
open rate
29%
open rate

The bad news is, there’s always bias. The good news is, if you can measure the bias, you can adjust for it. In this article, I’ll cover the example of determining what product feature is driving retention, and I’ll include full SQL examples for doing the analysis. I’ve hosted the sample data set on a MetricaDB, and you can click through from any of the embedded examples below to try running your own queries against it.

Before we get too far along, I want to point out a couple of caveats. I’m going to show how to adjust for known bias, but that doesn’t mean you can ignore “hidden bias”, or other aspects of your sample that you can’t measure. We have to assume that there are hidden factors in our data that influence our results. In addition, the more we look for correlations in a set of data, the more likely we’re going to find a spurious one. These sorts of analyses are very valuable, but are best when followed up by an experiment to confirm (or contradict) the results.

## The Scenario: Which feature increases retention?

Your product has a couple of major features, and you want to know which one is most important in driving retention. Once you know which feature is making users stick around, you can focus your marketing message on that feature, and get more people to love your product.

Your new marketing slogan?

Let’s imagine you’ve got a recipe site, and you want to decide which of three features to make prominent: recipe search, cooking videos, or forums.

The bias comes from the fact that there are other factors which influence retention. One big influence can be the way the user was introduced to the product, whether through facebook, an email campaign, an article, a referral, or some other source. Some sources of traffic will bring highly qualified users who are likely to stick with the product. Others bring people who are just curious, don’t have a need for the product, and likely won’t stay long.

So we need to remove the effect of the marketing source to get a clearer picture of each feature’s impact.

To start you need to wrangle your data into a table which has all the relevant metrics for each user. We’ll assume you can create a table like this:

create table user_retention (
user_id int,
marketing_source varchar,
used_search int,
used_videos int,
used_forums int,
retention_time float
);


We’ll represent the used_* columns as either 1 or 0. A 1 could mean any use of the feature, or you could determine thresholds of use that are relevant (e.g., if the user has watched more than 2 videos, then used_videos is 1). To simplify the SQL, you might create a view or put this data in a temporary table.

Now it’s tempting to do three queries:

If only it were so simple, but we know the marketing source may be skewing our data. To adjust, we’ll borrow a method used heavily by epidemiologists: stratification. Stratification is a technique that splits subjects up into groups (strata) that are similar. We’ll examine each feature one at a time. For each feature, we’ll stratify users by their marketing source, compare results for those who used or did not use the feature, and then recombine the results.

### Stratify users

We want to group similar users together into strata so that we can observe the effects of using a feature among users who all come from the same marketing source. We’ll separate users who used the feature from those who didn’t and compare the retention times between the two.

### Find effect size from each strata

Once we split the users into strata, we determine the effect of using search on each strata. Our effect is the difference between the mean retention time of users who used search and the mean retention time of those who didn’t. We’ll call that difference the ‘effect size’.

### Combine results to get an overall effect

To get the adjusted effect size, we then average the effect size from all strata, weighted by the number of users in each strata. If we have $$S$$ as the number of strata, $$N$$ as the total number of users, and $$n_s$$ as the number of users in each strata, then the weighted average looks like this:

$$\text{effect size} = \sum_{s=1}^{S} \frac{n_s}{N} \left\{\text{avg retention for search users} - \text{avg retention for non search users} \right\}$$

And here’s the full SQL query:

We do this once for each feature, and we get the relative effect size. The largest result is the feature with the most impact on retention. Here’s are the final results, before and after adjustment

SearchVideosForums

## Next steps

We’ve adjusted for only one factor here (marketing source), but I’m sure there are more you can think of. Luckily it’s easy to adjust for more than one other variable. Simply add another dimension to the table and group by that dimension when creating your strata.

When looking at marketing source, we had a pretty small set of values, and therefore a small number of strata. If you want to stratify on a continuous value (such as age or number of friends), you will probably need to bucket the values in order to get enough users in each strata.

This technique applies to many more cases than just retention analysis, so I hope you’ve got a few ideas for ways to improve your own analysis. If you have any questions or suggestions, drop me a note in the comments.

We’re building a new platform for analysts, starting with SQL analytics on cloud data such as Salesforce and Google Analytics. Please sign up at metricadb.com if you’re interested. Thanks!

### Notes

1. danielpearson reblogged this from metricadb
2. metricadb posted this