A new way to analyze data in the cloud

August 9, 2013 at 2:20pm
David Crawford

0 notes

## Analytics Roundup #3

Here’s the latest roundup.

Going for it with analytics (davidstarke.com)
What can we learn about successful analytics from a high school football coach from Arkansas? Seems like a lot. I especially love this concisely worded takeaway from Starke on the value of experimentation: “The process and tools of analytics are much better at giving you feedback on things that you try than they are at explaining what might happen for things that you haven’t tried.”
Rubrics and the bimodality of 1-5 ratings (zedshaw.com)
A great summary of the full thought process behind the rating system on a (now defunct) guitar competition website. The author discusses trying to improve the quality of the scoring by suggesting a rubric, using standard deviation to describe the summary rating, and give code examples for calculating summary statistics from a stream of data. The blog post seems to have been lost in a website redesign, but I tracked it down with the wayback machine.
Kartlytics: Applying Big Data Analytics to Mario Kart (joyent.com)
From the overengineering department, we have Joyent’s Dave Pacheco describing what looks like about 5 months of work to create an analytics system for the Mario Kart games played in his office. Full source code is available for the computer vision work done to extract race positions over time from a video of the match.
Data visualization: A view of every Points of View column (blogs.nature.com)
Nature Methods, a meta-journal from the preeminent scientific publication Nature, has just opened up a great new resource to the public. This blog post is a guide to the Points of View column from the journal, providing practical advice on data visualization from the top practitioners in the life sciences, available for free on their website.
Want to get these in your email? We’re starting a custom email list just for analytics roundup posts. No spam, no solicitation, just a handful of top-quality analytics links designed to make you a better inquisitor, delivered roughly once a week.

August 4, 2013 at 3:56pm
David Crawford

1 note

## SQL Case Study: Removing bias from customer analysis

A/B tests are a powerful way of deciding whether to make a change to your product, marketing, or sales, and there’s a wealth of resources available to help you run and analyze them. A/B tests focus an experiment on a single change, and make sure that other factors (what your users ate for breakfast, whether it was raining outside, etc.) don’t impact the experiment’s result beyond a measurable confidence level.

As analysts, our lives would be simple (but boring) if every analysis was a randomized experiment. But we can’t run an experiment for every decision. Running experiments can be very expensive. It’s easy when you’re testing changing the background color of your landing page, but what if you want to know whether prioritizing one feature will have an impact on 1 month retention? You’ll need to wait at least a month before you get results. Experiments on conversions that are high value but low frequency take a long time to reach statistical significance. Not only might you have to wait a while, but you might have to put extensive resources into some tests. If you want to know the effect of a new feature you have to build the new feature.

In reality, we’re often trying to interpret trends in historical data collected during the course of business. Trends are useful for alerting us to changes (everybody loves a dashboard), but they have nothing to say about causality. If we’re trying to use trends to make decisions, we need to get around the fact that our historical data is riddled with bias.

## Bias

When you’re looking for causation, you want to know if a particular “treatment” has a particular “effect.” The effect is measured by comparing to a “control.” Bias is when another variable is correlated with who received the treatment. If that variable has an impact on the effect, than we could observe its effect in the effect we’re measuring from the treatment.

Here’s an example: between sending your first marketing email and your second, your subscriber base changes. Was the second email more effective because you had a better subject line? Or was it because you had more receptive subscribers (a bunch of interested new readers signed up)?

Email 1 Email 2
13%
open rate
29%
open rate

The bad news is, there’s always bias. The good news is, if you can measure the bias, you can adjust for it. In this article, I’ll cover the example of determining what product feature is driving retention, and I’ll include full SQL examples for doing the analysis. I’ve hosted the sample data set on a MetricaDB, and you can click through from any of the embedded examples below to try running your own queries against it.

Before we get too far along, I want to point out a couple of caveats. I’m going to show how to adjust for known bias, but that doesn’t mean you can ignore “hidden bias”, or other aspects of your sample that you can’t measure. We have to assume that there are hidden factors in our data that influence our results. In addition, the more we look for correlations in a set of data, the more likely we’re going to find a spurious one. These sorts of analyses are very valuable, but are best when followed up by an experiment to confirm (or contradict) the results.

## The Scenario: Which feature increases retention?

Your product has a couple of major features, and you want to know which one is most important in driving retention. Once you know which feature is making users stick around, you can focus your marketing message on that feature, and get more people to love your product.

Let’s imagine you’ve got a recipe site, and you want to decide which of three features to make prominent: recipe search, cooking videos, or forums.

The bias comes from the fact that there are other factors which influence retention. One big influence can be the way the user was introduced to the product, whether through facebook, an email campaign, an article, a referral, or some other source. Some sources of traffic will bring highly qualified users who are likely to stick with the product. Others bring people who are just curious, don’t have a need for the product, and likely won’t stay long.

So we need to remove the effect of the marketing source to get a clearer picture of each feature’s impact.

To start you need to wrangle your data into a table which has all the relevant metrics for each user. We’ll assume you can create a table like this:

create table user_retention (
user_id int,
marketing_source varchar,
used_search int,
used_videos int,
used_forums int,
retention_time float
);


We’ll represent the used_* columns as either 1 or 0. A 1 could mean any use of the feature, or you could determine thresholds of use that are relevant (e.g., if the user has watched more than 2 videos, then used_videos is 1). To simplify the SQL, you might create a view or put this data in a temporary table.

Now it’s tempting to do three queries:

If only it were so simple, but we know the marketing source may be skewing our data. To adjust, we’ll borrow a method used heavily by epidemiologists: stratification. Stratification is a technique that splits subjects up into groups (strata) that are similar. We’ll examine each feature one at a time. For each feature, we’ll stratify users by their marketing source, compare results for those who used or did not use the feature, and then recombine the results.

### Stratify users

We want to group similar users together into strata so that we can observe the effects of using a feature among users who all come from the same marketing source. We’ll separate users who used the feature from those who didn’t and compare the retention times between the two.

### Find effect size from each strata

Once we split the users into strata, we determine the effect of using search on each strata. Our effect is the difference between the mean retention time of users who used search and the mean retention time of those who didn’t. We’ll call that difference the ‘effect size’.

### Combine results to get an overall effect

To get the adjusted effect size, we then average the effect size from all strata, weighted by the number of users in each strata. If we have $$S$$ as the number of strata, $$N$$ as the total number of users, and $$n_s$$ as the number of users in each strata, then the weighted average looks like this:

$$\text{effect size} = \sum_{s=1}^{S} \frac{n_s}{N} \left\{\text{avg retention for search users} - \text{avg retention for non search users} \right\}$$

And here’s the full SQL query:

We do this once for each feature, and we get the relative effect size. The largest result is the feature with the most impact on retention. Here’s are the final results, before and after adjustment

SearchVideosForums

## Next steps

We’ve adjusted for only one factor here (marketing source), but I’m sure there are more you can think of. Luckily it’s easy to adjust for more than one other variable. Simply add another dimension to the table and group by that dimension when creating your strata.

When looking at marketing source, we had a pretty small set of values, and therefore a small number of strata. If you want to stratify on a continuous value (such as age or number of friends), you will probably need to bucket the values in order to get enough users in each strata.

This technique applies to many more cases than just retention analysis, so I hope you’ve got a few ideas for ways to improve your own analysis. If you have any questions or suggestions, drop me a note in the comments.

We’re building a new platform for analysts, starting with SQL analytics on cloud data such as Salesforce and Google Analytics. Please sign up at metricadb.com if you’re interested. Thanks!

July 28, 2013 at 5:32pm
David Crawford

0 notes

## Analytics Roundup #2

"Data" is a pretty general term, so many data-focused blogs stretch pretty far in what they cover, from interesting new studies to infrastructure technologies. This blog is about the practice of data analysis, so we curate articles that provide a greater understanding of your data, or actionable explanations of new techniques or tools. Here’s the latest roundup.

Four common statistical misconceptions you should avoid (lifehacker.com)
This article gives simple explanations and clear examples to very common errors in interpreting data, without covering old ground. The Simpson’s paradox is sometimes used in Facebook PM interviews.
Useful Unix commands for data science (gregreda.com)
Covers usage of tools like awk, head & tail, wc, and grep for dealing with data on the command line. Many of these tools have whole books dedicated to using them well, but even learning a couple of specific tricks with them can make you dramatically more efficient. I like to use these tools in combination with OS X’s pbcopy and pbpaste to get stuff in and out of my clipboard.
How not to sort by average rating (evanmiller.org)
An oldie but goodie. Evan Miller covers the challenges of sorting items when some have more data than others. He includes code samples for a solution in both Ruby and SQL. Make sure to also check out the Bayesian approach.
Measuring content page performance (snowplowanalytics.com)
Snowplow analytics provides a wealth of in-depth analytics examples in their Analytics Cookbook. They’re all based on the Snowplow Analytics “page ping” data, but can be easily adapted to work with any sort of event data. They’re full of interesting SQL and visualization examples, and will hopefully inspire some of your own web analytics exploration.

July 21, 2013 at 1:03pm
David Crawford

0 notes

## Analytics Roundup

Some great analytics reads we’ve found around the web:

How to grow your app revenue with DuPont analysis (andrewchen.com)
This post borrows a technique from financial analysts, and shows how it can help you learn where to focus to improve your metrics. The technique involves breaking a target metric (such as ARPU, average revenue per user) into components, to determine which is the most significant driver.
Pivoting in Postgres (craigkerstiens.com)
Craig Kerstiens, product manager for Heroku Postgres, introduces the crosstab function for pivoting data directly in the database. This follow up shows how to do it in plain SQL – a technique I’ve used to great effect many times.
How Aziz Ansari tests jokes with analytics (fastcolabs.com)
On the lighter side, Aziz Ansari polls his fans and creates demographic-based test audiences to see how different “customer segments” respond to his jokes. Data is eating comedy!
There’s more than one kind of data scientist (strata.oreilly.com)
It doesn’t happen enough, but when researchers put themselves under the microscope, the analysis is revealing. This article introduces a free Strata report surveying analytics professionals on their skills and education. Are you really a data scientist, or are you a “Data Businessperson,” “Data Creative,” “Data Developer,” or “Data Researcher”?
That’s all for now. We’ll be back soon with more analytics links from around the web.

July 5, 2013 at 11:19pm
David Crawford

1 note

## Finding the most recent purchase by each customer with SQL

If you have transaction history such as purchases in a database (or really any list of items with timestamps), you frequently want to pick out recent items. SQL makes it easy to find the most recent items in the table:

SELECT * FROM purchases ORDER BY created_at DESC;

id  | customer_id | subtotal | shipping |   tax   |  total   |         created_at
------+-------------+----------+----------+---------+----------+----------------------------
1101 |       83922 |   102.44 |     4.99 |  8.7074 | 116.1374 | 2013-07-05 22:49:09.20922
1051 |       83922 |    51.75 |     4.99 | 4.39875 | 61.13875 | 2013-07-05 22:48:45.353301
1079 |       48582 |    47.02 |     4.99 |  3.9967 |  56.0067 | 2013-07-05 22:48:02.866348
1057 |       48582 |    30.64 |     4.99 |     2.6 |    38.23 | 2013-07-05 22:47:12.535409
1048 |       48582 |    81.72 |     4.99 |    6.95 |    93.66 | 2013-07-05 22:46:18.304818


Or to find the time of the most recent events for each group:

SELECT MAX(created_at) AS most_recent FROM purchases GROUP BY customer_id;

most_recent
----------------------------
2013-07-05 22:48:02.866348
2013-07-05 22:49:09.20922


But sometimes you want to get the full record for each group, not just the timestamp. This is where window functions come in handy again.

Here’s how PostgreSQL’s documentation defines window functions:

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

In this case, what we want to do is use a window function to sort each purchase from a customer by its created_at date and add a column with the rank of the purchase by that sort. We’ll use the row_number() PostgreSQL function. When we define the window partition and the sort order, row_number() will return a number starting from and counting up for each partition. (There is actually a function called rank, but if two rows are tied in order it gives them the same rank number, and in this case I only want one purchase per customer, even if two were made at the exact same time.)

Here’s an example. We have a table with some purchases. We add the row number, partitioning by customer_id and ordering by the created_at timestamp:

SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id ORDER BY created_at DESC
)
AS created_order
FROM purchases

id  | customer_id | subtotal | shipping |   tax   |  total   |         created_at         | created_order
------+-------------+----------+----------+---------+----------+----------------------------+---------------
1079 |       48582 |    47.02 |     4.99 |  3.9967 |  56.0067 | 2013-07-05 22:48:02.866348 |             1
1057 |       48582 |    30.64 |     4.99 |     2.6 |    38.23 | 2013-07-05 22:47:12.535409 |             2
1048 |       48582 |    81.72 |     4.99 |    6.95 |    93.66 | 2013-07-05 22:46:18.304818 |             3
1101 |       83922 |   102.44 |     4.99 |  8.7074 | 116.1374 | 2013-07-05 22:49:09.20922  |             1
1051 |       83922 |    51.75 |     4.99 | 4.39875 | 61.13875 | 2013-07-05 22:48:45.353301 |             2

So to get the most recent purchase from each customer, we simply select the rows where created_order equals 1:

WITH ordered_purchases AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id ORDER BY created_at DESC
)
AS created_order
FROM purchases
)
SELECT * FROM ordered_purchases WHERE created_order = 1;

id  | customer_id | subtotal | shipping |  tax   |  total   |         created_at         | created_order
------+-------------+----------+----------+--------+----------+----------------------------+---------------
1079 |       48582 |    47.02 |     4.99 | 3.9967 |  56.0067 | 2013-07-05 22:48:02.866348 |             1
1101 |       83922 |   102.44 |     4.99 | 8.7074 | 116.1374 | 2013-07-05 22:49:09.20922  |             1

And that’s it. Of course you can also do this for any other list where you want to pick out certain rows from each group. Happy querying!

May 27, 2013 at 2:35pm
David Crawford

10 notes
Reblogged from mononcqc

## Secondary Brain Garbage: Interesting bits from "Why Do Computers Stop and What Can Be Done About It?" →

Why Do Computers Stop and What Can Be Done About It? is a technical report by Jim Gray from Tandem Computers, written in 1985. It’s what I’d call “an oldie but a goldie”. Lots of insights from production systems, very likely still applicable today, but mostly forgotten and ignored.

The paper…

April 25, 2013 at 5:41pm
David Crawford

0 notes

## Meet our six Structure 2013 finalists →

Metrica is proud to be a finalist at GigaOM Structure this year on June 19th. Hope you can make it!

April 16, 2013 at 4:12pm
David Crawford

8 notes

## Finding your One Metric That Matters using SQL

Last time, we did a detailed walk-through on how to calculate retention from your user and event tables in SQL. Now we get to the real pay-off—ad-hoc segmentation of your customers based on retention.

Segmentation, in this context, is a grouping of your users in a way that helps you make business decisions. Here are a few examples:

1. You have a B2B product, and different sized companies need different features (security, LDAP authentication, user management, admin control, etc). You want to figure out how your customers and potential customers fall into different groups so that you can fine-tune your marketing copy for each kind of customer. (This is why most B2B companies talk about “solutions” rather than products—one product may be used different ways to solve problems for different kinds of customers.)
2. You have a freemium app and want to find out who your product qualified leads are. This means, you want to know who is most likely to convert to a highly engaged user. There is often an important tipping point, and knowing it helps you focus on getting people to that tipping point, and then upselling when they hit it. Twitter has honed in on 7 days per month—if you use Twitter that much, you are highly likely to stick around for many months. FriendFeed once found that if a user had 5 friends on the site, they’d become an active user. Finding the tipping point is a kind of segmentation that requires exploring your dataset for both the factors and the quantities that predict the user is getting a lot of value.

So lets find out how FriendFeed might have gotten that magic number. How can we find out what minimum number of friends you need before you stick?

Determining what metric leads to retention is difficult. Since most metrics go up with retention, it’s easy to reverse the causality—maybe users just get more friends because they stick around longer. So I’m not going to show a foolproof analysis here, but we can use this problem as an interesting use case to try some cool SQL.

If we’re right that hitting a certain friend threshold leads to high engagement and retention in a social product, then we should see an inflection point after users hit that threshold. High engagement and retention should lead to even higher friend counts. So users who drop out after a few weeks should have no more than X friends, and users who stay longer should have much more than X.

To try to spot this threshold number, we can generate a chart of retention versus friend count for our users.

Let’s imagine our tables are the same as last time, but we now have a friend count field on the users table called ‘friends’:

Users

• id
• sign_up_time
• friends

Events

• user_id
• time

To start, we need to figure out the retention for each user. We want to get the most recent login time, and subtract that from the sign up time.

SELECT
users.id,
-- Find most recent login time and calculate retention
MAX(events.time) - users.sign_up_time as retention_time
FROM users
JOIN events ON users.id = events.user_id
GROUP BY 1;


Pretty straightforward. In PostgreSQL, we can simply subtract two timestamps and we’ll get an “interval”. Here’s what the result looks like:

  id  | retention_time
------+----------------
1074 | 1 day 22:01:14
887 | 34 days 02:23:38
1209 | 14 days 13:33:20
1420 | 89 days 05:03:45
1532 | 87 days 01:16:32
1101 | 4 days 18:44:43
577 | 71 days 09:52:11
...


This looks pretty nice. But we want to plot friend counts against retention_time. Most charting software won’t handle the text description of an interval so well, so we need to get a numeric value into retention_time. In PostgreSQL, we can use the extract function to pull a particular field out of the interval. In this case, we want to look at weeks. PostgreSQL extract doesn’t support weeks, but it does support days:

SELECT
users.id,
-- Find most recent login time and calculate retention in weeks
floor(extract(days from MAX(events.time) - users.sign_up_time) / 7) as retention_weeks
FROM users
JOIN events ON users.id = events.user_id
GROUP BY 1;

id  | retention_weeks
------+-----------------
1074 |               1
887 |               4
1209 |               2
1420 |              12
1532 |              12
1101 |               0
577 |              10
...


In this report, we’re looking at users by how long they were retained. But in the case of new users, we don’t know how long they’re going to stick around. In this query, a user who signed up one week ago will look the same as a user who signed up three months ago but only stayed one week. So we will cut out any user who signed up less than four weeks ago.

SELECT
users.id,
-- Find most recent login time and calculate retention in weeks
floor(extract(days from MAX(events.time) - users.sign_up_time) / 7) as retention_weeks
FROM users
JOIN events
ON users.id = events.user_id
-- Ignore recent sign ups, we don't know if they'll abandon yet
AND users.sign_up_time > now() - interval '4 weeks'
GROUP BY 1;


Now we get to group by friend count. But we can’t do it in this query, since we’re already grouping by users.id to find the most recent login event. We need to use this query as a subquery, so we can aggregate it again:

SELECT
-- Show friend counts by # of weeks retained
retention.retention_weeks,
avg(users.friends) as avg_friends
FROM users
JOIN
(
SELECT
users.id,
-- Find most recent login time and calculate retention in weeks
floor(extract(days from MAX(events.time) - users.sign_up_time) / 7) as retention_weeks
FROM users
JOIN events
ON users.id = events.user_id
-- Ignore recent sign ups, we don't know if they'll abandon yet
AND users.sign_up_time > now() - interval '4 weeks'
GROUP BY 1
) retention
ON users.id = retention.id
GROUP BY 1
ORDER BY 1;


And here’s what the result might look like. Remember, this shows “For users who use the product for X weeks and then abandon, the average number of friends is Y.”

 retention_weeks |     avg_friends
-----------------+---------------------
0 | 1.02144631384730000
1 | 2.48581334349008808
2 | 6.74929485869492110
3 | 3.86893085019385901
4 | 10.1238238084029199
5 | 11.4918409583109385
6 | 45.0391858292459861
...


As we expected (because I’m using fake data), there’s a disconnect between people who stick around for only 3 weeks and those who stick around for 4 or longer. However, there’s one way we could refine this query. We’re looking at the average number of friends for a group, but averages are subject to skew in the case of outliers. Also, we probably are more interested in the top of the range than the middle. One trick to remove the influence of outliers is to look at the 90th percentile of the data.

Percentiles are a great chance to use window functions. Window functions are a really cool part of SQL that as far as I can tell very few people know about. Here’s how the PostgreSQL documentation describes them:

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

That’s a little confusing, so here’s a simple example. Take a table of grades with the columns: year, student, gpa. We could use a window function to add a 5th column which shows each student’s rank in their year. Let’s say the table looks like:

SELECT * FROM gpas;

year | student | gpa
----------------------
2012 | Alice   | 4.0
2012 | Bob     | 3.2
2012 | Charlie | 4.0
2012 | Dylan   | 2.6
2011 | Elaine  | 3.2
2011 | Frank   | 3.9
2011 | Greg    | 3.3
2011 | Harriet | 3.7


Then we can find the student rank using a window function. Window functions require an “OVER” clause which defines the window frame, the set of rows that are the context for the function, and an ordering for those rows. PostgreSQL has a window function called rank() which gives you the 1-based rank for the row based on the window frame and ordering.

SELECT year, student, gpa, rank() OVER (PARTITION BY year ORDER BY gpa DESC) AS class_rank FROM gpas;

year | student | gpa | class_rank
-----------------------------------
2012 | Alice   | 4.0 |          1
2012 | Bob     | 3.2 |          3
2012 | Charlie | 4.0 |          1
2012 | Dylan   | 2.6 |          4
2011 | Elaine  | 3.2 |          4
2011 | Frank   | 3.9 |          1
2011 | Greg    | 3.3 |          3
2011 | Harriet | 3.7 |          2


Cool. So here’s how it works for us. We can use a window function to add a new column to our subquery table showing what percentile that user is in based on the user’s friend count. And we can calculate the percentile among users who were retained for the same number of weeks. Here’s the new subquery:

SELECT
users.id,
-- Find most recent login time and calculate retention in weeks
extract(days from MAX(events.time) - users.sign_up_time) / 7 as retention_weeks,
users.friends,
ntile(100) OVER (PARTITION BY extract(days from MAX(events.time) - users.sign_up_time) / 7 ORDER BY users.friends ASC) as percentile
FROM users JOIN events ON users.id = events.user_id
GROUP BY 1


And the result would look like:

  id  | retention_weeks | friends | percentile
------+----------------------------------------
1074 |               1 |       0 |          3
887 |               4 |      20 |         72
1209 |               2 |       3 |         88
1420 |              12 |     120 |         70
1532 |              12 |     123 |         70
1101 |               0 |       0 |         20
577 |              10 |       4 |          2
...


NTILE(n) is a PostgreSQL window function that splits the window frame into n even buckets and assigns a bucket number. Now we can find the 90th percentile by just picking the rows with percentile = 90. However, since there is more than one row in each percentile, we’ll group by retention weeks and pick the maximum friend number in the 90th percentile bucket for each retention week.

SELECT
retention.retention_weeks,
max(retention.friends) as friends_90_percentile
FROM (
SELECT
users.id,
users.friends,
ntile(100) OVER (PARTITION BY extract(days from MAX(events.time) - users.sign_up_time) / 7 ORDER BY users.friends ASC) as percentile,
-- Find most recent login time and calculate retention in weeks
extract(days from MAX(events.time) - users.sign_up_time) / 7 as retention_weeks
FROM users JOIN events ON users.id = events.user_id
GROUP BY 1
) retention
WHERE percentile = 90
GROUP BY retention_weeks;


And we get our final result:

 retention_weeks |     avg_friends
-----------------+---------------------
0 | 1
1 | 3
2 | 3
3 | 5
4 | 12
5 | 17
6 | 70
...


And it looks like our number is somewhere between 5 and 12.

Again, I don’t think this analysis is easy, and I didn’t have a social networking data set to work on. You’d probably want to look at your numbers a few more ways. It would be illustrative to look at a simple histogram of friend counts—you might see a dip in the distribution around your activation number, suggesting that if people get at least 6 friends, they’ll stick around and then most likely accumulate a few more (and not stay at the 6 number for long).

Also, while I was excited to show off window functions. It’s probably better to just plot the individual user data as a scatterplot. Then you can visualize the distribution of friend counts within each retention time frame. However, if your user base is large, this would be a lot of data, and the aggregation is a good first step that lets your database do all the work.

I hope this introduced you to some new ways to perform analysis in SQL. I first used SQL a decade ago, but I continue to learn new things about it all the time. If you need help with any of the concepts here, feel free to email me at david@getmetrica.com.

March 24, 2013 at 11:26am
David Crawford

0 notes

## NYU launches initiative in data science

We’re excited to hear that NYU has created a Center for Data Science, with a new MS. From one of the professors:

"The CDS is a multi-disciplinary center that will bring people who
produce theory, tools, and methods (from mathematics, statistics and
computer science) together with people who need to extract knowledge
from data from the physical, life and social sciences, as well as from

http://datascience.nyu.edu/nyu-launches-initiative-in-data-science-and-statistics-and-center-for-data-science/

March 7, 2013 at 5:14pm
David Crawford

5 notes

## Calculating rolling cohort retention with SQL

One of the most important metrics you need to track for almost any product is retention. It helps you calculate your customer lifetime value so you can determine how much to spend on marketing. It helps you know the impact of new features on your bottom line. It helps you court investors and partners. If you’re serious about metrics you need to know your retention rate. You are what you measure.

KISSmetrics, Mixpanel, and others are building great tools for this, but there are always times you want to calculate it yourself. It’s dangerous to use metrics you don’t understand, so it’s worth the exercise just to learn how retention is calculated. And sometimes you want to go deeper than the available tools. If your retention is low or stagnant, you may want to dig into retention metrics to determine a root cause.

In this article we’ll go through the process of creating a retention report from your own database. We’ll show different ways to slice it in a follow-up post.

## The database

In order to calculate cohort retention you need two pieces of information:

(You may want to use some other action to check retention, but login is pretty universal so we’ll use it here.)

We’ll assume the following tables:

Users

• id
• sign_up_time

Events

• user_id
• type
• time

## The queries

Let’s start by calculating the D1 retention for users who signed up two days ago. That means, “of all users who signed up two days ago, how many logged in yesterday?”

Here are the users who signed up two days ago:

SELECT count(1) as signed_up_users
FROM users
WHERE users.sign_up_time >= date_trunc('day', now() - interval '2 days')
AND users.sign_up_time < date_trunc('day', now() - interval '1 day');


What? Let’s break it down.

To get the date for “two days ago” we do now() - interval '2 days'. In PostgreSQL, that gives us the time exactly 48 hours ago, which will be a different time of day depending on what time you run the query. We don’t want our metrics changing based on the time of day we run the query, so we clip to midnight with date_trunc('day', time). Altogether, our where clause:

WHERE users.sign_up_time >= date_trunc('day', now() - interval '2 days')
AND users.sign_up_time < date_trunc('day', now() - interval '1 day')


means where the user signed up between midnight two days ago and midnight yesterday.

Here are the *users who logged in yesterday*:

SELECT count(distinct events.user_id) as logged_in_users
FROM events
AND events.time >= date_trunc('day', now() - interval '1 day')
AND events.time < date_trunc('day', now())


So there are two new parts. First, I’m assuming you have multiple types of events in your event database, so we filter to ‘login’ type events. More interesting is the count(distinct events.user_id). Since users can log in more than once per day, a simple count would give us the total number of log ins, which could be more than the number of users who logged in. We want to know what percent of users logged in versus signed up, so we need to use count(distinct events.user_id).

From now on I’m going to drop the date_trunc() stuff and use absolute dates to make the queries simpler.

Now we have the number of users who signed up two days ago and the number who logged in yesterday. But that counts *all* logins yesterday. We want to know of the users who signed up how many logged in the next day. So we need to do a join.

SELECT
count(distinct users.id) as signed_up_users,
count(distinct events.user_id) as logged_in_users
FROM users
LEFT JOIN events
ON users.id = events.user_id
AND events.time >= '2013-01-04'
AND events.time < '2013-01-05'
WHERE users.sign_up_time >= '2013-01-03'
AND users.sign_up_time < '2013-01-04';


We switch from count(1) to count(distinct users.id) because joining to the events table means we may have more than one row per user (if a user logged in more than once), so we use distinct to make sure we count each user only once.

We use a LEFT JOIN to get the logins from the event table for each user. If we used a simple JOIN, the result would omit the rows for users that didn’t log in the next day, so we would lose the total sign up count. A LEFT JOIN preserves those rows, and returns NULL for all columns in the events table. Here’s an example of what’s going on in that LEFT JOIN:

SELECT * from users LEFT JOIN events ON users.id = events.user_id LIMIT 5;

id  |    sign_up_time     | id  | user_id | type  |        time
-----+---------------------+-----+---------+-------+---------------------
651 | 2013-01-05 13:38:22 |     |         |       |
652 | 2013-01-05 14:04:13 |     |         |       |
653 | 2013-01-05 14:16:36 | 363 |     653 | login | 2013-01-06 12:16:36
653 | 2013-01-05 14:16:36 | 362 |     653 | login | 2013-01-06 12:16:36
654 | 2013-01-05 14:19:21 | 364 |     654 | login | 2013-01-06 12:19:21


So now we have the D1 retention for users who signed up on 2013-01-03. Next we’ll get a full report of the retention numbers for users who signed up between 2013-01-03 and 2013-01-13.

SELECT
date_trunc('day', users.sign_up_time) as day,
count(distinct users.id) as signed_up_users,
count(distinct events.user_id) as logged_in_users,
case when count(distinct users.id) > 0 then count(distinct events.user_id) * 100 / count(distinct users.id) else 0 end as retention_pct
FROM users
LEFT JOIN events
ON users.id = events.user_id
AND events.time >= date_trunc('day', users.sign_up_time + interval '1 day')
AND events.time < date_trunc('day', users.sign_up_time + interval '2 days')
WHERE users.sign_up_time >= '2013-01-03'
AND users.sign_up_time < '2013-01-13'
GROUP BY 1
ORDER BY 1;


We made three changes. First, we added the users.sign_up_time day to the select clause and grouped by it. Second we changed the join clause to use times that are relative to users.sign_up_time. Now we need to check each event row to find out whether it occurred the day after the user sign up. Since we’re no longer looking at just one day of sign ups, we need to calculate that day based on the user we’re joining to. Finally, we added a retention_pct as a convenience. It’s a little complicated because if you have no signed up users for a period, the calculation will divide by zero and error, so we added a CASE WHEN clause to check for that case.

And here’s the result:


day         | signed_up_users | logged_in_users | retention_pct
---------------------+-----------------+-----------------+---------------
2013-01-03 00:00:00 |              74 |              42 |            56
2013-01-04 00:00:00 |              85 |              25 |            29
2013-01-05 00:00:00 |              97 |              34 |            35
2013-01-06 00:00:00 |             103 |              37 |            35
2013-01-07 00:00:00 |             100 |              31 |            31
2013-01-08 00:00:00 |              96 |              26 |            27
2013-01-09 00:00:00 |              98 |              26 |            26
2013-01-10 00:00:00 |              89 |              27 |            30
2013-01-11 00:00:00 |              97 |              31 |            31
2013-01-12 00:00:00 |              96 |              22 |            22
(10 rows)


A lot of hard work for a small set of numbers, but those numbers are the life blood of your company. We’ll follow up soon with ways to dig deeper into the data to find the factors behind this result.

March 3, 2013 at 10:39pm
David Crawford

1 note

Everyone wants a data scientist. And no wonder—in a world of polyglot persistence analysts can no longer get access to their data with SQL alone. When you hire someone who knows stats but can’t hack, you’ll eventually end up hiring another hacker to make up the difference.

Finding a good data scientist is hard, but that’s not the only problem. To really turn data into value, you need all three elements of the data science venn diagram

Hiring managers need to know they are making a tradeoff between these three skills when choosing a candidate. Skimp on hacking and your employee will always be waiting for someone to get them data. Skimp on stats and they’ll misinterpret noise as signal. Skimp on expertise (domain knowledge) and they’ll overlook important business drivers.

If at all possible, remove one of these requirements. You’ll hire candidates who are much stronger in the other two.

March 2, 2013 at 8:31am
David Crawford

0 notes

## Grab first, structure later

I know I’m not the only one who’s frustrated by the term “Big Data”. It means everything and nothing.

Even 1000 rows can be big data to a human who’s trying to understand it. For many, the question isn’t whether the machines can handle the number of bits, it’s whether their brains can handle the structure of the data.

Data storage costs have been dropping exponentially since the computer was invented, but when you add the proliferation of key-value stores and document databases of the last 5 years, you get a significant change. You get companies saving data before they know what they want it for. They’re designing data models for maximum capture. Pythian’s Gwen Shapira calls it "the most important paradigm of the post-relational-database era: Grab data first, structure it later."

Developers are capturing much more data. But the work of structuring it has shifted: from the developer to the analyst, and from write-time to query-time.

So analysts today start with more but also less. More data, less coherence. I think we’re going to see a lot more tools popping up to help analysts tackle this challenge.

Discuss on Hacker News.

February 26, 2013 at 1:59am
David Crawford

0 notes

John Rauser of Amazon delivers a great talk with clear, concrete examples of how to get deeper on your data.  Gave me a new appreciation for histograms.

## Use Elastic IPs on EC2 from the Start

In the three months we’ve been using EC2, two of our boxes have been “retired.”  It sounds as if this is happening more frequently, according to a blog post by awe.sm.  And while Amazon tries to make it as painless as possible, if you’re not using elastic IPs, you can get yourself in trouble.

When there are problems with a physical machine, Amazon may retire it, and you will need to move your image to another machine.  They’ll give you anywhere between a few days and a few weeks of notice—you’ll get an email, and a clock icon will show up next to the instance name signaling a “scheduled event.”  If you do nothing, your instance will be stopped on the scheduled day with no further notice (you only get one email, so don’t ignore it).

You can restart the instance any time after the retirement is scheduled, and your image will be automatically loaded on a new machine on startup.  This is convenient enough.  However, your machine will also be assigned new IP addresses and domain names.  And this is why you want to be using elastic IPs from the start.  A new domain name could mean updating your DNS settings, or even updating configuration and restarting other servers that connect to the affected one.

An alternative to elastic IPs would be to use only your own domain names in configuration (such as database server connection strings).  Then, when a box is retired, you can update DNS in order to avoid changing configuration, but you’ll still have downtime up to the TTL on your DNS settings.  With elastic IP settings, you can simply re-assign the IP and public domain (e.g. ec2-12-34-56-78.compute-1.amazonaws.com) to the new machine, and the changes take place immediately.

12:10am
David Crawford

2 notes
Reblogged from evanrose

(via Laser-cut wooden maps of underwater contours) cool.