As far as technical skills go, SQL is a really nice skill to have for product managers and product marketers. Instead of constantly running into performance issues in Excel or begging "technical" people to look stuff up for you, you can get answers to your questions directly from data.
Unfortunately, good, non-encyclopedic resources are few and far between on the Internet. Many of them just tell you the syntax without context. Many assume you are already a programmer and/or omit the technical details of setting up the environment.
The approach here is different. In a series of blog posts, I will walk through a simple but common data analysis scenario. Each step introduces a new concept in SQL that's tied to a natural business question. By the end of this series, you should feel comfortable getting read-only access to your company's databases.
It is helpful to keep the following mental model throughout this tutorial
Excel | SQL |
---|---|
Table | Table |
Sort | ORDER BY |
Filter | WHERE |
Pivot Table | GROUP BY |
VLOOKUP | JOIN |
Array Functions | Aggregation Functions |
Imagine that you are a product manager at an e-commerce or SaaS startup. Then, the critical metric to keep pulse on your product and business is CLTV: Customer Lifetime Value. If you know how much revenue each customer brings to your business over their lifetime, you have a much better idea of how to meet revenue goals and evolve your product.
Also, if you know each customer's LTV, then you can make more efficient business decisions. Which customer acquisition channel has high LTVs? Which one is losing money? Are their regional differences?
The goal of this tutorial is to compute the average customer lifetime value from two data sources: users data and payments data.
For this series, I will be using SQLite. SQLite is simple, free, and shipped with recent versions of Mac OSX, which I assume many of you are on. Windows is also supported (See here, for example, on how to download SQLite on Windows).
Also, this might be the first time you are working with the command line interface. I promise that you won't have to do much besides entering SQL commands.
First, let's open up Terminal. If you are on a Mac, it's as simple as searching for it in the search bar like this:
GIF IMAGE HERE
The first step is to start sqlite3 with pre-populated tables. Download bootstrap.sql and place it in the directory of your choice. Then, in your terminal, run the following command in the same directory:
sqlite3 -init bootstrap.sql
This should start up a SQLite interactive shell like this:
sqlite>
Let's see what tables (again, they are like the tables in Excel) are there. For sqlite3, typing in ".tables" does the job:
sqlite> .tables
payments users
Great. There are two tables here, "payments" and "users". Unlike Excel, you need to actually write a query to see what the data looks like.
To do so, we run the simplest SQL query, which just grabs all the columns and rows in a table. DO NOT DO THIS for large tables. But here, the table is small (with just 10 rows), so go ahead and type in "SELECT * FROM users;" and hit enter. Do not forget the trailing semicolon.
sqlite> SELECT * FROM users;
id campaign signed_up_on
---------- ---------- ------------
1 facebook 2014-10-01
2 twitter 2014-10-02
3 direct 2014-10-02
4 facebook 2014-10-03
5 organic 2014-10-03
6 organic 2014-10-03
7 organic 2014-10-04
8 direct 2014-10-05
9 twitter 2014-10-05
10 organic 2014-10-05
As you can see, this table has three fields/columns:
- "id" is the user's ID. This will be referenced in the "payments" table
- "campaign" is the campaign used to acquire that user.
- "signed_up_on" is the date when the user signed up for the website.
Just like Excel's tables, SQL lets you sort data by one or more columns. To do so, we add ORDER BY <column_names>
to your SQL statement.
sqlite> SELECT * FROM users ORDER BY campaign;
id campaign signed_up_on
---------- ---------- ------------
3 direct 2014-10-02
8 direct 2014-10-05
1 facebook 2014-10-01
4 facebook 2014-10-03
5 organic 2014-10-03
6 organic 2014-10-03
7 organic 2014-10-04
10 organic 2014-10-05
2 twitter 2014-10-02
9 twitter 2014-10-05
Here is how you can sort by campaign AND signed_up_on.
sqlite> SELECT * FROM users ORDER BY campaign, signed_up_on;
id campaign signed_up_on
---------- ---------- ------------
3 direct 2014-10-02
8 direct 2014-10-05
1 facebook 2014-10-01
4 facebook 2014-10-03
5 organic 2014-10-03
6 organic 2014-10-03
7 organic 2014-10-04
10 organic 2014-10-05
2 twitter 2014-10-02
9 twitter 2014-10-05
You can also sort in reverse order by adding "DESC".
sqlite> SELECT * FROM users ORDER BY campaign DESC;
id campaign signed_up_on
---------- ---------- ------------
2 twitter 2014-10-02
9 twitter 2014-10-05
5 organic 2014-10-03
6 organic 2014-10-03
7 organic 2014-10-04
10 organic 2014-10-05
1 facebook 2014-10-01
4 facebook 2014-10-03
3 direct 2014-10-02
8 direct 2014-10-05
And of course, "DESC" can be used when sorted by multiple columns.
sqlite> SELECT * FROM users ORDER BY campaign, signed_up_on DESC;
id campaign signed_up_on
---------- ---------- ------------
8 direct 2014-10-05
3 direct 2014-10-02
4 facebook 2014-10-03
1 facebook 2014-10-01
10 organic 2014-10-05
7 organic 2014-10-04
5 organic 2014-10-03
6 organic 2014-10-03
9 twitter 2014-10-05
2 twitter 2014-10-02
One of Excel's most used features is filters. Of course, SQL, too, has filters. They are called WHERE clauses and can express an even broader range of filter conditions than Excel's filters.
Here is the query that fetches all the users that signed up organically:
sqlite> SELECT * FROM users WHERE campaign = 'organic';
id campaign signed_up_on
---------- ---------- ------------
5 organic 2014-10-03
6 organic 2014-10-03
7 organic 2014-10-04
10 organic 2014-10-05
As you can see, we have "WHERE campaign = 'organic'" in the above query. An analog in Excel is going to the "campaign" column and selecting just 'organic'.
What if you want to select multiple values? No problem, SQL can handle that with "IN" like this.
sqlite> SELECT * FROM users WHERE campaign IN ('facebook', 'twitter');
id campaign signed_up_on
---------- ---------- ------------
1 facebook 2014-10-01
2 twitter 2014-10-02
4 facebook 2014-10-03
9 twitter 2014-10-05
The above query fetches all the users that signed up through Facebook or Twitter. As you can see, you can list up multiple fields separated by commas inside the parenthesis.
What if you wanted to fetch all the users EXCEPT the ones that came Facebook or Twitter? This is supported with "NOT IN" like this:
sqlite> SELECT * FROM users WHERE campaign NOT IN ('facebook', 'twitter');
id campaign signed_up_on
---------- ---------- ------------
3 direct 2014-10-02
5 organic 2014-10-03
6 organic 2014-10-03
7 organic 2014-10-04
8 direct 2014-10-05
10 organic 2014-10-05
Okay, but all the filtering thus far involved a single column. Can SQL filter by multiple columns? They answer is yes, and it uses AND to concatenate filter conditions. The following query fetches all the users that came from Facebook or Twitter campaigns that signed up on Oct. 1, 2014.
sqlite> SELECT * FROM users WHERE campaign in ('facebook', 'twitter') AND signed_up_on = '2014-10-01';
id campaign signed_up_on
---------- ---------- ------------
1 facebook 2014-10-01
Now, it's time to show that SQL's WHERE is more powerful than Excel's filters. In addition to AND, You can actually use OR to say something like "Get me all the users that signed up before 2014-10-04 OR came in organically". Here is the query:
sqlite> SELECT * FROM users WHERE campaign = 'organic' OR signed_up_on < '2014-10-04';
id campaign signed_up_on
---------- ---------- ------------
1 facebook 2014-10-01
2 twitter 2014-10-02
3 direct 2014-10-02
4 facebook 2014-10-03
5 organic 2014-10-03
6 organic 2014-10-03
7 organic 2014-10-04
10 organic 2014-10-05
As you might have guessed by now, SQL allows you to filter and sort in one go. The syntax is simple: Have both WHERE and ORDER BY in your query, but make sure WHERE comes before ORDER BY. Here is a query that fetches all the Facebook-/Twitter-sourced users, sorted by campaign.
sqlite> SELECT * FROM users WHERE campaign in ('facebook', 'twitter') ORDER BY campaign;
id campaign signed_up_on
---------- ---------- ------------
1 facebook 2014-10-01
4 facebook 2014-10-03
2 twitter 2014-10-02
9 twitter 2014-10-05
Congratulations! You now know how to do the SQL equivalent of Excel's sorting and filtering. In the next entry, I will show how to do "pivot tables" in SQL.