Skip to content

Instantly share code, notes, and snippets.

@ewoodh2o
Created October 11, 2012 16:45
Show Gist options
  • Save ewoodh2o/3873775 to your computer and use it in GitHub Desktop.
Save ewoodh2o/3873775 to your computer and use it in GitHub Desktop.
Analytics Design Question

Loading Metrics from a Database

Given the following database schema:

| page_metrics                                         |
+------------------------------------------------------+
| id | page_id | date       | fans | comments | clicks |
+----+---------+------------+------+----------+--------+
| 11 |     123 | 2012-08-01 | 1750 |       23 |    103 |
| 12 |     456 | 2012-08-01 |   58 |        8 |     15 |
| 13 |     123 | 2012-08-02 | 1755 |       48 |    421 |
| 14 |     456 | 2012-08-02 |   61 |        6 |     25 |
|                    ...                               |
+----+---------+------------+------+----------+--------+

1. Basic query

Write a function that takes the following parameters:

page_id, metric, start_time, end_time

and returns data in the following format:

[
  { 'date': '2012-08-01', 'fans': 1750 },
  { 'date': '2012-08-01', 'fans': 1755 }
]

You may assume that you have a simple DB library that provides some basic functionality. For example: DB[:table_name].select('SELECT ...your SQL statement') could run a SELECT statement and return an array of hashes containing the matching data.

2. Multiple metrics

What if we wanted the metric parameter in the previous function to accept an array of metrics?

3. Multiple pages

What if we wanted the page_id parameter to also accept an array? In this case, we would still only return a single date series, but the values would be the sum of all the pages indicated.

[
  { 'date': '2012-08-01', 'fans': 1808 },
  { 'date': '2012-08-01', 'fans': 1816 }
]

Multiple Values per Day

Now suppose we started tracking metrics that had multiple values per day. For example, on row 11 above, we would know that 1500 of those fans were from the United States, 200 were from Canada, and 50 were from Mexico.

We would want to be able to maintain performant queries to get the total number of likes. We would also want to be able to query for fans_by_country and receive a broken-down list. The response might look as follows:

[
  {
    'date': '2012-08-01',
    'fans': 1750,
    'comments': 23,
    'fans_by_country': {
      'United States': 1500,
      'Canada': 200,
      'Mexico': 50
    }
  },
  {
    'date': '2012-08-01',
    'fans': 1755,
    'comments': 48,
    'fans_by_country': {
      'United States': 1502,
      'Canada': 201,
      'Mexico': 52
    }
  }
]

4. Schema design

How would you design a schema to support that breakdown?

5. Query functionality

How would you add to your previous function so that it could return data as specified?

6. Database optimization

Imagine this database at full scale:

  • 10,000 pages
  • A year worth of data
  • Dozens of metrics
  • All broken down by country

What sort of changes might be warranted in your database schema design? What optimizations could be made to your functions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment