Skip to content

Instantly share code, notes, and snippets.

@msukmanowsky
Last active December 12, 2019 18:15
Show Gist options
  • Save msukmanowsky/805db14a7b4eeadb3f52798074018baa to your computer and use it in GitHub Desktop.
Save msukmanowsky/805db14a7b4eeadb3f52798074018baa to your computer and use it in GitHub Desktop.

LookML vs Materialized Views

Say we're working with clickstream data:

CREATE TABLE pageviews (
  "timestamp"   timestamp,
  url           text,
  referrer      text,
  user_agent    text,
  ip_address    inet,
  is_bot        boolean
)

Most users won't care about the raw pageviews table and it might even be error-prone to query it directly, so we usually create views on top of this (plain views or materialized):

CREATE VIEW daily_pageviews AS (
  SELECT
    date_trunc('day', "timestamp") as day,
    url,
    count(*) as page_views,
    count(distinct ip_address) as users,
    sum(case when lower(user_agent) like "%chrome%" then 1 else 0 end) as chrome_pageviews
  FROM pageviews
  WHERE
    is_bot=false
  GROUP BY
    1, 2
)

Of course it's likely that we'll want to support other granularities, so we might end up maintaining multiple views like weekly_pageviews or monthly_pageviews.

Beyond it being annoying to maintain multiple views, a larger issue has potentially just crept up. Consider the users column in daily_pageviews which is a count(distinct ip_address). What we'd like to do is flag that this column cannot be summed if aggregating further (say trying to get unique users for the month), but that kind of information is impossible to encode in SQL and the engine will happily allow a user to sum(users) and arrive at an invalid number.

What LookML lets you do is say that the measure called "users" is intended to be a count distint on the ip_address column. We can also say that the raw pageviews table contains a dimension called timestamp which Looker will already know can be truncated to second, minute, hour, day, etc. What Looker then enforces is that any time I request the "users" measure for any aggregate, it'll know that it must perform a count distinct and not try to sum on top of some materialized table.

You're basically enforcing data definitions via LookML that you can't via plain SQL. Does that make sense or was this confusing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment