Say we're working with clickstream data:
CREATE TABLE pageviews (
"timestamp" timestamp,
url text,
referrer text,
user_agent text,
ip_address inet,
is_bot boolean
)
Most users won't care about the raw pageviews
table and it might even be error-prone to query it directly, so we usually create views on top of this (plain views or materialized):
CREATE VIEW daily_pageviews AS (
SELECT
date_trunc('day', "timestamp") as day,
url,
count(*) as page_views,
count(distinct ip_address) as users,
sum(case when lower(user_agent) like "%chrome%" then 1 else 0 end) as chrome_pageviews
FROM pageviews
WHERE
is_bot=false
GROUP BY
1, 2
)
Of course it's likely that we'll want to support other granularities, so we might end up maintaining multiple views like weekly_pageviews
or monthly_pageviews
.
Beyond it being annoying to maintain multiple views, a larger issue has potentially just crept up. Consider the users
column in daily_pageviews
which is a count(distinct ip_address)
. What we'd like to do is flag that this column cannot be summed if aggregating further (say trying to get unique users for the month), but that kind of information is impossible to encode in SQL and the engine will happily allow a user to sum(users)
and arrive at an invalid number.
What LookML lets you do is say that the measure called "users" is intended to be a count distint
on the ip_address
column. We can also say that the raw pageviews
table contains a dimension called timestamp which Looker will already know can be truncated to second, minute, hour, day, etc. What Looker then enforces is that any time I request the "users" measure for any aggregate, it'll know that it must perform a count distinct
and not try to sum on top of some materialized table.
You're basically enforcing data definitions via LookML that you can't via plain SQL. Does that make sense or was this confusing?