Skip to content

Instantly share code, notes, and snippets.

@bstancil
Last active October 18, 2023 16:07
Show Gist options
  • Star 42 You must be signed in to star a gist
  • Fork 9 You must be signed in to fork a gist
  • Save bstancil/95536d6d47c9c6410306 to your computer and use it in GitHub Desktop.
Save bstancil/95536d6d47c9c6410306 to your computer and use it in GitHub Desktop.
These queries let you define find user sessions against event data logged to Segment SQL, Snowplow, or Google BigQuery.
-- These queries let you define find user sessions against event data
-- logged to Segment SQL, Snowplow, or Google BigQuery.
-- For more details, see the full post:
-- LINK
--- SEGMENT SQL
-- Finding the start of every session
SELECT *
FROM (
SELECT *
LAG(sent_at,1) OVER (PARTITION BY user_id ORDER BY sent_at) AS last_event
FROM "your_project".tracks
) last
WHERE EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
OR last_event IS NULL
-- Mapping every event to its session
SELECT *,
SUM(is_new_session) OVER (ORDER BY user_id, sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS user_session_id
FROM (
SELECT *,
CASE WHEN EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
OR last_event IS NULL
THEN 1 ELSE 0 END AS is_new_session
FROM (
SELECT *,
LAG(sent_at,1) OVER (PARTITION BY user_id ORDER BY sent_at) AS last_event
FROM "your_project".tracks
) last
) final
--- SNOWPLOW
-- Finding the start of every session
SELECT *
FROM (
SELECT *
LAG(collector_tstamp,1) OVER (PARTITION BY domain_userid ORDER BY collector_tstamp) AS last_event
FROM "your_project".events
) last
WHERE EXTRACT('EPOCH' FROM collector_tstamp) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
OR last_event IS NULL
-- Mapping every event to its session
SELECT *,
SUM(is_new_session) OVER (ORDER BY domain_userid, collector_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY domain_userid ORDER BY collector_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS user_session_id
FROM (
SELECT *,
CASE WHEN EXTRACT('EPOCH' FROM collector_tstamp) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
OR last_event IS NULL
THEN 1 ELSE 0 END AS is_new_session
FROM (
SELECT *,
LAG(collector_tstamp,1) OVER (PARTITION BY domain_userid ORDER BY collector_tstamp) AS last_event
FROM "your_project".events
) last
) final
--- GOOGLE BIGQUERY
-- Finding the start of every session
SELECT *
FROM (
SELECT *,
LAG(occurred_at,1) OVER (PARTITION BY fullVisitorId ORDER BY occurred_at) AS last_event
FROM (
SELECT fullVisitorId,
visitStartTime + hits.time/1000 AS occurred_at
FROM "your_tables"
) pre
) last
WHERE occurred_at - last_event >= (60 * 10)
OR last_event IS NULL
-- Mapping every event to its session
SELECT *,
SUM(is_new_session) OVER (ORDER BY fullVisitorId, occurred_at) AS global_session_id
FROM (
SELECT *,
CASE WHEN occurred_at - last_event >= (60 * 10) OR last_event IS NULL THEN 1 ELSE 0 END AS is_new_session
FROM (
SELECT *,
LAG(occurred_at,1) OVER (PARTITION BY fullVisitorId ORDER BY occurred_at) AS last_event
FROM (
SELECT fullVisitorId,
visitStartTime + hits.time/1000 AS occurred_at
FROM "your_tables"
) pre
) last
) final
ORDER BY 1,2
@fraserdeans
Copy link

Thanks so much for these. FYI there is a missing comma before the LAG on the first Segment SQL.

Should be...

SELECT *
  FROM (
       SELECT *,
              LAG(sent_at,1) OVER (PARTITION BY user_id ORDER BY sent_at) AS last_event
        FROM "your_project".tracks
      ) last
WHERE EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
   OR last_event IS NULL

@alfurb
Copy link

alfurb commented Nov 22, 2017

In the Google BigQuery queries, what is the reason for adding hits.time/1000 in visitStartTime + hits.time/1000 AS occurred_at?

@guillaumewibaux
Copy link

A note that could help some:
It can happen than ordering of events using timestamps is nondeterministic a.k.a several timestamps are equal. (let's say in the same second) (More here: https://docs.aws.amazon.com/redshift/latest/dg/r_Examples_order_by_WF.html)

It can led to some fun stuff with the usage of window functions and 'sessions IDs' . To counter it, you could use the event id to make the sorting deterministic across the different windows.

@axrdb
Copy link

axrdb commented Feb 8, 2018

Is it possible to update the query to also assign new sessions based on the length of the session? we want to calculate sessions as continuous events within 60 min for example. If there is a window of > 60 min for that user then its a different session. Any idea how to either extend your query to accomplish the above?

@ankitagarwal
Copy link

For segment wouldn't using "timestamp" field instead of "sent_at" make more sense?

@dooberss
Copy link

In the Google BigQuery queries, what is the reason for adding hits.time/1000 in visitStartTime + hits.time/1000 AS occurred_at?

Following up here, was this ever solved? Using Big Query as well, would be great to get this data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment