Skip to content

Instantly share code, notes, and snippets.

@AlexDinahl
Last active February 23, 2024 14:44
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AlexDinahl/7dd82169290dfb28359c3092f5408809 to your computer and use it in GitHub Desktop.
Save AlexDinahl/7dd82169290dfb28359c3092f5408809 to your computer and use it in GitHub Desktop.
Last-Click non direct model attribution by source, medium, campaign
/* Disclaimer: This is the modified code originally taken from this page:
https://tanelytics.com/ga4-bigquery-session-traffic_source/
So, all credits to Taneli Salonen
I just modified this code for my needs.
Sessions calculated as HyperLogLog: https://developers.google.com/analytics/blog/2022/hll#about_hll
So, approximate calculation that takes less recources
Changes:
CTEs: -date_range - this is the report date_range you wish to see (I took one day)
-lookup_range - always 30 days for last_click_non_direct
This is just an assumption, and made for purpose if you don't want to extract all events table as in the original code here:
https://tanelytics.com/ga4-bigquery-session-traffic_source/
My logic as follows:
1. Code collects 'collected_traffic_source' for all sessions within 30 days range
2. Then it looks up for the very first last-click non-direct source within the table
3. If traffic source is still direct then code addresses to the 'traffic_source' which is the very first traffic source for user
4. Thus I added the code below
(
select
as struct traffic_source.source as first_source,
traffic_source.medium as first_medium,
traffic_source.name as first_campaign
) as first_traffic_source
5. Moreover this code can still give descrepances for CRM for example, because imagine someone entered your website 3 months ago by 'cpc',
then after a month (2 month ago) by 'email', and finally 10 days ago via 'direct'. 3 visits in total.
Your lookup table takes range only for 30 days so in the table last session will be 'direct', first traffic source will be 'cpc'
which this query will return, but in reality it is 'email'.
But in this case if you need better quality, do as in the article. My method is just allows to get 'quick' glance.
And still an approximation.
*/
with date_range as (
select
'20231005' as start_date,
'20231005' as end_date),
lookup_range as (select
format_date('%Y%m%d',date_sub(parse_date('%Y%m%d',start_date), interval 30 day)) as lookup_start_date,
end_date as lookup_end_date
from date_range),
events as (
select
cast(event_date as date format 'YYYYMMDD') as date,
-- unique session id
concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
user_pseudo_id,
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_start,
-- wrap all traffic source dimensions into a struct for the next step
(
select
as struct collected_traffic_source.manual_source as source,
collected_traffic_source.manual_medium as medium,
collected_traffic_source.manual_campaign_name as campaign,
collected_traffic_source.gclid as gclid
) as traffic_source,
--We collect first traffic source to apply it for empty ones, when the lookup window is not covering it
(
select
as struct traffic_source.source as first_source,
traffic_source.medium as first_medium,
traffic_source.name as first_campaign
) as first_traffic_source,
event_timestamp
from
`your_account.analytics_1234567.events_*`,date_range,lookup_range
where
(_table_suffix >= lookup_start_date and _table_suffix <= lookup_end_date)
and event_name not in ('session_start', 'first_visit')
),
sessions as (
select
min(date) as date,
session_id,
user_pseudo_id,
session_start,
-- the traffic source of the first event in the session with session_start and first_visit excluded
array_agg(
if(
coalesce(traffic_source.source, traffic_source.medium, traffic_source.campaign, traffic_source.gclid) is not null,
(
select
as struct if(traffic_source.gclid is not null, 'google', traffic_source.source) as source,
if(traffic_source.gclid is not null, 'cpc', traffic_source.medium) as medium,
traffic_source.campaign,
traffic_source.gclid
),
null
)
order by
event_timestamp asc
limit
1
) [safe_offset(0)] as session_first_traffic_source,
-- the last not null traffic source of the session
array_agg(
if(
coalesce(traffic_source.source,traffic_source.medium,traffic_source.campaign,traffic_source.gclid) is not null,
(
select
as struct if(traffic_source.gclid is not null, 'google', traffic_source.source) as source,
if(traffic_source.gclid is not null, 'cpc', traffic_source.medium) as medium,
traffic_source.campaign,
traffic_source.gclid
),
null
) ignore nulls
order by
event_timestamp desc
limit
1
) [safe_offset(0)] as session_last_traffic_source,
array_agg(
if(
coalesce(first_traffic_source.first_source,first_traffic_source.first_medium,first_traffic_source.first_campaign) is not null,
traffic_source,
null
) ignore nulls
order by
event_timestamp desc
limit
1
) [safe_offset(0)] as first_visit_traffic_source,
from
events
where
session_id is not null
group by
session_id,
user_pseudo_id,
session_start
),
attribution as (
select date
--Here if session_first_traffic_source is null take session_last_traffic_source within lookup window which is not direct, and finally if this one is null take the very first one
,ifnull(ifnull(
session_first_traffic_source,
last_value(session_last_traffic_source ignore nulls) over(
partition by user_pseudo_id
order by
session_start range between 2592000 preceding
and 1 preceding -- 30 days lookback
)
),first_visit_traffic_source) as last_non_direct,session_id
from sessions
)
select concat(ifnull(last_non_direct.source,'(direct)'),' / ',ifnull(last_non_direct.medium,'(none)')) as sm,count(distinct session_id) as sessions,hll_count.extract(hll_count.init(session_id,12)) as approx_sessions
from attribution,date_range
where date between parse_date('%Y%m%d',start_date) and parse_date('%Y%m%d',end_date)
group by 1
order by 2 desc;
#############################Ver 2.Beta Version. Based on string_agg##############################
--This is beta version and attemp to optimize of what is written above
--Temp function to extract parameters from events
create temp function GetParamValue(params any type, target_key string)
as (
(select `value`from unnest(params) where key = target_key limit 1)
);
--Main code
with date_range as (
select
'20240222' as start_date,
'20240222' as end_date),
channels as (
select distinct
cast(event_date as date format 'YYYYMMDD') as visit_date,
event_timestamp,
user_pseudo_id,
event_name,
GetParamValue(event_params, 'ga_session_id').int_value as session_id,
if(collected_traffic_source.manual_source is null and collected_traffic_source.gclid is not null,
'google',collected_traffic_source.manual_source) as utm_source
from `your_account.analytics_1234567.events_*`,date_range
where _table_suffix between start_date and end_date),
chains as (
select visit_date
,user_pseudo_id
,session_id
,utm_source
,string_agg(utm_source,',') over (partition by user_pseudo_id order by session_id,event_timestamp) as usource
,instr(string_agg(event_name,',') over (partition by concat(user_pseudo_id,session_id) order by session_id,event_timestamp),'session_start') as event
,array_length(regexp_extract_all(string_agg(utm_source,',') over (partition by user_pseudo_id order by session_id,event_timestamp),',')) as size
from channels),
attribution as (
select visit_date
,user_pseudo_id
,session_id
,usource
,case when split(usource,',')[safe_offset(size)] is null and event=0 then '(not set)'
when split(usource,',')[safe_offset(size)] is null and event>0 then '(direct)'
else split(usource,',')[safe_offset(size)] end as last_click_source
,event
from chains
)
select last_click_source,count(distinct concat(user_pseudo_id, session_id)) as sessions
from attribution
group by 1
order by 2 desc;
@martinee-io
Copy link

how can i add measure like purchase, purchase_revenue_in_usd, new_users ?

since prep CTE select min date, aggregation of those measure is not aggreaged correctly...

@AlexDinahl
Copy link
Author

AlexDinahl commented Jan 8, 2024

Hi!

In this case you can create additional CTE. For example for E-Commerce, and join it to the 'attribution' CTE.

This is how modified code will look like:

with date_range as (
  select
    '20231005' as start_date,
    '20231005' as end_date)

{....omitted code.....}

attribution as (
  select date
  --Here if session_first_traffic_source is null take session_last_traffic_source within lookup window which is not direct, and finally if this one is null take the very first one
  ,ifnull(ifnull(
    session_first_traffic_source,
    last_value(session_last_traffic_source ignore nulls) over(
      partition by user_pseudo_id
      order by
        session_start range between 2592000 preceding
        and 1 preceding -- 30 days lookback
    )
  ),first_visit_traffic_source) as last_non_direct,session_id
  from sessions
  ),
--Added CTE for E-commerce
ecommerce  as (
select
    concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
    if(ecommerce.transaction_id='(not set)',null,ecommerce.transaction_id) as transaction_id,
    sum(ecommerce.purchase_revenue) as purchase_revenue,
from `your_account.analytics_1234567.events_*` , date_range
where _table_suffix between start_date and end_date
group by 1,2
)
  select 
  concat(ifnull(last_non_direct.source,'(direct)'),' / ',ifnull(last_non_direct.medium,'(none)')) as sm
  ,count(distinct session_id) as sessions
  --,hll_count.extract(hll_count.init(session_id,12)) as approx_sessions
  ,count(distinct transaction_id) as transactions
  ,sum(purchase_revenue) as revenue
  from attribution,date_range
  left join ecommerce using (session_id)
  where date between parse_date('%Y%m%d',start_date) and parse_date('%Y%m%d',end_date)
  group by 1
  order by 2 desc

But this is not very efficient way in terms of data consumtion within Big Query. Better to create a daily updated table (based on the query above or herehttps://tanelytics.com/ga4-bigquery-session-traffic_source/ ) with attribution which will look like:

timestamp, session_id, source, medium, campaign, term etc.

and then join to this table (where session_id is already attributed to source/medium campaign) any other parameters like, ecom, new users, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment