Last active
February 23, 2024 14:44
-
-
Save AlexDinahl/7dd82169290dfb28359c3092f5408809 to your computer and use it in GitHub Desktop.
Last-Click non direct model attribution by source, medium, campaign
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* Disclaimer: This is the modified code originally taken from this page: | |
https://tanelytics.com/ga4-bigquery-session-traffic_source/ | |
So, all credits to Taneli Salonen | |
I just modified this code for my needs. | |
Sessions calculated as HyperLogLog: https://developers.google.com/analytics/blog/2022/hll#about_hll | |
So, approximate calculation that takes less recources | |
Changes: | |
CTEs: -date_range - this is the report date_range you wish to see (I took one day) | |
-lookup_range - always 30 days for last_click_non_direct | |
This is just an assumption, and made for purpose if you don't want to extract all events table as in the original code here: | |
https://tanelytics.com/ga4-bigquery-session-traffic_source/ | |
My logic as follows: | |
1. Code collects 'collected_traffic_source' for all sessions within 30 days range | |
2. Then it looks up for the very first last-click non-direct source within the table | |
3. If traffic source is still direct then code addresses to the 'traffic_source' which is the very first traffic source for user | |
4. Thus I added the code below | |
( | |
select | |
as struct traffic_source.source as first_source, | |
traffic_source.medium as first_medium, | |
traffic_source.name as first_campaign | |
) as first_traffic_source | |
5. Moreover this code can still give descrepances for CRM for example, because imagine someone entered your website 3 months ago by 'cpc', | |
then after a month (2 month ago) by 'email', and finally 10 days ago via 'direct'. 3 visits in total. | |
Your lookup table takes range only for 30 days so in the table last session will be 'direct', first traffic source will be 'cpc' | |
which this query will return, but in reality it is 'email'. | |
But in this case if you need better quality, do as in the article. My method is just allows to get 'quick' glance. | |
And still an approximation. | |
*/ | |
with date_range as ( | |
select | |
'20231005' as start_date, | |
'20231005' as end_date), | |
lookup_range as (select | |
format_date('%Y%m%d',date_sub(parse_date('%Y%m%d',start_date), interval 30 day)) as lookup_start_date, | |
end_date as lookup_end_date | |
from date_range), | |
events as ( | |
select | |
cast(event_date as date format 'YYYYMMDD') as date, | |
-- unique session id | |
concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id, | |
user_pseudo_id, | |
(select value.int_value from unnest(event_params) where key = 'ga_session_id') as session_start, | |
-- wrap all traffic source dimensions into a struct for the next step | |
( | |
select | |
as struct collected_traffic_source.manual_source as source, | |
collected_traffic_source.manual_medium as medium, | |
collected_traffic_source.manual_campaign_name as campaign, | |
collected_traffic_source.gclid as gclid | |
) as traffic_source, | |
--We collect first traffic source to apply it for empty ones, when the lookup window is not covering it | |
( | |
select | |
as struct traffic_source.source as first_source, | |
traffic_source.medium as first_medium, | |
traffic_source.name as first_campaign | |
) as first_traffic_source, | |
event_timestamp | |
from | |
`your_account.analytics_1234567.events_*`,date_range,lookup_range | |
where | |
(_table_suffix >= lookup_start_date and _table_suffix <= lookup_end_date) | |
and event_name not in ('session_start', 'first_visit') | |
), | |
sessions as ( | |
select | |
min(date) as date, | |
session_id, | |
user_pseudo_id, | |
session_start, | |
-- the traffic source of the first event in the session with session_start and first_visit excluded | |
array_agg( | |
if( | |
coalesce(traffic_source.source, traffic_source.medium, traffic_source.campaign, traffic_source.gclid) is not null, | |
( | |
select | |
as struct if(traffic_source.gclid is not null, 'google', traffic_source.source) as source, | |
if(traffic_source.gclid is not null, 'cpc', traffic_source.medium) as medium, | |
traffic_source.campaign, | |
traffic_source.gclid | |
), | |
null | |
) | |
order by | |
event_timestamp asc | |
limit | |
1 | |
) [safe_offset(0)] as session_first_traffic_source, | |
-- the last not null traffic source of the session | |
array_agg( | |
if( | |
coalesce(traffic_source.source,traffic_source.medium,traffic_source.campaign,traffic_source.gclid) is not null, | |
( | |
select | |
as struct if(traffic_source.gclid is not null, 'google', traffic_source.source) as source, | |
if(traffic_source.gclid is not null, 'cpc', traffic_source.medium) as medium, | |
traffic_source.campaign, | |
traffic_source.gclid | |
), | |
null | |
) ignore nulls | |
order by | |
event_timestamp desc | |
limit | |
1 | |
) [safe_offset(0)] as session_last_traffic_source, | |
array_agg( | |
if( | |
coalesce(first_traffic_source.first_source,first_traffic_source.first_medium,first_traffic_source.first_campaign) is not null, | |
traffic_source, | |
null | |
) ignore nulls | |
order by | |
event_timestamp desc | |
limit | |
1 | |
) [safe_offset(0)] as first_visit_traffic_source, | |
from | |
events | |
where | |
session_id is not null | |
group by | |
session_id, | |
user_pseudo_id, | |
session_start | |
), | |
attribution as ( | |
select date | |
--Here if session_first_traffic_source is null take session_last_traffic_source within lookup window which is not direct, and finally if this one is null take the very first one | |
,ifnull(ifnull( | |
session_first_traffic_source, | |
last_value(session_last_traffic_source ignore nulls) over( | |
partition by user_pseudo_id | |
order by | |
session_start range between 2592000 preceding | |
and 1 preceding -- 30 days lookback | |
) | |
),first_visit_traffic_source) as last_non_direct,session_id | |
from sessions | |
) | |
select concat(ifnull(last_non_direct.source,'(direct)'),' / ',ifnull(last_non_direct.medium,'(none)')) as sm,count(distinct session_id) as sessions,hll_count.extract(hll_count.init(session_id,12)) as approx_sessions | |
from attribution,date_range | |
where date between parse_date('%Y%m%d',start_date) and parse_date('%Y%m%d',end_date) | |
group by 1 | |
order by 2 desc; | |
#############################Ver 2.Beta Version. Based on string_agg############################## | |
--This is beta version and attemp to optimize of what is written above | |
--Temp function to extract parameters from events | |
create temp function GetParamValue(params any type, target_key string) | |
as ( | |
(select `value`from unnest(params) where key = target_key limit 1) | |
); | |
--Main code | |
with date_range as ( | |
select | |
'20240222' as start_date, | |
'20240222' as end_date), | |
channels as ( | |
select distinct | |
cast(event_date as date format 'YYYYMMDD') as visit_date, | |
event_timestamp, | |
user_pseudo_id, | |
event_name, | |
GetParamValue(event_params, 'ga_session_id').int_value as session_id, | |
if(collected_traffic_source.manual_source is null and collected_traffic_source.gclid is not null, | |
'google',collected_traffic_source.manual_source) as utm_source | |
from `your_account.analytics_1234567.events_*`,date_range | |
where _table_suffix between start_date and end_date), | |
chains as ( | |
select visit_date | |
,user_pseudo_id | |
,session_id | |
,utm_source | |
,string_agg(utm_source,',') over (partition by user_pseudo_id order by session_id,event_timestamp) as usource | |
,instr(string_agg(event_name,',') over (partition by concat(user_pseudo_id,session_id) order by session_id,event_timestamp),'session_start') as event | |
,array_length(regexp_extract_all(string_agg(utm_source,',') over (partition by user_pseudo_id order by session_id,event_timestamp),',')) as size | |
from channels), | |
attribution as ( | |
select visit_date | |
,user_pseudo_id | |
,session_id | |
,usource | |
,case when split(usource,',')[safe_offset(size)] is null and event=0 then '(not set)' | |
when split(usource,',')[safe_offset(size)] is null and event>0 then '(direct)' | |
else split(usource,',')[safe_offset(size)] end as last_click_source | |
,event | |
from chains | |
) | |
select last_click_source,count(distinct concat(user_pseudo_id, session_id)) as sessions | |
from attribution | |
group by 1 | |
order by 2 desc; | |
Hi!
In this case you can create additional CTE. For example for E-Commerce, and join it to the 'attribution' CTE.
This is how modified code will look like:
with date_range as (
select
'20231005' as start_date,
'20231005' as end_date)
{....omitted code.....}
attribution as (
select date
--Here if session_first_traffic_source is null take session_last_traffic_source within lookup window which is not direct, and finally if this one is null take the very first one
,ifnull(ifnull(
session_first_traffic_source,
last_value(session_last_traffic_source ignore nulls) over(
partition by user_pseudo_id
order by
session_start range between 2592000 preceding
and 1 preceding -- 30 days lookback
)
),first_visit_traffic_source) as last_non_direct,session_id
from sessions
),
--Added CTE for E-commerce
ecommerce as (
select
concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
if(ecommerce.transaction_id='(not set)',null,ecommerce.transaction_id) as transaction_id,
sum(ecommerce.purchase_revenue) as purchase_revenue,
from `your_account.analytics_1234567.events_*` , date_range
where _table_suffix between start_date and end_date
group by 1,2
)
select
concat(ifnull(last_non_direct.source,'(direct)'),' / ',ifnull(last_non_direct.medium,'(none)')) as sm
,count(distinct session_id) as sessions
--,hll_count.extract(hll_count.init(session_id,12)) as approx_sessions
,count(distinct transaction_id) as transactions
,sum(purchase_revenue) as revenue
from attribution,date_range
left join ecommerce using (session_id)
where date between parse_date('%Y%m%d',start_date) and parse_date('%Y%m%d',end_date)
group by 1
order by 2 desc
But this is not very efficient way in terms of data consumtion within Big Query. Better to create a daily updated table (based on the query above or herehttps://tanelytics.com/ga4-bigquery-session-traffic_source/ ) with attribution which will look like:
timestamp, session_id, source, medium, campaign, term etc.
and then join to this table (where session_id is already attributed to source/medium campaign) any other parameters like, ecom, new users, etc.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
how can i add measure like purchase, purchase_revenue_in_usd, new_users ?
since prep CTE select min date, aggregation of those measure is not aggreaged correctly...