Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save QuantVI/4d32c4ddde109a1ba31a0edb4197d00e to your computer and use it in GitHub Desktop.
Save QuantVI/4d32c4ddde109a1ba31a0edb4197d00e to your computer and use it in GitHub Desktop.
A HIVE/Hadoop HQL query over a large distributed cluster. For a single client, it scans 45+ days worth of re-targeting campaign data related to clicks, click revenue, sales and order value. It returns an aggregated result based on site-visitor segments, such as "Abandoned Cart" visitors.
-- Query to breakout events into legacy segmentation
CREATE TEMPORARY FUNCTION rank AS 'ABCxyz.QWERTY.hadoop.hive.udf.Rank';
SET mapred.map.output.compression.codec = ABCxyz.hadoop.compression.lzo.LzopCodec;
SET hive.cli.print.header = TRUE;
SET mapred.job.priority = HIGH;
SET partnerid = 1731;
SET golivedate = "2014-03-01";
-- abandondate is 15 days before golivedate
SET abandondate = "2014-02-14";
-- lowerfunneldate is 11 days before golivedate
SET lowerfunneldate = "2014-02-18";
-- doublebasket is 45 days before golivedate
SET doublebasketdate = "2014-01-15";
-- The abandoned cart eligible window in days
SET acwindow = 15;
SELECT
clicksegment.campaign
,clicksegment.clicks
,clicksegment.clickrevenue
,transsegment.sales
,transsegment.ordervalue
FROM
-- ================================================================
-- ================================================================
-- The below part is one of two major parts to this query
-- clicksegment below is joined to the lower part
-- on the segment names: Browsers, Customers, Shoppers, Abandoned Cart
-- The two parts otherwise do NOT mix. Debug them separately.
-- ================================================================
-- ================================================================
-- Subquery clicksegment
-- Buckets all clicks from golivedate
-- into segments of Browsers, Customers, Shoppers, and Abandoned Cart.
-- Also gets the Click Revenue
-- NOTE. Since the clicks can have nulls, there is an Unknown Segment group
-- in cligksegment, that is not defined in the lower part. That is why we take segment names
-- from click segment.
(
Select
maxcontext.Test5CX campaign
,count(maxcontext.a_click) clicks
,sum(maxcontext.click_revenue) clickrevenue
From
-- Subquery maxcontext
-- 1 row per user click on golivedate, with the context of that click.
-- For each user associated with the click, it also gives
-- the last basket event, and last sales event for the user.
-- The last_basket and last_sale events are timestamps to compare with the click timestamp
(
select
allclicksgoliveday.user_id as userid,
allclicksgoliveday.a_click ,
allclicksgoliveday.context_id ,
allclicksgoliveday.click_revenue ,
mxbasket14day.last_basket ,
mxsale14day.last_sale,
CASE when mxbasket14day.last_basket is NULL
then (case when mxsale14day.last_sale is NULL
then (case when allclicksgoliveday.context_id is NULL then 'Unknown'
when allclicksgoliveday.context_id in ('2','3','4','5','10') then 'Browsers'
when allclicksgoliveday.context_id in ('7','8','9') then 'Customers'
when allclicksgoliveday.context_id in ('6') then 'Shoppers'
end)
ELSE (case when allclicksgoliveday.context_id is NULL then 'Customers'
when allclicksgoliveday.context_id in ('2','3','4','5','10') then 'Browsers'
when allclicksgoliveday.context_id in ('7','8','9') then 'Customers'
when allclicksgoliveday.context_id in ('6') then 'Shoppers'
end)
end)
ELSE (case when mxsale14day.last_sale is NULL then 'Abndn Crt'
when mxsale14day.last_sale > mxbasket14day.last_basket
then (case when allclicksgoliveday.context_id is NULL then 'Customers'
when allclicksgoliveday.context_id in ('2','3','4','5','10') then 'Browsers'
when allclicksgoliveday.context_id in ('7','8','9') then 'Customers'
when allclicksgoliveday.context_id in ('6') then 'Shoppers'
end)
ELSE 'Abndn Crt'
end)
END Test5CX
from
-- every click on golivedate
(
select
cl.user_id ,
cl.unixtime a_click ,
context_id ,
revenue_local click_revenue
from
bi_data.partnerdb_bi_click cl
where
partner_id = 1731
and partner_partition = 731
and DAY = ${hiveconf:golivedate}
and cl.user_id <> '00000000-0000-0000-0000-000000000000'
)
allclicksgoliveday
left outer join
-- the date of the most recent (last) basket event for a User between abandondate and golivedate
(
select
ae1.user_id ,
MAX(ae1.unixtime) last_basket
from
bi_data.partnerdb_bi_advertiser_event ae1
where
ae1.event_name = 'Basket'
--and partner_partition = pmod(${hiveconf:partnerid},1000)
and partner_id = 1731
and DAY BETWEEN ${hiveconf:abandondate} and ${hiveconf:golivedate}
and ae1.user_id <> '00000000-0000-0000-0000-000000000000'
group by
ae1.user_id
)
mxbasket14day
on allclicksgoliveday.user_id = mxbasket14day.user_id
left outer join
-- the date of the most recent (last) Sale event for a User between abandondate and golivedate
(
select
ae2.user_id ,
MAX(ae2.unixtime) last_sale
from
bi_data.partnerdb_bi_advertiser_event ae2
where
ae2.event_name = 'Sales'
--and partner_partition = pmod(${hiveconf:partnerid},1000)
and partner_id = 1731
and DAY BETWEEN ${hiveconf:abandondate} and ${hiveconf:golivedate}
and ae2.user_id <> '00000000-0000-0000-0000-000000000000'
group by
ae2.user_id
)
mxsale14day
on allclicksgoliveday.user_id = mxsale14day.user_id
)
maxcontext
Group By
maxcontext.Test5CX
)
clicksegment
LEFT OUTER JOIN
-- ================================================================
-- ================================================================
-- The above and below parts are the two major parts to this query
-- transsegment below is joined to the above part
-- on the segment names: Browsers, Customers, Shoppers, Abandoned Cart
-- The two parts otherwise do NOT mix. Debug them separately.
-- ================================================================
-- ================================================================
-- Subqery transsegment
-- Buckets all transactions from golivedate
-- into segments of Browsers, Customers, Shoppers, and Abandoned Cart.
-- Also gets the Order Value
(
select
attribtrans.attrib
,sum(attribtrans.sales) sales
,sum(attribtrans.ordervalue) ordervalue
from
-- Subquery attribtrans
-- With all the transactions, we find which came from Abandoned Cart users
-- (as defined by subquery acartfinder)
-- We look at the context_id to attribute the other sales to their own campaigns
(
Select
ordervaluez.userid,
ordervaluez.context_id,
ordervaluez.sales,
ordervaluez.ordervalue,
Case when acartfinder.bkaftersl is NULL
then (Case when ordervaluez.context_id is NULL then 'Unknown'
when ordervaluez.context_id in ('2','3','4','5','10') then 'Browsers'
when ordervaluez.context_id in ('7','8','9') then 'Customers'
when ordervaluez.context_id in ('6') then 'Shoppers'
end)
else 'Abndn Crt'
end attrib
From
-- Subquery ordervaluez
-- Using the matched_transactions_rich table, sum up the number of sales and total value of
-- all transactions that happen up to 30 days before golivedate. (Currently same as CPOP Order Value)
(
select
rawtransactions1.userid,
rawtransactions1.context_id,
count(distinct rawtransactions1.sales) as sales,
sum(rawtransactions1.ordervalue) as ordervalue
from
(
select
mtr.user_id as userid
,mtr.context_id
,mtr.transaction_id as sales
,mtr.order_value as ordervalue
,mtr.trans_timestamp
,mtr.click_timestamp
from bi_data.cpop_matched_transactions_rich mtr
where mtr.partner_id = 1731
and mtr.day = ${hiveconf:golivedate}
and mtr.click_timestamp IS NOT NULL
) rawtransactions1
group by
rawtransactions1.userid,
rawtransactions1.context_id
)
ordervaluez
Left Outer Join
-- Subqeury acartfinder
-- Finds Sales and Basket events up to 45 days ago for transactions from golivedate.
-- It compares Sales and Basket events from the same user to find out which happened first/last
-- When a basket for the user is the last or only event, this will become an Abandonded Cart user.
(
SELECT
doubleabandon.userid,
dabasketonly.baskettime,
dasalesonly.salestime,
case when dabasketonly.baskettime > dasalesonly.salestime then 'Yes: BK > Sale'
-- when check > 1 instead of IS NOT NULL because that is faster for query execution
-- unixtime in partnerdb_bi_advertiser_event is an INT so this check is OK to use
when dabasketonly.baskettime > 1 and dasalesonly.salestime is NULL then 'Yes: BK, no Sl'
end bkaftersl
FROM
-- Subquery doubleabandon
-- Finds every basket event up to 15 days before the last_click of a matched transaction
-- finds every sales event up to 15 days before the last_click of a matched transaction
-- these are unioned for comparison to see which possibly happened before the other
-- The purpose of the suqbquery is to help us identify which users that transacted on
-- golivedate that were previously abandonded cart users (This is not the same as users who
-- became abandoned cart on golivedate)
(
-- query for last basket events
SELECT
ae.user_id userid,
max(ae.unixtime) eventime,
'Basket' evtype
FROM
bi_data.partnerdb_bi_advertiser_event ae
JOIN
-- rawtransactions gives the 30day post click trans for a golivedate
(
select
mtr.user_id ,
mtr.click_timestamp as last_assoc_click
from
bi_data.cpop_matched_transactions_rich mtr
where
mtr.partner_id = 1731
and mtr.day = ${hiveconf:golivedate}
and mtr.click_timestamp IS NOT NULL
) rawtransactionsdb
ON ae.user_id = rawtransactionsdb.user_id
WHERE
-- restrictions on advertiser_event
ae.event_name = 'Basket'
AND partner_partition = 731
AND partner_id = 1731
AND DAY BETWEEN ${hiveconf:doublebasketdate} AND ${hiveconf:golivedate}
AND ae.user_id <> '00000000-0000-0000-0000-000000000000'
-- restrictions on the join.
-- These can't be in the join clause since only equi join is allowed in hive
-- the event happens before the click
AND ae.unixtime < rawtransactionsdb.last_assoc_click
-- adding 15 days to the event time should exceed the click time
-- if it did not, that would mean the click is MORE than 15 days away
AND ae.unixtime + ${hiveconf:acwindow}*24*3600 >= rawtransactionsdb.last_assoc_click
GROUP BY
ae.user_id
UNION ALL
-- query for last sales events
SELECT
ae2.user_id userid,
max(ae2.unixtime) eventime,
'Sales' evtype
FROM
bi_data.partnerdb_bi_advertiser_event ae2
JOIN
-- rawtransactions gives the 30day post click trans for a golivedate
(
select
mtr.user_id ,
mtr.click_timestamp as last_assoc_click
from
bi_data.cpop_matched_transactions_rich mtr
where
mtr.partner_id = 1731
and mtr.day = ${hiveconf:golivedate}
and mtr.click_timestamp IS NOT NULL
) rawtransactionsds
ON ae2.user_id = rawtransactionsds.user_id
WHERE
-- restrictions on advertiser_event
ae2.event_name = 'Sales'
AND partner_partition = 731
AND partner_id = 1731
AND DAY BETWEEN ${hiveconf:doublebasketdate} AND ${hiveconf:golivedate}
AND ae2.user_id <> '00000000-0000-0000-0000-000000000000'
-- restrictions on the join.
-- These can't be in the join clause since only equi join is allowed in hive
-- the event happens before the click
AND ae2.unixtime < rawtransactionsds.last_assoc_click
-- adding 15 days to the event time should exceed the click time
-- if it did not, that would mean the click is MORE than 15 days away
AND ae2.unixtime + ${hiveconf:acwindow}*24*3600 >= rawtransactionsds.last_assoc_click
GROUP BY
ae2.user_id
)
doubleabandon
LEFT OUTER JOIN
-- query for last basket events
(
SELECT
aeb.user_id as userid,
max(aeb.unixtime) as baskettime,
'Basket' as evtype
FROM
bi_data.partnerdb_bi_advertiser_event aeb
JOIN
-- rawtransactions gives the 30day post click trans for a golivedate
(
select
mtr.user_id ,
mtr.click_timestamp as last_assoc_click
from
bi_data.cpop_matched_transactions_rich mtr
where
mtr.partner_id = 1731
and mtr.day = ${hiveconf:golivedate}
and mtr.click_timestamp IS NOT NULL
) rawtransactions
ON aeb.user_id = rawtransactions.user_id
WHERE
-- restrictions on advertiser_event
event_name = 'Basket'
AND partner_partition = 731
AND partner_id = 1731
AND DAY BETWEEN ${hiveconf:doublebasketdate} AND ${hiveconf:golivedate}
AND aeb.user_id <> '00000000-0000-0000-0000-000000000000'
-- restrictions on the join.
-- These can't be in the join clause since only equi join is allowed in hive
-- the event happens before the click
AND aeb.unixtime < rawtransactions.last_assoc_click
-- adding 15 days to the event time should exceed the click time
-- if it did not, that would mean the click is MORE than 15 days away
AND aeb.unixtime + ${hiveconf:acwindow}*24*3600 >= rawtransactions.last_assoc_click
GROUP BY
aeb.user_id
)
dabasketonly
ON doubleabandon.userid = dabasketonly.userid
AND dabasketonly.evtype = 'Basket'
LEFT OUTER JOIN
-- query for last sales events
(
SELECT
aes.user_id as userid,
max(aes.unixtime) as salestime,
'Sales' as evtype
FROM
bi_data.partnerdb_bi_advertiser_event aes
JOIN
-- rawtransactions gives the 30day post click trans for a golivedate
(
select
mtr.user_id ,
mtr.click_timestamp as last_assoc_click
from
bi_data.cpop_matched_transactions_rich mtr
where
mtr.partner_id = 1731
and mtr.day = ${hiveconf:golivedate}
and mtr.click_timestamp IS NOT NULL
) rawtransactions
ON aes.user_id = rawtransactions.user_id
WHERE
-- restrictions on advertiser_event
event_name = 'Sales'
AND partner_partition = 731
AND partner_id = 1731
AND DAY BETWEEN ${hiveconf:doublebasketdate} AND ${hiveconf:golivedate}
AND aes.user_id <> '00000000-0000-0000-0000-000000000000'
-- restrictions on the join.
-- These can't be in the join clause since only equi join is allowed in hive
-- the event happens before the click
AND aes.unixtime < rawtransactions.last_assoc_click
-- adding 15 days to the event time should exceed the click time
-- if it did not, that would mean the click is MORE than 15 days away
AND aes.unixtime + ${hiveconf:acwindow}*24*3600 >= rawtransactions.last_assoc_click
GROUP BY
aes.user_id
)
dasalesonly
ON doubleabandon.userid = dasalesonly.userid
AND dasalesonly.evtype = 'Sales'
GROUP BY
doubleabandon.userid,
dabasketonly.baskettime,
dasalesonly.salestime
)
acartfinder
On ordervaluez.userid = acartfinder.userid
)
attribtrans
group by
attribtrans.attrib
)
transsegment
ON clicksegment.campaign = transsegment.attrib
@QuantVI
Copy link
Author

QuantVI commented Jul 24, 2019

This was the longest and most intricate query I wrote as a Business Intelligence analyst. It's full of various forms of business logic, and even required hard-coding of certain items. Since the data behind the query was indeed "big data", it turned out that without certain times hard-coded, the query would fail. This final version of the query, a .hql file for a HIVE/hadoop cluster, still took 10+ minutes to return results.

From the initial design, to re-implementation to testing, re-testing, debugging, getting help with optimizations (such as the rank function fro m the software devs), to the final version was more than 3 months. Several other people in the Business Intelligence/Data Analytics department had tried to write a similar query, but ran into walls (e.g. query timeout) due to the vast amounts of data the different parts of the query had to access and keep track of during execution.

It was only through many questions about underlying tables, and chiefly from narrowing in on the exact and most minimal logic needed to tie users together for proper attribution that this was completed. It is of course the crowning achievement of my time in that role.

The query spans 466 lines, with comments, s p a c i n g   and indentation indicative of its long evolution, and required to keep it understandable. For a given client, it combed millions of rows in multiple tables, and took 10+ minutes to either complete or crash. In contrast, the output was something very simple, resembling the following:

campaign clicks clickrevenue sales ordervalue
Abndn Crt 4012 2094.969 409 36117.214
Browsers 1309 226.759 37 2569.686
Customers 3079 1806.089 260 24738.453
Shoppers 4199 963.430 136 12385.677
Unknown 49 18.050 NULL NULL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment