Created
July 24, 2019 00:08
-
-
Save QuantVI/4d32c4ddde109a1ba31a0edb4197d00e to your computer and use it in GitHub Desktop.
A HIVE/Hadoop HQL query over a large distributed cluster. For a single client, it scans 45+ days worth of re-targeting campaign data related to clicks, click revenue, sales and order value. It returns an aggregated result based on site-visitor segments, such as "Abandoned Cart" visitors.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- Query to breakout events into legacy segmentation | |
CREATE TEMPORARY FUNCTION rank AS 'ABCxyz.QWERTY.hadoop.hive.udf.Rank'; | |
SET mapred.map.output.compression.codec = ABCxyz.hadoop.compression.lzo.LzopCodec; | |
SET hive.cli.print.header = TRUE; | |
SET mapred.job.priority = HIGH; | |
SET partnerid = 1731; | |
SET golivedate = "2014-03-01"; | |
-- abandondate is 15 days before golivedate | |
SET abandondate = "2014-02-14"; | |
-- lowerfunneldate is 11 days before golivedate | |
SET lowerfunneldate = "2014-02-18"; | |
-- doublebasket is 45 days before golivedate | |
SET doublebasketdate = "2014-01-15"; | |
-- The abandoned cart eligible window in days | |
SET acwindow = 15; | |
SELECT | |
clicksegment.campaign | |
,clicksegment.clicks | |
,clicksegment.clickrevenue | |
,transsegment.sales | |
,transsegment.ordervalue | |
FROM | |
-- ================================================================ | |
-- ================================================================ | |
-- The below part is one of two major parts to this query | |
-- clicksegment below is joined to the lower part | |
-- on the segment names: Browsers, Customers, Shoppers, Abandoned Cart | |
-- The two parts otherwise do NOT mix. Debug them separately. | |
-- ================================================================ | |
-- ================================================================ | |
-- Subquery clicksegment | |
-- Buckets all clicks from golivedate | |
-- into segments of Browsers, Customers, Shoppers, and Abandoned Cart. | |
-- Also gets the Click Revenue | |
-- NOTE. Since the clicks can have nulls, there is an Unknown Segment group | |
-- in cligksegment, that is not defined in the lower part. That is why we take segment names | |
-- from click segment. | |
( | |
Select | |
maxcontext.Test5CX campaign | |
,count(maxcontext.a_click) clicks | |
,sum(maxcontext.click_revenue) clickrevenue | |
From | |
-- Subquery maxcontext | |
-- 1 row per user click on golivedate, with the context of that click. | |
-- For each user associated with the click, it also gives | |
-- the last basket event, and last sales event for the user. | |
-- The last_basket and last_sale events are timestamps to compare with the click timestamp | |
( | |
select | |
allclicksgoliveday.user_id as userid, | |
allclicksgoliveday.a_click , | |
allclicksgoliveday.context_id , | |
allclicksgoliveday.click_revenue , | |
mxbasket14day.last_basket , | |
mxsale14day.last_sale, | |
CASE when mxbasket14day.last_basket is NULL | |
then (case when mxsale14day.last_sale is NULL | |
then (case when allclicksgoliveday.context_id is NULL then 'Unknown' | |
when allclicksgoliveday.context_id in ('2','3','4','5','10') then 'Browsers' | |
when allclicksgoliveday.context_id in ('7','8','9') then 'Customers' | |
when allclicksgoliveday.context_id in ('6') then 'Shoppers' | |
end) | |
ELSE (case when allclicksgoliveday.context_id is NULL then 'Customers' | |
when allclicksgoliveday.context_id in ('2','3','4','5','10') then 'Browsers' | |
when allclicksgoliveday.context_id in ('7','8','9') then 'Customers' | |
when allclicksgoliveday.context_id in ('6') then 'Shoppers' | |
end) | |
end) | |
ELSE (case when mxsale14day.last_sale is NULL then 'Abndn Crt' | |
when mxsale14day.last_sale > mxbasket14day.last_basket | |
then (case when allclicksgoliveday.context_id is NULL then 'Customers' | |
when allclicksgoliveday.context_id in ('2','3','4','5','10') then 'Browsers' | |
when allclicksgoliveday.context_id in ('7','8','9') then 'Customers' | |
when allclicksgoliveday.context_id in ('6') then 'Shoppers' | |
end) | |
ELSE 'Abndn Crt' | |
end) | |
END Test5CX | |
from | |
-- every click on golivedate | |
( | |
select | |
cl.user_id , | |
cl.unixtime a_click , | |
context_id , | |
revenue_local click_revenue | |
from | |
bi_data.partnerdb_bi_click cl | |
where | |
partner_id = 1731 | |
and partner_partition = 731 | |
and DAY = ${hiveconf:golivedate} | |
and cl.user_id <> '00000000-0000-0000-0000-000000000000' | |
) | |
allclicksgoliveday | |
left outer join | |
-- the date of the most recent (last) basket event for a User between abandondate and golivedate | |
( | |
select | |
ae1.user_id , | |
MAX(ae1.unixtime) last_basket | |
from | |
bi_data.partnerdb_bi_advertiser_event ae1 | |
where | |
ae1.event_name = 'Basket' | |
--and partner_partition = pmod(${hiveconf:partnerid},1000) | |
and partner_id = 1731 | |
and DAY BETWEEN ${hiveconf:abandondate} and ${hiveconf:golivedate} | |
and ae1.user_id <> '00000000-0000-0000-0000-000000000000' | |
group by | |
ae1.user_id | |
) | |
mxbasket14day | |
on allclicksgoliveday.user_id = mxbasket14day.user_id | |
left outer join | |
-- the date of the most recent (last) Sale event for a User between abandondate and golivedate | |
( | |
select | |
ae2.user_id , | |
MAX(ae2.unixtime) last_sale | |
from | |
bi_data.partnerdb_bi_advertiser_event ae2 | |
where | |
ae2.event_name = 'Sales' | |
--and partner_partition = pmod(${hiveconf:partnerid},1000) | |
and partner_id = 1731 | |
and DAY BETWEEN ${hiveconf:abandondate} and ${hiveconf:golivedate} | |
and ae2.user_id <> '00000000-0000-0000-0000-000000000000' | |
group by | |
ae2.user_id | |
) | |
mxsale14day | |
on allclicksgoliveday.user_id = mxsale14day.user_id | |
) | |
maxcontext | |
Group By | |
maxcontext.Test5CX | |
) | |
clicksegment | |
LEFT OUTER JOIN | |
-- ================================================================ | |
-- ================================================================ | |
-- The above and below parts are the two major parts to this query | |
-- transsegment below is joined to the above part | |
-- on the segment names: Browsers, Customers, Shoppers, Abandoned Cart | |
-- The two parts otherwise do NOT mix. Debug them separately. | |
-- ================================================================ | |
-- ================================================================ | |
-- Subqery transsegment | |
-- Buckets all transactions from golivedate | |
-- into segments of Browsers, Customers, Shoppers, and Abandoned Cart. | |
-- Also gets the Order Value | |
( | |
select | |
attribtrans.attrib | |
,sum(attribtrans.sales) sales | |
,sum(attribtrans.ordervalue) ordervalue | |
from | |
-- Subquery attribtrans | |
-- With all the transactions, we find which came from Abandoned Cart users | |
-- (as defined by subquery acartfinder) | |
-- We look at the context_id to attribute the other sales to their own campaigns | |
( | |
Select | |
ordervaluez.userid, | |
ordervaluez.context_id, | |
ordervaluez.sales, | |
ordervaluez.ordervalue, | |
Case when acartfinder.bkaftersl is NULL | |
then (Case when ordervaluez.context_id is NULL then 'Unknown' | |
when ordervaluez.context_id in ('2','3','4','5','10') then 'Browsers' | |
when ordervaluez.context_id in ('7','8','9') then 'Customers' | |
when ordervaluez.context_id in ('6') then 'Shoppers' | |
end) | |
else 'Abndn Crt' | |
end attrib | |
From | |
-- Subquery ordervaluez | |
-- Using the matched_transactions_rich table, sum up the number of sales and total value of | |
-- all transactions that happen up to 30 days before golivedate. (Currently same as CPOP Order Value) | |
( | |
select | |
rawtransactions1.userid, | |
rawtransactions1.context_id, | |
count(distinct rawtransactions1.sales) as sales, | |
sum(rawtransactions1.ordervalue) as ordervalue | |
from | |
( | |
select | |
mtr.user_id as userid | |
,mtr.context_id | |
,mtr.transaction_id as sales | |
,mtr.order_value as ordervalue | |
,mtr.trans_timestamp | |
,mtr.click_timestamp | |
from bi_data.cpop_matched_transactions_rich mtr | |
where mtr.partner_id = 1731 | |
and mtr.day = ${hiveconf:golivedate} | |
and mtr.click_timestamp IS NOT NULL | |
) rawtransactions1 | |
group by | |
rawtransactions1.userid, | |
rawtransactions1.context_id | |
) | |
ordervaluez | |
Left Outer Join | |
-- Subqeury acartfinder | |
-- Finds Sales and Basket events up to 45 days ago for transactions from golivedate. | |
-- It compares Sales and Basket events from the same user to find out which happened first/last | |
-- When a basket for the user is the last or only event, this will become an Abandonded Cart user. | |
( | |
SELECT | |
doubleabandon.userid, | |
dabasketonly.baskettime, | |
dasalesonly.salestime, | |
case when dabasketonly.baskettime > dasalesonly.salestime then 'Yes: BK > Sale' | |
-- when check > 1 instead of IS NOT NULL because that is faster for query execution | |
-- unixtime in partnerdb_bi_advertiser_event is an INT so this check is OK to use | |
when dabasketonly.baskettime > 1 and dasalesonly.salestime is NULL then 'Yes: BK, no Sl' | |
end bkaftersl | |
FROM | |
-- Subquery doubleabandon | |
-- Finds every basket event up to 15 days before the last_click of a matched transaction | |
-- finds every sales event up to 15 days before the last_click of a matched transaction | |
-- these are unioned for comparison to see which possibly happened before the other | |
-- The purpose of the suqbquery is to help us identify which users that transacted on | |
-- golivedate that were previously abandonded cart users (This is not the same as users who | |
-- became abandoned cart on golivedate) | |
( | |
-- query for last basket events | |
SELECT | |
ae.user_id userid, | |
max(ae.unixtime) eventime, | |
'Basket' evtype | |
FROM | |
bi_data.partnerdb_bi_advertiser_event ae | |
JOIN | |
-- rawtransactions gives the 30day post click trans for a golivedate | |
( | |
select | |
mtr.user_id , | |
mtr.click_timestamp as last_assoc_click | |
from | |
bi_data.cpop_matched_transactions_rich mtr | |
where | |
mtr.partner_id = 1731 | |
and mtr.day = ${hiveconf:golivedate} | |
and mtr.click_timestamp IS NOT NULL | |
) rawtransactionsdb | |
ON ae.user_id = rawtransactionsdb.user_id | |
WHERE | |
-- restrictions on advertiser_event | |
ae.event_name = 'Basket' | |
AND partner_partition = 731 | |
AND partner_id = 1731 | |
AND DAY BETWEEN ${hiveconf:doublebasketdate} AND ${hiveconf:golivedate} | |
AND ae.user_id <> '00000000-0000-0000-0000-000000000000' | |
-- restrictions on the join. | |
-- These can't be in the join clause since only equi join is allowed in hive | |
-- the event happens before the click | |
AND ae.unixtime < rawtransactionsdb.last_assoc_click | |
-- adding 15 days to the event time should exceed the click time | |
-- if it did not, that would mean the click is MORE than 15 days away | |
AND ae.unixtime + ${hiveconf:acwindow}*24*3600 >= rawtransactionsdb.last_assoc_click | |
GROUP BY | |
ae.user_id | |
UNION ALL | |
-- query for last sales events | |
SELECT | |
ae2.user_id userid, | |
max(ae2.unixtime) eventime, | |
'Sales' evtype | |
FROM | |
bi_data.partnerdb_bi_advertiser_event ae2 | |
JOIN | |
-- rawtransactions gives the 30day post click trans for a golivedate | |
( | |
select | |
mtr.user_id , | |
mtr.click_timestamp as last_assoc_click | |
from | |
bi_data.cpop_matched_transactions_rich mtr | |
where | |
mtr.partner_id = 1731 | |
and mtr.day = ${hiveconf:golivedate} | |
and mtr.click_timestamp IS NOT NULL | |
) rawtransactionsds | |
ON ae2.user_id = rawtransactionsds.user_id | |
WHERE | |
-- restrictions on advertiser_event | |
ae2.event_name = 'Sales' | |
AND partner_partition = 731 | |
AND partner_id = 1731 | |
AND DAY BETWEEN ${hiveconf:doublebasketdate} AND ${hiveconf:golivedate} | |
AND ae2.user_id <> '00000000-0000-0000-0000-000000000000' | |
-- restrictions on the join. | |
-- These can't be in the join clause since only equi join is allowed in hive | |
-- the event happens before the click | |
AND ae2.unixtime < rawtransactionsds.last_assoc_click | |
-- adding 15 days to the event time should exceed the click time | |
-- if it did not, that would mean the click is MORE than 15 days away | |
AND ae2.unixtime + ${hiveconf:acwindow}*24*3600 >= rawtransactionsds.last_assoc_click | |
GROUP BY | |
ae2.user_id | |
) | |
doubleabandon | |
LEFT OUTER JOIN | |
-- query for last basket events | |
( | |
SELECT | |
aeb.user_id as userid, | |
max(aeb.unixtime) as baskettime, | |
'Basket' as evtype | |
FROM | |
bi_data.partnerdb_bi_advertiser_event aeb | |
JOIN | |
-- rawtransactions gives the 30day post click trans for a golivedate | |
( | |
select | |
mtr.user_id , | |
mtr.click_timestamp as last_assoc_click | |
from | |
bi_data.cpop_matched_transactions_rich mtr | |
where | |
mtr.partner_id = 1731 | |
and mtr.day = ${hiveconf:golivedate} | |
and mtr.click_timestamp IS NOT NULL | |
) rawtransactions | |
ON aeb.user_id = rawtransactions.user_id | |
WHERE | |
-- restrictions on advertiser_event | |
event_name = 'Basket' | |
AND partner_partition = 731 | |
AND partner_id = 1731 | |
AND DAY BETWEEN ${hiveconf:doublebasketdate} AND ${hiveconf:golivedate} | |
AND aeb.user_id <> '00000000-0000-0000-0000-000000000000' | |
-- restrictions on the join. | |
-- These can't be in the join clause since only equi join is allowed in hive | |
-- the event happens before the click | |
AND aeb.unixtime < rawtransactions.last_assoc_click | |
-- adding 15 days to the event time should exceed the click time | |
-- if it did not, that would mean the click is MORE than 15 days away | |
AND aeb.unixtime + ${hiveconf:acwindow}*24*3600 >= rawtransactions.last_assoc_click | |
GROUP BY | |
aeb.user_id | |
) | |
dabasketonly | |
ON doubleabandon.userid = dabasketonly.userid | |
AND dabasketonly.evtype = 'Basket' | |
LEFT OUTER JOIN | |
-- query for last sales events | |
( | |
SELECT | |
aes.user_id as userid, | |
max(aes.unixtime) as salestime, | |
'Sales' as evtype | |
FROM | |
bi_data.partnerdb_bi_advertiser_event aes | |
JOIN | |
-- rawtransactions gives the 30day post click trans for a golivedate | |
( | |
select | |
mtr.user_id , | |
mtr.click_timestamp as last_assoc_click | |
from | |
bi_data.cpop_matched_transactions_rich mtr | |
where | |
mtr.partner_id = 1731 | |
and mtr.day = ${hiveconf:golivedate} | |
and mtr.click_timestamp IS NOT NULL | |
) rawtransactions | |
ON aes.user_id = rawtransactions.user_id | |
WHERE | |
-- restrictions on advertiser_event | |
event_name = 'Sales' | |
AND partner_partition = 731 | |
AND partner_id = 1731 | |
AND DAY BETWEEN ${hiveconf:doublebasketdate} AND ${hiveconf:golivedate} | |
AND aes.user_id <> '00000000-0000-0000-0000-000000000000' | |
-- restrictions on the join. | |
-- These can't be in the join clause since only equi join is allowed in hive | |
-- the event happens before the click | |
AND aes.unixtime < rawtransactions.last_assoc_click | |
-- adding 15 days to the event time should exceed the click time | |
-- if it did not, that would mean the click is MORE than 15 days away | |
AND aes.unixtime + ${hiveconf:acwindow}*24*3600 >= rawtransactions.last_assoc_click | |
GROUP BY | |
aes.user_id | |
) | |
dasalesonly | |
ON doubleabandon.userid = dasalesonly.userid | |
AND dasalesonly.evtype = 'Sales' | |
GROUP BY | |
doubleabandon.userid, | |
dabasketonly.baskettime, | |
dasalesonly.salestime | |
) | |
acartfinder | |
On ordervaluez.userid = acartfinder.userid | |
) | |
attribtrans | |
group by | |
attribtrans.attrib | |
) | |
transsegment | |
ON clicksegment.campaign = transsegment.attrib | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This was the longest and most intricate query I wrote as a Business Intelligence analyst. It's full of various forms of business logic, and even required hard-coding of certain items. Since the data behind the query was indeed "big data", it turned out that without certain times hard-coded, the query would fail. This final version of the query, a .hql file for a HIVE/hadoop cluster, still took 10+ minutes to return results.
From the initial design, to re-implementation to testing, re-testing, debugging, getting help with optimizations (such as the rank function fro m the software devs), to the final version was more than 3 months. Several other people in the Business Intelligence/Data Analytics department had tried to write a similar query, but ran into walls (e.g. query timeout) due to the vast amounts of data the different parts of the query had to access and keep track of during execution.
It was only through many questions about underlying tables, and chiefly from narrowing in on the exact and most minimal logic needed to tie users together for proper attribution that this was completed. It is of course the crowning achievement of my time in that role.
The query spans 466 lines, with
comments
, s p a c i n g and indentation indicative of its long evolution, and required to keep it understandable. For a given client, it combed millions of rows in multiple tables, and took 10+ minutes to either complete or crash. In contrast, the output was something very simple, resembling the following: