Skip to content

Instantly share code, notes, and snippets.

@arun057
Created November 13, 2019 08:28
Show Gist options
  • Save arun057/d8dee65ece54de42b574f287ed9e601d to your computer and use it in GitHub Desktop.
Save arun057/d8dee65ece54de42b574f287ed9e601d to your computer and use it in GitHub Desktop.
-- Imported data into postgresql running locally. For larger data sets I would go the map reduce route, seemed overkill here.
-- All of the questions are answered with sql - I did end up double checking some of the results with python - happy to provide that code on request.
-- Consider only the rows with country_id = "BDV" (there are 844 such rows).
-- For each site_id, we can compute the number of unique user_id's found in these 844 rows.
-- Which site_id has the largest number of unique users? And what's the number?
SELECT x.site_id, COUNT(x.site_id) as site_count
FROM (SELECT DISTINCT user_id, site_id FROM visits WHERE country_id = 'BDV') AS x
GROUP BY x.site_id
ORDER BY site_count DESC
LIMIT 1
-- site_id site_count
-- "5NPAU" "544"
-- Between 2019-02-03 00:00:00 and 2019-02-04 23:59:59,
-- there are four users who visited a certain site more than 10 times.
-- Find these four users & which sites they (each) visited more than 10 times.
-- (Simply provides four triples in the form (user_id, site_id, number of visits)
-- in the box below.)
SELECT user_id, site_id, (COUNT(site_id)) AS site_count
FROM visits
WHERE ts > '2019-02-03 00:00:00'
and ts < '2019-02-04 23:59:59'
GROUP BY user_id, site_id
HAVING COUNT(site_id) > 10
ORDER BY site_count DESC;
-- user_id site_id site_count
-- "LC3A59" "N0OTG" "26"
-- "LC06C3" "N0OTG" "25"
-- "LC3C9D" "N0OTG" "17"
-- "LC3C7E" "3POLC" "15"
-- For each site, compute the unique number of users whose last visit
-- (found in the original data set) was to that site.
-- For instance, user "LC3561"'s last visit is to "N0OTG" based on
-- timestamp data. Based on this measure, what are top three sites?
-- (hint: site "3POLC" is ranked at 5th with 28 users whose last
-- visit in the data set was to 3POLC;
-- simply provide three pairs in the form (site_id, number of users).)
SELECT v.site_id, COUNT(v.user_id) as site_count
FROM visits AS v
INNER JOIN (
SELECT MAX(ts) as latest_ts, user_id
FROM visits
GROUP BY user_id
ORDER BY user_id
) AS x
ON v.user_id=x.user_id AND v.ts=x.latest_ts
GROUP BY v.site_id
ORDER BY site_count DESC
LIMIT 3
-- site_id site_count
-- "5NPAU" "992"
-- "N0OTG" "561"
-- "QGO3G" "289"
-- For each user, determine the first site he/she visited and
-- the last site he/she visited based on the timestamp data.
-- Compute the number of users whose first/last visits
-- are to the same website. What is the number?
SELECT COUNT(*) FROM (
SELECT a.user_id, a.site_id, a.ts
FROM visits AS a
INNER JOIN
(
SELECT MIN(c.ts) as first_visit, c.user_id
FROM visits AS c
GROUP BY c.user_id
) AS b
ON b.first_visit=a.ts AND b.user_id=a.user_id
) AS low
INNER JOIN (
SELECT a.user_id, a.site_id, a.ts
FROM visits AS a
INNER JOIN
(
SELECT MAX(c.ts) as first_visit, c.user_id
FROM visits AS c
GROUP BY c.user_id
) AS b
ON b.first_visit=a.ts AND b.user_id=a.user_id
) AS high
ON low.site_id=high.site_id AND low.user_id=high.user_id
-- count
-- "1670"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment