Skip to content

Instantly share code, notes, and snippets.

@pif
Last active August 21, 2018 12:45
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pif/49c21510e637d3dee6389d904f894d97 to your computer and use it in GitHub Desktop.
Save pif/49c21510e637d3dee6389d904f894d97 to your computer and use it in GitHub Desktop.
---- Query 1: query all dataset and write presence of tags into temp table `tags_counts`
SELECT
url,
CAST(REGEXP_CONTAINS(body, r'(?i)<div\s[^>]*\s*onclick\s*=\s*') AS INT64) tag_div,
CAST(REGEXP_CONTAINS(body, r'(?i)<a\s[^>]*\s*onclick\s*=\s*') AS INT64) tag_a,
CAST(REGEXP_CONTAINS(body, r'(?i)<button\s') AS INT64) tag_button,
CAST(REGEXP_CONTAINS(body, r'(?i)<input\s[^>]*\s*type\s*=\s*[\'"]button[\'"]') AS INT64) tag_input
FROM
`httparchive.response_bodies.2018_08_01_desktop`
WHERE page = url
---- Query 2: get some aggregated stats
SELECT COUNT(url) urls, SUM(tag_div) tag_div, SUM(tag_a) tag_a, SUM(tag_button) tag_button, SUM(tag_input) tag_input
FROM
`http_archive.tags_counts` counts
WHERE
tag_div > 0 OR tag_a > 0 OR tag_button > 0 OR tag_input > 0
----
-- urls tag_div tag_a tag_button tag_input
-- 781,290 86,881 361,872 533,464 81,187
-- 100% 11.12% 46.32% 68.28% 10.39%
-- Out of all pages 68% use `button`, 46% use `a onclick`. `div onclick` and `input type=button` are used on ~11% websites each.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment