Skip to content

Instantly share code, notes, and snippets.

@steren steren/regexp-github.sql
Last active Oct 24, 2019

Embed
What would you like to do?
Extract constant Go regular expressions from GitHub
# Extracts constant Go regular expressions from GitHub
# using BigQuery GitHub public dataset.
# To run on the entire GitHub corpus,
# remove the `sample_` prefix from the table names.
# Warning: This query processes ~2.2 TB of data, which is above BigQuery free quota.
SELECT
REGEXP_EXTRACT(pattern, r'^[\"\`](.*)[\"\`]$') as pattern,
COUNT(*) AS cnt,
FROM (
SELECT
REGEXP_EXTRACT(content, r'.(?:(?:Must)?Compile|MatchString)\((\"[^\"]+\"|\`[^\`]+\`)') AS pattern
FROM (
SELECT
id,
SPLIT(content, "regexp") AS content
FROM
[bigquery-public-data:github_repos.sample_contents]
WHERE
REGEXP_MATCH(content, r'.(?:(?:Must)?Compile|MatchString)\((\"[^\"]+\"|\`[^\`]+\`)')) AS C
JOIN (
SELECT
id
FROM
[bigquery-public-data:github_repos.sample_files]
WHERE
path LIKE '%.go'
GROUP BY
id) AS F
ON
C.id = F.id )
WHERE
pattern != "null"
GROUP BY
pattern
ORDER BY
cnt DESC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.