Skip to content

Instantly share code, notes, and snippets.

@tswast
Created April 4, 2017 19:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tswast/19f1e7988095207e9dd34d9031e0c816 to your computer and use it in GitHub Desktop.
Save tswast/19f1e7988095207e9dd34d9031e0c816 to your computer and use it in GitHub Desktop.
USA Names Conditional Probabilities
#standardSQL
SELECT
a.name AS name,
a.state AS state,
a.gender AS gender,
a.year AS year,
a.number AS number,
(a.number / b.total_number) AS name_frequency
FROM
`bigquery-public-data.usa_names.usa_1910_current` a
JOIN (
SELECT
name,
gender,
year,
SUM(number) AS total_number
FROM
`bigquery-public-data.usa_names.usa_1910_current`
GROUP BY
name,
gender,
year) b
ON
a.name = b.name
AND a.gender = b.gender
AND a.year = b.year
@tswast
Copy link
Author

tswast commented Apr 4, 2017

Why is name frequency equal to “number / total number”? The name frequency values represent conditional probabilities, for example

P(State=OH|Name=Pearl & Gender=M & Year=1917) =
    P(State=OH & Name=Pearl & Gender=M & Year=1917) /
    P(Name=Pearl & Gender=M & Year=1917)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment