Skip to content

Instantly share code, notes, and snippets.

@lorinc
Last active August 22, 2019 17:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lorinc/40339e874e2aac8eb08230fed5f7d329 to your computer and use it in GitHub Desktop.
Save lorinc/40339e874e2aac8eb08230fed5f7d329 to your computer and use it in GitHub Desktop.
Visualization of political polarization in Hungary using public parliamentary voting data using Google BigQuery. Pro bono work for atlatszo.hu, the recognized Hungarian pro-democracy data journalist group.
/*
Pairwise distance between every single members of the parliament
active in that specific term, for every single voting that happened.
Generates a 61GB dataset for the 14,452 voting events.
Next step is reducing the amount of data, using roling window aggregation
on the time dimension, and clustering on the people dimension.
*/
WITH
extended AS (
SELECT
person,
party,
vote_timestamp,
-- to reduce data volume, we aggregate
-- distances down to weekly granurality
--
-- also, analytic functions don't work
-- on date/time data types, we need number
DATE(vote_timestamp) AS date_week, -- human readable
DATE_DIFF( -- for analytic functions
DATE(vote_timestamp),
DATE('2009-01-01'),
WEEK
) AS week_no,
vote_subject_hash,
cycle,
(
CASE
WHEN vote = 'igen'
THEN 1
WHEN vote = 'nem'
THEN 0
-- considering absence
-- and not voting equal
ELSE .5
END
) AS vote
FROM
`atlo-parliament-polarization.scraped.ready_for_distance_calculation`
)
SELECT
date_week,
week_no,
vote_timestamp,
vote_subject_hash,
cycle,
ARRAY_AGG(
STRUCT(
q1.person AS source_person,
q1.party AS source_party,
q2.person AS target_person,
q2.party AS target_party,
ABS(q1.vote - q2.vote) AS distance
)
) AS edges
FROM
extended AS q1
FULL OUTER JOIN
extended AS q2
USING(
date_week,
week_no,
cycle,
vote_timestamp,
vote_subject_hash
)
WHERE
-- calculating edges only once
q1.person > q2.person
GROUP BY
date_week,
week_no,
vote_timestamp,
vote_subject_hash,
cycle
/*
This query (falsely) assumes that members of the parliament
serve full cycles, and flags members as absent from voting,
even if they resigned mid-term and not actually active.
*/
WITH
ingestion AS (
SELECT * FROM `atlo-parliament-polarization.scraped.y2010`
UNION ALL
SELECT * FROM `atlo-parliament-polarization.scraped.y2011`
UNION ALL
SELECT * FROM `atlo-parliament-polarization.scraped.y2012`
UNION ALL
SELECT * FROM `atlo-parliament-polarization.scraped.y2013`
UNION ALL
SELECT * FROM `atlo-parliament-polarization.scraped.y2014`
UNION ALL
SELECT * FROM `atlo-parliament-polarization.scraped.y2015`
UNION ALL
SELECT * FROM `atlo-parliament-polarization.scraped.y2016`
UNION ALL
SELECT * FROM `atlo-parliament-polarization.scraped.y2017`
UNION ALL
SELECT * FROM `atlo-parliament-polarization.scraped.y2018`
),
cleaned AS (
SELECT
ingestion.string_field_0 AS person,
ingestion.string_field_1 AS party,
MD5(ingestion.string_field_2) AS vote_subject_hash,
PARSE_TIMESTAMP(
"%Y.%m.%d.%X",
SUBSTR(ingestion.string_field_3, 0, 19),
"Europe/Budapest"
) AS vote_timestamp,
ingestion.string_field_4 AS vote
FROM
ingestion
),
actual_votes AS (
SELECT
person,
party,
vote_subject_hash,
vote_timestamp,
vote,
(
CASE
WHEN vote_timestamp > TIMESTAMP('2018-04-09T00:00:00', 'Europe/Budapest') THEN '2018'
WHEN vote_timestamp > TIMESTAMP('2014-04-17T00:00:00', 'Europe/Budapest') THEN '2014'
WHEN vote_timestamp > TIMESTAMP('2010-04-26T00:00:00', 'Europe/Budapest') THEN '2010'
ELSE '2006'
END
) AS cycle
FROM
cleaned
),
voting_events AS (
SELECT DISTINCT
vote_timestamp,
vote_subject_hash,
cycle
FROM
actual_votes
),
people_per_cycle AS (
SELECT DISTINCT
person,
party,
cycle
FROM
actual_votes
),
all_possible_votes AS (
SELECT
person,
party,
cycle,
vote_timestamp,
vote_subject_hash
FROM
voting_events
LEFT JOIN
people_per_cycle
USING(cycle)
)
SELECT
person,
party,
vote_timestamp,
vote_subject_hash,
apv.cycle,
IFNULL(vote, 'távol') AS vote
FROM
all_possible_votes AS apv
LEFT JOIN
actual_votes
USING(person, party, vote_timestamp, vote_subject_hash)
/*
THIS IS WORK IN PROGRESS
Refactor: ARRAY_AGG(analytic_function) is not supported
1, Weekly distance averages for each person-person pairs
2, Rolling 4-week averages for each person-person pairs
(Averaging averages is *really* not elegant...)
*/
WITH
weekly_distances AS (
SELECT
DATE(MIN(votes.vote_timestamp)) AS week,
cycle,
edges.source_person,
edges.source_party,
edges.target_person,
edges.target_party,
AVG(edges.distance) AS weekly_distance
FROM
`atlo-parliament-polarization.scraped.pairwise_distances_for_14k_voting_events` AS votes
CROSS JOIN
UNNEST(votes.edges) AS edges
GROUP BY
week_no,
cycle,
edges.source_person,
edges.source_party,
edges.target_person,
edges.target_party
LIMIT
1000
)
SELECT
week,
cycle,
ARRAY_AGG(
STRUCT(
source_person,
source_party,
target_person,
target_party,
AVG(weekly_distance)
OVER(
PARTITION BY
week,
cycle,
source_person,
source_party,
target_person,
target_party
ORDER BY
week
ROWS BETWEEN
3 PRECEDING
AND CURRENT ROW
)
)
)
FROM
weekly_distances
LIMIT
50
@lorinc
Copy link
Author

lorinc commented Aug 19, 2019

To use D3.js force-directed graph for visualization, must build distance data in this format: https://gist.githubusercontent.com/mbostock/4062045/raw/5916d145c8c048a6e3086915a6be464467391c62/miserables.json

I do not know, how D3.js will display 40k edges dynamically. Force layout algorithms take cubic time to finish. While I can reduce the granularity on the time scale, I can not simply do the same on the people dimension. It's likely that I got to compromise, and replace interactive visualization with pre-rendered one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment