Last active
August 22, 2019 17:09
-
-
Save lorinc/40339e874e2aac8eb08230fed5f7d329 to your computer and use it in GitHub Desktop.
Visualization of political polarization in Hungary using public parliamentary voting data using Google BigQuery. Pro bono work for atlatszo.hu, the recognized Hungarian pro-democracy data journalist group.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
Pairwise distance between every single members of the parliament | |
active in that specific term, for every single voting that happened. | |
Generates a 61GB dataset for the 14,452 voting events. | |
Next step is reducing the amount of data, using roling window aggregation | |
on the time dimension, and clustering on the people dimension. | |
*/ | |
WITH | |
extended AS ( | |
SELECT | |
person, | |
party, | |
vote_timestamp, | |
-- to reduce data volume, we aggregate | |
-- distances down to weekly granurality | |
-- | |
-- also, analytic functions don't work | |
-- on date/time data types, we need number | |
DATE(vote_timestamp) AS date_week, -- human readable | |
DATE_DIFF( -- for analytic functions | |
DATE(vote_timestamp), | |
DATE('2009-01-01'), | |
WEEK | |
) AS week_no, | |
vote_subject_hash, | |
cycle, | |
( | |
CASE | |
WHEN vote = 'igen' | |
THEN 1 | |
WHEN vote = 'nem' | |
THEN 0 | |
-- considering absence | |
-- and not voting equal | |
ELSE .5 | |
END | |
) AS vote | |
FROM | |
`atlo-parliament-polarization.scraped.ready_for_distance_calculation` | |
) | |
SELECT | |
date_week, | |
week_no, | |
vote_timestamp, | |
vote_subject_hash, | |
cycle, | |
ARRAY_AGG( | |
STRUCT( | |
q1.person AS source_person, | |
q1.party AS source_party, | |
q2.person AS target_person, | |
q2.party AS target_party, | |
ABS(q1.vote - q2.vote) AS distance | |
) | |
) AS edges | |
FROM | |
extended AS q1 | |
FULL OUTER JOIN | |
extended AS q2 | |
USING( | |
date_week, | |
week_no, | |
cycle, | |
vote_timestamp, | |
vote_subject_hash | |
) | |
WHERE | |
-- calculating edges only once | |
q1.person > q2.person | |
GROUP BY | |
date_week, | |
week_no, | |
vote_timestamp, | |
vote_subject_hash, | |
cycle |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
This query (falsely) assumes that members of the parliament | |
serve full cycles, and flags members as absent from voting, | |
even if they resigned mid-term and not actually active. | |
*/ | |
WITH | |
ingestion AS ( | |
SELECT * FROM `atlo-parliament-polarization.scraped.y2010` | |
UNION ALL | |
SELECT * FROM `atlo-parliament-polarization.scraped.y2011` | |
UNION ALL | |
SELECT * FROM `atlo-parliament-polarization.scraped.y2012` | |
UNION ALL | |
SELECT * FROM `atlo-parliament-polarization.scraped.y2013` | |
UNION ALL | |
SELECT * FROM `atlo-parliament-polarization.scraped.y2014` | |
UNION ALL | |
SELECT * FROM `atlo-parliament-polarization.scraped.y2015` | |
UNION ALL | |
SELECT * FROM `atlo-parliament-polarization.scraped.y2016` | |
UNION ALL | |
SELECT * FROM `atlo-parliament-polarization.scraped.y2017` | |
UNION ALL | |
SELECT * FROM `atlo-parliament-polarization.scraped.y2018` | |
), | |
cleaned AS ( | |
SELECT | |
ingestion.string_field_0 AS person, | |
ingestion.string_field_1 AS party, | |
MD5(ingestion.string_field_2) AS vote_subject_hash, | |
PARSE_TIMESTAMP( | |
"%Y.%m.%d.%X", | |
SUBSTR(ingestion.string_field_3, 0, 19), | |
"Europe/Budapest" | |
) AS vote_timestamp, | |
ingestion.string_field_4 AS vote | |
FROM | |
ingestion | |
), | |
actual_votes AS ( | |
SELECT | |
person, | |
party, | |
vote_subject_hash, | |
vote_timestamp, | |
vote, | |
( | |
CASE | |
WHEN vote_timestamp > TIMESTAMP('2018-04-09T00:00:00', 'Europe/Budapest') THEN '2018' | |
WHEN vote_timestamp > TIMESTAMP('2014-04-17T00:00:00', 'Europe/Budapest') THEN '2014' | |
WHEN vote_timestamp > TIMESTAMP('2010-04-26T00:00:00', 'Europe/Budapest') THEN '2010' | |
ELSE '2006' | |
END | |
) AS cycle | |
FROM | |
cleaned | |
), | |
voting_events AS ( | |
SELECT DISTINCT | |
vote_timestamp, | |
vote_subject_hash, | |
cycle | |
FROM | |
actual_votes | |
), | |
people_per_cycle AS ( | |
SELECT DISTINCT | |
person, | |
party, | |
cycle | |
FROM | |
actual_votes | |
), | |
all_possible_votes AS ( | |
SELECT | |
person, | |
party, | |
cycle, | |
vote_timestamp, | |
vote_subject_hash | |
FROM | |
voting_events | |
LEFT JOIN | |
people_per_cycle | |
USING(cycle) | |
) | |
SELECT | |
person, | |
party, | |
vote_timestamp, | |
vote_subject_hash, | |
apv.cycle, | |
IFNULL(vote, 'távol') AS vote | |
FROM | |
all_possible_votes AS apv | |
LEFT JOIN | |
actual_votes | |
USING(person, party, vote_timestamp, vote_subject_hash) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
THIS IS WORK IN PROGRESS | |
Refactor: ARRAY_AGG(analytic_function) is not supported | |
1, Weekly distance averages for each person-person pairs | |
2, Rolling 4-week averages for each person-person pairs | |
(Averaging averages is *really* not elegant...) | |
*/ | |
WITH | |
weekly_distances AS ( | |
SELECT | |
DATE(MIN(votes.vote_timestamp)) AS week, | |
cycle, | |
edges.source_person, | |
edges.source_party, | |
edges.target_person, | |
edges.target_party, | |
AVG(edges.distance) AS weekly_distance | |
FROM | |
`atlo-parliament-polarization.scraped.pairwise_distances_for_14k_voting_events` AS votes | |
CROSS JOIN | |
UNNEST(votes.edges) AS edges | |
GROUP BY | |
week_no, | |
cycle, | |
edges.source_person, | |
edges.source_party, | |
edges.target_person, | |
edges.target_party | |
LIMIT | |
1000 | |
) | |
SELECT | |
week, | |
cycle, | |
ARRAY_AGG( | |
STRUCT( | |
source_person, | |
source_party, | |
target_person, | |
target_party, | |
AVG(weekly_distance) | |
OVER( | |
PARTITION BY | |
week, | |
cycle, | |
source_person, | |
source_party, | |
target_person, | |
target_party | |
ORDER BY | |
week | |
ROWS BETWEEN | |
3 PRECEDING | |
AND CURRENT ROW | |
) | |
) | |
) | |
FROM | |
weekly_distances | |
LIMIT | |
50 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
To use D3.js force-directed graph for visualization, must build distance data in this format: https://gist.githubusercontent.com/mbostock/4062045/raw/5916d145c8c048a6e3086915a6be464467391c62/miserables.json
I do not know, how D3.js will display 40k edges dynamically. Force layout algorithms take cubic time to finish. While I can reduce the granularity on the time scale, I can not simply do the same on the people dimension. It's likely that I got to compromise, and replace interactive visualization with pre-rendered one.