Last active
May 28, 2017 18:57
-
-
Save thiagomarzagao/51ee10feb5e5c6762403d68dc2a635ff to your computer and use it in GitHub Desktop.
replicating "Using NLP to measure democracy"
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
To produce the ADS I relied on supervised learning. I tried three different approaches, compared the results, | |
and picked the approach that worked best. More specifically, I tried: a) a combination of Latent Semantic Analysis | |
and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods; | |
and c) the Wordscores algorithm. The Wordscores algorithm outperformed the alternatives. | |
I created a <a href="http://democracy-scores.org">web application</a> where anyone can tweak the training data and | |
see how the results change (no coding required). <u>Data and code</u>. The two corpora (A and B) are available | |
in <a href="http://math.nist.gov/MatrixMarket/formats.html#MMformat">MatrixMarket format</a>. | |
Each corpus is accompanied by other files: an internal index; a Python pickle with a dictionary mapping word IDs | |
to words; and a Python pickle with a dictionary mapping words to word IDs. Here are the links: | |
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a.mm">Corpus A</a> | |
(<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a.mm.index">index</a>, | |
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a_id2token">id2token</a>, | |
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a_token2id">token2id</a>) | |
and <a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b.mm">Corpus B</a> | |
(<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b.mm.index">index</a>, | |
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b_id2token">id2token</a>, | |
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b_token2id">token2id</a>). There is also a | |
cleaned-up version of the UDS dataset - <a href="https://s3.amazonaws.com/thiagomarzagao/uds.csv">here</a>. | |
As for the code, I used Python 2.7.6. For LSA and LDA I used the package gensim (v. 0.8.9) and for the tree | |
methods I used the package scikit-learn (v. 0.14.1). All the LSA/LDA/trees scripts are available online | |
(change the "num_topics", "ipath", "opath", and "udsfile" variables as needed): | |
<a href="https://gist.github.com/thiagomarzagao/1b7ecc3335f758fdf713">LSA</a>, | |
<a href="https://gist.github.com/thiagomarzagao/459ebc07a0abe32407bd">LDA</a>, | |
<a href="https://gist.github.com/thiagomarzagao/22cb3f26a750c9c7c2d3">tree-based predictions</a>, | |
and <a href="https://gist.github.com/thiagomarzagao/116a40aadf70e52e5596">list of country-years</a> | |
(must be in the same folder as the LSA and LDA scripts). To run Wordscores I had to implement it in Python, | |
as the existing implementations (in R and Stata) do not handle out-of-core data. The code is not pretty | |
though, so if you want to replicate the Wordscores part it may be easier for you to write your own code. | |
If you do want to use my code, first you'll need to convert the term-frequency matrix from sparse to dense format, | |
split it twice (once row-wise into chunks of 106,241 rows each and once column-wise in chunks of 49 columns each), | |
compute the relative frequencies, split the matrix of relative frequencies (column-wise in chunks of 49 columns each), | |
save all the chunks in <a href="www.hdfgroup.org/HDF5/">HDF5</a> format, name the chunks in very specific | |
ways (see code) and download a cleaned-up version of the UDS dataset from | |
<a href="https://s3.amazonaws.com/thiagomarzagao/uds.csv">here</a>. If you're willing to endure all that | |
then my code should work - you can find it <a href="https://gist.github.com/thiagomarzagao/406be950a4fb67af3bde">here</a>. | |
Running LSA, LDA, and Wordscores required high-performance computers. To run LSA, LDA, and Wordscores | |
on corpus A I used a cluster of memory-optimized servers from Amazon EC2. Each server was an | |
8-core Intel Xeon E5-2670 v2 (Ivy Bridge) CPU with 61GB of RAM. To run LSA and LDA on corpus B I used | |
a cluster of nodes from the Ohio Supercomputer Center. Each node was a 12-core Intel Xeon x5650 CPU with 48GB of RAM. | |
Most LSA and LDA specifications took about a day to run, but a few (especially LDA with 300 topics) took almost a week. | |
Total computing time was 1,512 hours. The tree-based methods only took a few seconds for each batch and did not require | |
high-performance computers. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment