Skip to content

Instantly share code, notes, and snippets.

@thiagomarzagao
Last active May 28, 2017 18:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thiagomarzagao/51ee10feb5e5c6762403d68dc2a635ff to your computer and use it in GitHub Desktop.
Save thiagomarzagao/51ee10feb5e5c6762403d68dc2a635ff to your computer and use it in GitHub Desktop.
replicating "Using NLP to measure democracy"
To produce the ADS I relied on supervised learning. I tried three different approaches, compared the results,
and picked the approach that worked best. More specifically, I tried: a) a combination of Latent Semantic Analysis
and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods;
and c) the Wordscores algorithm. The Wordscores algorithm outperformed the alternatives.
I created a <a href="http://democracy-scores.org">web application</a> where anyone can tweak the training data and
see how the results change (no coding required). <u>Data and code</u>. The two corpora (A and B) are available
in <a href="http://math.nist.gov/MatrixMarket/formats.html#MMformat">MatrixMarket format</a>.
Each corpus is accompanied by other files: an internal index; a Python pickle with a dictionary mapping word IDs
to words; and a Python pickle with a dictionary mapping words to word IDs. Here are the links:
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a.mm">Corpus A</a>
(<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a.mm.index">index</a>,
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a_id2token">id2token</a>,
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a_token2id">token2id</a>)
and <a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b.mm">Corpus B</a>
(<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b.mm.index">index</a>,
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b_id2token">id2token</a>,
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b_token2id">token2id</a>). There is also a
cleaned-up version of the UDS dataset - <a href="https://s3.amazonaws.com/thiagomarzagao/uds.csv">here</a>.
As for the code, I used Python 2.7.6. For LSA and LDA I used the package gensim (v. 0.8.9) and for the tree
methods I used the package scikit-learn (v. 0.14.1). All the LSA/LDA/trees scripts are available online
(change the "num_topics", "ipath", "opath", and "udsfile" variables as needed):
<a href="https://gist.github.com/thiagomarzagao/1b7ecc3335f758fdf713">LSA</a>,
<a href="https://gist.github.com/thiagomarzagao/459ebc07a0abe32407bd">LDA</a>,
<a href="https://gist.github.com/thiagomarzagao/22cb3f26a750c9c7c2d3">tree-based predictions</a>,
and <a href="https://gist.github.com/thiagomarzagao/116a40aadf70e52e5596">list of country-years</a>
(must be in the same folder as the LSA and LDA scripts). To run Wordscores I had to implement it in Python,
as the existing implementations (in R and Stata) do not handle out-of-core data. The code is not pretty
though, so if you want to replicate the Wordscores part it may be easier for you to write your own code.
If you do want to use my code, first you'll need to convert the term-frequency matrix from sparse to dense format,
split it twice (once row-wise into chunks of 106,241 rows each and once column-wise in chunks of 49 columns each),
compute the relative frequencies, split the matrix of relative frequencies (column-wise in chunks of 49 columns each),
save all the chunks in <a href="www.hdfgroup.org/HDF5/">HDF5</a> format, name the chunks in very specific
ways (see code) and download a cleaned-up version of the UDS dataset from
<a href="https://s3.amazonaws.com/thiagomarzagao/uds.csv">here</a>. If you're willing to endure all that
then my code should work - you can find it <a href="https://gist.github.com/thiagomarzagao/406be950a4fb67af3bde">here</a>.
Running LSA, LDA, and Wordscores required high-performance computers. To run LSA, LDA, and Wordscores
on corpus A I used a cluster of memory-optimized servers from Amazon EC2. Each server was an
8-core Intel Xeon E5-2670 v2 (Ivy Bridge) CPU with 61GB of RAM. To run LSA and LDA on corpus B I used
a cluster of nodes from the Ohio Supercomputer Center. Each node was a 12-core Intel Xeon x5650 CPU with 48GB of RAM.
Most LSA and LDA specifications took about a day to run, but a few (especially LDA with 300 topics) took almost a week.
Total computing time was 1,512 hours. The tree-based methods only took a few seconds for each batch and did not require
high-performance computers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment