thiagomarzagao/howto.html

## howto.html
To produce the ADS I relied on supervised learning. I tried three different approaches, compared the results,
and picked the approach that worked best. More specifically, I tried: a) a combination of Latent Semantic Analysis
and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods;
and c) the Wordscores algorithm. The Wordscores algorithm outperformed the alternatives.
I created a <a href="http://democracy-scores.org">web application</a> where anyone can tweak the training data and
see how the results change (no coding required). <u>Data and code</u>. The two corpora (A and B) are available
in <a href="http://math.nist.gov/MatrixMarket/formats.html#MMformat">MatrixMarket format</a>.
Each corpus is accompanied by other files: an internal index; a Python pickle with a dictionary mapping word IDs
to words; and a Python pickle with a dictionary mapping words to word IDs. Here are the links:
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a.mm">Corpus A</a>
(<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a.mm.index">index</a>,
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a_id2token">id2token</a>,
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a_token2id">token2id</a>)
and <a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b.mm">Corpus B</a>
(<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b.mm.index">index</a>,
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b_id2token">id2token</a>,
<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b_token2id">token2id</a>). There is also a
cleaned-up version of the UDS dataset - <a href="https://s3.amazonaws.com/thiagomarzagao/uds.csv">here</a>.
As for the code, I used Python 2.7.6. For LSA and LDA I used the package gensim (v. 0.8.9) and for the tree
methods I used the package scikit-learn (v. 0.14.1). All the LSA/LDA/trees scripts are available online
(change the "num_topics", "ipath", "opath", and "udsfile" variables as needed):
<a href="https://gist.github.com/thiagomarzagao/1b7ecc3335f758fdf713">LSA</a>,
<a href="https://gist.github.com/thiagomarzagao/459ebc07a0abe32407bd">LDA</a>,
<a href="https://gist.github.com/thiagomarzagao/22cb3f26a750c9c7c2d3">tree-based predictions</a>,
and <a href="https://gist.github.com/thiagomarzagao/116a40aadf70e52e5596">list of country-years</a>
(must be in the same folder as the LSA and LDA scripts). To run Wordscores I had to implement it in Python,
as the existing implementations (in R and Stata) do not handle out-of-core data. The code is not pretty
though, so if you want to replicate the Wordscores part it may be easier for you to write your own code.
If you do want to use my code, first you'll need to convert the term-frequency matrix from sparse to dense format,
split it twice (once row-wise into chunks of 106,241 rows each and once column-wise in chunks of 49 columns each),
compute the relative frequencies, split the matrix of relative frequencies (column-wise in chunks of 49 columns each),
save all the chunks in <a href="www.hdfgroup.org/HDF5/">HDF5</a> format, name the chunks in very specific
ways (see code) and download a cleaned-up version of the UDS dataset from
<a href="https://s3.amazonaws.com/thiagomarzagao/uds.csv">here</a>. If you're willing to endure all that
then my code should work - you can find it <a href="https://gist.github.com/thiagomarzagao/406be950a4fb67af3bde">here</a>.
Running LSA, LDA, and Wordscores required high-performance computers. To run LSA, LDA, and Wordscores
on corpus A I used a cluster of memory-optimized servers from Amazon EC2. Each server was an
8-core Intel Xeon E5-2670 v2 (Ivy Bridge) CPU with 61GB of RAM. To run LSA and LDA on corpus B I used
a cluster of nodes from the Ohio Supercomputer Center. Each node was a 12-core Intel Xeon x5650 CPU with 48GB of RAM.
Most LSA and LDA specifications took about a day to run, but a few (especially LDA with 300 topics) took almost a week.
Total computing time was 1,512 hours. The tree-based methods only took a few seconds for each batch and did not require
high-performance computers.
	To produce the ADS I relied on supervised learning. I tried three different approaches, compared the results,
	and picked the approach that worked best. More specifically, I tried: a) a combination of Latent Semantic Analysis
	and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods;
	and c) the Wordscores algorithm. The Wordscores algorithm outperformed the alternatives.
	I created a <a href="http://democracy-scores.org">web application</a> where anyone can tweak the training data and
	see how the results change (no coding required). <u>Data and code</u>. The two corpora (A and B) are available
	in <a href="http://math.nist.gov/MatrixMarket/formats.html#MMformat">MatrixMarket format</a>.
	Each corpus is accompanied by other files: an internal index; a Python pickle with a dictionary mapping word IDs
	to words; and a Python pickle with a dictionary mapping words to word IDs. Here are the links:
	<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a.mm">Corpus A</a>
	(<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a.mm.index">index</a>,
	<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a_id2token">id2token</a>,
	<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_a/corpora_a_token2id">token2id</a>)
	and <a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b.mm">Corpus B</a>
	(<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b.mm.index">index</a>,
	<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b_id2token">id2token</a>,
	<a href="https://s3.amazonaws.com/thiagomarzagao/corpora_b/corpora_b_token2id">token2id</a>). There is also a
	cleaned-up version of the UDS dataset - <a href="https://s3.amazonaws.com/thiagomarzagao/uds.csv">here</a>.
	As for the code, I used Python 2.7.6. For LSA and LDA I used the package gensim (v. 0.8.9) and for the tree
	methods I used the package scikit-learn (v. 0.14.1). All the LSA/LDA/trees scripts are available online
	(change the "num_topics", "ipath", "opath", and "udsfile" variables as needed):
	<a href="https://gist.github.com/thiagomarzagao/1b7ecc3335f758fdf713">LSA</a>,
	<a href="https://gist.github.com/thiagomarzagao/459ebc07a0abe32407bd">LDA</a>,
	<a href="https://gist.github.com/thiagomarzagao/22cb3f26a750c9c7c2d3">tree-based predictions</a>,
	and <a href="https://gist.github.com/thiagomarzagao/116a40aadf70e52e5596">list of country-years</a>
	(must be in the same folder as the LSA and LDA scripts). To run Wordscores I had to implement it in Python,
	as the existing implementations (in R and Stata) do not handle out-of-core data. The code is not pretty
	though, so if you want to replicate the Wordscores part it may be easier for you to write your own code.
	If you do want to use my code, first you'll need to convert the term-frequency matrix from sparse to dense format,
	split it twice (once row-wise into chunks of 106,241 rows each and once column-wise in chunks of 49 columns each),
	compute the relative frequencies, split the matrix of relative frequencies (column-wise in chunks of 49 columns each),
	save all the chunks in <a href="www.hdfgroup.org/HDF5/">HDF5</a> format, name the chunks in very specific
	ways (see code) and download a cleaned-up version of the UDS dataset from
	<a href="https://s3.amazonaws.com/thiagomarzagao/uds.csv">here</a>. If you're willing to endure all that
	then my code should work - you can find it <a href="https://gist.github.com/thiagomarzagao/406be950a4fb67af3bde">here</a>.
	Running LSA, LDA, and Wordscores required high-performance computers. To run LSA, LDA, and Wordscores
	on corpus A I used a cluster of memory-optimized servers from Amazon EC2. Each server was an
	8-core Intel Xeon E5-2670 v2 (Ivy Bridge) CPU with 61GB of RAM. To run LSA and LDA on corpus B I used
	a cluster of nodes from the Ohio Supercomputer Center. Each node was a 12-core Intel Xeon x5650 CPU with 48GB of RAM.
	Most LSA and LDA specifications took about a day to run, but a few (especially LDA with 300 topics) took almost a week.
	Total computing time was 1,512 hours. The tree-based methods only took a few seconds for each batch and did not require
	high-performance computers.