Skip to content

Instantly share code, notes, and snippets.

View Journalists and Numbers
Some things journalists may want to consider:
1. Anecdotes can mislead. People seeing another yet another episodic story on crime may infer that crime is increasing.
So report numbers where trustworthy numerical data are available.
2. But numbers need to be reported carefully. Most people, when reading news, do not do back of the envelope calculations to interpret data correctly.
So ill-reported numbers can mislead.
3. Rules for numbers:
a. % changes than changes in %. The former is more impressive when the base rate is low. Latter generally a better way to report things. If confused, report t1 and t2.
soodoku / Hillary_Clinton
Last active Aug 29, 2015
Calculating Hillary's Missing Emails
View Hillary_Clinton
55000/(365*4) ~ 37.7. That seems a touch low for Sec. of state.
1. Clinton may have used more than one private server
2. Clinton may have sent emails from other servers to unofficial accounts of other state department employees
Lower bound for missing emails from Clinton:
Take a small weighted random sample (weighting seniority more) of top state department employees.
soodoku /
Last active Aug 29, 2015
Get Congressional Speech Data Via CapitolWords API
Gets Congressional speech text, arranged by speaker.
Produces a csv (capitolwords.csv) with the following columns:
Uses the Sunlight foundation library:
soodoku /
Last active Aug 29, 2015
Salvage Corrupted CSV
What does it do?
Goes through a corrupted csv sequentially and outputs rows that are clean.
Also outputs, total n, total corrupted n
@author: Gaurav Sood
Run: python input_csv output_csv
soodoku / prop_weights.R
Created May 31, 2015
Weighting datasets by propensity scores (~YouGov Method for Sampling)
View prop_weights.R
Weighting by Propensity Scores
Last Edited: 5/31/2015
Task Outline:
1. Two datasets:
dataset 1: large pop. representative sample
dataset 2: convenient sample
2. Create weights for dataset 2 so that its marginals are close to dataset 1 on some vars.
soodoku / Distributed
Last active Aug 29, 2015
Reducing Costs for Producing Training Data and Implementing Semi-Automated Systems
View Distributed

The goal is to make it easier to produce distributed Human Intelligence Tasks (HIT, nomenclature courtesy Amazon). HITs include production of training data, general class of recognition problems such as image recognition tasks that humans can do with very little error and which machines are still somewhat bad at, surveys (where the source of data in the human being surveyed) etc.

The general idea traces its ancestry to CAPTCHA, which was developed to solve two problems at the same time -- provide a way to websites to distinguish between humans and bots, and help OCR written (or heard) material. But it differs from CAPTCHA in three ways. First, our goal is to not try to solve two problems at once. Thus, instead of current CAPTCHA systems, which make it as hard as possible for humans to get the answer right, we want to invert that logic -- make it as easy for humans to get the answer right. Second, we want to build it for tasks other than recognition tasks. Third, we plan to attach it to a payment architectu

soodoku / server_installs
Last active Aug 30, 2015
Basic R related installs for Initializing Scrapers on Digital Ocean Ubuntu
View server_installs
apt-get upgrade
apt-get update
sudo aptitude install emacs24
sudo aptitude install r-base
sudo aptitude install libcurl4-openssl-dev
sudo aptitude install libxml2-dev
apt-get install openjdk-7-*
R CMD javareconf -e
soodoku /
Last active Sep 20, 2015
Making it Count: Counting Women On the Street

Making'em Count: Counting Women On the Street

Proposal for a crowd-sourced study:

The purpose: to estimate the proportion of males in the people on the streets.

Some priors: the proportion varies by time of the day, and by place. Proportion of women out on the city's streets likely declines at night — and tragic as reasons for that are, it is likely that proportion of men is greater around office complexes than on residential streets. The aim is to get data from a diverse set of places and from a range of times.

soodoku /
Last active Nov 14, 2015
Basic sentiment analysis with AFINN or custom word database
Basic Sentiment Analysis
Builds on:
Utilizes AFINN or a custom sentiment db
Example Snippets at end from:
soodoku / cong.csv
Last active Nov 22, 2015
Educational Qualifications of Members of the 111th Congress
View cong.csv
Name District Education Science Law
Jeff Sessions (R) AL-Senate B.A., Huntingdon College; J.D. University of Alabama School of Law 1
Richard Shelby (R) AL-Senate B.A., University of Alabama; J.D. University of Alabama School of Law 1
Jo Bonner (R) AL-1 B.A. Journalism, University of Alabama 0
Bobby Bright (D) AL-2 B.A. Political Science, Auburn University; M.S. Criminal Justice, Troy State University; J.D. Thomas Goode Jones School of Law 1
Mike Rogers (R) AL-3 B.A., Political Science; M.P.A., Jackson State University; J.D. Birmingham School of Law 1
Robert Aderholt (R) AL-4 B.A., Political Science/History, Birmingham Southern College; J.D., Samford University 1
Partker Griffith (D) AL-5 B.S.; M.D., Louisiana State University 0
Spencer Bachus (R) AL-6 B.A., Auburn University; J.D., University of Alabama 1
Artur Davis (D) AL-7 B.A., Government, Harvard University; J.D., Harvard University School of Law 1