gaurav soodoku

## Liberal Regex Pattern for All URLs
The regex patterns in this gist are intended to match any URLs,
including "mailto:foo@example.com", "x-whatever://foo", etc. For a
pattern that attempts only to match web URLs (http, https), see:
https://gist.github.com/gruber/8891611


# Single-line version of pattern:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

## icpsr_to_bioguide.rb
# house-icpsr.csv and senate-icpsr.csv are made by converting the XLS files found here to CSV:
# http://web.mit.edu/17.251/www/data_page.html#2

# Specifically, these files that list information and IDs for members from the 103rd to 112th Congress:
# http://web.mit.edu/cstewart/www/data/house_members_103-112-1.xls
# http://web.mit.edu/cstewart/www/data/senators_103-112-1.xls

# This script looks through the two original CSVs, caches the ICPSR ID of every member from the 110th Congress onward,
# then goes through every legislator in the Sunlight Labs Congress API and tries to match them up by a combination of
# last name, state, and party.

## github_traversal.py
import requests
import getpass
import sys
import json
import Queue

# This is a script, let's be lazy. We'll fill up this global and print it.
g = {"nodes": {}, "edges": []}
# And here's the cutoff criterion
MAX_NODES = 1000

## Journalists and Numbers
Some things journalists may want to consider:

1. Anecdotes can mislead. People seeing another yet another episodic story on crime may infer that crime is increasing.
   So report numbers where trustworthy numerical data are available.

2. But numbers need to be reported carefully. Most people, when reading news, do not do back of the envelope calculations to interpret data correctly.
  So ill-reported numbers can mislead.

3. Rules for numbers:
  a. % changes than changes in %. The former is more impressive when the base rate is low. Latter generally a better way to report things. If confused, report t1 and t2.

## scrape_wisconsin_ads.py
'''
Text from Searchable pdfs
Scrape Text off Wisconsin Ads pdfs

Uses pyPdf to get text from searchable pdfs. The script is for tailored for getting data
from Wisconsin Political Ads Database: http://wiscadproject.wisc.edu/Storyboards.

@author: Gaurav Sood

Created on November 02, 2011

## basic_sentiment_analysis.py
'''
Basic Sentiment Analysis
Builds on:
    https://finnaarupnielsen.wordpress.com/2011/06/20/simplest-sentiment-analysis-in-python-with-af/

    Utilizes AFINN or a custom sentiment db

    Example Snippets at end from: https://code.google.com/p/sentana/wiki/ExampleSentiments
'''

## Hillary_Clinton
Note:

55000/(365*4) ~ 37.7. That seems a touch low for Sec. of state.

Caveats:
1. Clinton may have used more than one private server
2. Clinton may have sent emails from other servers to unofficial accounts of other state department employees

Lower bound for missing emails from Clinton:
Take a small weighted random sample (weighting seniority more) of top state department employees.

## capitol_speech.py
'''
Gets Congressional speech text, arranged by speaker.

Produces a csv (capitolwords.csv) with the following columns:
speaker_state,speaker_raw,speaker_first,congress,title,origin_url,number,id,volume,chamber,session,speaker_last,
pages,speaker_party,date,bills,bioguide_id,order,speaking,capitolwords_url

Uses the Sunlight foundation library: http://python-sunlight.readthedocs.org/en/latest/
'''

## salvage_csv.py
'''
What does it do?
Goes through a corrupted csv sequentially and outputs rows that are clean.
Also outputs, total n, total corrupted n

@author: Gaurav Sood
Run: python salvage_csv.py input_csv output_csv

'''

## text_classifier.R
"
Basic Text Classifier
	- Takes a csv with a text column, and column of labels
	- Splits into train and test
	- Preprocesses text using tm/bag-of-words, 1/2-order Markov
	- Uses SVM and Lasso

@author: Gaurav Sood

"
	The regex patterns in this gist are intended to match any URLs,
	including "mailto:foo@example.com", "x-whatever://foo", etc. For a
	pattern that attempts only to match web URLs (http, https), see:
	https://gist.github.com/gruber/8891611


	# Single-line version of pattern:

	(?i)\b((?:[a-z][\w-]+:(?:/{1,3}\|[a-z0-9%])\|www\d{0,3}[.]\|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+\|\(([^\s()<>]+\|(\([^\s()<>]+\)))\))+(?:\(([^\s()<>]+\|(\([^\s()<>]+\)))\)\|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
	# house-icpsr.csv and senate-icpsr.csv are made by converting the XLS files found here to CSV:
	# http://web.mit.edu/17.251/www/data_page.html#2

	# Specifically, these files that list information and IDs for members from the 103rd to 112th Congress:
	# http://web.mit.edu/cstewart/www/data/house_members_103-112-1.xls
	# http://web.mit.edu/cstewart/www/data/senators_103-112-1.xls

	# This script looks through the two original CSVs, caches the ICPSR ID of every member from the 110th Congress onward,
	# then goes through every legislator in the Sunlight Labs Congress API and tries to match them up by a combination of
	# last name, state, and party.
	import requests
	import getpass
	import sys
	import json
	import Queue

	# This is a script, let's be lazy. We'll fill up this global and print it.
	g = {"nodes": {}, "edges": []}
	# And here's the cutoff criterion
	MAX_NODES = 1000
	Some things journalists may want to consider:

	1. Anecdotes can mislead. People seeing another yet another episodic story on crime may infer that crime is increasing.
	So report numbers where trustworthy numerical data are available.

	2. But numbers need to be reported carefully. Most people, when reading news, do not do back of the envelope calculations to interpret data correctly.
	So ill-reported numbers can mislead.

	3. Rules for numbers:
	a. % changes than changes in %. The former is more impressive when the base rate is low. Latter generally a better way to report things. If confused, report t1 and t2.
	'''
	Text from Searchable pdfs
	Scrape Text off Wisconsin Ads pdfs

	Uses pyPdf to get text from searchable pdfs. The script is for tailored for getting data
	from Wisconsin Political Ads Database: http://wiscadproject.wisc.edu/Storyboards.

	@author: Gaurav Sood

	Created on November 02, 2011
	'''
	Basic Sentiment Analysis
	Builds on:
	https://finnaarupnielsen.wordpress.com/2011/06/20/simplest-sentiment-analysis-in-python-with-af/

	Utilizes AFINN or a custom sentiment db

	Example Snippets at end from: https://code.google.com/p/sentana/wiki/ExampleSentiments
	'''
	Note:

	55000/(365*4) ~ 37.7. That seems a touch low for Sec. of state.

	Caveats:
	1. Clinton may have used more than one private server
	2. Clinton may have sent emails from other servers to unofficial accounts of other state department employees

	Lower bound for missing emails from Clinton:
	Take a small weighted random sample (weighting seniority more) of top state department employees.
	'''
	Gets Congressional speech text, arranged by speaker.

	Produces a csv (capitolwords.csv) with the following columns:
	speaker_state,speaker_raw,speaker_first,congress,title,origin_url,number,id,volume,chamber,session,speaker_last,
	pages,speaker_party,date,bills,bioguide_id,order,speaking,capitolwords_url

	Uses the Sunlight foundation library: http://python-sunlight.readthedocs.org/en/latest/
	'''
	'''
	What does it do?
	Goes through a corrupted csv sequentially and outputs rows that are clean.
	Also outputs, total n, total corrupted n

	@author: Gaurav Sood
	Run: python salvage_csv.py input_csv output_csv

	'''
	"
	Basic Text Classifier
	- Takes a csv with a text column, and column of labels
	- Splits into train and test
	- Preprocesses text using tm/bag-of-words, 1/2-order Markov
	- Uses SVM and Lasso

	@author: Gaurav Sood

	"