dannguyen/congress-twitter-data-getter.md

## congress-twitter-data-getter.md

      
    Raw
  

              congress-twitter-data-getter.md
            
          
    Find out who Congress follows on Twitter using the command line

Part of a lesson for the Stanford Journalism Computational Methods in the Civic Sphere
This is a short tutorial on how to use command-line tools, including csvfix and t, the command-line Twitter interface, to access and parse data from the Sunlight Foundation and Twitter. The end goal of this exercise is to gather who everyone in Congress follows (friends, in the parlance of Twitter), and then count up the common friends to find out which Twitter accounts are most followed by members of Congress.
Here's a screenshot of the result, after it's been imported into Google Spreadsheets. Note that it's sorted by total number of followers, not necessarily the most followed by Congress:

If you don't care about the code, you can visit the Google Spreadsheet here, which contains an excerpt of the data (top 600 accounts by total follower count).
The code in this tutorial is meant to be brief and expects that you have some familiarity with the basic Unix tools (such as grep and sort), as well as the concept of pipes and processes.
Getting legislator data from the Sunlight Foundationz

The Sunlight Foundation has a convenient spreadsheet of legislators, their party affiliation, and social media handles.
# save to legislators.csv
curl -s -O http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
`
The sheet contains past legislators, so filter it by the "in_office" column, which is at position 10, and save it to a new file named current-legislators.csv
csvfix find -f 10 -s 1 < legislators.csv > current-legislators.csv
Then we extract the Twitter handles (column 22) and send it off to the t users command
csvfix order -smq -f 22 < current-legislators.csv | grep '[A-z]' | 
  xargs t users --csv > legislators-twitter-profiles.csv
Batch retrieval of friend IDs

At this point, you'll have to use a more low-level API to do the batch lookup of friend ids as t followings will exceed the rate limit for a few of the legislators, such as @DarrellIssa with  30,000+ followings/friends. I recommend the Ruby Twitter gem, or, if you want to stay at the command line, twurl.
For example, here's an example of twurl wrapped in a while loop. The Twitter API endpoint, friends/ids returns up to 5,000 Twitter account IDs at a time. For accounts with 5,000+ friends, the response includes a next_cursor value, which you assign to the cursor parameter in the API call.
When that next_cursor is 0, you've reached the end of the list of friend IDs. For accounts with lots of friends, you might want to add a sleep interval (the Twitter API lets you make 15 calls per 15-minute window).
username=DarrellIssa
next_cursor=-1
while [[ $next_cursor -ne 0 && $next_cursor != "" && $next_cursor != 'null' ]]; do 
  json=$(twurl "/1.1/friends/ids.json?screen_name=$username&cursor=$next_cursor")
  if [[ $? != 0 || $(echo $json | jq 'has("errors")') == 'true' ]]; then 
    next_cursor=0
    echo "errors: $(echo $json | jq '.errors[0] .message')"
  else  # just exist if there's an error
    echo $json | jq '.ids[]'
    next_cursor=$(echo $json | jq -r '.next_cursor')
  fi
done
In any case, you can download the friend IDs I've fetched for our current legislators in this zip file.
The rest of this script assumes you have a subfolder named friend_ids with text files named after the respective lowercased-Twitter handles, e.g.
   friend_ids/
     |__aaronschock.txt
     |__andercrenshaw.txt

An aside: some random statistics

To find the total number of users who Congressmembers follow on Twitter (based on the zipped snapshot of their Twitter friend ids):
cat friend_ids/*.txt | wc -l
# 755918
To find the number of unique users followed by Congress:
cat friend_ids/*.txt | sort | uniq | wc -l
# 380599
Interested in finding out how many members of Congress follow you? If you've downloaded the zip of Congressmembers' friend_ids (and have access to the t and csvfix tools, of course):
# assuming you don't know your Twitter ID offhand, use `t user` to retrieve it,
# and then, csvfix to grab it from the first column
# (insert your username instead of mine)
my_twitter_id=$(t user dancow --csv | csvfix order -smq -f 1 | tail -n 1)
cat friend_ids/*.txt | grep -c $my_twitter_id
# 4
Alternatively, use grep -l to find out who exactly follows you:
grep -l $(t user dancow --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt
# friend_ids/darrellissa.txt
# friend_ids/peterroskam.txt
# friend_ids/repgaramendi.txt
# friend_ids/reppeterdefazio.txt
Not too many followers of @God (the Twitter account, I mean). More followers of  @Stanford than @Harvard, though  @TheOnion has more Congressional fans than those colleges combined.
# God
grep -l $(t user God --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt
# friend_ids/lorettasanchez.txt
# friend_ids/repdianadegette.txt
# friend_ids/repjustinamash.txt
# friend_ids/repsandylevin.txt

# Harvard
grep -l $(t user Harvard --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt 
# friend_ids/chakafattah.txt
# friend_ids/nikiinthehouse.txt
# friend_ids/repbobbyscott.txt
# friend_ids/repkarenbass.txt
# friend_ids/senatorshaheen.txt
# friend_ids/senschumer.txt

# Stanford
grep -l $(t user Stanford --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt 
# friend_ids/dorismatsui.txt
# friend_ids/joaquincastrotx.txt
# friend_ids/repannaeshoo.txt
# friend_ids/repbecerra.txt
# friend_ids/repgaramendi.txt
# friend_ids/repspeier.txt
# friend_ids/reptimryan.txt
# friend_ids/senfeinstein.txt

# Lots of Onion fans
cat friend_ids/*.txt | grep -c $(t user TheOnion --csv | csvfix order -smq -f 1 | tail -n 1)
# 45
What Twitter accounts are most popular among Republicans versus Democrats?

This next bit of code is exceptionally gross looking, but only because I wanted to throw everything into as few lines as possible. What it does is given the current-legislators.csv file, filters by party (R or D), then selects the Twitter username of the filtered legislators. This list of usernames is fed by process to grep -f, which uses those usernames to find all the friend_ids/*.txt files. Each of those text files contains a list of Twitter IDs (numbers, not readable usernames), and so sort | uniq -c | sort -rn is used to count up the most frequently occuring Twitter IDs for the given party of legislators. I save the results into two temp files, /tmp/democrat-friends.csv and /tmp/republican-friends.csv
grep -f <(csvfix find -f 7 -s D < current-legislators.csv | 
  csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
  <(ls friend_ids/*.txt) |
  xargs cat | sort | uniq -c | sort -rn | 
  sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/democrat-friends.csv

# Now the Republicans
grep -f <(csvfix find -f 7 -s R < current-legislators.csv | 
  csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
  <(ls friend_ids/*.txt) |
  xargs cat | sort | uniq -c | sort -rn | 
  sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/republican-friends.csv

Let's filter this list to accounts followed by at least 10 legislators:
echo "ID,democrat_friends,republican_friends" > friends_by_party.csv
csvfix join -f 1:1 /tmp/democrat-friends.csv /tmp/republican-friends.csv | 
  csvfix find -smq -if '($2 + $3) > 10' >> friends_by_party.csv
Finally we take the IDs in this sheet and pass them to t users, which will do a batch lookup of these 4,400+ IDs. We then use the csvfix join command to combine the friends_by_party data with the Twitter profile data.
csvfix join -f 1:1 \
  friends_by_party.csv \
  <(csvfix order -smq -f 1 friends_by_party.csv | xargs t users --id --csv) > friends_by_party_profiles.csv
I've posted a truncated result on this Google Spreadsheet. The adjusted_democrat_friends and adjusted_ratio reflect that there are currently 300 Republicans to 236 Democrats:
csvfix order -f 7 < current-legislators.csv | grep 'R' | wc -l
csvfix order -f 7 < current-legislators.csv | grep 'D' | wc -l
And so I've multiplied the democrat_friends column by a factor of 1.2, as some Twitter accounts have more Republican followers simply because there are more Republicans. In the case of the New York Times (@nytimes), which has 131 Democrat followers to 134 Republican followers, the adjusted number of Democrat followers is 157.2.
Check out the Google Spreadsheet:
https://docs.google.com/spreadsheets/d/1cjRJyrPYj8KAhUrot8ubPOWgbAwm0OqIgPtsswJ2sjM/edit#gid=1358907576
The top 25 Twitter accounts by ratio of Democrat to Republican followers:

The top 25 Twitter accounts by ratio of Republican to Democrat followers:

Comparing media outlets

A quick lookup of which media outlets are followed by Congressmember twitter accounts.
If you don't feel like running all the code yourself to get the results, you can download my copy of friends_by_party_profiles.csv, which again, is limited to the 4,000+ Twitter users followed by at least 10 Congressmembers.
Note: because I didn't do a search for exact names, some terms (such as cbs) returned a lot of affiliated accounts. I trimmed the list down to 100 entities that were interesting to me, and of course, the csvfix/find search expressions were just news orgs off the top of my head, so this isn't an absolute top-of-all-media list (for example, @rollcallpols has 226 total followers, the same as @FoxNews, but I arbitrarily trimmed it since @rollcall is already one of the top items. And then I got bored and so the trimming isn't consistent).
csvfix order -fn "Screen name,democrat_friends,republican_friends" < friends_by_party_profiles.csv | 
  csvfix find -f 1 \
  -ei "nytimes" -ei "washingtonpost"  -ei "politico"  -ei "wsj"  \
  -ei "foxnews" -ei "thehill"  -ei "newshour"  -ei "npr"  -ei 'huffingtonpost' \
  -ei "cspan"  -ei "cnn"  -ei "msnbc"  -ei "latimes"  -ei 'rollcall' \
  -ei "\btime\b"  -ei "propublica"  -ei "newsweek"  -ei "newyorker"  \
  -ei "theatlantic"  -ei "qz"  -ei "reuters"  -ei "\bap\b"  \
  -ei "abc" -ei "nbc" -ei "cbs" -ei "pbs" -ei "usatoday" -ei 'slate' \
  -ei "bbc" -ei "guardian" -ei "forbes" -ei "theeconomist" -ei "rollingstone" \
  -ei "propublica"  -ei "voxdotcom"  -ei "thisisfusion"  -ei "reddit"  \
  -ei "gawker"  -ei "buzzfeed"  -ei "vicenews"  -ei "the_intercept"  \
  -ei "digg"  -ei "thedailyshow" -ei "lastweektonight" -ei "stephenathome" \
  -ei "theonion"  -ei "conanobrien" -ei "espn" |
  csvfix eval -e '($2 + $3)' | csvfix sort -smq -f 4:DN
Excerpted results:


Screename
Democrat friends
Republican friends
Total Congressfriends


thehill
143
205
348


cspan
140
190
330


politico
139
190
329


rollcall
138
182
320


washingtonpost
130
171
301


WSJ
110
186
296


nytimes
131
134
265


cnnbrk
108
129
237


CNN
112
124
236


FoxNews
42
184
226


nprpolitics
108
86
194


nprnews
113
79
192


CBSNews
91
101
192


cspanwj
71
114
185


JoeNBC
73
112
185


jparkABC
74
97
171


ABC
84
87
171


TheEconomist
73
98
171


Reuters
75
93
168


USATODAY
69
96
165


BuzzFeedBen
74
88
162


HuffingtonPost
97
57
154


NBCNews
76
77
153


JohnKingCNN
64
80
144


msnbc
92
52
144


CNBC
45
84
129


NewsHour
70
56
126


thehilltweets
48
76
124


NBCNightlyNews
63
59
122


foxnewspolitics
14
98
112


Newsweek
60
49
109


NBCPolitics
59
49
108


CNNPolitics
53
55
108


FoxNewsSunday
17
87
104


WSJPolitics
34
70
104


NYTimeskrugman
84
19
103


BBCWorld
62
40
102


BBCBreaking
51
50
101


PBS
70
30
100


BuzzFeedAndrew
46
51
97


ABCPolitics
28
64
92


latimes
49
42
91


Forbes
31
59
90


Slate
57
29
86


politicoalex
42
40
82


TheAtlantic
55
24
79


NewYorker
52
25
77


WSJopinion
10
66
76


StephenAtHome
56
20
76


NPR
58
16
74


cspanradio
33
39
72


CNNnewsroom
32
39
71


TheDailyShow
52
17
69


CBSThisMorning
29
35
64


ThisWeekABC
38
26
64


CNNSitRoom
31
29
60


BuzzFeed
35
25
60


CBSPolitics
27
31
58


politico2016
22
34
56


CBSEveningNews
26
28
54


foxnewsradio
3
48
51


NBCFirstRead
26
25
51


POLITICOPro
20
30
50


WSJbreakingnews
18
30
48


ProPublica
34
12
46


msnbc_breaking
30
14
44


TheOnion
19
25
44


nbc
20
23
43


guardian
28
13
41


nytimesworld
29
12
41


CNNgo
25
15
40


CNNMoney
21
18
39


foxnewsalert
2
34
36


BuzzFeedNews
18
17
35


CapitolHillCNN
15
20
35


nprscottsimon
18
17
35


BuzzFeedPol
20
14
34


voxdotcom
31
3
34


FoxNewsLive
2
31
33


NPRinskeep
27
4
31


CBSRadioNews
12
19
31


espn
14
17
31


NYTimesDowd
29
1
30


ReutersPolitics
11
19
30


POLITICO_Steve
16
14
30


NBCNewsWorld
18
12
30


ThisIsFusion
24
5
29


msnbcvideo
21
7
28


BBCNews
13
14
27


nytimesbusiness
13
14
27


NPRHealth
19
7
26


CQRollCall
16
9
25


ConanOBrien
14
11
25


nbcwashington
14
10
24


guardiannews
17
5
22


DatelineNBC
11
11
22


nytimeshealth
14
8
22


CSPANBlog
12
9
21


frontlinepbs
16
5
21


RollingStone
11
4
15


Gawker
9
4
13


GuardianUS
5
7
12
Screename	Democrat friends	Republican friends	Total Congressfriends
thehill	143	205	348
cspan	140	190	330
politico	139	190	329
rollcall	138	182	320
washingtonpost	130	171	301
WSJ	110	186	296
nytimes	131	134	265
cnnbrk	108	129	237
CNN	112	124	236
FoxNews	42	184	226
nprpolitics	108	86	194
nprnews	113	79	192
CBSNews	91	101	192
cspanwj	71	114	185
JoeNBC	73	112	185
jparkABC	74	97	171
ABC	84	87	171
TheEconomist	73	98	171
Reuters	75	93	168
USATODAY	69	96	165
BuzzFeedBen	74	88	162
HuffingtonPost	97	57	154
NBCNews	76	77	153
JohnKingCNN	64	80	144
msnbc	92	52	144
CNBC	45	84	129
NewsHour	70	56	126
thehilltweets	48	76	124
NBCNightlyNews	63	59	122
foxnewspolitics	14	98	112
Newsweek	60	49	109
NBCPolitics	59	49	108
CNNPolitics	53	55	108
FoxNewsSunday	17	87	104
WSJPolitics	34	70	104
NYTimeskrugman	84	19	103
BBCWorld	62	40	102
BBCBreaking	51	50	101
PBS	70	30	100
BuzzFeedAndrew	46	51	97
ABCPolitics	28	64	92
latimes	49	42	91
Forbes	31	59	90
Slate	57	29	86
politicoalex	42	40	82
TheAtlantic	55	24	79
NewYorker	52	25	77
WSJopinion	10	66	76
StephenAtHome	56	20	76
NPR	58	16	74
cspanradio	33	39	72
CNNnewsroom	32	39	71
TheDailyShow	52	17	69
CBSThisMorning	29	35	64
ThisWeekABC	38	26	64
CNNSitRoom	31	29	60
BuzzFeed	35	25	60
CBSPolitics	27	31	58
politico2016	22	34	56
CBSEveningNews	26	28	54
foxnewsradio	3	48	51
NBCFirstRead	26	25	51
POLITICOPro	20	30	50
WSJbreakingnews	18	30	48
ProPublica	34	12	46
msnbc_breaking	30	14	44
TheOnion	19	25	44
nbc	20	23	43
guardian	28	13	41
nytimesworld	29	12	41
CNNgo	25	15	40
CNNMoney	21	18	39
foxnewsalert	2	34	36
BuzzFeedNews	18	17	35
CapitolHillCNN	15	20	35
nprscottsimon	18	17	35
BuzzFeedPol	20	14	34
voxdotcom	31	3	34
FoxNewsLive	2	31	33
NPRinskeep	27	4	31
CBSRadioNews	12	19	31
espn	14	17	31
NYTimesDowd	29	1	30
ReutersPolitics	11	19	30
POLITICO_Steve	16	14	30
NBCNewsWorld	18	12	30
ThisIsFusion	24	5	29
msnbcvideo	21	7	28
BBCNews	13	14	27
nytimesbusiness	13	14	27
NPRHealth	19	7	26
CQRollCall	16	9	25
ConanOBrien	14	11	25
nbcwashington	14	10	24
guardiannews	17	5	22
DatelineNBC	11	11	22
nytimeshealth	14	8	22
CSPANBlog	12	9	21
frontlinepbs	16	5	21
RollingStone	11	4	15
Gawker	9	4	13
GuardianUS	5	7	12