Find out who Congress follows on Twitter using the command line
Part of a lesson for the Stanford Journalism Computational Methods in the Civic Sphere
This is a short tutorial on how to use command-line tools, including csvfix and t, the command-line Twitter interface, to access and parse data from the Sunlight Foundation and Twitter. The end goal of this exercise is to gather who everyone in Congress follows (friends, in the parlance of Twitter), and then count up the common friends to find out which Twitter accounts are most followed by members of Congress.
Here's a screenshot of the result, after it's been imported into Google Spreadsheets. Note that it's sorted by total number of followers, not necessarily the most followed by Congress:
If you don't care about the code, you can visit the Google Spreadsheet here, which contains an excerpt of the data (top 600 accounts by total follower count).
Getting legislator data from the Sunlight Foundationz
The Sunlight Foundation has a convenient spreadsheet of legislators, their party affiliation, and social media handles.
# save to legislators.csv curl -s -O http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
The sheet contains past legislators, so filter it by the "in_office" column, which is at position 10, and save it to a new file named
csvfix find -f 10 -s 1 < legislators.csv > current-legislators.csv
Then we extract the Twitter handles (column 22) and send it off to the
t users command
csvfix order -smq -f 22 < current-legislators.csv | grep '[A-z]' | xargs t users --csv > legislators-twitter-profiles.csv
Batch retrieval of friend IDs
At this point, you'll have to use a more low-level API to do the batch lookup of friend ids as
t followings will exceed the rate limit for a few of the legislators, such as @DarrellIssa with 30,000+ followings/friends. I recommend the Ruby Twitter gem, or, if you want to stay at the command line, twurl.
For example, here's an example of
twurl wrapped in a
while loop. The Twitter API endpoint, friends/ids returns up to 5,000 Twitter account IDs at a time. For accounts with 5,000+ friends, the response includes a
next_cursor value, which you assign to the
cursor parameter in the API call.
next_cursor is 0, you've reached the end of the list of friend IDs. For accounts with lots of friends, you might want to add a
sleep interval (the Twitter API lets you make 15 calls per 15-minute window).
username=DarrellIssa next_cursor=-1 while [[ $next_cursor -ne 0 && $next_cursor != "" && $next_cursor != 'null' ]]; do json=$(twurl "/1.1/friends/ids.json?screen_name=$username&cursor=$next_cursor") if [[ $? != 0 || $(echo $json | jq 'has("errors")') == 'true' ]]; then next_cursor=0 echo "errors: $(echo $json | jq '.errors .message')" else # just exist if there's an error echo $json | jq '.ids' next_cursor=$(echo $json | jq -r '.next_cursor') fi done
In any case, you can download the friend IDs I've fetched for our current legislators in this zip file.
The rest of this script assumes you have a subfolder named friend_ids with text files named after the respective lowercased-Twitter handles, e.g.
friend_ids/ |__aaronschock.txt |__andercrenshaw.txt
An aside: some random statistics
To find the total number of users who Congressmembers follow on Twitter (based on the zipped snapshot of their Twitter friend ids):
cat friend_ids/*.txt | wc -l # 755918
To find the number of unique users followed by Congress:
cat friend_ids/*.txt | sort | uniq | wc -l # 380599
Interested in finding out how many members of Congress follow you? If you've downloaded the zip of Congressmembers' friend_ids (and have access to the t and csvfix tools, of course):
# assuming you don't know your Twitter ID offhand, use `t user` to retrieve it, # and then, csvfix to grab it from the first column # (insert your username instead of mine) my_twitter_id=$(t user dancow --csv | csvfix order -smq -f 1 | tail -n 1) cat friend_ids/*.txt | grep -c $my_twitter_id # 4
grep -l to find out who exactly follows you:
grep -l $(t user dancow --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt # friend_ids/darrellissa.txt # friend_ids/peterroskam.txt # friend_ids/repgaramendi.txt # friend_ids/reppeterdefazio.txt
# God grep -l $(t user God --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt # friend_ids/lorettasanchez.txt # friend_ids/repdianadegette.txt # friend_ids/repjustinamash.txt # friend_ids/repsandylevin.txt # Harvard grep -l $(t user Harvard --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt # friend_ids/chakafattah.txt # friend_ids/nikiinthehouse.txt # friend_ids/repbobbyscott.txt # friend_ids/repkarenbass.txt # friend_ids/senatorshaheen.txt # friend_ids/senschumer.txt # Stanford grep -l $(t user Stanford --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt # friend_ids/dorismatsui.txt # friend_ids/joaquincastrotx.txt # friend_ids/repannaeshoo.txt # friend_ids/repbecerra.txt # friend_ids/repgaramendi.txt # friend_ids/repspeier.txt # friend_ids/reptimryan.txt # friend_ids/senfeinstein.txt # Lots of Onion fans cat friend_ids/*.txt | grep -c $(t user TheOnion --csv | csvfix order -smq -f 1 | tail -n 1) # 45
What Twitter accounts are most popular among Republicans versus Democrats?
This next bit of code is exceptionally gross looking, but only because I wanted to throw everything into as few lines as possible. What it does is given the
current-legislators.csv file, filters by party (
D), then selects the Twitter username of the filtered legislators. This list of usernames is fed by process to
grep -f, which uses those usernames to find all the
friend_ids/*.txt files. Each of those text files contains a list of Twitter IDs (numbers, not readable usernames), and so
sort | uniq -c | sort -rn is used to count up the most frequently occuring Twitter IDs for the given party of legislators. I save the results into two temp files,
grep -f <(csvfix find -f 7 -s D < current-legislators.csv | csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \ <(ls friend_ids/*.txt) | xargs cat | sort | uniq -c | sort -rn | sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/democrat-friends.csv # Now the Republicans grep -f <(csvfix find -f 7 -s R < current-legislators.csv | csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \ <(ls friend_ids/*.txt) | xargs cat | sort | uniq -c | sort -rn | sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/republican-friends.csv
Let's filter this list to accounts followed by at least 10 legislators:
echo "ID,democrat_friends,republican_friends" > friends_by_party.csv csvfix join -f 1:1 /tmp/democrat-friends.csv /tmp/republican-friends.csv | csvfix find -smq -if '($2 + $3) > 10' >> friends_by_party.csv
Finally we take the IDs in this sheet and pass them to
t users, which will do a batch lookup of these 4,400+ IDs. We then use the csvfix join command to combine the
friends_by_party data with the Twitter profile data.
csvfix join -f 1:1 \ friends_by_party.csv \ <(csvfix order -smq -f 1 friends_by_party.csv | xargs t users --id --csv) > friends_by_party_profiles.csv
I've posted a truncated result on this Google Spreadsheet. The adjusted_democrat_friends and adjusted_ratio reflect that there are currently 300 Republicans to 236 Democrats:
csvfix order -f 7 < current-legislators.csv | grep 'R' | wc -l csvfix order -f 7 < current-legislators.csv | grep 'D' | wc -l
And so I've multiplied the democrat_friends column by a factor of
1.2, as some Twitter accounts have more Republican followers simply because there are more Republicans. In the case of the New York Times (@nytimes), which has
131 Democrat followers to
134 Republican followers, the adjusted number of Democrat followers is
Check out the Google Spreadsheet:
The top 25 Twitter accounts by ratio of Democrat to Republican followers:
The top 25 Twitter accounts by ratio of Republican to Democrat followers:
Comparing media outlets
A quick lookup of which media outlets are followed by Congressmember twitter accounts.
If you don't feel like running all the code yourself to get the results, you can download my copy of friends_by_party_profiles.csv, which again, is limited to the 4,000+ Twitter users followed by at least 10 Congressmembers.
Note: because I didn't do a search for exact names, some terms (such as
cbs) returned a lot of affiliated accounts. I trimmed the list down to 100 entities that were interesting to me, and of course, the
csvfix/find search expressions were just news orgs off the top of my head, so this isn't an absolute top-of-all-media list (for example, @rollcallpols has 226 total followers, the same as @FoxNews, but I arbitrarily trimmed it since @rollcall is already one of the top items. And then I got bored and so the trimming isn't consistent).
csvfix order -fn "Screen name,democrat_friends,republican_friends" < friends_by_party_profiles.csv | csvfix find -f 1 \ -ei "nytimes" -ei "washingtonpost" -ei "politico" -ei "wsj" \ -ei "foxnews" -ei "thehill" -ei "newshour" -ei "npr" -ei 'huffingtonpost' \ -ei "cspan" -ei "cnn" -ei "msnbc" -ei "latimes" -ei 'rollcall' \ -ei "\btime\b" -ei "propublica" -ei "newsweek" -ei "newyorker" \ -ei "theatlantic" -ei "qz" -ei "reuters" -ei "\bap\b" \ -ei "abc" -ei "nbc" -ei "cbs" -ei "pbs" -ei "usatoday" -ei 'slate' \ -ei "bbc" -ei "guardian" -ei "forbes" -ei "theeconomist" -ei "rollingstone" \ -ei "propublica" -ei "voxdotcom" -ei "thisisfusion" -ei "reddit" \ -ei "gawker" -ei "buzzfeed" -ei "vicenews" -ei "the_intercept" \ -ei "digg" -ei "thedailyshow" -ei "lastweektonight" -ei "stephenathome" \ -ei "theonion" -ei "conanobrien" -ei "espn" | csvfix eval -e '($2 + $3)' | csvfix sort -smq -f 4:DN
|Screename||Democrat friends||Republican friends||Total Congressfriends|