Instantly share code, notes, and snippets.

Embed
What would you like to do?
Who Congress Follows on Twitter data

Find out who Congress follows on Twitter using the command line

Part of a lesson for the Stanford Journalism Computational Methods in the Civic Sphere

This is a short tutorial on how to use command-line tools, including csvfix and t, the command-line Twitter interface, to access and parse data from the Sunlight Foundation and Twitter. The end goal of this exercise is to gather who everyone in Congress follows (friends, in the parlance of Twitter), and then count up the common friends to find out which Twitter accounts are most followed by members of Congress.

Here's a screenshot of the result, after it's been imported into Google Spreadsheets. Note that it's sorted by total number of followers, not necessarily the most followed by Congress:

img

If you don't care about the code, you can visit the Google Spreadsheet here, which contains an excerpt of the data (top 600 accounts by total follower count).

The code in this tutorial is meant to be brief and expects that you have some familiarity with the basic Unix tools (such as grep and sort), as well as the concept of pipes and processes.

Getting legislator data from the Sunlight Foundationz

The Sunlight Foundation has a convenient spreadsheet of legislators, their party affiliation, and social media handles.

# save to legislators.csv
curl -s -O http://unitedstates.sunlightfoundation.com/legislators/legislators.csv

`

The sheet contains past legislators, so filter it by the "in_office" column, which is at position 10, and save it to a new file named current-legislators.csv

csvfix find -f 10 -s 1 < legislators.csv > current-legislators.csv

Then we extract the Twitter handles (column 22) and send it off to the t users command

csvfix order -smq -f 22 < current-legislators.csv | grep '[A-z]' | 
  xargs t users --csv > legislators-twitter-profiles.csv

Batch retrieval of friend IDs

At this point, you'll have to use a more low-level API to do the batch lookup of friend ids as t followings will exceed the rate limit for a few of the legislators, such as @DarrellIssa with 30,000+ followings/friends. I recommend the Ruby Twitter gem, or, if you want to stay at the command line, twurl.

For example, here's an example of twurl wrapped in a while loop. The Twitter API endpoint, friends/ids returns up to 5,000 Twitter account IDs at a time. For accounts with 5,000+ friends, the response includes a next_cursor value, which you assign to the cursor parameter in the API call.

When that next_cursor is 0, you've reached the end of the list of friend IDs. For accounts with lots of friends, you might want to add a sleep interval (the Twitter API lets you make 15 calls per 15-minute window).

username=DarrellIssa
next_cursor=-1
while [[ $next_cursor -ne 0 && $next_cursor != "" && $next_cursor != 'null' ]]; do 
  json=$(twurl "/1.1/friends/ids.json?screen_name=$username&cursor=$next_cursor")
  if [[ $? != 0 || $(echo $json | jq 'has("errors")') == 'true' ]]; then 
    next_cursor=0
    echo "errors: $(echo $json | jq '.errors[0] .message')"
  else  # just exist if there's an error
    echo $json | jq '.ids[]'
    next_cursor=$(echo $json | jq -r '.next_cursor')
  fi
done

In any case, you can download the friend IDs I've fetched for our current legislators in this zip file.

The rest of this script assumes you have a subfolder named friend_ids with text files named after the respective lowercased-Twitter handles, e.g.

   friend_ids/
     |__aaronschock.txt
     |__andercrenshaw.txt

An aside: some random statistics

To find the total number of users who Congressmembers follow on Twitter (based on the zipped snapshot of their Twitter friend ids):

cat friend_ids/*.txt | wc -l
# 755918

To find the number of unique users followed by Congress:

cat friend_ids/*.txt | sort | uniq | wc -l
# 380599

Interested in finding out how many members of Congress follow you? If you've downloaded the zip of Congressmembers' friend_ids (and have access to the t and csvfix tools, of course):

# assuming you don't know your Twitter ID offhand, use `t user` to retrieve it,
# and then, csvfix to grab it from the first column
# (insert your username instead of mine)
my_twitter_id=$(t user dancow --csv | csvfix order -smq -f 1 | tail -n 1)
cat friend_ids/*.txt | grep -c $my_twitter_id
# 4

Alternatively, use grep -l to find out who exactly follows you:

grep -l $(t user dancow --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt
# friend_ids/darrellissa.txt
# friend_ids/peterroskam.txt
# friend_ids/repgaramendi.txt
# friend_ids/reppeterdefazio.txt

Not too many followers of @God (the Twitter account, I mean). More followers of @Stanford than @Harvard, though @TheOnion has more Congressional fans than those colleges combined.

# God
grep -l $(t user God --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt
# friend_ids/lorettasanchez.txt
# friend_ids/repdianadegette.txt
# friend_ids/repjustinamash.txt
# friend_ids/repsandylevin.txt

# Harvard
grep -l $(t user Harvard --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt 
# friend_ids/chakafattah.txt
# friend_ids/nikiinthehouse.txt
# friend_ids/repbobbyscott.txt
# friend_ids/repkarenbass.txt
# friend_ids/senatorshaheen.txt
# friend_ids/senschumer.txt

# Stanford
grep -l $(t user Stanford --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt 
# friend_ids/dorismatsui.txt
# friend_ids/joaquincastrotx.txt
# friend_ids/repannaeshoo.txt
# friend_ids/repbecerra.txt
# friend_ids/repgaramendi.txt
# friend_ids/repspeier.txt
# friend_ids/reptimryan.txt
# friend_ids/senfeinstein.txt

# Lots of Onion fans
cat friend_ids/*.txt | grep -c $(t user TheOnion --csv | csvfix order -smq -f 1 | tail -n 1)
# 45

What Twitter accounts are most popular among Republicans versus Democrats?

This next bit of code is exceptionally gross looking, but only because I wanted to throw everything into as few lines as possible. What it does is given the current-legislators.csv file, filters by party (R or D), then selects the Twitter username of the filtered legislators. This list of usernames is fed by process to grep -f, which uses those usernames to find all the friend_ids/*.txt files. Each of those text files contains a list of Twitter IDs (numbers, not readable usernames), and so sort | uniq -c | sort -rn is used to count up the most frequently occuring Twitter IDs for the given party of legislators. I save the results into two temp files, /tmp/democrat-friends.csv and /tmp/republican-friends.csv

grep -f <(csvfix find -f 7 -s D < current-legislators.csv | 
  csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
  <(ls friend_ids/*.txt) |
  xargs cat | sort | uniq -c | sort -rn | 
  sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/democrat-friends.csv

# Now the Republicans
grep -f <(csvfix find -f 7 -s R < current-legislators.csv | 
  csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
  <(ls friend_ids/*.txt) |
  xargs cat | sort | uniq -c | sort -rn | 
  sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/republican-friends.csv

Let's filter this list to accounts followed by at least 10 legislators:

echo "ID,democrat_friends,republican_friends" > friends_by_party.csv
csvfix join -f 1:1 /tmp/democrat-friends.csv /tmp/republican-friends.csv | 
  csvfix find -smq -if '($2 + $3) > 10' >> friends_by_party.csv

Finally we take the IDs in this sheet and pass them to t users, which will do a batch lookup of these 4,400+ IDs. We then use the csvfix join command to combine the friends_by_party data with the Twitter profile data.

csvfix join -f 1:1 \
  friends_by_party.csv \
  <(csvfix order -smq -f 1 friends_by_party.csv | xargs t users --id --csv) > friends_by_party_profiles.csv

I've posted a truncated result on this Google Spreadsheet. The adjusted_democrat_friends and adjusted_ratio reflect that there are currently 300 Republicans to 236 Democrats:

csvfix order -f 7 < current-legislators.csv | grep 'R' | wc -l
csvfix order -f 7 < current-legislators.csv | grep 'D' | wc -l

And so I've multiplied the democrat_friends column by a factor of 1.2, as some Twitter accounts have more Republican followers simply because there are more Republicans. In the case of the New York Times (@nytimes), which has 131 Democrat followers to 134 Republican followers, the adjusted number of Democrat followers is 157.2.

Check out the Google Spreadsheet:

https://docs.google.com/spreadsheets/d/1cjRJyrPYj8KAhUrot8ubPOWgbAwm0OqIgPtsswJ2sjM/edit#gid=1358907576

The top 25 Twitter accounts by ratio of Democrat to Republican followers:

img

The top 25 Twitter accounts by ratio of Republican to Democrat followers:

img

Comparing media outlets

A quick lookup of which media outlets are followed by Congressmember twitter accounts.

If you don't feel like running all the code yourself to get the results, you can download my copy of friends_by_party_profiles.csv, which again, is limited to the 4,000+ Twitter users followed by at least 10 Congressmembers.

Note: because I didn't do a search for exact names, some terms (such as cbs) returned a lot of affiliated accounts. I trimmed the list down to 100 entities that were interesting to me, and of course, the csvfix/find search expressions were just news orgs off the top of my head, so this isn't an absolute top-of-all-media list (for example, @rollcallpols has 226 total followers, the same as @FoxNews, but I arbitrarily trimmed it since @rollcall is already one of the top items. And then I got bored and so the trimming isn't consistent).

csvfix order -fn "Screen name,democrat_friends,republican_friends" < friends_by_party_profiles.csv | 
  csvfix find -f 1 \
  -ei "nytimes" -ei "washingtonpost"  -ei "politico"  -ei "wsj"  \
  -ei "foxnews" -ei "thehill"  -ei "newshour"  -ei "npr"  -ei 'huffingtonpost' \
  -ei "cspan"  -ei "cnn"  -ei "msnbc"  -ei "latimes"  -ei 'rollcall' \
  -ei "\btime\b"  -ei "propublica"  -ei "newsweek"  -ei "newyorker"  \
  -ei "theatlantic"  -ei "qz"  -ei "reuters"  -ei "\bap\b"  \
  -ei "abc" -ei "nbc" -ei "cbs" -ei "pbs" -ei "usatoday" -ei 'slate' \
  -ei "bbc" -ei "guardian" -ei "forbes" -ei "theeconomist" -ei "rollingstone" \
  -ei "propublica"  -ei "voxdotcom"  -ei "thisisfusion"  -ei "reddit"  \
  -ei "gawker"  -ei "buzzfeed"  -ei "vicenews"  -ei "the_intercept"  \
  -ei "digg"  -ei "thedailyshow" -ei "lastweektonight" -ei "stephenathome" \
  -ei "theonion"  -ei "conanobrien" -ei "espn" |
  csvfix eval -e '($2 + $3)' | csvfix sort -smq -f 4:DN

Excerpted results:

Screename Democrat friends Republican friends Total Congressfriends
thehill 143 205 348
cspan 140 190 330
politico 139 190 329
rollcall 138 182 320
washingtonpost 130 171 301
WSJ 110 186 296
nytimes 131 134 265
cnnbrk 108 129 237
CNN 112 124 236
FoxNews 42 184 226
nprpolitics 108 86 194
nprnews 113 79 192
CBSNews 91 101 192
cspanwj 71 114 185
JoeNBC 73 112 185
jparkABC 74 97 171
ABC 84 87 171
TheEconomist 73 98 171
Reuters 75 93 168
USATODAY 69 96 165
BuzzFeedBen 74 88 162
HuffingtonPost 97 57 154
NBCNews 76 77 153
JohnKingCNN 64 80 144
msnbc 92 52 144
CNBC 45 84 129
NewsHour 70 56 126
thehilltweets 48 76 124
NBCNightlyNews 63 59 122
foxnewspolitics 14 98 112
Newsweek 60 49 109
NBCPolitics 59 49 108
CNNPolitics 53 55 108
FoxNewsSunday 17 87 104
WSJPolitics 34 70 104
NYTimeskrugman 84 19 103
BBCWorld 62 40 102
BBCBreaking 51 50 101
PBS 70 30 100
BuzzFeedAndrew 46 51 97
ABCPolitics 28 64 92
latimes 49 42 91
Forbes 31 59 90
Slate 57 29 86
politicoalex 42 40 82
TheAtlantic 55 24 79
NewYorker 52 25 77
WSJopinion 10 66 76
StephenAtHome 56 20 76
NPR 58 16 74
cspanradio 33 39 72
CNNnewsroom 32 39 71
TheDailyShow 52 17 69
CBSThisMorning 29 35 64
ThisWeekABC 38 26 64
CNNSitRoom 31 29 60
BuzzFeed 35 25 60
CBSPolitics 27 31 58
politico2016 22 34 56
CBSEveningNews 26 28 54
foxnewsradio 3 48 51
NBCFirstRead 26 25 51
POLITICOPro 20 30 50
WSJbreakingnews 18 30 48
ProPublica 34 12 46
msnbc_breaking 30 14 44
TheOnion 19 25 44
nbc 20 23 43
guardian 28 13 41
nytimesworld 29 12 41
CNNgo 25 15 40
CNNMoney 21 18 39
foxnewsalert 2 34 36
BuzzFeedNews 18 17 35
CapitolHillCNN 15 20 35
nprscottsimon 18 17 35
BuzzFeedPol 20 14 34
voxdotcom 31 3 34
FoxNewsLive 2 31 33
NPRinskeep 27 4 31
CBSRadioNews 12 19 31
espn 14 17 31
NYTimesDowd 29 1 30
ReutersPolitics 11 19 30
POLITICO_Steve 16 14 30
NBCNewsWorld 18 12 30
ThisIsFusion 24 5 29
msnbcvideo 21 7 28
BBCNews 13 14 27
nytimesbusiness 13 14 27
NPRHealth 19 7 26
CQRollCall 16 9 25
ConanOBrien 14 11 25
nbcwashington 14 10 24
guardiannews 17 5 22
DatelineNBC 11 11 22
nytimeshealth 14 8 22
CSPANBlog 12 9 21
frontlinepbs 16 5 21
RollingStone 11 4 15
Gawker 9 4 13
GuardianUS 5 7 12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment