Part of a lesson for the Stanford Journalism Computational Methods in the Civic Sphere
This is a short tutorial on how to use command-line tools, including csvfix and t, the command-line Twitter interface, to access and parse data from the Sunlight Foundation and Twitter. The end goal of this exercise is to gather who everyone in Congress follows (friends, in the parlance of Twitter), and then count up the common friends to find out which Twitter accounts are most followed by members of Congress.
Here's a screenshot of the result, after it's been imported into Google Spreadsheets. Note that it's sorted by total number of followers, not necessarily the most followed by Congress:
If you don't care about the code, you can visit the Google Spreadsheet here, which contains an excerpt of the data (top 600 accounts by total follower count).
The code in this tutorial is meant to be brief and expects that you have some familiarity with the basic Unix tools (such as grep and sort), as well as the concept of pipes and processes.
The Sunlight Foundation has a convenient spreadsheet of legislators, their party affiliation, and social media handles.
# save to legislators.csv
curl -s -O http://unitedstates.sunlightfoundation.com/legislators/legislators.csv
`
The sheet contains past legislators, so filter it by the "in_office" column, which is at position 10, and save it to a new file named current-legislators.csv
csvfix find -f 10 -s 1 < legislators.csv > current-legislators.csv
Then we extract the Twitter handles (column 22) and send it off to the t users
command
csvfix order -smq -f 22 < current-legislators.csv | grep '[A-z]' |
xargs t users --csv > legislators-twitter-profiles.csv
At this point, you'll have to use a more low-level API to do the batch lookup of friend ids as t followings
will exceed the rate limit for a few of the legislators, such as @DarrellIssa with 30,000+ followings/friends. I recommend the Ruby Twitter gem, or, if you want to stay at the command line, twurl.
For example, here's an example of twurl
wrapped in a while
loop. The Twitter API endpoint, friends/ids returns up to 5,000 Twitter account IDs at a time. For accounts with 5,000+ friends, the response includes a next_cursor
value, which you assign to the cursor
parameter in the API call.
When that next_cursor
is 0, you've reached the end of the list of friend IDs. For accounts with lots of friends, you might want to add a sleep
interval (the Twitter API lets you make 15 calls per 15-minute window).
username=DarrellIssa
next_cursor=-1
while [[ $next_cursor -ne 0 && $next_cursor != "" && $next_cursor != 'null' ]]; do
json=$(twurl "/1.1/friends/ids.json?screen_name=$username&cursor=$next_cursor")
if [[ $? != 0 || $(echo $json | jq 'has("errors")') == 'true' ]]; then
next_cursor=0
echo "errors: $(echo $json | jq '.errors[0] .message')"
else # just exist if there's an error
echo $json | jq '.ids[]'
next_cursor=$(echo $json | jq -r '.next_cursor')
fi
done
In any case, you can download the friend IDs I've fetched for our current legislators in this zip file.
The rest of this script assumes you have a subfolder named friend_ids with text files named after the respective lowercased-Twitter handles, e.g.
friend_ids/
|__aaronschock.txt
|__andercrenshaw.txt
To find the total number of users who Congressmembers follow on Twitter (based on the zipped snapshot of their Twitter friend ids):
cat friend_ids/*.txt | wc -l
# 755918
To find the number of unique users followed by Congress:
cat friend_ids/*.txt | sort | uniq | wc -l
# 380599
Interested in finding out how many members of Congress follow you? If you've downloaded the zip of Congressmembers' friend_ids (and have access to the t and csvfix tools, of course):
# assuming you don't know your Twitter ID offhand, use `t user` to retrieve it,
# and then, csvfix to grab it from the first column
# (insert your username instead of mine)
my_twitter_id=$(t user dancow --csv | csvfix order -smq -f 1 | tail -n 1)
cat friend_ids/*.txt | grep -c $my_twitter_id
# 4
Alternatively, use grep -l
to find out who exactly follows you:
grep -l $(t user dancow --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt
# friend_ids/darrellissa.txt
# friend_ids/peterroskam.txt
# friend_ids/repgaramendi.txt
# friend_ids/reppeterdefazio.txt
Not too many followers of @God (the Twitter account, I mean). More followers of @Stanford than @Harvard, though @TheOnion has more Congressional fans than those colleges combined.
# God
grep -l $(t user God --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt
# friend_ids/lorettasanchez.txt
# friend_ids/repdianadegette.txt
# friend_ids/repjustinamash.txt
# friend_ids/repsandylevin.txt
# Harvard
grep -l $(t user Harvard --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt
# friend_ids/chakafattah.txt
# friend_ids/nikiinthehouse.txt
# friend_ids/repbobbyscott.txt
# friend_ids/repkarenbass.txt
# friend_ids/senatorshaheen.txt
# friend_ids/senschumer.txt
# Stanford
grep -l $(t user Stanford --csv | csvfix order -smq -f 1 | tail -n 1) friend_ids/*.txt
# friend_ids/dorismatsui.txt
# friend_ids/joaquincastrotx.txt
# friend_ids/repannaeshoo.txt
# friend_ids/repbecerra.txt
# friend_ids/repgaramendi.txt
# friend_ids/repspeier.txt
# friend_ids/reptimryan.txt
# friend_ids/senfeinstein.txt
# Lots of Onion fans
cat friend_ids/*.txt | grep -c $(t user TheOnion --csv | csvfix order -smq -f 1 | tail -n 1)
# 45
This next bit of code is exceptionally gross looking, but only because I wanted to throw everything into as few lines as possible. What it does is given the current-legislators.csv
file, filters by party (R
or D
), then selects the Twitter username of the filtered legislators. This list of usernames is fed by process to grep -f
, which uses those usernames to find all the friend_ids/*.txt
files. Each of those text files contains a list of Twitter IDs (numbers, not readable usernames), and so sort | uniq -c | sort -rn
is used to count up the most frequently occuring Twitter IDs for the given party of legislators. I save the results into two temp files, /tmp/democrat-friends.csv
and /tmp/republican-friends.csv
grep -f <(csvfix find -f 7 -s D < current-legislators.csv |
csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
<(ls friend_ids/*.txt) |
xargs cat | sort | uniq -c | sort -rn |
sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/democrat-friends.csv
# Now the Republicans
grep -f <(csvfix find -f 7 -s R < current-legislators.csv |
csvfix order -smq -f 22 | grep '[A-z]' | tr [:upper:] [:lower:]) \
<(ls friend_ids/*.txt) |
xargs cat | sort | uniq -c | sort -rn |
sed -E 's/ *([0-9]+) +([0-9]+)/\2,\1/' > /tmp/republican-friends.csv
Let's filter this list to accounts followed by at least 10 legislators:
echo "ID,democrat_friends,republican_friends" > friends_by_party.csv
csvfix join -f 1:1 /tmp/democrat-friends.csv /tmp/republican-friends.csv |
csvfix find -smq -if '($2 + $3) > 10' >> friends_by_party.csv
Finally we take the IDs in this sheet and pass them to t users
, which will do a batch lookup of these 4,400+ IDs. We then use the csvfix join command to combine the friends_by_party
data with the Twitter profile data.
csvfix join -f 1:1 \
friends_by_party.csv \
<(csvfix order -smq -f 1 friends_by_party.csv | xargs t users --id --csv) > friends_by_party_profiles.csv
I've posted a truncated result on this Google Spreadsheet. The adjusted_democrat_friends and adjusted_ratio reflect that there are currently 300 Republicans to 236 Democrats:
csvfix order -f 7 < current-legislators.csv | grep 'R' | wc -l
csvfix order -f 7 < current-legislators.csv | grep 'D' | wc -l
And so I've multiplied the democrat_friends column by a factor of 1.2
, as some Twitter accounts have more Republican followers simply because there are more Republicans. In the case of the New York Times (@nytimes), which has 131
Democrat followers to 134
Republican followers, the adjusted number of Democrat followers is 157.2
.
Check out the Google Spreadsheet:
The top 25 Twitter accounts by ratio of Democrat to Republican followers:
The top 25 Twitter accounts by ratio of Republican to Democrat followers:
A quick lookup of which media outlets are followed by Congressmember twitter accounts.
If you don't feel like running all the code yourself to get the results, you can download my copy of friends_by_party_profiles.csv, which again, is limited to the 4,000+ Twitter users followed by at least 10 Congressmembers.
Note: because I didn't do a search for exact names, some terms (such as cbs
) returned a lot of affiliated accounts. I trimmed the list down to 100 entities that were interesting to me, and of course, the csvfix/find
search expressions were just news orgs off the top of my head, so this isn't an absolute top-of-all-media list (for example, @rollcallpols has 226 total followers, the same as @FoxNews, but I arbitrarily trimmed it since @rollcall is already one of the top items. And then I got bored and so the trimming isn't consistent).
csvfix order -fn "Screen name,democrat_friends,republican_friends" < friends_by_party_profiles.csv |
csvfix find -f 1 \
-ei "nytimes" -ei "washingtonpost" -ei "politico" -ei "wsj" \
-ei "foxnews" -ei "thehill" -ei "newshour" -ei "npr" -ei 'huffingtonpost' \
-ei "cspan" -ei "cnn" -ei "msnbc" -ei "latimes" -ei 'rollcall' \
-ei "\btime\b" -ei "propublica" -ei "newsweek" -ei "newyorker" \
-ei "theatlantic" -ei "qz" -ei "reuters" -ei "\bap\b" \
-ei "abc" -ei "nbc" -ei "cbs" -ei "pbs" -ei "usatoday" -ei 'slate' \
-ei "bbc" -ei "guardian" -ei "forbes" -ei "theeconomist" -ei "rollingstone" \
-ei "propublica" -ei "voxdotcom" -ei "thisisfusion" -ei "reddit" \
-ei "gawker" -ei "buzzfeed" -ei "vicenews" -ei "the_intercept" \
-ei "digg" -ei "thedailyshow" -ei "lastweektonight" -ei "stephenathome" \
-ei "theonion" -ei "conanobrien" -ei "espn" |
csvfix eval -e '($2 + $3)' | csvfix sort -smq -f 4:DN
Excerpted results:
Screename | Democrat friends | Republican friends | Total Congressfriends |
---|---|---|---|
thehill | 143 | 205 | 348 |
cspan | 140 | 190 | 330 |
politico | 139 | 190 | 329 |
rollcall | 138 | 182 | 320 |
washingtonpost | 130 | 171 | 301 |
WSJ | 110 | 186 | 296 |
nytimes | 131 | 134 | 265 |
cnnbrk | 108 | 129 | 237 |
CNN | 112 | 124 | 236 |
FoxNews | 42 | 184 | 226 |
nprpolitics | 108 | 86 | 194 |
nprnews | 113 | 79 | 192 |
CBSNews | 91 | 101 | 192 |
cspanwj | 71 | 114 | 185 |
JoeNBC | 73 | 112 | 185 |
jparkABC | 74 | 97 | 171 |
ABC | 84 | 87 | 171 |
TheEconomist | 73 | 98 | 171 |
Reuters | 75 | 93 | 168 |
USATODAY | 69 | 96 | 165 |
BuzzFeedBen | 74 | 88 | 162 |
HuffingtonPost | 97 | 57 | 154 |
NBCNews | 76 | 77 | 153 |
JohnKingCNN | 64 | 80 | 144 |
msnbc | 92 | 52 | 144 |
CNBC | 45 | 84 | 129 |
NewsHour | 70 | 56 | 126 |
thehilltweets | 48 | 76 | 124 |
NBCNightlyNews | 63 | 59 | 122 |
foxnewspolitics | 14 | 98 | 112 |
Newsweek | 60 | 49 | 109 |
NBCPolitics | 59 | 49 | 108 |
CNNPolitics | 53 | 55 | 108 |
FoxNewsSunday | 17 | 87 | 104 |
WSJPolitics | 34 | 70 | 104 |
NYTimeskrugman | 84 | 19 | 103 |
BBCWorld | 62 | 40 | 102 |
BBCBreaking | 51 | 50 | 101 |
PBS | 70 | 30 | 100 |
BuzzFeedAndrew | 46 | 51 | 97 |
ABCPolitics | 28 | 64 | 92 |
latimes | 49 | 42 | 91 |
Forbes | 31 | 59 | 90 |
Slate | 57 | 29 | 86 |
politicoalex | 42 | 40 | 82 |
TheAtlantic | 55 | 24 | 79 |
NewYorker | 52 | 25 | 77 |
WSJopinion | 10 | 66 | 76 |
StephenAtHome | 56 | 20 | 76 |
NPR | 58 | 16 | 74 |
cspanradio | 33 | 39 | 72 |
CNNnewsroom | 32 | 39 | 71 |
TheDailyShow | 52 | 17 | 69 |
CBSThisMorning | 29 | 35 | 64 |
ThisWeekABC | 38 | 26 | 64 |
CNNSitRoom | 31 | 29 | 60 |
BuzzFeed | 35 | 25 | 60 |
CBSPolitics | 27 | 31 | 58 |
politico2016 | 22 | 34 | 56 |
CBSEveningNews | 26 | 28 | 54 |
foxnewsradio | 3 | 48 | 51 |
NBCFirstRead | 26 | 25 | 51 |
POLITICOPro | 20 | 30 | 50 |
WSJbreakingnews | 18 | 30 | 48 |
ProPublica | 34 | 12 | 46 |
msnbc_breaking | 30 | 14 | 44 |
TheOnion | 19 | 25 | 44 |
nbc | 20 | 23 | 43 |
guardian | 28 | 13 | 41 |
nytimesworld | 29 | 12 | 41 |
CNNgo | 25 | 15 | 40 |
CNNMoney | 21 | 18 | 39 |
foxnewsalert | 2 | 34 | 36 |
BuzzFeedNews | 18 | 17 | 35 |
CapitolHillCNN | 15 | 20 | 35 |
nprscottsimon | 18 | 17 | 35 |
BuzzFeedPol | 20 | 14 | 34 |
voxdotcom | 31 | 3 | 34 |
FoxNewsLive | 2 | 31 | 33 |
NPRinskeep | 27 | 4 | 31 |
CBSRadioNews | 12 | 19 | 31 |
espn | 14 | 17 | 31 |
NYTimesDowd | 29 | 1 | 30 |
ReutersPolitics | 11 | 19 | 30 |
POLITICO_Steve | 16 | 14 | 30 |
NBCNewsWorld | 18 | 12 | 30 |
ThisIsFusion | 24 | 5 | 29 |
msnbcvideo | 21 | 7 | 28 |
BBCNews | 13 | 14 | 27 |
nytimesbusiness | 13 | 14 | 27 |
NPRHealth | 19 | 7 | 26 |
CQRollCall | 16 | 9 | 25 |
ConanOBrien | 14 | 11 | 25 |
nbcwashington | 14 | 10 | 24 |
guardiannews | 17 | 5 | 22 |
DatelineNBC | 11 | 11 | 22 |
nytimeshealth | 14 | 8 | 22 |
CSPANBlog | 12 | 9 | 21 |
frontlinepbs | 16 | 5 | 21 |
RollingStone | 11 | 4 | 15 |
Gawker | 9 | 4 | 13 |
GuardianUS | 5 | 7 | 12 |