Created
March 3, 2011 15:46
-
-
Save dannguyen/852963 to your computer and use it in GitHub Desktop.
Getting wjchat Tweets with API #
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Scrape wjchat using Twitter API | |
# API Twitter [Docs here](http://dev.twitter.com/doc/get/search): | |
# by @dancow, News App Developer at ProPublica | |
# Updated Mar. 2, 2011 | |
## | |
# This tutorial uses basic scripting commands but doesn't take the time | |
# to explain them. A more explanatory tutorial can be found here: | |
# | |
# http://code-dancow.s3.amazonaws.com/beginners.html | |
# | |
# This is part 1 of tweet archiving. Hope to have a few more tutorials | |
# showing how to parse these and mashem up into yummy potato-like goodness | |
# with not much more programming sophistication than what we'll do | |
# in this tutorial | |
# The Twitter API: How did I figure this out? I read the Twitter API | |
# docs. Usually, this is the easiest and most important step and requires | |
# only reading skill | |
# http://dev.twitter.com/doc/get/search | |
# This is the url upon which we add parameters to tell Twitter | |
# including which search term we want | |
BASE_URL="http://search.twitter.com/search.atom" | |
# We could use .json, but we'll use atom since it kind of looks like HTML | |
# and you can read it in your browser if you are so inclined | |
# This was the tweet that kicked off last night's 3/2/2011 wjchat | |
# Twitter API allows you to specify from what tweet you want to start searching | |
# for AFTER | |
FIRST_TWEET_ID = 43113508832948224 | |
# Twitter allows you at most to return 100 tweets at a time | |
POSTS_PER_PAGE = 100 | |
# The query: wjchat | |
QUERY="wjchat" | |
# some libraries we need: | |
require 'open-uri' | |
# Let's open up a file | |
# the second parameter, 'w', specifies that we want to WRITE to this file | |
# and it will erase whatever existed at that filename | |
output_file = File.open('tweets.xml', 'w') | |
# OK, Twitter's API says we can only get at maximum 1500 tweets | |
# So, with 100 tweets at a time...we need to call the API 15 times: | |
for page_number in 1..15 | |
# OK, first we'll form the URL that we use to hit up Twitter's API | |
api_call_string = BASE_URL + "?q=#{QUERY}"+ "&rpp=#{POSTS_PER_PAGE}" + "&page=#{page_number}" + "&since_id=#{FIRST_TWEET_ID}" | |
# print it to screen just so we can read it | |
output_file.puts((open(api_call_string)).read) | |
# output something to screen so we know something happened | |
puts "Made call: " + api_call_string | |
# Let's sleep a sec or so before the next call | |
sleep 2.2 | |
end | |
# close the file and we're done! | |
output_file.close | |
# Admittedly, this 1+ MB XML file isn't the most useful thing | |
# but at least all the tweets we want are organized into one file | |
# The next tutorial will be how to use the Ruby library Nokogiri | |
# to make something really interesting | |
# http://nokogiri.org/ | |
# Hopefully, it's as easy as typing 'gem install nokogiri' | |
# If not, you'll have to use the power of Google to remedy | |
# whatever issues come up |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment