Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Created March 3, 2011 15:46
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dannguyen/852963 to your computer and use it in GitHub Desktop.
Save dannguyen/852963 to your computer and use it in GitHub Desktop.
Getting wjchat Tweets with API #
## Scrape wjchat using Twitter API
# API Twitter [Docs here](http://dev.twitter.com/doc/get/search):
# by @dancow, News App Developer at ProPublica
# Updated Mar. 2, 2011
##
# This tutorial uses basic scripting commands but doesn't take the time
# to explain them. A more explanatory tutorial can be found here:
#
# http://code-dancow.s3.amazonaws.com/beginners.html
#
# This is part 1 of tweet archiving. Hope to have a few more tutorials
# showing how to parse these and mashem up into yummy potato-like goodness
# with not much more programming sophistication than what we'll do
# in this tutorial
# The Twitter API: How did I figure this out? I read the Twitter API
# docs. Usually, this is the easiest and most important step and requires
# only reading skill
# http://dev.twitter.com/doc/get/search
# This is the url upon which we add parameters to tell Twitter
# including which search term we want
BASE_URL="http://search.twitter.com/search.atom"
# We could use .json, but we'll use atom since it kind of looks like HTML
# and you can read it in your browser if you are so inclined
# This was the tweet that kicked off last night's 3/2/2011 wjchat
# Twitter API allows you to specify from what tweet you want to start searching
# for AFTER
FIRST_TWEET_ID = 43113508832948224
# Twitter allows you at most to return 100 tweets at a time
POSTS_PER_PAGE = 100
# The query: wjchat
QUERY="wjchat"
# some libraries we need:
require 'open-uri'
# Let's open up a file
# the second parameter, 'w', specifies that we want to WRITE to this file
# and it will erase whatever existed at that filename
output_file = File.open('tweets.xml', 'w')
# OK, Twitter's API says we can only get at maximum 1500 tweets
# So, with 100 tweets at a time...we need to call the API 15 times:
for page_number in 1..15
# OK, first we'll form the URL that we use to hit up Twitter's API
api_call_string = BASE_URL + "?q=#{QUERY}"+ "&rpp=#{POSTS_PER_PAGE}" + "&page=#{page_number}" + "&since_id=#{FIRST_TWEET_ID}"
# print it to screen just so we can read it
output_file.puts((open(api_call_string)).read)
# output something to screen so we know something happened
puts "Made call: " + api_call_string
# Let's sleep a sec or so before the next call
sleep 2.2
end
# close the file and we're done!
output_file.close
# Admittedly, this 1+ MB XML file isn't the most useful thing
# but at least all the tweets we want are organized into one file
# The next tutorial will be how to use the Ruby library Nokogiri
# to make something really interesting
# http://nokogiri.org/
# Hopefully, it's as easy as typing 'gem install nokogiri'
# If not, you'll have to use the power of Google to remedy
# whatever issues come up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment