dannguyen/01-get-wjchat-from-api.rb

## 01-get-wjchat-from-api.rb
## Scrape wjchat using Twitter API
# API Twitter [Docs here](http://dev.twitter.com/doc/get/search):
# by @dancow, News App Developer at ProPublica
# Updated Mar. 2, 2011
##
# This tutorial uses basic scripting commands but doesn't take the time
# to explain them. A more explanatory tutorial can be found here:
#
# http://code-dancow.s3.amazonaws.com/beginners.html
#
# This is part 1 of tweet archiving. Hope to have a few more tutorials
# showing how to parse these and mashem up into yummy potato-like goodness
# with not much more programming sophistication than what we'll do
# in this tutorial


# The Twitter API: How did I figure this out? I read the Twitter API
# docs. Usually, this is the easiest and most important step and requires
# only reading skill
# http://dev.twitter.com/doc/get/search

# This is the url upon which we add parameters to tell Twitter
# including which search term we want
BASE_URL="http://search.twitter.com/search.atom"


# We could use .json, but we'll use atom since it kind of looks like HTML
# and you can read it in your browser if you are so inclined

# This was the tweet that kicked off last night's 3/2/2011 wjchat
# Twitter API allows you to specify from what tweet you want to start searching
# for AFTER
FIRST_TWEET_ID = 43113508832948224

# Twitter allows you at most to return 100 tweets at a time
POSTS_PER_PAGE = 100

# The query: wjchat
QUERY="wjchat"


# some libraries we need:
require 'open-uri'


# Let's open up a file
# the second parameter, 'w', specifies that we want to WRITE to this file
# and it will erase whatever existed at that filename
output_file = File.open('tweets.xml', 'w')

# OK, Twitter's API says we can only get at maximum 1500 tweets
# So, with 100 tweets at a time...we need to call the API 15 times:

for page_number in 1..15
  # OK, first we'll form the URL that we use to hit up Twitter's API
  api_call_string = BASE_URL + "?q=#{QUERY}"+ "&rpp=#{POSTS_PER_PAGE}" + "&page=#{page_number}" + "&since_id=#{FIRST_TWEET_ID}"

  # print it to screen just so we can read it
  output_file.puts((open(api_call_string)).read)

  # output something to screen so we know something happened
  puts "Made call: " + api_call_string

  # Let's sleep a sec or so before the next call
  sleep 2.2
end

# close the file and we're done!
output_file.close

# Admittedly, this 1+ MB XML file isn't the most useful thing
# but at least all the tweets we want are organized into one file
# The next tutorial will be how to use the Ruby library Nokogiri
# to make something really interesting

# http://nokogiri.org/
# Hopefully, it's as easy as typing 'gem install nokogiri'
# If not, you'll have to use the power of Google to remedy
# whatever issues come up
	## Scrape wjchat using Twitter API
	# API Twitter [Docs here](http://dev.twitter.com/doc/get/search):
	# by @dancow, News App Developer at ProPublica
	# Updated Mar. 2, 2011
	##
	# This tutorial uses basic scripting commands but doesn't take the time
	# to explain them. A more explanatory tutorial can be found here:
	#
	# http://code-dancow.s3.amazonaws.com/beginners.html
	#
	# This is part 1 of tweet archiving. Hope to have a few more tutorials
	# showing how to parse these and mashem up into yummy potato-like goodness
	# with not much more programming sophistication than what we'll do
	# in this tutorial


	# The Twitter API: How did I figure this out? I read the Twitter API
	# docs. Usually, this is the easiest and most important step and requires
	# only reading skill
	# http://dev.twitter.com/doc/get/search

	# This is the url upon which we add parameters to tell Twitter
	# including which search term we want
	BASE_URL="http://search.twitter.com/search.atom"


	# We could use .json, but we'll use atom since it kind of looks like HTML
	# and you can read it in your browser if you are so inclined

	# This was the tweet that kicked off last night's 3/2/2011 wjchat
	# Twitter API allows you to specify from what tweet you want to start searching
	# for AFTER
	FIRST_TWEET_ID = 43113508832948224

	# Twitter allows you at most to return 100 tweets at a time
	POSTS_PER_PAGE = 100

	# The query: wjchat
	QUERY="wjchat"


	# some libraries we need:
	require 'open-uri'


	# Let's open up a file
	# the second parameter, 'w', specifies that we want to WRITE to this file
	# and it will erase whatever existed at that filename
	output_file = File.open('tweets.xml', 'w')

	# OK, Twitter's API says we can only get at maximum 1500 tweets
	# So, with 100 tweets at a time...we need to call the API 15 times:

	for page_number in 1..15
	# OK, first we'll form the URL that we use to hit up Twitter's API
	api_call_string = BASE_URL + "?q=#{QUERY}"+ "&rpp=#{POSTS_PER_PAGE}" + "&page=#{page_number}" + "&since_id=#{FIRST_TWEET_ID}"

	# print it to screen just so we can read it
	output_file.puts((open(api_call_string)).read)

	# output something to screen so we know something happened
	puts "Made call: " + api_call_string

	# Let's sleep a sec or so before the next call
	sleep 2.2
	end

	# close the file and we're done!
	output_file.close

	# Admittedly, this 1+ MB XML file isn't the most useful thing
	# but at least all the tweets we want are organized into one file
	# The next tutorial will be how to use the Ruby library Nokogiri
	# to make something really interesting

	# http://nokogiri.org/
	# Hopefully, it's as easy as typing 'gem install nokogiri'
	# If not, you'll have to use the power of Google to remedy
	# whatever issues come up