Skip to content

Instantly share code, notes, and snippets.

@metade
Created September 14, 2010 18:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save metade/579538 to your computer and use it in GitHub Desktop.
Save metade/579538 to your computer and use it in GitHub Desktop.
require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'sqlite_cache'
$cache = SqliteCache.new('my_cache.db')
def copen(url)
$cache.do_cached(url) do
puts "fetching: #{url}"
open(url).read
end
end
File.open('data.tsv', 'w') do |file|
(1960..1999).each do |year|
doc = Hpricot(copen("http://www.bbc.co.uk/1xtra/blackhistory/years/#{year}.shtml"))
(doc/"//div[@class='text-rm']/h5").each do |heading|
if heading.inner_html =~ /What happened in/
fact_type = 'event'
elsif heading.inner_html =~ /In the music/
fact_type = 'music'
elsif heading.inner_html =~ /Notable releases/
fact_type = 'release'
end
(heading.parent/"//ul/li").each do |fact|
text = fact.inner_html.gsub(%r[<br ?/>], '').gsub("\n", '').strip
if fact_type == 'release'
subheading = fact.parent.previous_sibling
if subheading.inner_html =~ /Single/
release_type = 'single'
elsif subheading.inner_html =~ /Album/
release_type = 'album'
elsif subheading.inner_html =~ /Gramm/
release_type = 'grammy'
end
end
file.puts [year, (release_type || fact_type), text].join("\t")
end
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment