Skip to content

Instantly share code, notes, and snippets.

@timothyandrew
Created September 19, 2012 20:53
Show Gist options
  • Save timothyandrew/3752188 to your computer and use it in GitHub Desktop.
Save timothyandrew/3752188 to your computer and use it in GitHub Desktop.
Scrape (links for) archived episodes of the Moth Podcast from podcast-directory.co.uk
require 'rubygems'
require 'mechanize'
def fetch_podcast_links_from_index_page(index_link)
agent = Mechanize.new
agent.get(index_link)
story_links = agent.page.links.find_all { |l| l.text.include? ':' }
story_links.map do |link|
link.click
agent.page.links.find { |l| l.text.include? 'mp3' }
end
end
def write_file_from(links)
links.each do |link|
puts "Writing #{link.text}"
File.open(ARGV[0], 'a') do |f|
f.puts link.href
end
end
end
unless ARGV.length == 1
puts "usage: ./moth-podcast-scrape.rb [output-file-location]"
exit 1
end
(1..12).each do |i|
puts '-' * 50
puts "Processing Page #{i}"
puts '-' * 50
links = fetch_podcast_links_from_index_page("http://www.podcast-directory.co.uk/podcastarchive/the-moth-podcast-147468/page-#{i}.html")
write_file_from links
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment