Skip to content

Instantly share code, notes, and snippets.

@mjc-gh
Last active August 29, 2015 14:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mjc-gh/0cfeb5d640fd01ddb3d2 to your computer and use it in GitHub Desktop.
Save mjc-gh/0cfeb5d640fd01ddb3d2 to your computer and use it in GitHub Desktop.
Fetch all The National song lyrics from wikia
require 'fileutils'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://lyrics.wikia.com/The_National'))
doc.css('h2').each do |header|
img = header.next_element
list = img.next_element if img
if list && list.name == 'ol'
album = header.css('.mw-headline a').text
count = 0
FileUtils.mkdir_p "data/#{album}"
list.css('li a').each do |track|
title = track.text
puts "Fetching #{title}..."
lyrics = Nokogiri::HTML(open("http://lyrics.wikia.com#{track['href']}")).css('.lyricbox').first
lyrics.children.each do |child|
child.replace("\n") if child.name == 'br'
child.remove if child.name == 'div'
child.remove if child.comment?
end
File.open("data/#{album}/#{count += 1} - #{title}", 'w+') do |file|
text = lyrics.text
text.strip!
text.gsub!(/[ ,]+$/, '')
file.puts text
end
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment