Skip to content

Instantly share code, notes, and snippets.

@zellux
Created July 31, 2013 11:37
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zellux/6121288 to your computer and use it in GitHub Desktop.
Save zellux/6121288 to your computer and use it in GitHub Desktop.
Extract words and explanations from epub version of Merriam Webster's Vocabulary Builder. See http://blog.yxwang.me/notes/languages/vocabulary-builder.html for more details.
require 'nokogiri'
unit = ARGV[0].to_i
started = false
titles = []
1.upto(3000) do |i|
filename = "html/Merriam-Webster_s_Vocabulary_Bu_split_#{'%03d' % i}.html"
doc = Nokogiri::HTML(open(filename))
title = doc.css('div span.bold').first
next if title == nil
word = title.content.strip
if word == "Unit #{unit}"
started = true
titles = doc.css('div p.calibre36 a').map(&:content)
titles = titles[0...titles.find_index { |e| e[/^Quiz/] }].map(&:strip)
end
break if word == "Unit #{unit + 1}"
next unless started
next if word[/^([ABC]\.)|(Quiz)/]
begin
if word.start_with?('Unit ')
word = doc.css('div p.calibre12 span').first.content.strip
explanation = doc.css('div p.calibre12').first.content.strip
elsif word == titles[-1]
word = doc.css('div p.calibre34 span.bold')[1].content.strip
explanation = doc.css('div p.calibre34 br').first.previous.content.strip
elsif titles.include?(word)
explanation = doc.css('div p').first.content.strip
else
explanation = doc.css('div p br').first.previous.content.strip
end
puts "#{word}:::#{explanation}"
rescue => e
puts word
puts e
break
end
end
@zellux
Copy link
Author

zellux commented Jul 31, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment