Skip to content

Instantly share code, notes, and snippets.

@dlundy
Last active December 30, 2015 02:09
Show Gist options
  • Save dlundy/7761109 to your computer and use it in GitHub Desktop.
Save dlundy/7761109 to your computer and use it in GitHub Desktop.
Quick hack to parse Folkets Lexikon xml dump into word pairs to import into Anki
# http://folkets-lexikon.csc.kth.se/folkets/om.en.html
# http://folkets-lexikon.csc.kth.se/folkets/folkets_en_sv_public.xml
# http://folkets-lexikon.csc.kth.se/folkets/folkets_sv_en_public.xml
require "rexml/document"
require 'cgi'
file = File.new("folkets_en_sv_public.xml")
doc = REXML::Document.new file
File.open('folkets_sv_en.txt', 'w') do |file|
doc.elements.each("*/word") do |element|
value = element.attributes["value"]
translations = []
element.elements.each("translation") do |child|
# double unescape because stuff isn't escaped properly
translations << CGI.unescapeHTML(CGI.unescapeHTML(child.attributes["value"]))
end
# filter out anything without a translation
if translations.length > 0 then
file.puts value + "\t" + translations.join(" / ")
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment