Skip to content

Instantly share code, notes, and snippets.

@veer66
Last active August 29, 2015 14:00
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save veer66/11047613 to your computer and use it in GitHub Desktop.
Save veer66/11047613 to your computer and use it in GitHub Desktop.
Parsing and extracting headwords and part-of-speech from GCIDE and save them to GDBM
require "nokogiri"
require "json"
require 'gdbm'
class LiPosFromGcideExtractor
def parse_each_file(filename)
File.open(filename, "r:ISO-8859-1") do |file|
chunks = file.read
.split(/\n\n/)
.select{|chunk| chunk =~ /^[<\[]\w/}
chunks.each do |chunk|
doc = Nokogiri::XML(chunk)
ent = doc.css("ent").map{|n| n.text}.join(" ")
pos = doc.css("pos").map{|n| n.text}.join(" ")
if pos != "" and ent != ""
@db[ent] = pos.split(/\s+/).map{|p| p.gsub(/\./, "")}.join(" ")
end
end
end
end
def extract
@db = GDBM.new("gcide_li_pos.db")
for i in "A".."Z"
parse_each_file("CIDE.#{i}")
end
@db.close
end
end
(LiPosFromGcideExtractor.new).extract
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment