Skip to content

Instantly share code, notes, and snippets.

@dustalov
Last active December 28, 2015 18:38
Show Gist options
  • Save dustalov/7544165 to your computer and use it in GitHub Desktop.
Save dustalov/7544165 to your computer and use it in GitHub Desktop.
Extract texts from the OpenCorpora XML dump.
#!/usr/bin/env ruby
# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'csv'
Dir.mkdir 'opencorpora' unless File.directory? 'opencorpora'
buf, flag = '', false
parents, children, names = [], {}, {}
File.foreach('annot.opcorpora.xml') do |s|
s.tap(&:chomp!).tap(&:strip!)
next unless flag ||= s =~ /<text.*>/
buf << s
unless flag &&= s !~ /<\/text.*>/
doc = Nokogiri::XML(buf)
id = doc.xpath('//text/@id').text.to_i
parent = doc.xpath('//text/@parent').text.to_i
paragraphs = doc.xpath('//text/paragraphs/paragraph')
names[id] = doc.xpath('//text/@name').text
if parent.zero?
parents << id
else
children[id] = parent
end
unless paragraphs.empty?
File.open('opencorpora/%04d.txt' % id, 'w') do |f|
paragraphs.each do |paragraph|
f.puts paragraph.xpath('sentence/source').map(&:text).join(' ')
end
end
end
puts 'text #%04d is done' % id
buf.clear
end
end
CSV.open('opencorpora/index.csv', 'w') do |csv|
csv << %w(id parent_id parent_name name)
children.each do |id, parent_id|
parent_id = children[parent_id] until parents.include? parent_id
csv << [id, parent_id, names[parent_id], names[id]]
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment