Skip to content

Instantly share code, notes, and snippets.

@dustalov
Last active April 20, 2016 21:15
Show Gist options
  • Save dustalov/679fc08ff7036eac5a7b77f5a0a12ace to your computer and use it in GitHub Desktop.
Save dustalov/679fc08ff7036eac5a7b77f5a0a12ace to your computer and use it in GitHub Desktop.
Fetch sentences from the Russian National Corpus.
#!/usr/bin/env ruby
require 'net/http'
require 'uri'
require 'nokogiri'
Example = Struct.new(:text, :source)
def ruscorpora(word)
uri = URI('http://search.ruscorpora.ru/download-xml.xml')
uri.query = URI.encode_www_form({ req: word, text: 'lexform', mode: 'main', doc_tagging: 'manual' })
doc = Nokogiri::XML(Net::HTTP.get(uri).gsub('́', '')).remove_namespaces!
doc.xpath('(/Workbook/Worksheet/Table/Row)[position()>1]/Cell[last()]').map do |cell|
md = cell.text.strip.match(/\A(?<text>.+) +\[(?<source>.+)\] +\[.+\]\z/)
Example.new(
md[:text].gsub(/ {2,}/, ''),
md[:source].tap(&:strip!).gsub(/ *\((\d+(|-\d+))\)\z/, ', \1')
)
end
end
# ruscorpora('лопата')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment