Skip to content

Instantly share code, notes, and snippets.

@alxgsv
Created December 8, 2012 16:03
Show Gist options
  • Save alxgsv/4240842 to your computer and use it in GitHub Desktop.
Save alxgsv/4240842 to your computer and use it in GitHub Desktop.
FineReader HTML post processing
#!/usr/bin/env ruby
# encoding: utf-8
require "rubygems"
require "nokogiri"
require "typograf"
content = File.open(ARGV[0]).read.force_encoding("cp1251").encode('UTF-8')
doc = Nokogiri::HTML(content)
result = ""
spans = doc.xpath("//span")#.map(&:inner_html)
merged_spans = []
previous_dashed = false
spans.each do |s|
if previous_dashed
merged_spans[-1] += s.inner_html
elsif s.content =~ /^\p{Lower}/
puts s
# если параграф начинается с маленькой буквы, присоединяем к предыдущему через пробел
merged_spans[-1] += " " + s.inner_html
else
merged_spans << s.inner_html
end
last_span = merged_spans[-1]
if last_span[-1] == "-"
# если параграф кончается переносом, убираем его и запоминаем это
merged_spans[-1] = last_span[0..-2]
previous_dashed = true
else
previous_dashed = false
end
end
merged_spans.each do |s|
typografed = Typograf.process(s)
result += typografed
end
File.open(ARGV[1], "w").write(result)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment