Last active
August 29, 2015 14:24
-
-
Save ephesus/6f02a313954a314920f6 to your computer and use it in GitHub Desktop.
Script to match Examiner names from finished English translations to Japanese Kanji in corresponding 電子データ
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
# encoding: UTF-8 | |
# == matches the Examiner name from ==: | |
# 特許出願の番号 特願2014-551499 | |
# 起案日 平成27年 6月22日 | |
# 特許庁審査官 衣鳩 文彦 9199 5X00 | |
# 特許出願人代理人 佐伯 義文(外 1名) 様 | |
# 適用条文 第29条第2項、第36条 | |
# | |
# == to the English version, this ==: | |
# Application Number: 2014-551499 | |
# Drafted: 2015/06/22 (year/month/day) | |
# Examiner: Fumihiko IBATO 9199 5X00 | |
# Attorney: Yoshifumi SAEKI et al. | |
# Cited Articles: Article 29, Paragraph 2, Article 36 | |
if ARGV.count < 2 | |
puts "<translations folder> <oadownloads folder>" | |
exit | |
end | |
require 'yomu' | |
require 'charlock_holmes' | |
require 'find' | |
def scrape(f) | |
data = Yomu.new f | |
m = data.text.scan(/Application Number:\s(\d+\-\d+)/) | |
app_no = m[0][0] unless m.nil? or m[0].nil? | |
htmlfile = get_html(f, app_no) unless m.nil? | |
return if htmlfile.nil? | |
html = File.read("#{ARGV[1]}/#{htmlfile}") if File.exist?("#{ARGV[1]}/#{htmlfile}") | |
encoding = CharlockHolmes::EncodingDetector.detect(html) | |
hdata = CharlockHolmes::Converter.convert html, encoding[:encoding], 'UTF-8' | |
m = data.text.scan(/Examiner:\s(\w+\s+\w+)\s*[0-9A-Z]+\s[0-9A-Z]+/m) | |
eng_exam = m[0][0] unless m[0].nil? | |
m = hdata.scan(/特許庁審査官\p{Z}+(\p{L}+)\p{Z}(\p{L}+)\p{Z}+\p{N}+\p{Z}[\p{N}\p{L}]+/) | |
ja_f = m[0][1] unless m[0].nil? | |
ja_l = m[0][0] unless m[0].nil? | |
return unless (eng_exam and ja_l) | |
puts "#{eng_exam}, #{ja_l} #{ja_f}" | |
end | |
def get_html(docfilename, app_no) | |
result = nil | |
m = docfilename.match(/2015(\d\d\d\d)/) | |
return if m.nil? or m[1].nil? | |
mnyr = m[1] | |
index = "#{ARGV[1]}/2015/#{mnyr}/index.txt" | |
f = File.read(index) if File.exist?(index) | |
return nil unless app_no | |
return nil if f.nil? | |
f.scan(/^#{app_no.gsub('-', '\-')},.+拒絶.+$/) do |hit| | |
#add to the array 'results' a hash with the three sections | |
results = Hash[ [:app_number, :oatype, :filename].zip(hit.split(/, /)) ] | |
result = results[:filename] | |
end | |
return result | |
end | |
#start | |
docdir = ARGV[0] | |
Dir.glob("#{docdir}/**/*doc", File::FNM_CASEFOLD) {|filename| | |
scrape(filename) | |
} | |
exit |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment