Skip to content

Instantly share code, notes, and snippets.

@zakuroishikuro
Last active February 23, 2020 07:12
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zakuroishikuro/33c7c8a6a6ed4bc141dd to your computer and use it in GitHub Desktop.
Save zakuroishikuro/33c7c8a6a6ed4bc141dd to your computer and use it in GitHub Desktop.
なるべく短い正規表現で住所を「都道府県/市区町村/それ以降」に分けるエクストリームスポーツ ref: http://qiita.com/zakuroishikuro/items/066421bce820e3c73ce9
rex = /ごにょごにょ/
p "東京都文京区後楽1丁目3−61".match(rex).captures
#=> ["東京都", "文京区", "後楽1丁目3−61"]
(.+?[都道府県])(.+?[市区町村])(.+)
(...??[都道府県])(.+?[市区町村])(.+)
(...??[都道府県])(.+?市.+?区|.+?[市区町村])(.+)
(...??[都道府県])(.+?郡.+?[町村]|.+?市.+?区|.+?[市区町村])(.+)
(...??[都道府県])((?:旭川|伊達|石狩|盛岡|奥州|田村|南相馬|那須塩原|東村山|武蔵村山|羽村|十日町|上越|富山|野々市|大町|蒲郡|四日市|姫路|大和郡山|廿日市|下松|岩国|田川|大村)市|.+?郡.+?[町村]|.+?市.+?区|.+?[市区町村])(.+)
(...??[都道府県])((?:旭川|伊達|石狩|盛岡|奥州|田村|南相馬|那須塩原|東村山|武蔵村山|羽村|十日町|上越|富山|野々市|大町|蒲郡|四日市|姫路|大和郡山|廿日市|下松|岩国|田川|大村)市|.+?郡(?:玉村|大町|.).*?[町村]|.+?市.+?区|.+?[市区町村])(.+)
(...??[都道府県])((?:旭川|伊達|石狩|盛岡|奥州|田村|南相馬|那須塩原|東村山|武蔵村山|羽村|十日町|上越|富山|野々市|大町|蒲郡|四日市|姫路|大和郡山|廿日市|下松|岩国|田川|大村)市|.+?郡.+?[町村]|.+?市.+?区|.+?[市区町村])(.+)
(...??[都道府県])((?:旭川|伊達|石狩|盛岡|奥州|田村|南相馬|那須塩原|東村山|武蔵村山|羽村|十日町|上越|富山|野々市|大町|蒲郡|四日市|姫路|大和郡山|廿日市|下松|岩国|田川|大村)市|.+?郡(?:玉村|大町|.).*?[町村]|.+?市.+?区|.+?[市区町村])(.+)
require 'csv'
# 住所データを読み込む (同じフォルダ内にKEN_ALL.csvを入れておくこと)
print "\nKEN_ALL.csvをパース中... "
address_list = []
csv_path = File.expand_path("../KEN_ALL.CSV", __FILE__)
CSV.foreach csv_path, encoding:"Shift_JIS:UTF-8" do |row|
#都道府県、市区町村、町域名のみ取り出す
#このcsvはデータ構造がうんこで、本当は町域名の結合とかする必要があるんだけど、今回はしなくても問題ない
address_list << row[6..8]
end
puts "完了"
# 無限ループ (control + Cで終了)
trap :INT, :exit
loop do
# 正規表現を取得
puts "\n正規表現を入力してください (control + cで終了):"
begin
rex = /#{gets.chomp}/
rescue RegexpError
puts "正規表現の作成に失敗しました。", $!.message
next
end
# 正規表現のマッチ結果を取得
result = address_list.map do |address|
address.join.match(rex).captures rescue []
end
# 判定
fail_count = 0
address_list.zip(result).each do |address, match|
if address != match
puts "失敗... #{address * ?|}\t(#{match * ?|})"
fail_count += 1
end
end
# 結果を出力
all = address_list.count
pct = (fail_count.to_f / all * 100).to_i
puts "\n正規表現: ", rex.source
puts "\n失敗数: ", "#{fail_count}/#{all} (#{(pct).to_i}%)"
end
rex = /ごにょごにょ/
p "東京都文京区後楽1丁目3−61".match(rex).captures
#=> ["東京都", "文京区", "後楽1丁目3−61"]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment