Last active
May 15, 2018 03:22
-
-
Save junara/09ec007e0addb0f8b483894e42e129b9 to your computer and use it in GitHub Desktop.
住所から都道府県を推定する ref: https://qiita.com/junara/items/dc5ab1ef4fb7c8872330
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
gem install levenshtein |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> keyword_prediction = KeywordPrediction.new(path: "./name2prefecture.csv") | |
=> #<KeywordPrediction:0x00007ff9ca2afb68 @path="./name2prefecture.csv", @keyword_col="keyword", @prediction_col="prefecture_name" | |
> keyword_prediction.load | |
=> 省略 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> keyword_prediction.predict('高松市') | |
=> ["香川県"] # 完全にマッチする県を表示します。 | |
> keyword_prediction.predict('府中市') | |
=> ["東京都", "広島県"] # 候補が複数あるときは全て表示します。 | |
> keyword_prediction.predict('高 松市') | |
=> ["香川県", "高知県"] # 曖昧な場合は類似度が近い順に表示します |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'csv' | |
require 'levenshtein' | |
class KeywordPrediction | |
DEFAULT_KEYWORD_COL = 'keyword' | |
DEFAULT_PREDICTION_COL = 'prefecture_name' | |
attr_accessor :path, :index, :keyword_col, :prediction_col | |
def initialize(args = {}) | |
@path = args[:path] ? args[:path] : 'lib/name2prefecture.csv' | |
@keyword_col = args[:keyword_col] ? args[:keyword_col] : DEFAULT_KEYWORD_COL | |
@prediction_col = args[:prediction_col] ? args[:prediction_col] : DEFAULT_PREDICTION_COL | |
end | |
def load(index_array: []) | |
@index = index_array.length > 0 ? load_array(index_array) : load_csv | |
end | |
def similarity(str1, str2) | |
1 - Levenshtein.normalized_distance(str1, str2) | |
end | |
def compare_all(str) | |
list = @index.map do |row| | |
row.to_hash.merge({'similarity' => similarity(row['keyword'], str)}) | |
end | |
list.sort_by {|h| [-h['similarity'], h['length']]} | |
end | |
def match(str, similarity = 0) | |
# Get maximum match every prefecture | |
similarities = {} | |
compare_all(str).select do |row| | |
if (row['similarity'] >= similarity) && (similarities[row[@prediction_col]].nil? || row['similarity'] > similarities[row[@prediction_col]]) | |
similarities[row[@prediction_col]] = row['similarity'] | |
true | |
else | |
false | |
end | |
end | |
end | |
def predict(str, similarity = 0) | |
results = match(str, similarity) | |
max_similarity = results.map {|result| result['similarity']}.sort.last | |
results.select {|result| result['similarity'] == max_similarity}.map do |result| | |
{'name' => str, | |
@prediction_col => result[@prediction_col], | |
'similarity' => result['similarity'], | |
@keyword_col => result[@keyword_col]} | |
end | |
end | |
private | |
def load_csv | |
index = [] | |
csv = CSV.read(@path, headers: true) | |
csv.each do |row| | |
index << row.to_hash.merge('length' => row[@keyword_col].length) | |
end | |
index | |
end | |
def load_array(array) | |
# csvファイルを読み込ませる代わりに、Railsで Model.all.map {|d| d.attributes} としてarrayを作成し、load(index_array: Model.all.map {|d| d.attributes}) として、指定することも可能です。 | |
array | |
end | |
end |
We can make this file beautiful and searchable if this error is corrected: It looks like row 6 should actually have 2 columns, instead of 1. in line 5.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
keyword,prefecture_name | |
高松市,香川県 | |
府中市,東京都 | |
府中市,広島県 | |
高知市,高知県 | |
などなど・・・ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment