Skip to content

Instantly share code, notes, and snippets.

@tas50
Created April 14, 2019 17:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tas50/9f7e8be30955090eb65bbbbb65d637c1 to your computer and use it in GitHub Desktop.
Save tas50/9f7e8be30955090eb65bbbbb65d637c1 to your computer and use it in GitHub Desktop.
Use jaro winkler to analyze brand name data in OSM
#!/usr/local/opt/ruby/bin/ruby
require 'json'
require 'jaro_winkler'
# parse out all the brands files
files = Dir.glob('brands/**/*.json')
brand_data = {}
files.each { |f| brand_data.merge!(JSON.parse(File.read(f))) }
brands = brand_data.keys
# parse out all known names
name_data = JSON.parse(File.read('dist/names_all.json'))
# iterate over all the brands and compare them to all the names then print if > .95 score
brands.each do |b|
b_type,b_name = b.split('|')
name_data.keys.each do |n|
n_type,n_name = n.split('|')
# don't compare the data to itself or if the entry is in the nomatch field of the brands data
next if n_name == b_name
next if brand_data[b].key?('nomatch') && brand_data[b]['nomatch'].include?(n)
score = JaroWinkler.distance(b_name, n_name, ignore_case: true)
puts "#{b} -> #{n}: #{name_data[n]} at #{score}" if score > 0.95
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment