Skip to content

Instantly share code, notes, and snippets.

@bilbof
Created October 22, 2019 14:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bilbof/b732ec345547c87a2bbc29d58bd52111 to your computer and use it in GitHub Desktop.
Save bilbof/b732ec345547c87a2bbc29d58bd52111 to your computer and use it in GitHub Desktop.
De-duplicate queries w/counts
require 'csv'
require 'json'
# Needs an in.csv with query and count columns.
def normalise_query(query)
query.downcase.gsub('+', ' ').strip # .split(' ').sort.join(' ')
end
data = File.open('in.csv').read
default_h = Hash.new(0)
h = CSV.parse(data).each_with_object(default_h) do |(query,count), hsh|
hsh[normalise_query(query)] += Integer(count, 10)
end
CSV.open("out.csv", "wb") {|csv| h.to_a.each {|elem| csv << elem} }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment