Skip to content

Instantly share code, notes, and snippets.

@iaingray
Last active August 29, 2015 14:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save iaingray/31ec172408bc409727e0 to your computer and use it in GitHub Desktop.
Save iaingray/31ec172408bc409727e0 to your computer and use it in GitHub Desktop.
Quick and dirty script to decode url-encoded CIDs from redshift/pg database
#db settings contains params for @conn
require_relative 'db_settings.rb'
require 'pg'
require 'open-uri'
conn = PG::Connection.new(:dbname => @database, :host => @host , :port => @port, :user => @user, :password => @password )
decoded_cids = {}
sql = {
:get_cids => 'SELECT raw_cid FROM util.cids WHERE decoded_cid IS NULL',
:create_temp_table => 'CREATE TEMPORARY TABLE #tmp_decoded_cids (raw_cid varchar (3000), decoded_cid varchar (3000))',
:populate_tmp_table => 'INSERT INTO #tmp_decoded_cids VALUES',
:count_undecoded_cids => 'SELECT COUNT(*) FROM util.cids WHERE decoded_cid IS NULL',
:update_from_tmp => 'UPDATE util.cids SET decoded_cid = #tmp_decoded_cids.decoded_cid FROM #tmp_decoded_cids WHERE cids.raw_cid = #tmp_decoded_cids.raw_cid'
}
#cleans utf-8 strings by decoding then re-encoding them
def clean_string (new_value)
# Converting ASCII-8BIT to UTF-8 based domain-specific guesses
if new_value.is_a? String
begin
# Try it as UTF-8 directly
cleaned = new_value.dup.force_encoding('UTF-8')
unless cleaned.valid_encoding?
# Some of it might be old Windows code page
cleaned = new_value.encode( 'UTF-8', 'Windows-1252' )
end
new_value = cleaned
rescue EncodingError
# Force it to UTF-8, throwing out invalid bits
new_value.encode!( 'UTF-8', invalid: :replace, undef: :replace )
end
end
end
conn.exec (sql[:create_temp_table])
undecoded_cid_count_start = conn.exec(sql[:count_undecoded_cids]).getvalue(0,0)
puts "#{undecoded_cid_count_start} undecoded cids in database"
#make a value list of all decoded cids
i=0
conn.exec(sql[:get_cids]).each do |cid|
raw_cid = cid['raw_cid']
decoded_cid = clean_string(URI::decode(raw_cid))
sql[:populate_tmp_table] << "('#{raw_cid}', '#{decoded_cid}'),"
i = i+1
end
if(i>0) then
#remove trailing comma
sql[:populate_tmp_table].chop!
puts "Updating table with decoded cids..."
conn.exec (sql[:populate_tmp_table]) #insert values into temp table
conn.exec (sql[:update_from_tmp])
puts "...done"
undecoded_cid_count_end = conn.exec(sql[:count_undecoded_cids]).getvalue(0,0)
total_cids_decoded = undecoded_cid_count_start.to_i - undecoded_cid_count_end.to_i
puts "#{total_cids_decoded} decoded cids updated"
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment