Skip to content

Instantly share code, notes, and snippets.

@thattommyhall
Last active December 15, 2015 10:39
Show Gist options
  • Save thattommyhall/5247184 to your computer and use it in GitHub Desktop.
Save thattommyhall/5247184 to your computer and use it in GitHub Desktop.
Fixing UTF-8 issues in redshift imports with ruby 1.9.3 and elastic mapreduce. The reducer is pretty dumb, tried to get it working with the identity reducer but it didnt work. You might want to add some more cleanup in either the mapper or the reducer to trim text fields that are too big, make sure number fields are exported correctly, NaN vs nu…
./elastic-mapreduce --create --stream \
--input s3n://YOUR_BUCKET/PATH_TO_FILES/ \
--mapper s3n://YOUR_BUCKET/utf8-cleanup.rb \
--reducer s3n://YOUR_BUCKET/utf8-cleanup-reducer.rb \
--output s3n://YOUR_BUCKET/PATH_TO_FILES_CLEANED \
--bootstrap-action "s3n://YOUR_BUCKET/ruby193.sh" \
--debug \
--num-instances 20
#!/bin/bash
sudo apt-get install -y ruby1.9.1-full
exit
#!/usr/bin/ruby1.9.1
# encoding=utf-8
ARGF.each do |line|
begin
parts = line.split("\t")
puts parts[1..-1].join("\t")
rescue
end
end
#!/usr/bin/ruby1.9.1
ARGF.each do |line|
begin
line.encode!('UTF-8', 'UTF-8', :invalid => :replace, :replace => "\uFFFD")
puts "#{rand(100)}\t" + (line.gsub /[^\t\n\u0020-\uffff]/,"\uFFFD")
rescue
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment