Last active
December 15, 2015 10:39
-
-
Save thattommyhall/5247184 to your computer and use it in GitHub Desktop.
Fixing UTF-8 issues in redshift imports with ruby 1.9.3 and elastic mapreduce.
The reducer is pretty dumb, tried to get it working with the identity reducer but it didnt work. You might want to add some more cleanup in either the mapper or the reducer to trim text fields that are too big, make sure number fields are exported correctly, NaN vs nu…
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
./elastic-mapreduce --create --stream \ | |
--input s3n://YOUR_BUCKET/PATH_TO_FILES/ \ | |
--mapper s3n://YOUR_BUCKET/utf8-cleanup.rb \ | |
--reducer s3n://YOUR_BUCKET/utf8-cleanup-reducer.rb \ | |
--output s3n://YOUR_BUCKET/PATH_TO_FILES_CLEANED \ | |
--bootstrap-action "s3n://YOUR_BUCKET/ruby193.sh" \ | |
--debug \ | |
--num-instances 20 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
sudo apt-get install -y ruby1.9.1-full | |
exit |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/ruby1.9.1 | |
# encoding=utf-8 | |
ARGF.each do |line| | |
begin | |
parts = line.split("\t") | |
puts parts[1..-1].join("\t") | |
rescue | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/ruby1.9.1 | |
ARGF.each do |line| | |
begin | |
line.encode!('UTF-8', 'UTF-8', :invalid => :replace, :replace => "\uFFFD") | |
puts "#{rand(100)}\t" + (line.gsub /[^\t\n\u0020-\uffff]/,"\uFFFD") | |
rescue | |
end | |
end | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment