Skip to content

Instantly share code, notes, and snippets.

@RobinDaugherty
Created August 2, 2013 04:12
Show Gist options
  • Save RobinDaugherty/6137432 to your computer and use it in GitHub Desktop.
Save RobinDaugherty/6137432 to your computer and use it in GitHub Desktop.
Uses https://github.com/brianmario/charlock_holmes to detect encoding of each line of input, converting to UTF-8 for output.
#!env ruby
#
# Converts character encoding of STDIN on a line-by-line basis.
# Properly-encoded UTF-8 is sent to STDOUT.
# Informative messages are sent to STDERR.
#
require 'charlock_holmes'
desired_encoding = 'UTF-8'
line_count = 0
$stdin.each do |line|
line_count += 1
line.strip!
next if line.length == 0
detection = CharlockHolmes::EncodingDetector.detect(line)
if detection && detection[:type] == :text
if detection[:encoding] != desired_encoding
$stderr.puts "Line #{line_count} converted from #{detection[:encoding]} (#{detection[:confidence]}% confidence) #{detection}"
$stdout.puts CharlockHolmes::Converter.convert line, detection[:encoding], desired_encoding
else
$stdout.puts line
end
elsif detection[:type] == :binary
$stderr.puts "Line #{line_count} is detected as binary, not text. (#{detection[:confidence]}% confidence) #{detection}"
$stdout.puts line
else
$stderr.puts "Line #{line_count} COULD NOT BE IDENTIFIED."
$stdout.puts line
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment