Created
August 2, 2013 04:12
-
-
Save RobinDaugherty/6137434 to your computer and use it in GitHub Desktop.
Uses https://github.com/brianmario/charlock_holmes to detect encoding of each line of input, converting to UTF-8 for output.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!env ruby | |
# | |
# Converts character encoding of STDIN on a line-by-line basis. | |
# Properly-encoded UTF-8 is sent to STDOUT. | |
# Informative messages are sent to STDERR. | |
# | |
require 'charlock_holmes' | |
desired_encoding = 'UTF-8' | |
line_count = 0 | |
$stdin.each do |line| | |
line_count += 1 | |
line.strip! | |
next if line.length == 0 | |
detection = CharlockHolmes::EncodingDetector.detect(line) | |
if detection && detection[:type] == :text | |
if detection[:encoding] != desired_encoding | |
$stderr.puts "Line #{line_count} converted from #{detection[:encoding]} (#{detection[:confidence]}% confidence) #{detection}" | |
$stdout.puts CharlockHolmes::Converter.convert line, detection[:encoding], desired_encoding | |
else | |
$stdout.puts line | |
end | |
elsif detection[:type] == :binary | |
$stderr.puts "Line #{line_count} is detected as binary, not text. (#{detection[:confidence]}% confidence) #{detection}" | |
$stdout.puts line | |
else | |
$stderr.puts "Line #{line_count} COULD NOT BE IDENTIFIED." | |
$stdout.puts line | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment