Created
May 27, 2012 00:28
-
-
Save alegomes/2795743 to your computer and use it in GitHub Desktop.
Converting unknown charset file to UTF-8
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I had a dataset but it was not UTF-8. So, I had to find out which charset was being used. 'file' command didn't helped me out. | |
$ file file_name.csv | |
file_name.csv: Non-ISO extended-ASCII C++ program text, with very long lines, with CRLF line terminators | |
So, I made this bash script to figure out its encoding: | |
First, I converted the file to every single format available by 'iconv': | |
$ for f in $(iconv -l); do echo "Convertendo $f ..."; iconv -f $f -t UTF-8 < file_name.csv > fil_name.$f.csv; done | |
The, I searched for the file name containing some known word: | |
$ IFS=$(echo -en "\n\b") ; for i in $(grep -Hi "são\ paulo" *); do echo $i | awk '{print $1}'; done |
I had a similar issue. I had one line with a character that gave errors like this when selected from a pg table set to SQL_ASCII. This was during the exploratory phase upgrading from an old unmaintained pg, so I was able to extract the row from the original database.
ERROR: invalid byte sequence for encoding "UTF8": 0x96
Here's how I narrowed down the possible codesets.
# First, create a file "orig" with the mystery text.
# Then look at the file and find a useful bit of non-gibberish.
cat -v orig
# Set a variable containing that non-gibberish string.
# This will make later steps easier.
# You could alternately set this to the empty string, but then you'll have to do more work with your brain.
known_string=XyZpDq
# Eliminate the impossible and the gibberish.
for codeset in $(iconv -l |sed 's,//$,,') # you may have to adjust the sed depending on what version of iconv you're using :(
do
echo "$codeset"
if ! iconv -f "$codeset" -t UTF-8 < orig > "x.$codeset"
then
echo "impossible codeset: $codeset"
rm -f "x.$codeset"
fi
if ! grep -q "$known_string" "x.$codeset"
then
echo "gibberish: $codeset"
rm -f "x.$codeset"
fi
done
# Now look for likely characters.
grep "$known_string" x.*
# Copy/paste the likely character into this line.
# Hopefully you'll have enough context to make a good guess.
weird_char='–'
# List matches.
grep "$known_string" x.* |grep "$weird_char"
# List codesets that are possible.
grep "$known_string" x.* |grep "$weird_char" |sed 's/:.*// ; s/^x.//'
# Finally, give a stern lecture to somebody.
this answer could be useful for someone
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
iconv-l returned strings such as "UTF7//" so I wrote this to eliminate them. It could be useful to someone: