Skip to content

Instantly share code, notes, and snippets.

@alegomes
Created May 27, 2012 00:28
Show Gist options
  • Save alegomes/2795743 to your computer and use it in GitHub Desktop.
Save alegomes/2795743 to your computer and use it in GitHub Desktop.
Converting unknown charset file to UTF-8
I had a dataset but it was not UTF-8. So, I had to find out which charset was being used. 'file' command didn't helped me out.
$ file file_name.csv
file_name.csv: Non-ISO extended-ASCII C++ program text, with very long lines, with CRLF line terminators
So, I made this bash script to figure out its encoding:
First, I converted the file to every single format available by 'iconv':
$ for f in $(iconv -l); do echo "Convertendo $f ..."; iconv -f $f -t UTF-8 < file_name.csv > fil_name.$f.csv; done
The, I searched for the file name containing some known word:
$ IFS=$(echo -en "\n\b") ; for i in $(grep -Hi "são\ paulo" *); do echo $i | awk '{print $1}'; done
@freejoe76
Copy link

Hi,

iconv-l returned strings such as "UTF7//" so I wrote this to eliminate them. It could be useful to someone:

for f in $(iconv -l); do echo "Converting ${f%//} ..."; iconv -f ${f%//} -t UTF-8 < badfile.xml > goodfile.${f%//}.xml; done

@bugi
Copy link

bugi commented Dec 4, 2015

I had a similar issue. I had one line with a character that gave errors like this when selected from a pg table set to SQL_ASCII. This was during the exploratory phase upgrading from an old unmaintained pg, so I was able to extract the row from the original database.

ERROR: invalid byte sequence for encoding "UTF8": 0x96

Here's how I narrowed down the possible codesets.

# First, create a file "orig" with the mystery text.
# Then look at the file and find a useful bit of non-gibberish.
cat -v orig

# Set a variable containing that non-gibberish string.
# This will make later steps easier.
# You could alternately set this to the empty string, but then you'll have to do more work with your brain.
known_string=XyZpDq

# Eliminate the impossible and the gibberish.
for codeset in $(iconv -l |sed 's,//$,,')  # you may have to adjust the sed depending on what version of iconv you're using :(
do
  echo "$codeset"
  if ! iconv -f "$codeset" -t UTF-8 < orig > "x.$codeset"
  then
    echo "impossible codeset: $codeset"
    rm -f "x.$codeset"
  fi
  if ! grep -q "$known_string" "x.$codeset"
  then
    echo "gibberish: $codeset"
    rm -f "x.$codeset"
  fi
done

# Now look for likely characters.
grep "$known_string" x.*

# Copy/paste the likely character into this line.
# Hopefully you'll have enough context to make a good guess.
weird_char=''

# List matches.
grep "$known_string" x.* |grep "$weird_char"

# List codesets that are possible.
grep "$known_string" x.* |grep "$weird_char" |sed 's/:.*// ; s/^x.//'

# Finally, give a stern lecture to somebody.

@sumonst21
Copy link

this answer could be useful for someone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment