Skip to content

Instantly share code, notes, and snippets.

@alegomes
Created May 27, 2012 00:28
Show Gist options
  • Save alegomes/2795743 to your computer and use it in GitHub Desktop.
Save alegomes/2795743 to your computer and use it in GitHub Desktop.
Converting unknown charset file to UTF-8
I had a dataset but it was not UTF-8. So, I had to find out which charset was being used. 'file' command didn't helped me out.
$ file file_name.csv
file_name.csv: Non-ISO extended-ASCII C++ program text, with very long lines, with CRLF line terminators
So, I made this bash script to figure out its encoding:
First, I converted the file to every single format available by 'iconv':
$ for f in $(iconv -l); do echo "Convertendo $f ..."; iconv -f $f -t UTF-8 < file_name.csv > fil_name.$f.csv; done
The, I searched for the file name containing some known word:
$ IFS=$(echo -en "\n\b") ; for i in $(grep -Hi "são\ paulo" *); do echo $i | awk '{print $1}'; done
@sumonst21
Copy link

this answer could be useful for someone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment