Skip to content

Instantly share code, notes, and snippets.

@esamson
Created July 12, 2012 09:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save esamson/3096984 to your computer and use it in GitHub Desktop.
Save esamson/3096984 to your computer and use it in GitHub Desktop.
Use chardet to guess a file's encoding and then iconv to convert the file to UTF-8
#!/usr/bin/env python
import sys
import urllib
import chardet
import os
orig = sys.argv[1]
rawdata = urllib.urlopen(orig).read()
enc = chardet.detect(rawdata)['encoding']
if enc.startswith('UTF-16'):
enc = 'UTF-16'
if enc != 'utf-8' and enc != 'ascii':
print("{0}: {1}".format(enc, orig))
utf8 = orig + '.utf8'
os.system("iconv -f {0} -t UTF-8 '{1}' > '{2}'".format(enc, orig, utf8))
os.system("mv '{0}' '{1}'".format(utf8, orig))
@esamson
Copy link
Author

esamson commented Feb 13, 2013

The guessing part really didn't work out so well in practice. It worked good enough for UTF-16 but not for other encodings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment