Skip to content

Instantly share code, notes, and snippets.

@pkqk
Created March 4, 2011 18:08
Show Gist options
  • Save pkqk/855418 to your computer and use it in GitHub Desktop.
Save pkqk/855418 to your computer and use it in GitHub Desktop.
utf8 it the fuck up
import sys
import string
def force_utf8(fragment):
try:
return fragment.decode('utf8')
except:
try:
return fragment.replace('\xe2\x80?', '\xe2\x80\x9d').decode('utf8')
except:
return fragment.decode('latin1')
with open(sys.argv[1],'rb') as infile:
with open(sys.argv[2],'wb') as outfile:
for line in infile:
fragments = line.split("\t")
outfile.write(u"\t".join(force_utf8(frag) for frag in fragments).encode('utf8'))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment