Skip to content

Instantly share code, notes, and snippets.

@dnng
Created October 13, 2017 20:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dnng/503188c8a70a0ef598404913caa90f89 to your computer and use it in GitHub Desktop.
Save dnng/503188c8a70a0ef598404913caa90f89 to your computer and use it in GitHub Desktop.
Fix file encoding when file has unicode code points written as strings (eg: "a string with u00e1 or \u00d1 code point as string in it")
from re import compile, findall, split
with open('original_file', 'r') as of:
with open('fixed_file', 'w') as ff:
for line in of:
# looks for substrings that are either 'u00fo' or '\u00fo'
pattern = compile(r'\\?u00..')
code_points = findall(pattern, line)
pieces = split(pattern, line)
res = ''
for idx, cp in enumerate(code_points):
res += pieces[idx] + chr(int(cp[1::] if cp[0] == 'u' else cp[2::], 16))
else:
res += pieces[-1]
ff.write(res)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment