Skip to content

Instantly share code, notes, and snippets.

@simonseo
Created February 13, 2019 08:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save simonseo/125f29f3c8c37d48bbd9bbed20d31ce4 to your computer and use it in GitHub Desktop.
Save simonseo/125f29f3c8c37d48bbd9bbed20d31ce4 to your computer and use it in GitHub Desktop.
Fixes UTF-8 files that are wrongly encoded into Latin-1
import json, codecs, re
from functools import partial
input_filename = "message.json"
output_filename = "message_new.json"
fix_mojibake_escapes = partial(
re.compile(rb'\\u00([\da-f]{2})').sub,
lambda m: bytes.fromhex(m.group(1).decode()))
with open(input_filename, 'rb') as binary_data:
repaired = fix_mojibake_escapes(binary_data.read())
data = json.loads(repaired.decode('utf8'))
with codecs.open(output_filename, 'w', encoding='utf-8') as outfile:
json.dump(data, outfile, ensure_ascii=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment