Skip to content

Instantly share code, notes, and snippets.

@ynkdir
Created March 12, 2011 16:29
Show Gist options
  • Save ynkdir/867347 to your computer and use it in GitHub Desktop.
Save ynkdir/867347 to your computer and use it in GitHub Desktop.
codecs backslashreplace for decode()
# codecs backslashreplace for decode()
# encoding: utf-8
from __future__ import unicode_literals
import sys
import codecs
_backslashreplace_errors = codecs.backslashreplace_errors
def backslashreplace_errors(exc):
if isinstance(exc, UnicodeDecodeError):
if sys.version_info[0] >= 3:
tohex = lambda c: "\\x{0:02x}".format(c)
else:
tohex = lambda c: "\\x{0:02x}".format(ord(c))
u = "".join(tohex(c) for c in exc.object[exc.start:exc.end])
return (u, exc.end)
return _backslashreplace_errors(exc)
def test():
codecs.register_error("backslashreplace", backslashreplace_errors)
u = "あいうえお"
s = u.encode("utf-8")
x = s.decode("ascii", "backslashreplace")
print(x)
if __name__ == "__main__":
test()
@galtgendo
Copy link

This code seems to create inconsistent results with python 3.4.1 for byte strings with embedded nulls.
Out of:

  • b"nm\xf6p\x00p"
  • b"nm\xf6p\x00pj"
  • b"nm\xf6p\x00pjk"

only the longest gets the expected result for decode('utf-8', 'backslashreplace')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment