Skip to content

Instantly share code, notes, and snippets.

@mahmoud
Created April 21, 2015 04:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mahmoud/b4a4dacf26e2c8e754de to your computer and use it in GitHub Desktop.
Save mahmoud/b4a4dacf26e2c8e754de to your computer and use it in GitHub Desktop.
str/unicode encoding kwarg causes exceptions
The encoding keyword argument to the Python 3 str() and Python 2 unicode() constructors is excessively constraining to the practical use of these core types.
Looking at common usage, both these constructors' primary mode is to convert various objects into text:
>>> str(2)
'2'
But adding an encoding yields:
>>> str(2, encoding='utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: coercing to str: need bytes, bytearray or buffer-like object, int found
While the error message is fine for an experienced developer, I would like to raise the question, is it necessary at all? Even harmlessly getting a str from a str is punished, but leaving off encoding is fine again:
>>> str('hi', encoding='utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: decoding str is not supported
>>> str('hi')
'hi'
Merging and simplifying the two modes of these constructors would yield much more predictable results for experienced and beginning Pythonists alike. Basically, the encoding argument should be ignored if the argument is already a unicode/str instance, or if it is a non-string object. It should only be consulted if the primary argument is a bytestring. Bytestrings already have a .decode() method on them, another, obscurer version of it isn't necessary.
Furthermore, despite the core nature and widespread usage of these types, changing this behavior should break very little existing code and understanding. unicode() and str() will simply behave as expected more often, returning text versions of the arguments passed to them.
Appendix: To demonstrate the expected behavior of the proposed unicode/str, here is a code snippet we've employed to sanely and safely get a text version of an arbitrary object:
def to_unicode(obj, encoding='utf8', errors='strict'):
# the encoding default should look at sys's value
try:
return unicode(obj)
except UnicodeDecodeError:
return unicode(obj, encoding=encoding, errors=errors)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment