Skip to content

Instantly share code, notes, and snippets.

@abadger
Created June 28, 2019 22:00
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save abadger/bab2c5c5ed7f169c433e62389803af01 to your computer and use it in GitHub Desktop.
Save abadger/bab2c5c5ed7f169c433e62389803af01 to your computer and use it in GitHub Desktop.
When are native literal strings safe?
Why do we have unadorned string literals (native strings) in our codebase?
Doesn't that put us in danger of UnicodeError exceptions?
(1) Your codebase should be using text by default. At the borders, you convert
strings from other APIs into text and then use text throughout, only
converting to bytes (or native strings) when those types are needed for
another, outside API.
(2) On Python2, text can be safely combined with (or compared to) text[1]_. Bytes
can be combined with bytes. And ascii-only bytes can be combined with text.
(3) On Python2, native strings are text so they follow the same rules as bytes:
Safe to combine native strings with bytes. Only safe to combine ascii-only
native strings with text.
(4) On Python3, text can be safely combined with text. Bytes can be combined
with bytes. Bytes and text can **never** be safely combined without an
explicit conversion of one value or the other.
(5) On Python3, native strings are text so they follow the same rules as tet:
Only safe to combine native strings with text.
If you understand all of the above, you'll find that the subset of safe types
to combine together on both Python2 and Python3 are: text with text, bytes with
bytes, and **ascii-only** native strings with text. That last part is because
native strings are text on Python3 and ascii-only byte strings are safe to
combine with text on Python2.
.. [1]_: Combined with includes `str.join()`, %-formatted strings, and
concatenation with ``+``. `str.format()` needs to be understood to
use safely, though. The other methods will always convert the byte
string to a text string using the ascii encoding. str.format will
convert its arguments to the type of string that it's a method of.
.. seealso:: https://anonbadger.wordpress.com/2016/01/05/python2-string-format-and-unicode/
So, some examples:
This is safe to do::
filenames = ('/path/one', '/path/two')
if pathname in filenames:
print('We are inside a recognized directory')
Following our coding guidelines (bullet point 1 in our list above), pathname
contains a text string. On Python2, the values in filenames will be converted
to text strings safely because they only contain ascii characters and compared.
On Python3, the values in filenames are text strings and so the comparison
doesn't need to do any conversion so the comparison will be safe.
This is unsafe to do::
filenames = os.listdir('.')
if u'one' in filenames:
print('Directory contains a recognized file')
In this example, filenames is getting native strings from a third-party API.
We can't control whether there are non-ascii characters in the filenames there.
So when we check to see if u'one' is one of the filenames, we are in danger of
a UnicodeError on Python2. That's because the filenames on Python2 would be
a byte string. So, in the comparison, Python2 will attempt to convert it into
a text string to match u'one'. In doing so, it will use the ascii encoding.
A non-ascii filename will traceback in this case.
So, similar to how we use a `b_` prefix when we want a variable to hold a byte
string, a variable which holds native strings needs to be prefixed with `n_`
when we can't rule out a variable holding non-ascii characters. In practice,
the easiest rule to follow is if you're setting the variable to a string
literal which only contains ascii characters, you are safe. If you set the
variable to a string literal with non-ascii characters *or* you set the
variable to a native string from a function call, then the variable should be
prefixed with an `n_` to warn that you have to think about the corner cases
when combining this with other non-native variables.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment