See also Proposal #1
- Introduce
typing.Text
for text data in Python 2+3 bytes
,str
,unicode
,typing.Text
in type hints mean whatever they mean at runtime for Python 2 or 3- Allow
str -> unicode
andunicode -> str
promotions for Python 2 - Type checking for Python 2 and Python 3 actually finds most text/binary errors
- A few false negatives for Python 2 are not worth special handling besides possible ad-hoc handling of non-ASCII literals conversions
If you want your code to be Python 2+3 compatible:
- Write text/binary type hints in 2+3 compatible comments
- Use
typing.Text
for text data,bytes
for binary data - Use
str
only for rare cases of "native strings" - Don't use
unicode
since it's absent in Python 3
- Use
- Run a type checker for both Python 2 and Python 3
The semantics of types bytes
, str
, unicode
, typing.Text
and the type
checking rules for them should match the runtime behavior of these types in
Python 2 and Python 3 depending on Python 2 or 3 modes. Using the runtime
semantics for the types is easy to understand while it still allows to catch
most errors. The Python 2+3 compatibility mode is just a sum of Python 2 and
Python 3 warnings.
Type checkers should promote str
/bytes
to unicode
/Text
and
unicode
/Text
to str
/bytes
for Python 2. Most text/binary conversion
errors can be found by running a type checker for Python 2 and for Python 3.
The typing.Text
type is a Python 2+3 compatible type for text data. It's
defined as follows:
if sys.version_info < (3,):
Text = unicode
else:
Text = str
For a Python 2+3 compatible type for binary data use bytes
that is available
in both 2 and 3.
In Python 2 text data is implicitly converted to binary data and vice versa
using the ASCII encoding. Only if the data isn't ASCII-compatible, then a
UnicodeEncodeError
or a UnicodeDecodeError
is raised. This results in many
programs that aren't well-tested regarding non-ASCII data handling.
In Python 3 converting text data to binary data always raises a TypeError
.
A type checker run in the Python 3 mode will find most of Python 2 implicit conversion errors.
In order to be Python 2+3 compatible a program has to pass both Python 2 and Python 3 type checking. In other words, the warnings found in the Python 2+3 compatible mode are a simple sum of Python 2 warnings and Python 3 warnings.
Here is a table of types whose values are compatible at runtime. Columns are the expected types, rows are the actual types:
| Text | bytes | str | unicode
--------+-------+-------+-------+---------
Text | . . | * F | * . | . F
bytes | * F | . . | . F | * F
str | * . | . F | . . | * F
unicode | . F | * F | * F | . F
Each cell contains two characters: the result in Python 2 and in Python 3 respectively. Abbreviations:
.
— types are compatibleF
— types are not compatible*
— types are compatible, ignoring implicit ASCII conversions
At runtime in Python 2 str
is compatible with unicode
and vice versa
(ignoring possible implicit ASCII conversion errors).
Using unicode
in Python 3 is always an error since there is no unicode
name
in Python 3.
As you can see from the table above, many implicit ASCII conversion errors in a Python 2 program can be found just by running a type checker in the Python 3 mode.
The only problematic conversions that may result in errors are Text
to str
and vice versa in Python 2.
Example 1. Text
to str
def foo(obj, x):
# type: (Any, str) -> Any
return getattr(obj, x)
foo(..., u'привет') # False negative warning for non-ASCII in Python 2
Example 2. str
to Text
def foo(x):
# type: (Text) -> Any
return u'Привет, ' + x
foo('Мир') # False negative warning for non-ASCII in Python 2
For non-ASCII text literals passed to functions that expect Text
or str
in
Python 2 a type checker can analyze the contents of the literal and show
additional warnings based on this information. For non-ASCII data coming from
sources other than literals this check would be more complicated.
To summarize, with this type compatibility table in place, a type checker run for both Python 2 and Python 3 is able to find almost all errors related to text and binary data except for a few text to "native string" conversions and vice versa in Python 2.
Mypy implies str
to unicode
promotion for Python 2, but it doesn't promote
unicode
to str
. Here is an example of a Python 2 program that is correct
given the runtime type compatibility semantics shown in the table above, but is
incorrect for Mypy:
def foo(obj, x):
# type: (Any, str) -> Any
return getattr(obj, x)
foo({}, u'upper') # False positive warning in Mypy for ASCII in Python 2
Here is the type compatibility table for the current version of Mypy:
| Text | bytes | str | unicode
--------+-------+-------+-------+---------
Text | . . | F F | F . | . F
bytes | * F | . . | . F | * F
str | * . | . F | . . | * F
unicode | . F | F F | F F | . F
Running the Mypy type checker in Python 2 mode and Python 3 mode for the same
program would find almost all implicit ASCII conversion errors except for str
to Text
conversions.
To summarize, the current Mypy type compatibility table covers almost all text and binary data handling errors when used for both Python 2 and Python 3. But it doesn't notice errors in "native string" to text conversions in Python 2 and produces false warnings for text to "native string" conversions in Python 2.
Can you summarize the the definition of typing.Text in the TL;DR section?