vlasovskikh/text-binary-hints-2-3.md Secret

## text-binary-hints-2-3.md

      
    Raw
  

              text-binary-hints-2-3.md
            
          
    Proposal #2

See also Proposal #1
TL;DR


Introduce typing.Text for text data in Python 2+3
bytes, str, unicode, typing.Text in type hints mean whatever they
mean at runtime for Python 2 or 3
Allow str -> unicode and unicode -> str promotions for Python 2
Type checking for Python 2 and Python 3 actually finds most text/binary
errors
A few false negatives for Python 2 are not worth special handling besides
possible ad-hoc handling of non-ASCII literals conversions

Summary for Python users

If you want your code to be Python 2+3 compatible:

Write text/binary type hints in 2+3 compatible comments

Use typing.Text for text data, bytes for binary data
Use str only for rare cases of "native strings"
Don't use unicode since it's absent in Python 3


Run a type checker for both Python 2 and Python 3

Summary for authors of type checkers

The semantics of types bytes, str, unicode, typing.Text and the type
checking rules for them should match the runtime behavior of these types in
Python 2 and Python 3 depending on Python 2 or 3 modes. Using the runtime
semantics for the types is easy to understand while it still allows to catch
most errors. The Python 2+3 compatibility mode is just a sum of Python 2 and
Python 3 warnings.
Type checkers should promote str/bytes to unicode/Text and
unicode/Text to str/bytes for Python 2. Most text/binary conversion
errors can be found by running a type checker for Python 2 and for Python 3.
typing.Text: Python 2+3 compatible type for text data

The typing.Text type is a Python 2+3 compatible type for text data. It's
defined as follows:
if sys.version_info < (3,):
    Text = unicode
else:
    Text = str

For a Python 2+3 compatible type for binary data use bytes that is available
in both 2 and 3.
Implicit text/binary conversions

In Python 2 text data is implicitly converted to binary data and vice versa
using the ASCII encoding. Only if the data isn't ASCII-compatible, then a
UnicodeEncodeError or a UnicodeDecodeError is raised. This results in many
programs that aren't well-tested regarding non-ASCII data handling.
In Python 3 converting text data to binary data always raises a TypeError.
A type checker run in the Python 3 mode will find most of Python 2 implicit
conversion errors.
Checking for Python 2+3 compatibility

In order to be Python 2+3 compatible a program has to pass both Python 2 and
Python 3 type checking. In other words, the warnings found in the Python 2+3
compatible mode are a simple sum of Python 2 warnings and Python 3 warnings.
Runtime type compatibility

Here is a table of types whose values are compatible at runtime. Columns are
the expected types, rows are the actual types:
        | Text  | bytes | str   | unicode
--------+-------+-------+-------+---------
Text    |  . .  |  * F  |  * .  |  . F
bytes   |  * F  |  . .  |  . F  |  * F
str     |  * .  |  . F  |  . .  |  * F
unicode |  . F  |  * F  |  * F  |  . F

Each cell contains two characters: the result in Python 2 and in Python 3
respectively. Abbreviations:

. — types are compatible
F — types are not compatible
* — types are compatible, ignoring implicit ASCII conversions

At runtime in Python 2 str is compatible with unicode and vice versa
(ignoring possible implicit ASCII conversion errors).
Using unicode in Python 3 is always an error since there is no unicode name
in Python 3.
As you can see from the table above, many implicit ASCII conversion
errors in a Python 2 program can be found just by running a type checker in the
Python 3 mode.
The only problematic conversions that may result in errors are Text to str
and vice versa in Python 2.
Example 1. Text to str
def foo(obj, x):
    # type: (Any, str) -> Any
    return getattr(obj, x)

foo(..., u'привет')  # False negative warning for non-ASCII in Python 2

Example 2. str to Text
def foo(x):
    # type: (Text) -> Any
    return u'Привет, ' + x

foo('Мир')  # False negative warning for non-ASCII in Python 2

For non-ASCII text literals passed to functions that expect Text or str in
Python 2 a type checker can analyze the contents of the literal and show
additional warnings based on this information. For non-ASCII data coming from
sources other than literals this check would be more complicated.
To summarize, with this type compatibility table in place, a type checker run
for both Python 2 and Python 3 is able to find almost all errors related to
text and binary data except for a few text to "native string" conversions and
vice versa in Python 2.
Current Mypy type compatibility (non-runtime semantics)

Mypy implies str to unicode promotion for Python 2, but it doesn't promote
unicode to str. Here is an example of a Python 2 program that is correct
given the runtime type compatibility semantics shown in the table above, but is
incorrect for Mypy:
def foo(obj, x):
    # type: (Any, str) -> Any
    return getattr(obj, x)

foo({}, u'upper')  # False positive warning in Mypy for ASCII in Python 2

Here is the type compatibility table for the current version of Mypy:
        | Text  | bytes | str   | unicode
--------+-------+-------+-------+---------
Text    |  . .  |  F F  |  F .  |  . F
bytes   |  * F  |  . .  |  . F  |  * F
str     |  * .  |  . F  |  . .  |  * F
unicode |  . F  |  F F  |  F F  |  . F

Running the Mypy type checker in Python 2 mode and Python 3 mode for the same
program would find almost all implicit ASCII conversion errors except for str
to Text conversions.
To summarize, the current Mypy type compatibility table covers almost all text
and binary data handling errors when used for both Python 2 and Python 3. But
it doesn't notice errors in "native string" to text conversions in Python 2 and
produces false warnings for text to "native string" conversions in Python 2.