gustavla/numpy-format-proposal-v1.md

## numpy-format-proposal-v1.md

      
    Raw
  

              numpy-format-proposal-v1.md
            
          
    Supporting __format__

Reference: Format Specification
Mini-Language.
Overview

Quick demo of desired behavior and motivation:
>>> x = np.array([100.002, 1.2])
>>> print('{:8.2f}'.format(x))
[  100.00     1.20]

Currently, the user has to do np.set_printoptions to adjust these options,
with this particular format not immediately supported without a custom
formatter:
>>> np.set_printoptions(formatter={'all': lambda x: format(x, '8.2f')})
>>> print(x)
[  100.00     1.20]

The user also has to revert changes with another set_printoptions call.
Personally, I would find it very useful to be able to quickly change the
printing behavior, which this would allow. Here is an example that I often deal
with:
[[  9.99640122e-01   2.17222256e-03  -1.24228450e-02  -5.61696924e-03
    7.06837666e-04]
 [ -2.48842046e-03   1.02228184e+00   7.16419216e-03  -1.33046711e-02
   -2.40332264e-03]
 [ -9.66609653e-03  -2.78424918e-02   9.99769143e-01  -1.34152343e-03
    2.11504829e-03]
 [ -7.08603279e-03  -7.57130700e-03  -1.64586641e-02   9.87134490e-01
   -2.06328584e-03]
 [  2.64709841e-02   1.75090350e-02  -1.06997156e-02   1.09245948e-02
    9.93078570e-01]]

After tweaking print options it can become much clearer what I'm working with:
[[ 1.  0. -0. -0.  0.]
 [-0.  1.  0. -0. -0.]
 [-0. -0.  1. -0.  0.]
 [-0. -0. -0.  1. -0.]
 [ 0.  0. -0.  0.  1.]]

Note, this is particularly relevant now that we have f-strings in Python 3.6,
e.g. f'{x:.1g}'.
Guiding principles

My suggestion follows a few main principles:


Map: To the extent possible, it should behave as if the format spec was
applied element-wise. That is, '{:6.2f}'.format(x) should produce a
numpy-style array grid with the first element formatted roughly according to
'{:6.2f}'.format(x[0]), and so forth. There will be times this should not be
strictly true to accommodate established numpy defaults (in this particular
case, it should behave as '{: 6.2f}'.format(x[0]), since numpy by default
leaves space for negative signs).
This principle also informs us of what data types should support which
format types and what errors should be raised if not. For instance, if you
use '{.2f}' with np.int64(1024), it should print 1024.00 without
complaining. However, if you try to use '{:4d}' with np.float64, it
should raise a ValueError.
The implied format types (e.g. '{:.2}') should be appropriate for the type.
Note, in Python, the implied types are:
string -> 's'
integer -> 'd'
float -> unique behavior (similar to 'g', but not quite)

I will describe below what I think the implied types should be for numpy
types.


Align: A strict interpretation of the map principle would suggest that
a very simple solution like this would do:
def __format__(self, fmt):
    formatter = {'all': lambda x: format(x, fmt)}
    return np.array2string(self, formatter=formatter)

This actually works in many cases, but decimal alignment is often incorrect.
For correct alignment, sometimes all elements must be considered, which is
why numpy has a set of custom formatters that are passed all the data first:
>>> x = np.array([100.002, 1.2])
>>> formatter = numpy.core.arrayprint.FloatFormat(x, 6, True)
>>> formatter(x[0])
'   1.2  '
>>> formatter(x[1])
' 100.002'

The strings are aligned at the decimal point. I think a reasonable
implementation is to parse fmt in __format__ and set up the appropriate
already existing numpy formatters. Some completely new formatters will need
to be written in order to support all options available to Python's
built-in types (e.g. '{:x}' for hexadecimal) and some lesser used
features might require extending exisiting formatters (e.g. alignments and
fill), if they are deemed important enough to support.


Respect current behavior: Empty format specs ('{}' and '{:}')
should work as they already do, and adding additional specs should gently
modify this behavior.


2/3 Consistency: Python 2 and 3 have different formatting behaviors (examples
below), so one big question is: Should numpy try to adopt the local dialect
or should numpy always be consistent between 2 and 3. I vote for
consistency (as much as possible), even though this would break what some
might consider a consistent and reasonable formatting behavior in Python
2.7. I think this is one of the most important points to discuss. I am also not
a Python 2 user, so I do not have terribly strong opinions what happens there.


No extensions (in this proposal): It would be possible to extend the formatting
language when it makes sense for numpy. For instance, one could dream up
something like '{:[+][-]B}'.format(np.array([True, False, True])) resulting in
'[+ - +]'. However, I defer all such considerations and focus on what
Python users expect will work inuitively.


Python 2.7

In 2.7, all numpy arrays can currently be formatted with format type 's' (or
implicitly with '') and a precision to mean string slicing or or a length to
add padding:
'{:.5}'.format(x) == str(x)[:5]             # slice
'{:25}'.format(x) == str(x).ljust(25)       # pad

This is followed by Python lists/tuples too, so it is consistent and
correct behavior for Python 2.7. However, in numpy it can cause subtle
confusion not quite possible with lists/tuples (even though array scalars are
rare):
>>> '{:.3}'.format(1234.5678)
'1.23e+03'
>>> '{:.3}'.format(np.array(1234.5678))
'123'

Note, if we break this behavior (which I am in favor of), this functionality is
still readily available (even in Python 3) with more explicit syntax:
>>> '{!s:.5}'.format(x)
>>> '{!s:25}'.format(x)

Brandon Rhodes has some great arguments for breaking with this on the GitHub
issue (#5543), saying that numpy
arrays should not conform to the behavior of lists/tuples, since their very
existence is based on breaking free from this in many many ways. I agree with him.
Python 3

Currently Python 3 throws a TypeError (stating __format__ is not implemented)
for all non-empty format specs with a numpy array. This means there are no
conflicts in Python 3 for implementing this.
Suggested behavior

I walk through my suggested behavior for various numpy arrays.
Array scalars

Numpy array scalars (e.g. np.array(1.23)) should always format identically to
their closests Python scalar equivalents.
Float arrays

Alignments

It does not really hurt to support [<=^], but since numbers are most sensibly
aligned by the decimal point, I think this can be very low priority and raise
ValueError in the initial implementation. The value '>' is default and should
be allowed to be explicitly specified (unless fill is deemed unnecessary),
since it is required if you want to define a fill character. Note, the longest
element decides length, so in numpy it can all of a sudden make sense to define
a fill character without defining a min length, e.g.:
.>-.1f      [....12.0 300212.4 ...123.5]

Note: What I mean when I write this is that if you call
'{:.>-1.f}'.format(x), where x in this case might be np.array([12, 300212.412, 123.51]), you get the output string on the right (shown without
quotation marks).
As to whether or not fill is important enough to support, I'm on the fence. If
you are using fill, you are probably looking to make a nice-looking table, in
which case you will probably want to change the look of the surrounding square
brackets as well. At this point, you are better off writing a custom loop
instead or perhaps putting it into a pandas DataFrame. If fill is not
supported, I think it should raise ValueError if specified.
Sign

All signs should be supported ([+- ]), but unlike Python where '-' is the
default, the default for numpy should be ' ' (leave space for negative
signs), since this is already the default behavior in numpy. For '-', if all
values are non-negative, it should not leave any unnecessary leading spaces.
However, if there is at least one negative value, '-' must fall back to
behaving like ' ' to align correctly.
Extras

Extras ([#_,0]) can be optionally supported and result in ValueError until they
are. It is also a question whether or not they should be supported in Python
versions where they are not supported. For instance, '_' was introduced in
3.6, but could technically be back-ported (although it will likely be more work
implementing it).
Types

Some examples:
.2f         [  1.00  45.68]
+.2f        [ +1.00 +45.68]
-.2f        [ 1.00 45.68]
-.2f        [  1.00 -50.00]          # note, the neg element changes behavior
10.2e       [  1.00e+02   1.20e+00]
.2%         [ 123.00% 4567.80%]
.2g         [  1.    45.68]          # 2 inidicates decimals (not significant digits)
#.2g        [  1.00  45.68]
.2g         [ 1.00e+05  1.00e+00]    # if max(abs(x))/min(abs(x)) > 1000
.2          [ 1.00e+05  1.00e+00]    # same as .2g

Note, 'f' and '%' are straightforward and well-defined and should correspond to
current numpy as much as possible. The type 'e' should use a consistent number
of exponents (and use a minimum of 2):
10.2e       [  1.23e+200   1.00e+000]

For 'g', we can't stick to built-in semantics, since it decides format based
on the length of the single result. The precision also refers to significant
digits, which is hard to synchronize across multiple elements:
'{:#.5g'}.format(10.0)   == '10.000'
'{:#.5g'}.format(100.0)  == '100.00'
'{:#.5g'}.format(1000.0) == '1000.0'

Instead, I suggest completely different semantics for 'g', closer to the
default numpy behavior. That is, it is equivalent to 'e' when
max(abs(x))/min(abs(x)) > 1000 and otherwise similar to 'f' but without
trailing zeros (unless alternate form is specified with '#'). Note, that the
precision thus refers to decimals and not significant digits. I also suggest
that the implicit type '' should fall back to 'g'. I do not think the
distinction between 'g' and '' for Python floats is relevant to numpy, but
feel free to argue otherwise.
Another difference between Python 'g' and numpy 'g' is that numpy always
shows decimal points for floats, while Python omits a trailing decimal point:
'{:g}'.format(1)  == '1'
'{:0g}'.format(1) == '1'
'{:1g}'.format(1) == '1'
'{:0f}'.format(1) == '1'
'{:1f}'.format(1) == '1.0'

To be consistent with current numpy behavior, it should only remove trailing
zeros and not a trailing decimal point:
g           [ 1.  0.]
0g          [ 1.  0.]
1g          [ 1.  0.]
0f          [ 1.  0.]
1f          [ 1.0  0.0]
#1g         [ 1.0  0.0]

This would mean it is impossible to get it to omit the decimal point. Opinions?
The types 'F', 'E', 'G' should work as expected. I defer the
locale-sensitive 'n' for now.
Complex arrays

Complex types should behave similarly to floats and precision should be applied
to both Re and Im. Note that alignment applies to both decimal points:
>>> print(np.array([[10+1.2345j, 300j]]).T)
[[ 10.  +1.2345j]
 [  0.+300.j    ]]

However, + and j are not aligned and instead are always tightly coupled with
Im on either side. Note also that space requirements are decided individually
for Re and Im (2 and 3, respectively, in this example). Same as for floats, I
suggest keeping this default behavior for 'g' and '', while 'f' should align j but
not necessarily +:
.3f         [[ 10.000  +1.235j]
             [  0.000+300.000j]]

Zero-padding is not allowed for Python complex values, so numpy should probably
throw the same error:
020f        ValueError

If we choose to support fill, it will have some visible gaps:
.> 20.3     [[....... 10.  +1.235j]
             [........ 0.+300.j   ]]

One gap appears because of ' ' (which could have been implicit here), leaving
a leading space for a negative (this is consistent with Python behavior).
Another gap appears between Im and Re on the first row and the third after the
j on the second row. This is probably not a very common use-case, so if this
causes implementation headaches, it might be worth to just shelve it.
Integer arrays

Alignments

Again, low priority. Raise ValueError until supported.
Sign

All signs should be supported ([+- ]) and '-' should be default (consistent
with Python and numpy). Although '-' should still default to ' ' if at least
one visible negative element exists.
Extras

Similar situation as for floats.
Types

Most of these are straightforward, but one important decision is whether or not
prefixes in the alternate forms '#' (0b/0x/0o) should be aligned. Compare:
#b          [  0b10 0b1000]
#b          [0b0010 0b1000]  (prefix aligned)

On a related note, leading zeros are specified with a '0' before the length
specification, since '0>' does not handle prefixes correctly:
#010b       [0b00001100 0b00001101]
0>#10b      [00000b1100 00000b1101]

In Python, the use of the leading zero means length has to be specified, since
implicit length is always tight. However, in numpy, since length is dictated by
multiple elements, it would be useful to be able to specify leading zeros
without specifying length. Without breaking from Python standards, I think the
correct way to do this would be to specify a redundant 0 minimum length
(resulting in two consecutive zeros):
+b          [   +12 -10000]
+0b         [   +12 -10000]   # min length 0 (this is a no-op)
+00b        [+00012 -10000]   # leading 0's and min length 0
0>+b        [000+12 -10000]   # leading 0's, but incorrect w.r.t. signs and prefixes

This presents an argument for why prefixes should not by default be aligned,
since there is an option to align them, while the standard formatting language
does not offer an intuitive way to un-align them:
Option 1: Do not align prefixes by default
#b          [  0b10 0b1000]
#00b        [0b0010 0b1000]

Option 2: Align prefixes by default
???         [  0b10 0b1000]
#b          [0b0010 0b1000]

The '???' would have to be non-standard, which is why I argue for Option 1.
If un-aligned prefixes are not at all important, then of course Option 2 is
better. However, I think aligned prefixes can look dense and it makes it harder
to spot small values. Also, since the complex j did not need to align,
perhaps nor does base prefixes.
My examples have focused on booleans, but all other types should work
analogously:
02x         [00 ff 2f ea]
2x          [  0 fff]        # min lengths can be overruled
#6o         [   0o0   0o77]

The type 'c' is the only tricky one, since it raises some questions of how to
deal with unicode and alignment. I think many of these issues is shared with
strings (which I will get to), one solution is simply to print them as if they
were strings of length 1 and presented without alignment:
c           ['a' '\n' '\x00' 'È' '中']

The input to this is np.array([97, 10, 0, 200, 20013]).
In Python 2.7, things get more complicated since the target string can be both
bytes and unicode. The behavior of 'c' is also modulo 256, with values in [128,
256) after the modulo failing when the formatted string is unicode (and
\x??-encoded otherwise).
It might make sense to continue this modulo-256 semantics in numpy under Python
2.7, so that it is consistent with built-in semantics. However, should
it really fail with a UnicodeDecodeError for half the values? I suggest both
byte and unicode strings in Python 2.7 use pure ascii and \x??-encodings where
necessary. For instace, the above example would be:
c           ['a' '\n' '\x00' '\xc8' '-']  # (in 2.7)

However, a case can also be made that the last one should be something like
'\u4e2d'. This would not be a valid Python 2.7 string (since \u requires a
unicode string), although this is not a repr representation, so it is not like
it needs to eval correctly. A final thought: Are the quotation marks necessary?
Boolean arrays

Alignments

Low priority.
Sign

N/A. Raise ValueError.
Extras

N/A. Raise ValueError.
Types

Booleans in Python are implicitly treated as numericals for formatting
purposes, and the default type is 'd' whenever the format spec is non-empty:
04          0001
04d         0001
04.1f       01.0
s           ValueError

In numpy, I think the default should be identical to the current printing
default, which is to write out True/False. The user can specify extra padding
if they want, although this is not an important feature:
8           [    True    False]
08          ValueError

Note that here the format type is '' here, and there is no equivalent explicit
counterpart in Python formatting. As I mentioned erarlier, it would be possible
to make such extension, such as introducing the type 'B' to explicitly mean
boolean. However, I do not propose anything like that here. The behavior is thus:
            [ True False]
8           [    True    False]
8d          [       1        0]
f           [1.00000000 0.00000000]   # depending on default precision (see at end)
g           [1. 0.]

String arrays

Strings in numpy use repr-based formatting for elements, with quotation marks
and escape sequences. You cannot correctly align repr strings (for instance
'\x00' is one character but displayed using 4), so numpy uses no alignment for
strings. It is hard to extend this, so I suggest adding minimal functionality
here.
Alignments

N/A. Raise ValueError.
Sign

N/A. Raise ValueError.
Extras

N/A. Raise ValueError.
Types

The only well-defined and useful feature that I can think of is string slicing:
.2s         ['A' 'Fo']   # input is np.array(['A', 'Foo Bar'])

It would be displayed exactly as if each element had been sliced before
printed. This might be useful for numpy arrays with extremely long strings.
Optionally, cut off strings could be indicated with some form of ellipsis (for
instance ['A' 'Fo'+]).
Object arrays

Disallow any non-empty format specs with a ValueError. Empty format spec should
continue to work as it already does.
Other types

I wrote up some thoughts on np.datetime64 and np.timedelta64, but I'll skip
that for now in the interest of not making this too long.
Settings and defaults

Python defaults to precision 6:
f           1.234568

While numpy defaults to precision 8:
f           [ 1.23456789]

Unless explicitly specified, numpy should use the set_printoptions defaults
when appropriate. Here is list of options that I think __format__ should use:
precision       yes, but only if precision is omitted in the format spec
threshold       yes
edgeitems       yes
linewidth       yes
suppress        yes for 'g' with floats
nanstr          yes
infstr          yes
formatter       no

If a formatter is set, it should be used for empty format specs. However, if
any format spec is defined (even just '{:g}'), it should ignore any formatter
that may be configured.