Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@gustavla
Last active March 2, 2017 04:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gustavla/2783543be1204d2b5d368f6a1fb4d069 to your computer and use it in GitHub Desktop.
Save gustavla/2783543be1204d2b5d368f6a1fb4d069 to your computer and use it in GitHub Desktop.

Supporting __format__

Reference: Format Specification Mini-Language.

Overview

Quick demo of desired behavior and motivation:

>>> x = np.array([100.002, 1.2])
>>> print('{:8.2f}'.format(x))
[  100.00     1.20]

Currently, the user has to do np.set_printoptions to adjust these options, with this particular format not immediately supported without a custom formatter:

>>> np.set_printoptions(formatter={'all': lambda x: format(x, '8.2f')})
>>> print(x)
[  100.00     1.20]

The user also has to revert changes with another set_printoptions call. Personally, I would find it very useful to be able to quickly change the printing behavior, which this would allow. Here is an example that I often deal with:

[[  9.99640122e-01   2.17222256e-03  -1.24228450e-02  -5.61696924e-03
    7.06837666e-04]
 [ -2.48842046e-03   1.02228184e+00   7.16419216e-03  -1.33046711e-02
   -2.40332264e-03]
 [ -9.66609653e-03  -2.78424918e-02   9.99769143e-01  -1.34152343e-03
    2.11504829e-03]
 [ -7.08603279e-03  -7.57130700e-03  -1.64586641e-02   9.87134490e-01
   -2.06328584e-03]
 [  2.64709841e-02   1.75090350e-02  -1.06997156e-02   1.09245948e-02
    9.93078570e-01]]

After tweaking print options it can become much clearer what I'm working with:

[[ 1.  0. -0. -0.  0.]
 [-0.  1.  0. -0. -0.]
 [-0. -0.  1. -0.  0.]
 [-0. -0. -0.  1. -0.]
 [ 0.  0. -0.  0.  1.]]

Note, this is particularly relevant now that we have f-strings in Python 3.6, e.g. f'{x:.1g}'.

Guiding principles

My suggestion follows a few main principles:

  • Map: To the extent possible, it should behave as if the format spec was applied element-wise. That is, '{:6.2f}'.format(x) should produce a numpy-style array grid with the first element formatted roughly according to '{:6.2f}'.format(x[0]), and so forth. There will be times this should not be strictly true to accommodate established numpy defaults (in this particular case, it should behave as '{: 6.2f}'.format(x[0]), since numpy by default leaves space for negative signs).

    This principle also informs us of what data types should support which format types and what errors should be raised if not. For instance, if you use '{.2f}' with np.int64(1024), it should print 1024.00 without complaining. However, if you try to use '{:4d}' with np.float64, it should raise a ValueError.

    The implied format types (e.g. '{:.2}') should be appropriate for the type. Note, in Python, the implied types are:

    string -> 's'
    integer -> 'd'
    float -> unique behavior (similar to 'g', but not quite)
    

    I will describe below what I think the implied types should be for numpy types.

  • Align: A strict interpretation of the map principle would suggest that a very simple solution like this would do:

    def __format__(self, fmt):
        formatter = {'all': lambda x: format(x, fmt)}
        return np.array2string(self, formatter=formatter)
    

    This actually works in many cases, but decimal alignment is often incorrect. For correct alignment, sometimes all elements must be considered, which is why numpy has a set of custom formatters that are passed all the data first:

    >>> x = np.array([100.002, 1.2])
    >>> formatter = numpy.core.arrayprint.FloatFormat(x, 6, True)
    >>> formatter(x[0])
    '   1.2  '
    >>> formatter(x[1])
    ' 100.002'
    

    The strings are aligned at the decimal point. I think a reasonable implementation is to parse fmt in __format__ and set up the appropriate already existing numpy formatters. Some completely new formatters will need to be written in order to support all options available to Python's built-in types (e.g. '{:x}' for hexadecimal) and some lesser used features might require extending exisiting formatters (e.g. alignments and fill), if they are deemed important enough to support.

  • Respect current behavior: Empty format specs ('{}' and '{:}') should work as they already do, and adding additional specs should gently modify this behavior.

  • 2/3 Consistency: Python 2 and 3 have different formatting behaviors (examples below), so one big question is: Should numpy try to adopt the local dialect or should numpy always be consistent between 2 and 3. I vote for consistency (as much as possible), even though this would break what some might consider a consistent and reasonable formatting behavior in Python 2.7. I think this is one of the most important points to discuss. I am also not a Python 2 user, so I do not have terribly strong opinions what happens there.

  • No extensions (in this proposal): It would be possible to extend the formatting language when it makes sense for numpy. For instance, one could dream up something like '{:[+][-]B}'.format(np.array([True, False, True])) resulting in '[+ - +]'. However, I defer all such considerations and focus on what Python users expect will work inuitively.

Python 2.7

In 2.7, all numpy arrays can currently be formatted with format type 's' (or implicitly with '') and a precision to mean string slicing or or a length to add padding:

'{:.5}'.format(x) == str(x)[:5]             # slice
'{:25}'.format(x) == str(x).ljust(25)       # pad

This is followed by Python lists/tuples too, so it is consistent and correct behavior for Python 2.7. However, in numpy it can cause subtle confusion not quite possible with lists/tuples (even though array scalars are rare):

>>> '{:.3}'.format(1234.5678)
'1.23e+03'
>>> '{:.3}'.format(np.array(1234.5678))
'123'

Note, if we break this behavior (which I am in favor of), this functionality is still readily available (even in Python 3) with more explicit syntax:

>>> '{!s:.5}'.format(x)
>>> '{!s:25}'.format(x)

Brandon Rhodes has some great arguments for breaking with this on the GitHub issue (#5543), saying that numpy arrays should not conform to the behavior of lists/tuples, since their very existence is based on breaking free from this in many many ways. I agree with him.

Python 3

Currently Python 3 throws a TypeError (stating __format__ is not implemented) for all non-empty format specs with a numpy array. This means there are no conflicts in Python 3 for implementing this.

Suggested behavior

I walk through my suggested behavior for various numpy arrays.

Array scalars

Numpy array scalars (e.g. np.array(1.23)) should always format identically to their closests Python scalar equivalents.

Float arrays

Alignments

It does not really hurt to support [<=^], but since numbers are most sensibly aligned by the decimal point, I think this can be very low priority and raise ValueError in the initial implementation. The value '>' is default and should be allowed to be explicitly specified (unless fill is deemed unnecessary), since it is required if you want to define a fill character. Note, the longest element decides length, so in numpy it can all of a sudden make sense to define a fill character without defining a min length, e.g.:

.>-.1f      [....12.0 300212.4 ...123.5]

Note: What I mean when I write this is that if you call '{:.>-1.f}'.format(x), where x in this case might be np.array([12, 300212.412, 123.51]), you get the output string on the right (shown without quotation marks).

As to whether or not fill is important enough to support, I'm on the fence. If you are using fill, you are probably looking to make a nice-looking table, in which case you will probably want to change the look of the surrounding square brackets as well. At this point, you are better off writing a custom loop instead or perhaps putting it into a pandas DataFrame. If fill is not supported, I think it should raise ValueError if specified.

Sign

All signs should be supported ([+- ]), but unlike Python where '-' is the default, the default for numpy should be ' ' (leave space for negative signs), since this is already the default behavior in numpy. For '-', if all values are non-negative, it should not leave any unnecessary leading spaces. However, if there is at least one negative value, '-' must fall back to behaving like ' ' to align correctly.

Extras

Extras ([#_,0]) can be optionally supported and result in ValueError until they are. It is also a question whether or not they should be supported in Python versions where they are not supported. For instance, '_' was introduced in 3.6, but could technically be back-ported (although it will likely be more work implementing it).

Types

Some examples:

.2f         [  1.00  45.68]
+.2f        [ +1.00 +45.68]
-.2f        [ 1.00 45.68]
-.2f        [  1.00 -50.00]          # note, the neg element changes behavior
10.2e       [  1.00e+02   1.20e+00]
.2%         [ 123.00% 4567.80%]
.2g         [  1.    45.68]          # 2 inidicates decimals (not significant digits)
#.2g        [  1.00  45.68]
.2g         [ 1.00e+05  1.00e+00]    # if max(abs(x))/min(abs(x)) > 1000
.2          [ 1.00e+05  1.00e+00]    # same as .2g

Note, 'f' and '%' are straightforward and well-defined and should correspond to current numpy as much as possible. The type 'e' should use a consistent number of exponents (and use a minimum of 2):

10.2e       [  1.23e+200   1.00e+000]

For 'g', we can't stick to built-in semantics, since it decides format based on the length of the single result. The precision also refers to significant digits, which is hard to synchronize across multiple elements:

'{:#.5g'}.format(10.0)   == '10.000'
'{:#.5g'}.format(100.0)  == '100.00'
'{:#.5g'}.format(1000.0) == '1000.0'

Instead, I suggest completely different semantics for 'g', closer to the default numpy behavior. That is, it is equivalent to 'e' when max(abs(x))/min(abs(x)) > 1000 and otherwise similar to 'f' but without trailing zeros (unless alternate form is specified with '#'). Note, that the precision thus refers to decimals and not significant digits. I also suggest that the implicit type '' should fall back to 'g'. I do not think the distinction between 'g' and '' for Python floats is relevant to numpy, but feel free to argue otherwise.

Another difference between Python 'g' and numpy 'g' is that numpy always shows decimal points for floats, while Python omits a trailing decimal point:

'{:g}'.format(1)  == '1'
'{:0g}'.format(1) == '1'
'{:1g}'.format(1) == '1'
'{:0f}'.format(1) == '1'
'{:1f}'.format(1) == '1.0'

To be consistent with current numpy behavior, it should only remove trailing zeros and not a trailing decimal point:

g           [ 1.  0.]
0g          [ 1.  0.]
1g          [ 1.  0.]
0f          [ 1.  0.]
1f          [ 1.0  0.0]
#1g         [ 1.0  0.0]

This would mean it is impossible to get it to omit the decimal point. Opinions?

The types 'F', 'E', 'G' should work as expected. I defer the locale-sensitive 'n' for now.

Complex arrays

Complex types should behave similarly to floats and precision should be applied to both Re and Im. Note that alignment applies to both decimal points:

>>> print(np.array([[10+1.2345j, 300j]]).T)
[[ 10.  +1.2345j]
 [  0.+300.j    ]]

However, + and j are not aligned and instead are always tightly coupled with Im on either side. Note also that space requirements are decided individually for Re and Im (2 and 3, respectively, in this example). Same as for floats, I suggest keeping this default behavior for 'g' and '', while 'f' should align j but not necessarily +:

.3f         [[ 10.000  +1.235j]
             [  0.000+300.000j]]

Zero-padding is not allowed for Python complex values, so numpy should probably throw the same error:

020f        ValueError

If we choose to support fill, it will have some visible gaps:

.> 20.3     [[....... 10.  +1.235j]
             [........ 0.+300.j   ]]

One gap appears because of ' ' (which could have been implicit here), leaving a leading space for a negative (this is consistent with Python behavior). Another gap appears between Im and Re on the first row and the third after the j on the second row. This is probably not a very common use-case, so if this causes implementation headaches, it might be worth to just shelve it.

Integer arrays

Alignments

Again, low priority. Raise ValueError until supported.

Sign

All signs should be supported ([+- ]) and '-' should be default (consistent with Python and numpy). Although '-' should still default to ' ' if at least one visible negative element exists.

Extras

Similar situation as for floats.

Types

Most of these are straightforward, but one important decision is whether or not prefixes in the alternate forms '#' (0b/0x/0o) should be aligned. Compare:

#b          [  0b10 0b1000]
#b          [0b0010 0b1000]  (prefix aligned)

On a related note, leading zeros are specified with a '0' before the length specification, since '0>' does not handle prefixes correctly:

#010b       [0b00001100 0b00001101]
0>#10b      [00000b1100 00000b1101]

In Python, the use of the leading zero means length has to be specified, since implicit length is always tight. However, in numpy, since length is dictated by multiple elements, it would be useful to be able to specify leading zeros without specifying length. Without breaking from Python standards, I think the correct way to do this would be to specify a redundant 0 minimum length (resulting in two consecutive zeros):

+b          [   +12 -10000]
+0b         [   +12 -10000]   # min length 0 (this is a no-op)
+00b        [+00012 -10000]   # leading 0's and min length 0
0>+b        [000+12 -10000]   # leading 0's, but incorrect w.r.t. signs and prefixes

This presents an argument for why prefixes should not by default be aligned, since there is an option to align them, while the standard formatting language does not offer an intuitive way to un-align them:

Option 1: Do not align prefixes by default

#b          [  0b10 0b1000]
#00b        [0b0010 0b1000]

Option 2: Align prefixes by default

???         [  0b10 0b1000]
#b          [0b0010 0b1000]

The '???' would have to be non-standard, which is why I argue for Option 1. If un-aligned prefixes are not at all important, then of course Option 2 is better. However, I think aligned prefixes can look dense and it makes it harder to spot small values. Also, since the complex j did not need to align, perhaps nor does base prefixes.

My examples have focused on booleans, but all other types should work analogously:

02x         [00 ff 2f ea]
2x          [  0 fff]        # min lengths can be overruled
#6o         [   0o0   0o77]

The type 'c' is the only tricky one, since it raises some questions of how to deal with unicode and alignment. I think many of these issues is shared with strings (which I will get to), one solution is simply to print them as if they were strings of length 1 and presented without alignment:

c           ['a' '\n' '\x00' 'È' '中']

The input to this is np.array([97, 10, 0, 200, 20013]).

In Python 2.7, things get more complicated since the target string can be both bytes and unicode. The behavior of 'c' is also modulo 256, with values in [128, 256) after the modulo failing when the formatted string is unicode (and \x??-encoded otherwise).

It might make sense to continue this modulo-256 semantics in numpy under Python 2.7, so that it is consistent with built-in semantics. However, should it really fail with a UnicodeDecodeError for half the values? I suggest both byte and unicode strings in Python 2.7 use pure ascii and \x??-encodings where necessary. For instace, the above example would be:

c           ['a' '\n' '\x00' '\xc8' '-']  # (in 2.7)

However, a case can also be made that the last one should be something like '\u4e2d'. This would not be a valid Python 2.7 string (since \u requires a unicode string), although this is not a repr representation, so it is not like it needs to eval correctly. A final thought: Are the quotation marks necessary?

Boolean arrays

Alignments

Low priority.

Sign

N/A. Raise ValueError.

Extras

N/A. Raise ValueError.

Types

Booleans in Python are implicitly treated as numericals for formatting purposes, and the default type is 'd' whenever the format spec is non-empty:

04          0001
04d         0001
04.1f       01.0
s           ValueError

In numpy, I think the default should be identical to the current printing default, which is to write out True/False. The user can specify extra padding if they want, although this is not an important feature:

8           [    True    False]
08          ValueError

Note that here the format type is '' here, and there is no equivalent explicit counterpart in Python formatting. As I mentioned erarlier, it would be possible to make such extension, such as introducing the type 'B' to explicitly mean boolean. However, I do not propose anything like that here. The behavior is thus:

            [ True False]
8           [    True    False]
8d          [       1        0]
f           [1.00000000 0.00000000]   # depending on default precision (see at end)
g           [1. 0.]

String arrays

Strings in numpy use repr-based formatting for elements, with quotation marks and escape sequences. You cannot correctly align repr strings (for instance '\x00' is one character but displayed using 4), so numpy uses no alignment for strings. It is hard to extend this, so I suggest adding minimal functionality here.

Alignments

N/A. Raise ValueError.

Sign

N/A. Raise ValueError.

Extras

N/A. Raise ValueError.

Types

The only well-defined and useful feature that I can think of is string slicing:

.2s         ['A' 'Fo']   # input is np.array(['A', 'Foo Bar'])

It would be displayed exactly as if each element had been sliced before printed. This might be useful for numpy arrays with extremely long strings. Optionally, cut off strings could be indicated with some form of ellipsis (for instance ['A' 'Fo'+]).

Object arrays

Disallow any non-empty format specs with a ValueError. Empty format spec should continue to work as it already does.

Other types

I wrote up some thoughts on np.datetime64 and np.timedelta64, but I'll skip that for now in the interest of not making this too long.

Settings and defaults

Python defaults to precision 6:

f           1.234568

While numpy defaults to precision 8:

f           [ 1.23456789]

Unless explicitly specified, numpy should use the set_printoptions defaults when appropriate. Here is list of options that I think __format__ should use:

precision       yes, but only if precision is omitted in the format spec
threshold       yes
edgeitems       yes
linewidth       yes
suppress        yes for 'g' with floats
nanstr          yes
infstr          yes
formatter       no

If a formatter is set, it should be used for empty format specs. However, if any format spec is defined (even just '{:g}'), it should ignore any formatter that may be configured.

@endolith
Copy link

endolith commented Mar 2, 2017

Is there a way to make it print every value of the array instead of omitting the middle? http://stackoverflow.com/a/1988024/125507

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment