Reference: Format Specification Mini-Language.
Quick demo of desired behavior and motivation:
>>> x = np.array([100.002, 1.2])
>>> print('{:8.2f}'.format(x))
[ 100.00 1.20]
Currently, the user has to do np.set_printoptions
to adjust these options,
with this particular format not immediately supported without a custom
formatter:
>>> np.set_printoptions(formatter={'all': lambda x: format(x, '8.2f')})
>>> print(x)
[ 100.00 1.20]
The user also has to revert changes with another set_printoptions
call.
Personally, I would find it very useful to be able to quickly change the
printing behavior, which this would allow. Here is an example that I often deal
with:
[[ 9.99640122e-01 2.17222256e-03 -1.24228450e-02 -5.61696924e-03
7.06837666e-04]
[ -2.48842046e-03 1.02228184e+00 7.16419216e-03 -1.33046711e-02
-2.40332264e-03]
[ -9.66609653e-03 -2.78424918e-02 9.99769143e-01 -1.34152343e-03
2.11504829e-03]
[ -7.08603279e-03 -7.57130700e-03 -1.64586641e-02 9.87134490e-01
-2.06328584e-03]
[ 2.64709841e-02 1.75090350e-02 -1.06997156e-02 1.09245948e-02
9.93078570e-01]]
After tweaking print options it can become much clearer what I'm working with:
[[ 1. 0. -0. -0. 0.]
[-0. 1. 0. -0. -0.]
[-0. -0. 1. -0. 0.]
[-0. -0. -0. 1. -0.]
[ 0. 0. -0. 0. 1.]]
Note, this is particularly relevant now that we have f-strings in Python 3.6,
e.g. f'{x:.1g}'
.
My suggestion follows a few main principles:
-
Map: To the extent possible, it should behave as if the format spec was applied element-wise. That is,
'{:6.2f}'.format(x)
should produce a numpy-style array grid with the first element formatted roughly according to'{:6.2f}'.format(x[0])
, and so forth. There will be times this should not be strictly true to accommodate established numpy defaults (in this particular case, it should behave as'{: 6.2f}'.format(x[0])
, since numpy by default leaves space for negative signs).This principle also informs us of what data types should support which format types and what errors should be raised if not. For instance, if you use
'{.2f}'
withnp.int64(1024)
, it should print1024.00
without complaining. However, if you try to use'{:4d}'
withnp.float64
, it should raise a ValueError.The implied format types (e.g.
'{:.2}'
) should be appropriate for the type. Note, in Python, the implied types are:string -> 's' integer -> 'd' float -> unique behavior (similar to 'g', but not quite)
I will describe below what I think the implied types should be for numpy types.
-
Align: A strict interpretation of the map principle would suggest that a very simple solution like this would do:
def __format__(self, fmt): formatter = {'all': lambda x: format(x, fmt)} return np.array2string(self, formatter=formatter)
This actually works in many cases, but decimal alignment is often incorrect. For correct alignment, sometimes all elements must be considered, which is why numpy has a set of custom formatters that are passed all the data first:
>>> x = np.array([100.002, 1.2]) >>> formatter = numpy.core.arrayprint.FloatFormat(x, 6, True) >>> formatter(x[0]) ' 1.2 ' >>> formatter(x[1]) ' 100.002'
The strings are aligned at the decimal point. I think a reasonable implementation is to parse
fmt
in__format__
and set up the appropriate already existing numpy formatters. Some completely new formatters will need to be written in order to support all options available to Python's built-in types (e.g.'{:x}'
for hexadecimal) and some lesser used features might require extending exisiting formatters (e.g. alignments and fill), if they are deemed important enough to support. -
Respect current behavior: Empty format specs (
'{}'
and'{:}'
) should work as they already do, and adding additional specs should gently modify this behavior. -
2/3 Consistency: Python 2 and 3 have different formatting behaviors (examples below), so one big question is: Should numpy try to adopt the local dialect or should numpy always be consistent between 2 and 3. I vote for consistency (as much as possible), even though this would break what some might consider a consistent and reasonable formatting behavior in Python 2.7. I think this is one of the most important points to discuss. I am also not a Python 2 user, so I do not have terribly strong opinions what happens there.
-
No extensions (in this proposal): It would be possible to extend the formatting language when it makes sense for numpy. For instance, one could dream up something like
'{:[+][-]B}'.format(np.array([True, False, True]))
resulting in'[+ - +]'
. However, I defer all such considerations and focus on what Python users expect will work inuitively.
In 2.7, all numpy arrays can currently be formatted with format type 's'
(or
implicitly with ''
) and a precision to mean string slicing or or a length to
add padding:
'{:.5}'.format(x) == str(x)[:5] # slice
'{:25}'.format(x) == str(x).ljust(25) # pad
This is followed by Python lists/tuples too, so it is consistent and correct behavior for Python 2.7. However, in numpy it can cause subtle confusion not quite possible with lists/tuples (even though array scalars are rare):
>>> '{:.3}'.format(1234.5678)
'1.23e+03'
>>> '{:.3}'.format(np.array(1234.5678))
'123'
Note, if we break this behavior (which I am in favor of), this functionality is still readily available (even in Python 3) with more explicit syntax:
>>> '{!s:.5}'.format(x)
>>> '{!s:25}'.format(x)
Brandon Rhodes has some great arguments for breaking with this on the GitHub issue (#5543), saying that numpy arrays should not conform to the behavior of lists/tuples, since their very existence is based on breaking free from this in many many ways. I agree with him.
Currently Python 3 throws a TypeError (stating __format__
is not implemented)
for all non-empty format specs with a numpy array. This means there are no
conflicts in Python 3 for implementing this.
I walk through my suggested behavior for various numpy arrays.
Numpy array scalars (e.g. np.array(1.23)
) should always format identically to
their closests Python scalar equivalents.
It does not really hurt to support [<=^]
, but since numbers are most sensibly
aligned by the decimal point, I think this can be very low priority and raise
ValueError in the initial implementation. The value '>'
is default and should
be allowed to be explicitly specified (unless fill is deemed unnecessary),
since it is required if you want to define a fill character. Note, the longest
element decides length, so in numpy it can all of a sudden make sense to define
a fill character without defining a min length, e.g.:
.>-.1f [....12.0 300212.4 ...123.5]
Note: What I mean when I write this is that if you call
'{:.>-1.f}'.format(x)
, where x
in this case might be np.array([12, 300212.412, 123.51])
, you get the output string on the right (shown without
quotation marks).
As to whether or not fill is important enough to support, I'm on the fence. If you are using fill, you are probably looking to make a nice-looking table, in which case you will probably want to change the look of the surrounding square brackets as well. At this point, you are better off writing a custom loop instead or perhaps putting it into a pandas DataFrame. If fill is not supported, I think it should raise ValueError if specified.
All signs should be supported ([+- ]
), but unlike Python where '-'
is the
default, the default for numpy should be ' '
(leave space for negative
signs), since this is already the default behavior in numpy. For '-'
, if all
values are non-negative, it should not leave any unnecessary leading spaces.
However, if there is at least one negative value, '-'
must fall back to
behaving like ' '
to align correctly.
Extras ([#_,0]
) can be optionally supported and result in ValueError until they
are. It is also a question whether or not they should be supported in Python
versions where they are not supported. For instance, '_'
was introduced in
3.6, but could technically be back-ported (although it will likely be more work
implementing it).
Some examples:
.2f [ 1.00 45.68]
+.2f [ +1.00 +45.68]
-.2f [ 1.00 45.68]
-.2f [ 1.00 -50.00] # note, the neg element changes behavior
10.2e [ 1.00e+02 1.20e+00]
.2% [ 123.00% 4567.80%]
.2g [ 1. 45.68] # 2 inidicates decimals (not significant digits)
#.2g [ 1.00 45.68]
.2g [ 1.00e+05 1.00e+00] # if max(abs(x))/min(abs(x)) > 1000
.2 [ 1.00e+05 1.00e+00] # same as .2g
Note, 'f'
and '%'
are straightforward and well-defined and should correspond to
current numpy as much as possible. The type 'e'
should use a consistent number
of exponents (and use a minimum of 2):
10.2e [ 1.23e+200 1.00e+000]
For 'g'
, we can't stick to built-in semantics, since it decides format based
on the length of the single result. The precision also refers to significant
digits, which is hard to synchronize across multiple elements:
'{:#.5g'}.format(10.0) == '10.000'
'{:#.5g'}.format(100.0) == '100.00'
'{:#.5g'}.format(1000.0) == '1000.0'
Instead, I suggest completely different semantics for 'g'
, closer to the
default numpy behavior. That is, it is equivalent to 'e'
when
max(abs(x))/min(abs(x)) > 1000
and otherwise similar to 'f'
but without
trailing zeros (unless alternate form is specified with '#'
). Note, that the
precision thus refers to decimals and not significant digits. I also suggest
that the implicit type ''
should fall back to 'g'
. I do not think the
distinction between 'g'
and ''
for Python floats is relevant to numpy, but
feel free to argue otherwise.
Another difference between Python 'g'
and numpy 'g'
is that numpy always
shows decimal points for floats, while Python omits a trailing decimal point:
'{:g}'.format(1) == '1'
'{:0g}'.format(1) == '1'
'{:1g}'.format(1) == '1'
'{:0f}'.format(1) == '1'
'{:1f}'.format(1) == '1.0'
To be consistent with current numpy behavior, it should only remove trailing zeros and not a trailing decimal point:
g [ 1. 0.]
0g [ 1. 0.]
1g [ 1. 0.]
0f [ 1. 0.]
1f [ 1.0 0.0]
#1g [ 1.0 0.0]
This would mean it is impossible to get it to omit the decimal point. Opinions?
The types 'F'
, 'E'
, 'G'
should work as expected. I defer the
locale-sensitive 'n'
for now.
Complex types should behave similarly to floats and precision should be applied to both Re and Im. Note that alignment applies to both decimal points:
>>> print(np.array([[10+1.2345j, 300j]]).T)
[[ 10. +1.2345j]
[ 0.+300.j ]]
However, +
and j
are not aligned and instead are always tightly coupled with
Im on either side. Note also that space requirements are decided individually
for Re and Im (2 and 3, respectively, in this example). Same as for floats, I
suggest keeping this default behavior for 'g'
and ''
, while 'f'
should align j
but
not necessarily +
:
.3f [[ 10.000 +1.235j]
[ 0.000+300.000j]]
Zero-padding is not allowed for Python complex values, so numpy should probably throw the same error:
020f ValueError
If we choose to support fill, it will have some visible gaps:
.> 20.3 [[....... 10. +1.235j]
[........ 0.+300.j ]]
One gap appears because of ' '
(which could have been implicit here), leaving
a leading space for a negative (this is consistent with Python behavior).
Another gap appears between Im and Re on the first row and the third after the
j
on the second row. This is probably not a very common use-case, so if this
causes implementation headaches, it might be worth to just shelve it.
Again, low priority. Raise ValueError until supported.
All signs should be supported ([+- ]
) and '-'
should be default (consistent
with Python and numpy). Although '-'
should still default to ' '
if at least
one visible negative element exists.
Similar situation as for floats.
Most of these are straightforward, but one important decision is whether or not
prefixes in the alternate forms '#'
(0b
/0x
/0o
) should be aligned. Compare:
#b [ 0b10 0b1000]
#b [0b0010 0b1000] (prefix aligned)
On a related note, leading zeros are specified with a '0'
before the length
specification, since '0>'
does not handle prefixes correctly:
#010b [0b00001100 0b00001101]
0>#10b [00000b1100 00000b1101]
In Python, the use of the leading zero means length has to be specified, since implicit length is always tight. However, in numpy, since length is dictated by multiple elements, it would be useful to be able to specify leading zeros without specifying length. Without breaking from Python standards, I think the correct way to do this would be to specify a redundant 0 minimum length (resulting in two consecutive zeros):
+b [ +12 -10000]
+0b [ +12 -10000] # min length 0 (this is a no-op)
+00b [+00012 -10000] # leading 0's and min length 0
0>+b [000+12 -10000] # leading 0's, but incorrect w.r.t. signs and prefixes
This presents an argument for why prefixes should not by default be aligned, since there is an option to align them, while the standard formatting language does not offer an intuitive way to un-align them:
Option 1: Do not align prefixes by default
#b [ 0b10 0b1000]
#00b [0b0010 0b1000]
Option 2: Align prefixes by default
??? [ 0b10 0b1000]
#b [0b0010 0b1000]
The '???'
would have to be non-standard, which is why I argue for Option 1.
If un-aligned prefixes are not at all important, then of course Option 2 is
better. However, I think aligned prefixes can look dense and it makes it harder
to spot small values. Also, since the complex j
did not need to align,
perhaps nor does base prefixes.
My examples have focused on booleans, but all other types should work analogously:
02x [00 ff 2f ea]
2x [ 0 fff] # min lengths can be overruled
#6o [ 0o0 0o77]
The type 'c'
is the only tricky one, since it raises some questions of how to
deal with unicode and alignment. I think many of these issues is shared with
strings (which I will get to), one solution is simply to print them as if they
were strings of length 1 and presented without alignment:
c ['a' '\n' '\x00' 'È' '中']
The input to this is np.array([97, 10, 0, 200, 20013])
.
In Python 2.7, things get more complicated since the target string can be both
bytes and unicode. The behavior of 'c'
is also modulo 256, with values in [128,
256) after the modulo failing when the formatted string is unicode (and
\x??
-encoded otherwise).
It might make sense to continue this modulo-256 semantics in numpy under Python
2.7, so that it is consistent with built-in semantics. However, should
it really fail with a UnicodeDecodeError for half the values? I suggest both
byte and unicode strings in Python 2.7 use pure ascii and \x??
-encodings where
necessary. For instace, the above example would be:
c ['a' '\n' '\x00' '\xc8' '-'] # (in 2.7)
However, a case can also be made that the last one should be something like
'\u4e2d'
. This would not be a valid Python 2.7 string (since \u
requires a
unicode string), although this is not a repr
representation, so it is not like
it needs to eval
correctly. A final thought: Are the quotation marks necessary?
Low priority.
N/A. Raise ValueError.
N/A. Raise ValueError.
Booleans in Python are implicitly treated as numericals for formatting
purposes, and the default type is 'd'
whenever the format spec is non-empty:
04 0001
04d 0001
04.1f 01.0
s ValueError
In numpy, I think the default should be identical to the current printing default, which is to write out True/False. The user can specify extra padding if they want, although this is not an important feature:
8 [ True False]
08 ValueError
Note that here the format type is ''
here, and there is no equivalent explicit
counterpart in Python formatting. As I mentioned erarlier, it would be possible
to make such extension, such as introducing the type 'B'
to explicitly mean
boolean. However, I do not propose anything like that here. The behavior is thus:
[ True False]
8 [ True False]
8d [ 1 0]
f [1.00000000 0.00000000] # depending on default precision (see at end)
g [1. 0.]
Strings in numpy use repr-based formatting for elements, with quotation marks
and escape sequences. You cannot correctly align repr strings (for instance
'\x00'
is one character but displayed using 4), so numpy uses no alignment for
strings. It is hard to extend this, so I suggest adding minimal functionality
here.
N/A. Raise ValueError.
N/A. Raise ValueError.
N/A. Raise ValueError.
The only well-defined and useful feature that I can think of is string slicing:
.2s ['A' 'Fo'] # input is np.array(['A', 'Foo Bar'])
It would be displayed exactly as if each element had been sliced before
printed. This might be useful for numpy arrays with extremely long strings.
Optionally, cut off strings could be indicated with some form of ellipsis (for
instance ['A' 'Fo'+]
).
Disallow any non-empty format specs with a ValueError. Empty format spec should continue to work as it already does.
I wrote up some thoughts on np.datetime64
and np.timedelta64
, but I'll skip
that for now in the interest of not making this too long.
Python defaults to precision 6:
f 1.234568
While numpy defaults to precision 8:
f [ 1.23456789]
Unless explicitly specified, numpy should use the set_printoptions
defaults
when appropriate. Here is list of options that I think __format__
should use:
precision yes, but only if precision is omitted in the format spec
threshold yes
edgeitems yes
linewidth yes
suppress yes for 'g' with floats
nanstr yes
infstr yes
formatter no
If a formatter is set, it should be used for empty format specs. However, if
any format spec is defined (even just '{:g}'
), it should ignore any formatter
that may be configured.
Is there a way to make it print every value of the array instead of omitting the middle? http://stackoverflow.com/a/1988024/125507