Partially-structured thoughts in response to https://numpy.org/neps/nep-0041-improved-dtype-support.html
Quoting @teoliphant from the mailing list
But, this is the right way to connect the data type system with the rest of Python typing. NumPy's current dtypes are currently analogous to Python 1's user-defined classes. In Python 1 all user-defined classes were instances of a single Class Type at the C-level, just like currently all NumPy dtypes are instances of a single Dtype "Type" in Python.
I'm not really following here. In python 3.x, all user-defined classes are instances of type
at the python level.
At the C level, almost all classes have the layout of either PyTypeObject
(note that while the layout PyHeapTypeObject
is sometimes used, this layout is not a different python type!).
Shifting Dtypes to be true types (by making them instances of a single low-level MetaType) is (IMHO) exactly the right approach.
Perhaps it's worth asking ourselves exactly what a c-level type insttance is for
in python. To be clear, that refers to PyTypeObject
which are not instances
of type
. I'd say they:
- Define some metadata about the type (
tp_name
,tp_flags
,tp_dict
) - Provide a table of behaviors, including:
- fast-paths for magic methods (
tp_as_number
, etc) - arbitrary methods with string names (
tp_methods
etc)
- fast-paths for magic methods (
- ... for operating on a single object pointer whose data contains the type itself (
self->ob_type
) - ... which is allocated and deallocated using hooks (
tp_alloc
) and metadata (tp_basicsize
) in the type
Let's compare that to today's dtype
s, which:
-
Define some metadata about the type (
kind
,type_num
, ...) -
Provide a table of behaviors, including:
- fast-paths for "magic" methods (
f
)
- fast-paths for "magic" methods (
-
... for operating on either:
- contigouous/strided 1D arrays of object data
- single pointers to object data
with the dtype itself store elsewhere (
arr->descr
) -
... which is allocated using metadata (
itemsize
,alignment
)
At a glance it already seems like we're close to the python model, without changing anything at all.
But it's also worth noting that right now, it seems that data described by a dtype
and data described by a type
are not interchangeable. The former is all about void *
of arbitrary structure, while the latter is about PyObject *
pointers whose head is a PyObject*
. So if anything, we should conclude that isinstance(dtype(np.int32), type)
should be False
(note: this stance would change if we decided to try and unify dtypes with scalar types, but right now we are not trying to do that).
Now let's look at how python uses metatypes at the clevel. This happens in only one place, in _ctypes.c
, where the following metatypes are defined:
-
CDataType
- This isn't actually a real metaclass for some reason, but can be thought of as a base class of the following metaclasses
- Provides static methods
- Provides operator overloads for
ctype * n
etc
-
PyCStructType_Type
,UnionType_Type
- defines a custom
setattr
to handle_fields_
class attributes
- defines a custom
-
PyCPointerType_Type
- add a
set_type
class methods
- add a
-
PyCSimpleType_Type
- Overload the
from_param
class method
- Overload the
-
PyCArrayType_Type
-
PyCFuncPtrType_Type
Clearly then, metatypes are the right solution for providing static methods. Almost all of the magic here is in tp_new
for these types. TODO: look at tp_new.
Now, we just looked at metatypes, type(some_type)
.
Everything we learnt here we should apply to type(some_dtype)
.
Crucial point: meta-dtypes are just dtype subclasses, not meta-types.
Let's build a toy metaclass:
class MyMetaType(type):
pass
class MyClass(metaclass=MyMetaType):
pass
and then do some comparisons:
-
Function hook names are C-level slots in:
scalar types array types python object o1
type
np.dtype
C decl PyTypeObject PyType_Type
PyTypeObject PyArrayDescr_Type
invariant instance(o1, type)
instance(o1, type)
-
C static methods are stored in:
scalar types array types python object o2
MyMeta
np.integral_dtype
C decl PyTypeObject MyMetaType_Type
PyTypeObject PyArrayIntegralDescr_Type
invariant issubclass(o2, o1)
issubclass(o2, o1)
-
Function hook values and allocations settings are stored by:
scalar types array types python object o3
MyClass
np.dtype(int)
C decl MyMetaType_Object MyClass_Type
PyArrayIntegralDescr_Object my_int_dtype
invariant isinstance(o3, o1)
isinstance(o3, o1)
sometimes... isinstance(o3, o2)
isinstance(o3, o2)
-
Instances of the type are:
scalar types array types python object o4
MyClass()
np.empty(..., np.dtype(int))
C decl MyClass_Object my_obj
np_int my_int_element
invariant o4.__class__ == o3
o3.dtype == o3
Another observation: class
definitions do not have to defined type
s:
# np.dtype
class NotAMetaClass:
def __init__(self, name, bases, dict):
self.name = name
self.__dict__ = dict
dict['__classcell__'].cell_contents = self
# my_custom_dtype
class NotAType(metaclass=NotAMetaClass):
# not tied to `self` conventions, can take anything here
def foo(arr):
return __class__.bar
bar = 'baz'
some_arr = ...
print(NotAType.foo(some_arr))
# baz
So we can support new dtypes defined using the class
statement even if issubclass(np.dtype, type)
is false.
class dtype(object):
# C level slots in PyArray_Descr today, perhaps with some removed
type_num: int
__common_dtype__: Callable["dtype(dtype...)"]
# or __new__, doesn't matter for this example
def __init__(dt, names, bases, dict):
dt.name = name
# in C, this is really just direct initialization of slots
dt.__dict__ = dict
# trick for python `__class__` magic
dict['__classcell__'].cell_contents = self
if bases:
# inherit slots from the base class
b, = bases
if dt.type_num == -1:
dt.type_num = b.type_num
if dt.__common_dtype__ == NULL:
dt.type_num = b.__common_dtype__
def __call__(dt, value):
return np.asarray(value, dtype=dt)
meta-dtypes:
class integral_dtype(np.dtype):
# C level slots in PyArrayIntegralDescr_Object, which start with `PyArrayDescr_Object`
endianess: char
signed: bool
# C storage in PyArrayIntegralDescr_Type, which starts with `PyType_Type`
_lookup_dict: dict
def __init__(self, name, bases, dict):
# meta-dtypes are regular types so super works just fine!
super().__init__(metadt, names, bases, dict)
if bases:
# inherit slots from the base class
b, = bases
# make sure to inherit the new slots
if dt.endianess is None:
dt.endianness = bases[0].endianess
meta-dtype instantiations: just plain old dtypes
class integer(metaclass=integral_dtype):
# this provides default values for the slots in the meta-dtype integral_dtype
def __common_dtype__(dt, other_dt):
if not isinstance(other_dt, integral_dtype):
return NotImplementedError
if dt.signed == other_dt.signed:
return integral_dtype._lookup_dict[max(dt.itemsize, other_dt.itemsize), dt.signed)
else:
return integral_dtype._lookup_dict[max(dt.itemsize, other_dt.itemsize) + 1, True]
class uint8(integer):
type_num = 1 # this slot is from PyArrayDescr
# these slots are from PyArrayIntegralDescr_Object
signed = False
endianess = '='
integral_type._lookup_dict[1, True] = uint8
assert not isinstance(uint8, type)
assert isinstance(uint8, np.dtype)
assert isinstance(uint8, integral_dtype)
dtype subclassing means "fill in slots from my parent" just like it does in python. This is handled by dtype.__init__
class uint8_non_native(uint8):
endianess = 'S'
dtype instantiation (not to be confused with invoking np.dtype
itself) could be made to mean array-with-this-dtype instantiation
arr = uint8_non_native([1, 2, 3])
assert arr.dtype == uint8_non_native