eric-wieser/nep-41-thoughts.md

## nep-41-thoughts.md

      
    Raw
  

              nep-41-thoughts.md
            
          
    Partially-structured thoughts in response to https://numpy.org/neps/nep-0041-improved-dtype-support.html

Quoting @teoliphant from the mailing list

But, this is the right way to connect the data type system with the rest of Python typing.  NumPy's current dtypes are currently analogous to Python 1's user-defined classes.  In Python 1 all user-defined classes were instances of a single Class Type at the C-level, just like currently all NumPy dtypes are instances of a single Dtype "Type" in Python.

I'm not really following here. In python 3.x, all user-defined classes are instances of type at the python level.
At the C level, almost all classes have the layout of either PyTypeObject (note that while the layout PyHeapTypeObject is sometimes used, this layout is not a different python type!).

Shifting Dtypes to be true types (by making them instances of a single low-level MetaType) is (IMHO) exactly the right approach.

Perhaps it's worth asking ourselves exactly what a c-level type insttance is for
in python. To be clear, that refers to PyTypeObject which are not instances
of type. I'd say they:

Define some metadata about the type (tp_name, tp_flags, tp_dict)
Provide a table of behaviors, including:

fast-paths for magic methods (tp_as_number, etc)
arbitrary methods with string names (tp_methods etc)


... for operating on a single object pointer whose data contains the type itself (self->ob_type)
... which is allocated and deallocated using hooks (tp_alloc) and metadata (tp_basicsize) in the type

Let's compare that to today's dtypes, which:


Define some metadata about the type (kind, type_num, ...)


Provide a table of behaviors, including:

fast-paths for "magic" methods (f)


... for operating on either:

contigouous/strided 1D arrays of object data
single pointers to object data

with the dtype itself store elsewhere (arr->descr)


... which is allocated using metadata (itemsize, alignment)


At a glance it already seems like we're close to the python model, without changing anything at all.
But it's also worth noting that right now, it seems that data described by a dtype and data described by a type are not interchangeable. The former is all about void * of arbitrary structure, while the latter is about PyObject * pointers whose head is a PyObject*. So if anything, we should conclude that isinstance(dtype(np.int32), type) should be False (note: this stance would change if we decided to try and unify dtypes with scalar types, but right now we are not trying to do that).
Now let's look at how python uses metatypes at the clevel. This happens in only one place, in _ctypes.c, where the following metatypes are defined:


CDataType

This isn't actually a real metaclass for some reason, but can be thought of as a base class of the following metaclasses
Provides static methods
Provides operator overloads for ctype * n etc


PyCStructType_Type, UnionType_Type

defines a custom setattr to handle _fields_ class attributes


PyCPointerType_Type

add a set_type class methods


PyCSimpleType_Type

Overload the from_param class method


PyCArrayType_Type


PyCFuncPtrType_Type


Clearly then, metatypes are the right solution for providing static methods. Almost all of the magic here is in tp_new for these types. TODO: look at tp_new.
Now, we just looked at metatypes, type(some_type).
Everything we learnt here we should apply to type(some_dtype).
Crucial point: meta-dtypes are just dtype subclasses, not meta-types.
Comparing scalar types and array types in more depth

Let's build a toy metaclass:
class MyMetaType(type):
    pass
class MyClass(metaclass=MyMetaType):
    pass

and then do some comparisons:


Function hook names are C-level slots in:


scalar types
array types


python object o1
type
np.dtype


C decl
PyTypeObject PyType_Type
PyTypeObject PyArrayDescr_Type


invariant
instance(o1, type)
instance(o1, type)


C static methods are stored in:


scalar types
array types


python object o2
MyMeta
np.integral_dtype


C decl
PyTypeObject MyMetaType_Type
PyTypeObject PyArrayIntegralDescr_Type


invariant
issubclass(o2, o1)
issubclass(o2, o1)


Function hook values and allocations settings are stored by:


scalar types
array types


python object o3
MyClass
np.dtype(int)


C decl
MyMetaType_Object MyClass_Type
PyArrayIntegralDescr_Object my_int_dtype


invariant
isinstance(o3, o1)
isinstance(o3, o1)


sometimes...
isinstance(o3, o2)
isinstance(o3, o2)


Instances of the type are:


scalar types
array types


python object o4
MyClass()
np.empty(..., np.dtype(int))


C decl
MyClass_Object my_obj
np_int my_int_element


invariant
o4.__class__ == o3
o3.dtype == o3


classes needn't be types

Another observation: class definitions do not have to defined types:
# np.dtype
class NotAMetaClass:
    def __init__(self, name, bases, dict): 
        self.name = name
        self.__dict__ = dict
        dict['__classcell__'].cell_contents = self

# my_custom_dtype
class NotAType(metaclass=NotAMetaClass):
    # not tied to `self` conventions, can take anything here
    def foo(arr):
        return __class__.bar
    bar = 'baz'

some_arr = ...
print(NotAType.foo(some_arr))
# baz
So we can support new dtypes defined using the class statement even if issubclass(np.dtype, type) is false.
What this could mean for dtypes

class dtype(object):
    # C level slots in PyArray_Descr today, perhaps with some removed
    type_num: int
    __common_dtype__: Callable["dtype(dtype...)"]

    # or __new__, doesn't matter for this example
    def __init__(dt, names, bases, dict):
        dt.name = name
        # in C, this is really just direct initialization of slots
        dt.__dict__ = dict
        
        # trick for python `__class__` magic
        dict['__classcell__'].cell_contents = self

        if bases:
            # inherit slots from the base class
            b, = bases
            if dt.type_num == -1:
                dt.type_num = b.type_num
            if dt.__common_dtype__ == NULL:
                dt.type_num = b.__common_dtype__
                
    def __call__(dt, value):
        return np.asarray(value, dtype=dt)
meta-dtypes:
class integral_dtype(np.dtype):
    # C level slots in PyArrayIntegralDescr_Object, which start with `PyArrayDescr_Object`
    endianess: char
    signed: bool
    
    # C storage in PyArrayIntegralDescr_Type, which starts with `PyType_Type`
    _lookup_dict: dict

    def __init__(self, name, bases, dict):
        # meta-dtypes are regular types so super works just fine!
        super().__init__(metadt, names, bases, dict)
        if bases:
            # inherit slots from the base class
            b, = bases
            # make sure to inherit the new slots
            if dt.endianess is None:
                dt.endianness = bases[0].endianess
meta-dtype instantiations: just plain old dtypes
class integer(metaclass=integral_dtype):
    # this provides default values for the slots in the meta-dtype integral_dtype
    def __common_dtype__(dt, other_dt):
        if not isinstance(other_dt, integral_dtype):
            return NotImplementedError
        if dt.signed == other_dt.signed:
            return integral_dtype._lookup_dict[max(dt.itemsize, other_dt.itemsize), dt.signed)
        else:
            return integral_dtype._lookup_dict[max(dt.itemsize, other_dt.itemsize) + 1, True]

class uint8(integer):
    type_num = 1  # this slot is from PyArrayDescr
    # these slots are from PyArrayIntegralDescr_Object
    signed = False
    endianess = '='
integral_type._lookup_dict[1, True] = uint8

assert not isinstance(uint8, type)
assert isinstance(uint8, np.dtype)
assert isinstance(uint8, integral_dtype)
dtype subclassing means "fill in slots from my parent" just like it does in python. This is handled by dtype.__init__
class uint8_non_native(uint8):
    endianess = 'S'
dtype instantiation (not to be confused with invoking np.dtype itself) could be made to mean array-with-this-dtype instantiation
arr = uint8_non_native([1, 2, 3])
assert arr.dtype == uint8_non_native
	scalar types	array types
python object `o1`	`type`	`np.dtype`
C decl	`PyTypeObject PyType_Type`	`PyTypeObject PyArrayDescr_Type`
invariant	`instance(o1, type)`	`instance(o1, type)`
	scalar types	array types
python object `o2`	`MyMeta`	`np.integral_dtype`
C decl	`PyTypeObject MyMetaType_Type`	`PyTypeObject PyArrayIntegralDescr_Type`
invariant	`issubclass(o2, o1)`	`issubclass(o2, o1)`
	scalar types	array types
python object `o3`	`MyClass`	`np.dtype(int)`
C decl	`MyMetaType_Object MyClass_Type`	`PyArrayIntegralDescr_Object my_int_dtype`
invariant	`isinstance(o3, o1)`	`isinstance(o3, o1)`
sometimes...	`isinstance(o3, o2)`	`isinstance(o3, o2)`
	scalar types	array types
python object `o4`	`MyClass()`	`np.empty(..., np.dtype(int))`
C decl	`MyClass_Object my_obj`	`np_int my_int_element`
invariant	`o4.__class__ == o3`	`o3.dtype == o3`