Skip to content

Instantly share code, notes, and snippets.

@eric-wieser
Last active March 25, 2020 12:39
Show Gist options
  • Save eric-wieser/49c55bcab744b0e782f6c2740603180b to your computer and use it in GitHub Desktop.
Save eric-wieser/49c55bcab744b0e782f6c2740603180b to your computer and use it in GitHub Desktop.

Partially-structured thoughts in response to https://numpy.org/neps/nep-0041-improved-dtype-support.html


Quoting @teoliphant from the mailing list

But, this is the right way to connect the data type system with the rest of Python typing.  NumPy's current dtypes are currently analogous to Python 1's user-defined classes.  In Python 1 all user-defined classes were instances of a single Class Type at the C-level, just like currently all NumPy dtypes are instances of a single Dtype "Type" in Python.

I'm not really following here. In python 3.x, all user-defined classes are instances of type at the python level. At the C level, almost all classes have the layout of either PyTypeObject (note that while the layout PyHeapTypeObject is sometimes used, this layout is not a different python type!).

Shifting Dtypes to be true types (by making them instances of a single low-level MetaType) is (IMHO) exactly the right approach.

Perhaps it's worth asking ourselves exactly what a c-level type insttance is for in python. To be clear, that refers to PyTypeObject which are not instances of type. I'd say they:

  • Define some metadata about the type (tp_name, tp_flags, tp_dict)
  • Provide a table of behaviors, including:
    • fast-paths for magic methods (tp_as_number, etc)
    • arbitrary methods with string names (tp_methods etc)
  • ... for operating on a single object pointer whose data contains the type itself (self->ob_type)
  • ... which is allocated and deallocated using hooks (tp_alloc) and metadata (tp_basicsize) in the type

Let's compare that to today's dtypes, which:

  • Define some metadata about the type (kind, type_num, ...)

  • Provide a table of behaviors, including:

    • fast-paths for "magic" methods (f)
  • ... for operating on either:

    • contigouous/strided 1D arrays of object data
    • single pointers to object data

    with the dtype itself store elsewhere (arr->descr)

  • ... which is allocated using metadata (itemsize, alignment)

At a glance it already seems like we're close to the python model, without changing anything at all.

But it's also worth noting that right now, it seems that data described by a dtype and data described by a type are not interchangeable. The former is all about void * of arbitrary structure, while the latter is about PyObject * pointers whose head is a PyObject*. So if anything, we should conclude that isinstance(dtype(np.int32), type) should be False (note: this stance would change if we decided to try and unify dtypes with scalar types, but right now we are not trying to do that).

Now let's look at how python uses metatypes at the clevel. This happens in only one place, in _ctypes.c, where the following metatypes are defined:

  • CDataType

    • This isn't actually a real metaclass for some reason, but can be thought of as a base class of the following metaclasses
    • Provides static methods
    • Provides operator overloads for ctype * n etc
  • PyCStructType_Type, UnionType_Type

    • defines a custom setattr to handle _fields_ class attributes
  • PyCPointerType_Type

    • add a set_type class methods
  • PyCSimpleType_Type

    • Overload the from_param class method
  • PyCArrayType_Type

  • PyCFuncPtrType_Type

Clearly then, metatypes are the right solution for providing static methods. Almost all of the magic here is in tp_new for these types. TODO: look at tp_new.

Now, we just looked at metatypes, type(some_type). Everything we learnt here we should apply to type(some_dtype).

Crucial point: meta-dtypes are just dtype subclasses, not meta-types.

Comparing scalar types and array types in more depth

Let's build a toy metaclass:

class MyMetaType(type):
    pass
class MyClass(metaclass=MyMetaType):
    pass

and then do some comparisons:

  • Function hook names are C-level slots in:

    scalar types array types
    python object o1 type np.dtype
    C decl PyTypeObject PyType_Type PyTypeObject PyArrayDescr_Type
    invariant instance(o1, type) instance(o1, type)
  • C static methods are stored in:

    scalar types array types
    python object o2 MyMeta np.integral_dtype
    C decl PyTypeObject MyMetaType_Type PyTypeObject PyArrayIntegralDescr_Type
    invariant issubclass(o2, o1) issubclass(o2, o1)
  • Function hook values and allocations settings are stored by:

    scalar types array types
    python object o3 MyClass np.dtype(int)
    C decl MyMetaType_Object MyClass_Type PyArrayIntegralDescr_Object my_int_dtype
    invariant isinstance(o3, o1) isinstance(o3, o1)
    sometimes... isinstance(o3, o2) isinstance(o3, o2)
  • Instances of the type are:

    scalar types array types
    python object o4 MyClass() np.empty(..., np.dtype(int))
    C decl MyClass_Object my_obj np_int my_int_element
    invariant o4.__class__ == o3 o3.dtype == o3

classes needn't be types

Another observation: class definitions do not have to defined types:

# np.dtype
class NotAMetaClass:
    def __init__(self, name, bases, dict): 
        self.name = name
        self.__dict__ = dict
        dict['__classcell__'].cell_contents = self

# my_custom_dtype
class NotAType(metaclass=NotAMetaClass):
    # not tied to `self` conventions, can take anything here
    def foo(arr):
        return __class__.bar
    bar = 'baz'

some_arr = ...
print(NotAType.foo(some_arr))
# baz

So we can support new dtypes defined using the class statement even if issubclass(np.dtype, type) is false.

What this could mean for dtypes

class dtype(object):
    # C level slots in PyArray_Descr today, perhaps with some removed
    type_num: int
    __common_dtype__: Callable["dtype(dtype...)"]

    # or __new__, doesn't matter for this example
    def __init__(dt, names, bases, dict):
        dt.name = name
        # in C, this is really just direct initialization of slots
        dt.__dict__ = dict
        
        # trick for python `__class__` magic
        dict['__classcell__'].cell_contents = self

        if bases:
            # inherit slots from the base class
            b, = bases
            if dt.type_num == -1:
                dt.type_num = b.type_num
            if dt.__common_dtype__ == NULL:
                dt.type_num = b.__common_dtype__
                
    def __call__(dt, value):
        return np.asarray(value, dtype=dt)

meta-dtypes:

class integral_dtype(np.dtype):
    # C level slots in PyArrayIntegralDescr_Object, which start with `PyArrayDescr_Object`
    endianess: char
    signed: bool
    
    # C storage in PyArrayIntegralDescr_Type, which starts with `PyType_Type`
    _lookup_dict: dict

    def __init__(self, name, bases, dict):
        # meta-dtypes are regular types so super works just fine!
        super().__init__(metadt, names, bases, dict)
        if bases:
            # inherit slots from the base class
            b, = bases
            # make sure to inherit the new slots
            if dt.endianess is None:
                dt.endianness = bases[0].endianess

meta-dtype instantiations: just plain old dtypes

class integer(metaclass=integral_dtype):
    # this provides default values for the slots in the meta-dtype integral_dtype
    def __common_dtype__(dt, other_dt):
        if not isinstance(other_dt, integral_dtype):
            return NotImplementedError
        if dt.signed == other_dt.signed:
            return integral_dtype._lookup_dict[max(dt.itemsize, other_dt.itemsize), dt.signed)
        else:
            return integral_dtype._lookup_dict[max(dt.itemsize, other_dt.itemsize) + 1, True]

class uint8(integer):
    type_num = 1  # this slot is from PyArrayDescr
    # these slots are from PyArrayIntegralDescr_Object
    signed = False
    endianess = '='
integral_type._lookup_dict[1, True] = uint8

assert not isinstance(uint8, type)
assert isinstance(uint8, np.dtype)
assert isinstance(uint8, integral_dtype)

dtype subclassing means "fill in slots from my parent" just like it does in python. This is handled by dtype.__init__

class uint8_non_native(uint8):
    endianess = 'S'

dtype instantiation (not to be confused with invoking np.dtype itself) could be made to mean array-with-this-dtype instantiation

arr = uint8_non_native([1, 2, 3])
assert arr.dtype == uint8_non_native
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment