The problem we’re trying to solve is a common one: to give Python objects a schema. Of course we can use plain dictionaries but that quickly becomes unergonomic. The default idiom until a while ago was to do either of three things:
- Python classes with
__init__
boilerplate - NamedTuples, only works if your objects are immutable,
- Use an ORM (like django’s), only makes sense if you also need other features of an ORM (db storage, crud, acid transactions under concurrency, etc), i.e. a web app.
pydantic and data classes are simple, modern solutions to this problem.
Let’s say we’re writing code that needs to represent simple shapes in a 2D grid. The underlying data is relatively simple. For example, a circle is defined by three numbers, the (x, y) coordinates of its center and its radius r.
Let's look at different ways of implementing the same thing.
If you like it more verbose:
point = {
"x": 12.34,
"y": 98.76,
}
circle = {
"center": {
"x": 12.34,
"y": 98.76
},
"radius": 10.0,
}
And if you like it more terse:
point = (12.34, 98.76) # [0] is x, [1] is y
circle = ((12.34, 98.76), 10.0) # [0] is point, [1] is radius
Here there’s barely any enforcement of anything. All sorts of clearly wrong values will fly through and silently produce wrong results.
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
Circle = namedtuple('Circle', ['center', 'radius'])
p = Point(12.34, 98.76)
p.x # => 12.34
p.y # => 98.76
C = Circle(center=p, radius=10.0) # can specify named fields
C.center.x # => 12.34
Here, for example, a mistake like p = Point(1, 2, 3)
would be caught, but p = Point(1, [2, 3])
will not.
class Point:
def __init__(self, x, y):
self.x = float(x)
self.y = float(y)
class Circle:
def __init__(self, center, radius):
assert isinstance(center, Point)
self.center = center
self.radius = float(radius)
And then the usage is exactly like with NamedTuples (that's the appeal of
NamedTuples). Here, the main improvement in terms of typing and schema is that
we can write custom validation logic in __init__
. But then all this
boilerplate.
If you happen to need an ORM for database storage, then this is a simple solved problem. For example, in django you'd say:
from django.db import models
class Point(models.Model):
x = models.FloatField(...)
y = models.FloatField(...)
class Circle(models.model):
radius = models.FloatField(...)
center = models.ForeignKey(Point, ...)
Then the magic of the base class is to treat these class attributes as the specification for instance attributes and add initialization code and validation accordingly. You'd use it like so:
p = Point(x=12.34, y=98.76)
C = Circle(center=p, radius=10.0)
The problem here, of course, is what to do in all the cases where we need type enforcement/validation outside of web apps!
Thanks to Python 3.6+ type hints, we now have a few more concise, and overall better, solutions for this problem:
- pydantic: works with Python 3.6
- data classes: were introduced in stdlib in Python 3.9. They are very close in spirit and usage to pydantic (and to a good extent interoperable). But pydantic still provides more features than the stdlib Data Classes.
They both use the general idea of ORMs: "fields" are declared as class attributes, and a base class or decorator provides the validation/enforcement logic on instances.
Here's the same code as above, using pydantic:
from pydantic import BaseModel
class Point(BaseModel):
x: float
y: float
class Circle(BaseModel):
center: Point
radius: float
p = Point(x=12.34, y=98.76)
C = Circle(center=p, radius=10.0)
The Data Class implementation is almost identical to the pydantic one, just swap
the pydantic base class BaseModel
with the class decorator @dataclass
.
from dataclasses import dataclass
@dataclass
class Point:
x: float
y: float
class Circle:
center: Point
radius: float
p = Point(x=12.34, y=98.76)
C = Circle(center=p, radius=10.0)