Skip to content

Instantly share code, notes, and snippets.

@amirkdv
Last active December 3, 2021 16:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save amirkdv/ad006dde4c3d7d665005993cbdf1d479 to your computer and use it in GitHub Desktop.
Save amirkdv/ad006dde4c3d7d665005993cbdf1d479 to your computer and use it in GitHub Desktop.
Intro: Dataclasses and Pydantic

Python Objects with a Schema: Data Classes and pydantic

Motivation

The problem we’re trying to solve is a common one: to give Python objects a schema. Of course we can use plain dictionaries but that quickly becomes unergonomic. The default idiom until a while ago was to do either of three things:

  • Python classes with __init__ boilerplate
  • NamedTuples, only works if your objects are immutable,
  • Use an ORM (like django’s), only makes sense if you also need other features of an ORM (db storage, crud, acid transactions under concurrency, etc), i.e. a web app.

pydantic and data classes are simple, modern solutions to this problem.

A Toy Problem

Let’s say we’re writing code that needs to represent simple shapes in a 2D grid. The underlying data is relatively simple. For example, a circle is defined by three numbers, the (x, y) coordinates of its center and its radius r.

Let's look at different ways of implementing the same thing.

The Old Ways

(1) tuples and dicts

If you like it more verbose:

point = {
  "x": 12.34,
  "y": 98.76,
}

circle = {
  "center": {
    "x": 12.34,
    "y": 98.76
  },
  "radius": 10.0,
}

And if you like it more terse:

point = (12.34, 98.76)             # [0] is x, [1] is y
circle = ((12.34, 98.76), 10.0)    # [0] is point, [1] is radius

Here there’s barely any enforcement of anything. All sorts of clearly wrong values will fly through and silently produce wrong results.

(2) NamedTuples

from collections import namedtuple

Point = namedtuple('Point', ['x', 'y'])
Circle = namedtuple('Circle', ['center', 'radius'])

p = Point(12.34, 98.76)
p.x # => 12.34
p.y # => 98.76

C = Circle(center=p, radius=10.0) # can specify named fields
C.center.x # => 12.34

Here, for example, a mistake like p = Point(1, 2, 3) would be caught, but p = Point(1, [2, 3]) will not.

(3) Plain Old Classes

class Point:
    def __init__(self, x, y):
        self.x = float(x)
        self.y = float(y)

class Circle:
    def __init__(self, center, radius):
        assert isinstance(center, Point)
        self.center = center
        self.radius = float(radius)

And then the usage is exactly like with NamedTuples (that's the appeal of NamedTuples). Here, the main improvement in terms of typing and schema is that we can write custom validation logic in __init__. But then all this boilerplate.

(*) ORMs

If you happen to need an ORM for database storage, then this is a simple solved problem. For example, in django you'd say:

from django.db import models

class Point(models.Model):
    x = models.FloatField(...)
    y = models.FloatField(...)


class Circle(models.model):
    radius = models.FloatField(...)
    center = models.ForeignKey(Point, ...)

Then the magic of the base class is to treat these class attributes as the specification for instance attributes and add initialization code and validation accordingly. You'd use it like so:

p = Point(x=12.34, y=98.76)
C = Circle(center=p, radius=10.0)

The problem here, of course, is what to do in all the cases where we need type enforcement/validation outside of web apps!

New Ways

Thanks to Python 3.6+ type hints, we now have a few more concise, and overall better, solutions for this problem:

  • pydantic: works with Python 3.6
  • data classes: were introduced in stdlib in Python 3.9. They are very close in spirit and usage to pydantic (and to a good extent interoperable). But pydantic still provides more features than the stdlib Data Classes.

They both use the general idea of ORMs: "fields" are declared as class attributes, and a base class or decorator provides the validation/enforcement logic on instances.

(4) pydantic

Here's the same code as above, using pydantic:

from pydantic import BaseModel

class Point(BaseModel):
    x: float
    y: float


class Circle(BaseModel):
    center: Point
    radius: float

p = Point(x=12.34, y=98.76)
C = Circle(center=p, radius=10.0)

(5) Data Classes

The Data Class implementation is almost identical to the pydantic one, just swap the pydantic base class BaseModel with the class decorator @dataclass.

from dataclasses import dataclass

@dataclass
class Point:
    x: float
    y: float


class Circle:
    center: Point
    radius: float

p = Point(x=12.34, y=98.76)
C = Circle(center=p, radius=10.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment