Skip to content

Instantly share code, notes, and snippets.

@EBNull

EBNull/blog2.md Secret

Last active October 8, 2018 13:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save EBNull/8388d8e2b019f87bdab4 to your computer and use it in GitHub Desktop.
Save EBNull/8388d8e2b019f87bdab4 to your computer and use it in GitHub Desktop.
Reflective studies of Alchemy

Introspection in SQLAlchemy:
Reflecting upon the Magnum Opus

The layering of orthogonal concepts within SQLAlchemy lends itself to deep introspection. These capabilities can be used for a variety of purposes including debugging and concise expression of programmatic intent. The detailed introspection API added in version 0.8 can be very useful in several scenarios. Previously, while these introspection capabilities were available, they were mostly undocumented and without official support. We'll cover some deeper parts of this API through the investigation of an application bug and the addition of a common feature. First, though, it might be best to glance at the surface.

Transmutation Ingredients

SQLAlchemy is a comprehensive database interface tool that is split into several components. The most obvious distinction is between 'SQLAlchemy Core' and the 'SQLAlchemy ORM'. Both the Core and ORM themselves are greatly subdivided into several layers, though the primary focus in this article is the ORM's internals. In addition, it's important to note the separation of the ORM from the declarative extension. The declarative extension adds the declarative base class and other niceties, but ultimately it is just another layer.

Session

A primary focal point of SQLAlchemy is the fabled "db session". This object is the key to interacting with the ORM (during model usage, rather than creation) since nearly all of the heavy lifting is done in a way that is rooted to a single Session. This Session does several things that are mostly behind the scenes, but all ORM object instances ultimately hold a reference back to it.

The Session object is responsible for storing and synchronizing the in-memory python object instances with the current state that the database is in. One important shortcut (normally) taken by SQLAlchemy is to assume that all interaction with the session takes place in the context of a transaction. This allows SQLAlchemy to batch updates, maintain it's identity map, and issue queries that return accurate results while only communicating with the database when needed.

Flushing

In common use of SQLAlchemy, communication with the database is delayed until "needed". In particular, this means that

inst = db_session.query(MyObj).get(1)
inst.first_attr = "hello"

does not execute an UPDATE statement. In fact, the data for 'first_attr' is stored within a "need to flush" attribute, and then sent in an UPDATE statement when a flush occurs. These flushes are either explicit (session.flush()) or automatic (run before each query, including SELECT queries). In addition, a flush is always executed before a commit. The reason for autoflush to exist is to ensure changing an object and then querying for it returns the correct value, since before the flush, the database is unaware of in-memory modifications. In other words, if one ran the above code and then ran db_session.query(MyObj).filter_by(first_attr="hello") with autoflush=False, it would not be returned, but with autoflush=True, a .flush() call would be executed first, allowing the DB to notice that this object meets the criteria.

InstanceState

Every model instance has an associated InstanceState instance, which is the actual store for these values. In particular, the "old" values ( /if they are loaded/ ) are stored on InstanceState's .dict attribute, and the current not-yet-flushed values are stored on .committed_state (which is a somewhat confusing name). The official API to access this data, however, is via the History interface. This interface shows the old value and new value in a much more convenient way, and is gained via the inspection api.

istate = inspect(inst) returns an InstanceState. istate.attrs returns a "namespace" (dict-like object) of attribute names mapped to AttributeState instances. These AttributeState instances contain the 'history' attribute, which returns the History object, and is the "official" interface to the old and new pre-flush values.

Alchemical Calcination 1

In resolving bugs, one must first investigate and determine their cause. In a bug I resolved recently, a logically unchanged object was causing SQLAlchemy to emit an UPDATE clause, which caused the database to update a recently changed timestamp. In this case, an application of inspect(), InstanceState, AttributeState, and History used just before db_session.commit() was very useful in spotting the issue:

>>>dict([(k, v.history) for k, v in inspect(model_instance).attrs.items() if v.history.has_changes()])
{u'location_id': History(added=['2'], unchanged=(), deleted=[2L])}

Given a model instance, we inspect() it, which returns an InstanceState instance. This tells us about the state of the object in it's session (pending, detached, etc), and has details about it's attributes. Accessing the attrs attribute returns a "namespace", which behaves more or less like a dict. It's keys are the names of persisted attributes for our instance, and it's values are AttributeState objects. The AttributeState object's history attribute gives us access to a History object, which records unpersisted changes to the database. In particular, it is these history objects that contain the details of state that is pending but not yet persisted to the database via a flush operation.

It is worthwhile to note that this history API is generally only useful pre-flush, because it is during flush that an UPDATE or INSERT statement can be issued. That being said, the above could integrate quite nicely with a session before_flush listener (or simple breakpoint).

Alchemical Multiplication 2

Serialization is a common function added to many declarative base object implementations. Often it will take the name of .as_dict(), .as_json(), or even .__getstate__() for Base classes that would like to support the pickle protocol. Unfortunately, several implementations fall short of achieving various desired outcomes. For example, one may want to serialize an object to json for display on the frontend. However, as soon as different users have different logical "attribute level" permissions to view fields (eg, 'owner', 'salary', or 'home_address'), this one size fits all approach can fall short. In addition, there are several other decisions to make - often an object has dependent children (say, a user has multiple phone numbers). In the json representation, it may be convenient to return the attribute 'phones' as a list of numbers rather than deal with an entirely separate UserPhone object on the frontend. In short, there's no one size fits all solution.

That being said, here's my one size fits all solution. It inspects an object instance and returns a serialized dict. The function is recursive by default, though that can be disabled. Many to many relationships are followed and returned as dicts or as a list of ids (depending on arguments). In addition, it takes a filter_func that is called twice per dumped object: once with a dict of attributes (before hitting the database) that can whitelist or add additional attributes to return, and then a second time with the loaded attribute values. This allows a clean logical dump with appropriate filtering based on where it's called.

>>> dump(model_instance)
{'id': 1, 'attr_a': 'a', 'attr_b': 'b'}

>>> dump(model_instance, include_relationships=True)
{'id': 1, 'attr_a': 'a', 'attr_b': 'b', 'foos': [{'id': 1, 'bar': 123}, {'id': 2, 'bar': 456}])

Footnotes

  1. http://en.wikipedia.org/wiki/Calcination#Alchemy

  2. http://en.wikipedia.org/wiki/Multiplication_%28alchemy%29

import logging
log = logging.getLogger(__name__)
from sqlalchemy import inspect
from sqlalchemy.orm.attributes import QueryableAttribute
from sqlalchemy.ext.associationproxy import AssociationProxy
from sqlalchemy.ext.declarative import DeclarativeMeta
from sqlalchemy.orm.properties import ColumnProperty, RelationshipProperty
from sqlalchemy.orm.util import identity_key
from sqlalchemy.orm.interfaces import MANYTOMANY #, MANYTOONE, ONETOMANY
class DepthExceededException(Exception):
pass
class MemoSeenException(Exception):
pass
def _is_model_instance(instance):
"""Return True if instance is a SQLAlchemy model instance"""
model_class = type(instance)
if isinstance(model_class, DeclarativeMeta):
#instance is an instance of a instance of DeclarativeMeta :)
return True
return False
def _extract_pk(ident_key):
"""Return the primary key portion from an sqlalchemy identity key.
Because there can be a composite primary key, this method may return a tuple.
"""
x = ident_key[1]
if len(x) == 1:
return x[0]
return x
def _get_class_sa_attribute_dict(obj):
"""Returns the 'interesting' objects of a mapped class.
Specifically, this returns a dict with attr_name as keys and an
ColumnProperty, RelationshipProperty, or AssociationProxy object as
values.
The dict's values are either SA properties or SA attributes.
The purpose of this function is to provide input for dump() and
filter_func() to allow them to conditionally decide how to behave.
"""
mapper = inspect(type(obj))
ret_attrs = {}
for attr_name, attr in mapper.all_orm_descriptors.items():
if isinstance(attr, QueryableAttribute):
mapper_prop = attr.property
if isinstance(mapper_prop, ColumnProperty):
#normal column
ret_attrs[attr_name] = mapper_prop
elif isinstance(mapper_prop, RelationshipProperty):
#relationship
ret_attrs[attr_name] = mapper_prop
else:
#unknown property type
pass
#elif isinstance(attr, hybrid_property):
#elif isinstance(attr, hybrid_method):
elif isinstance(attr, AssociationProxy):
ret_attrs[attr_name] = attr
else:
pass #unknown attribute type
return ret_attrs
#TODO: Add option to strip the foreign key back to the caller. That is, if you call this
# on parent, for each child, strip the 'parent_id' field.
def dump(obj, filter_func=None, include_relationships=False, m2m_as_ids=False, depth=10, memo=None):
"""Return a mapped instance's column attributes and their current values as a dict.
Additionally, if include_relationships == True:
- Relationship attributes are returned as dicts, obtained by recursively
calling this function.
- If m2m_as_ids == True, the value of relationships with a secondary table
are instead returned as a list of ids to the remote table (ignoring the secondary).
- In addition, the values of AssociationProxy properties are returned.
This function call a filter_func, if supplied, that can filter the attributes of an instance
both before and after the getattr() calls. Therefore, one can avoid traversing relationships
(and thus issuing sql queries that would not be used), or filter the output after that data
has been retrieved.
The calling convetion for the filter_func is:
def filter_func(when, obj_type, obj, dct):
return dct
It will be called with when='before_getattr' and when='after_getattr', and must return a dict.
In the case this function would recurse on the same object, it will instead elect to not expand upon
a relationship. For example, when one dumps 'parent', the 'child' objects will not get the parent backref
expanded.
"""
if depth < 0:
raise DepthExceededException()
memo = memo or set()
if obj in memo:
raise MemoSeenException()
memo.add(obj)
def _dump_or_scalar(value):
if not _is_model_instance(value):
return value #Not a db instance
return dump(value, filter_func, include_relationships, m2m_as_ids, depth-1, memo=memo)
ret_attrs = _get_class_sa_attribute_dict(obj)
if filter_func:
#filter before touching attributes to prevent extra sql queries.
#if the filter_func wants to touch an attribute, that's fine.
ret_attrs = filter_func('before_getattr', type(obj), obj, ret_attrs)
if ret_attrs is None:
log.warn('%s should return a dict, not None', filter_func)
ret = {}
for attr_name, attr_or_prop in ret_attrs.iteritems():
if isinstance(attr_or_prop, ColumnProperty):
ret[attr_name] = _dump_or_scalar(getattr(obj, attr_name))
elif include_relationships and isinstance(attr_or_prop, RelationshipProperty):
rprop = attr_or_prop
try:
if not rprop.uselist:
ret[attr_name] = _dump_or_scalar(getattr(obj, attr_name))
else:
if m2m_as_ids and rprop.direction is MANYTOMANY:
#requested ids only
ret[attr_name] = [_extract_pk(identity_key(instance=x)) for x in getattr(obj, attr_name)]
else:
ret[attr_name] = [_dump_or_scalar(x) for x in getattr(obj, attr_name)]
except (DepthExceededException, MemoSeenException):
pass
elif include_relationships and isinstance(attr_or_prop, AssociationProxy):
v = getattr(obj, attr_name)
if attr_or_prop.collection_class:
#Grab data now so DB access isn't needed
v = attr_or_prop.collection_class(v)
ret[attr_name] = _dump_or_scalar(v)
elif isinstance(attr_or_prop, RelationshipProperty):
pass
elif isinstance(attr_or_prop, AssociationProxy):
pass
else:
raise NotImplementedError("Unknown property type %r"%(attr_or_prop))
if filter_func:
#filter after touching attributes if the filter_func wants to inspect them
ret = filter_func('after_getattr', type(obj), obj, ret)
if ret is None:
log.warn('%s should return a dict, not None', filter_func)
return ret
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment