jmandel/data_models_and_coding_systems.md

## data_models_and_coding_systems.md

      
    Raw
  

              data_models_and_coding_systems.md
            
          
    It's a mistake to abstract Coding Systems from Data Models

JCM, 12/27/2011
How models are "supposed to work"

In theory, clinical modellers erect an abstraction barrier to keep the details of coding systems out of their data models.  According to this theory:


"Data Models" describe classes or types of data, along with the properties that class members can/should/must assign.  For instance, let's take a model for "Prescription" specifying that each Prescription has a required start date, optional end date, and a required "drug" field.


"Coding Systems" are separately-maintained sets of values that a data model can "bind" to a given property.  These are often large, multi-purpose systems maintained by teams of analysts working for a major healthcare organization, government, or SDO.  For instance:


RxNorm is a database published by the National Library of Medicine which includes, among other things, codes ("concept identifiers") for each drug that can be prescribed in the USA -- concepts like "20 mg loratadine tablet" or "40 mg/mL amoxicillin suspension".


SNOMED CT is herculean effort now maintained by IHTSDO that specifies over 300,000 concepts in every area of and related to clinical medicine.


The idea is that models (and instances) can "refer" to particular coding systems in particular contexts, so that a Prescription data model might specify that the valid drug codes for a Prescription are "the 'clinical drug' concepts from RxNorm".  And if we have a true abstraction barrier, we can swap out one coding system and replace it with another, while keeping our data model constant. For example, if I get bored with RxNorm, I can just say "use the same Prescription data model, but assign each Prescription's 'drug' to a SNOMED drug code."
What happens in practice

Just how a data model refers to a drug code is a matter of some difficulty.  Usually a data model will want to say something more specific than "use a SNOMED code."  So our Prescription model might say something like "use a SNOMED code that descends from SNOMED:373873005 (the code for 'drug product')".  This degree of specificity is extremely important and practical, because it prevents me from creating a Prescription instance whose "drug" is assigned to SNOMED:257522005 (the code for 'Recreational watercraft user' -- ?!).
As we begin to get more sophisticated, we'll want to say more than just "Prescriptions must use a SNOMED drug product code."  We'll want to add in further constraints like "... and that drug product code should not be a descendent of SNOMED:102272007 (the code for 'Veterinary drug')."
What happened here?  Our data model, which was originally designed to be agnostic about coding systems, now needs a strategy to specify complex SNOMED expressions as constraints on which codes are allowed in which slots.  Perhaps we can just describe these constraints in human-readable text, and assert that they must be followed -- which is common enough in practice!  But then we have no way to automate the validation of instance data.  So perhaps we can try to keep a formal interface, but abstract it by inventing a constraint language for the job, and applying that constraint language to SNOMED.  But at the very least, our language will need to be designed with an understanding of SNOMED in mind...
As it turns out, coding systems also influence data models in subtler ways. For example, take the principle of data normalization:  when we repeat information in multiple places, we give ourselves the opportunity to get out of sync.  When there are conflicts (as there inevitably will be), whom do we trust?  Applying this principle to our Prescription data model:  how should we represent the dose form and strength of a Prescription's drug?  It's tempting to include these as attributes on the Prescription object itself, e.g. by including properties for "dose form" and "strength."  But there are a lot of details to get right: for instance, what's the appropriate strength for a 5/500 mg Vicodin tablet?  Do we need to represent two ingredients, hydrocodone and acetaminophen, each with its own strength?
Making coding systems work for you...

And then we have an insight:  this information already exists in the RxNorm database.  After all, RxNorm isn't just a list of codes with string values:  it's also a treasure trove of structured data specifying every aspect of every drug, including ingredients, strengths, dose forms, pill sizes, colors, and shapes, brand names, generic manufacturers, and a host of other useful information. And it's actively maintained by the National Library of Medicine: errors corrected, new drugs added, old drugs retired!  Clearly we don't want to recapitulate all these details in our own Prescription data model, when we're already referencing a code from RxNorm.  So why not just say:  we'll point to an RxNorm code for each Prescription's drug element, and then our model can concentrate on the Prescription-y stuff that isn't in RxNorm (the start date, provider signature, dosage schedule, and so on).  Now we're really getting some value from RxNorm and the good folks at the National Library of Medicine.
... breaks the abstraction barrier!

But what have we done?  We've violated our abstraction barrier by making good use of a coding system!  At this point our data model actually outsources part of its job to the coding system, which means more normalized data, and fewer opportunities to get out of sync. But in the process, we've become opinionated:  we've planted our feet and said "Our Prescription data model uses RxNorm codes to identify drugs."  And suddenly it's easy to tell which fields we can outsource (i.e. fields RxNorm already covers) and what sorts of constraints we need to represent (i.e. those needed to constrain RxNorm).  And developers who want to work with these models know how to get started (i.e. by learning about RxNorm).  These are all really good things!
This is good.

Sure, enlightenment came at the cost of our abstraction barrier. And given the facts, I'd argue that the trade-off is worthwhile. Even more:  I'd argue there was no trade-off in the first place: we only thought our data models were cleanly separated from our coding systems.  In fact, they were intertwined all along.  But this wasn't obvious up front, and working through some examples helps develop an intuition about why.