Skip to content

Instantly share code, notes, and snippets.

@lepfhty
Created June 28, 2013 22:47
Show Gist options
  • Save lepfhty/5888743 to your computer and use it in GitHub Desktop.
Save lepfhty/5888743 to your computer and use it in GitHub Desktop.
Data Ninja Tagging

[TOC]

Observation Records

Use Cases

Case 1: Simple Observation

A single observation record should have the following:

  1. a code or ID to identify the type of observation
  2. a human-readable label or description
  3. a measured or observed value
  4. units associated with the measurement
{
  cd: 1,
  desc: "Heart Rate",
  result: 65,
  units: "bpm"
}

Case 2: Multiple Values in a Single Record (e.g., ISM blood pressure)

The field value1 contains Systolic Blood Pressure and value2 contains Diastolic Blood Pressure.

{
  cd: 2,
  desc: "BP",
  value1: 120,
  value2: 90,
  value1units: "mmHg",
  value2units: "mmHg"
}

Case 3: Special Parsing Required (e.g., Cedars base excess)

The string NEG 12 should evaluate to the integer -12

{
  cd: 3,
  desc: "Base Excess - Arterial",
  result: "NEG 12",
  units: "mEq/L"
}

Case 4: Special Parsing for Multiple Values (e.g., Cedars blood pressure)

The string 120/90 should produce the integer 120 for systolic and 90 for diastolic.

{
  cd: 4,
  desc: "Blood Pressure",
  result: "120/90",
  units: "mmHg"
}

Case 5: Observation Metadata within the Record (e.g., ISM temperature site)

A temperature observation has an associated site or route of measurement.

{
  cd: 5,
  desc: "Temperature",
  value1: "oral",
  value2: 37,
  value1units: null,
  value2units: "Celsius"
}

Case 6: Units Value Conversion (e.g., temperature)

The units are given in Fahrenheit but are expected as Celsius.

{
  cd: 6,
  desc: "Temperature",
  result: 98.6,
  units: "Fahrenheit"
}

The units are given as a fraction but are expected as a percentage (integer between 0 and 100).

{
  cd: 7,
  desc: "FiO2",
  result: 0.22,
  units: "_"
}

Case 7: Units String Normalization

The units are listed as MMOL/L but should be mEq/L. Other examples are mm HG should be mmHg, or ug/kg/hr should be mcg/kg/hr.

{
  cd: 8,
  desc: "Base Excess",
  result: 12,
  units: "MMOL/L"
}

Term Generation

Term Generation is the process of calculating aggregate statistics, called "Terms", from the raw data of observational Records. Terms are the basic unit of information for the Data Ninja application.

Terms can be generated from a simple configuration file that indicate properties of the Term and values that should be aggregated.

Term Configuration:

{
  collection: // name of the collection of records
  srcid:      // string identifier of the source
  termidkey:  // key of term identifier
  namekey:    // key of human-readable name
  unitskey:   // key of units of measure
  valuekey:   // key of observation values that should be aggregated
}

Let's follow Case 1 through the TermGen process:

// 1. Define a Term Configuration
{
  collection: 'events',
  srcid:      'sitex_events',
  termidkey:  'cd',
  namekey:    'desc',
  unitskey:   'units',
  valuekey:   'result'
}

// 2. Map individual records into a MappedRecord:
{
  srcid: 'sitex_events',
  termid: 1,
  name: 'Heart Rate',
  units: 'bpm',
  value: 65
}

// 3. Reduce (aggregate) individual records into a CountRecord:
{
  srcid: 'sitex_events',
  termid: 1,
  name: 'Heart Rate',
  units: 'bpm',
  value: 65,
  count: 123
}

// 4. Reduce CountRecords into a Term:
{
  srcid: 'sitex_events',
  termid: 1,
  name: 'Heart Rate',
  units: 'bpm',
  values: [60,61,62,63,64,65,...],
  counts: [100,110,115,120,121,123,...]
}

// 5. Some final steps to calculate statistics and save metadata.

Case 2 (and Case 5) follows a similar process, except we require 2 separate runs of the TermGen process. Even though the resulting Terms from the first and second runs are similar, they will be persisted separately, each with different ObjectIDs.

Extensions

Handling Cases 3, 4, 6, and 7 requires the definition and evaluation of special parsing functions. We can extend the Term Configuration to include these function definitions as Strings that can be evaluated as Javascript code.

Evaluating Units and Value Functions

Term Configuration:

{
  // same as above
  unitsfunction: // "function(unitsString) { return newUnitsString; }"
  valuefunction: // "function(valueString) { return newValueNumberOrString; }"
}

These functions will be evaluated during Step 2 (map individual records) of the TermGen process. The argument of the unitsfunction is the Record's unitskey property. Likewise, the valuefunction argument is the valuekey property. The remaining steps of TermGen can proceed normally.

Updating a Term Configuration

A typical scenario may play out as follows:

  1. Term Configurations is defined.
  2. Terms are generated.
  3. A Term is viewed and discovered to require special parsing of its values.

At this point, we would like a way to update the TermConfig and re-generate the Term with correctly parsed values. Thus the following properties of TermConfig may be redefined:

  • unitskey
  • unitsfunction
  • valuekey
  • valuefunction

When only valuefunction is redefined on TermX, we can simply recompute the histogram values and statistics. (This may even be done as a "preview" in the browser without ever being persisted on the server.)

When valuekey is redefined on TermX, we can lookup all Records with termidkey property equal to TermX.termid. Then we run the TermGen process on just these Records to recompute TermX (replacing the former TermX).

When unitskey or unitsfunction is redefined on TermX, the newly generated TermX may collide with an existing Term. Since Terms are distinct by srcid, termid, name, and units, TermX with new units may no longer be distinct. Modifying unitskey or unitsfunction should result in merging the recomputed Term with existing Terms.

Duplicate and Modify a Term

In dealing with Case 4, we cannot simply update the valuefunction to generate multiple values (and multiple Terms). There should be an operation to "Modify a Copy" of an existing Term. This operation would copy TermX's Config and allow the user to modify both TermX and the copy to generate new Terms.

Extra Metadata on Elements (nice to have)

To facilitate the process of Tagging Records (next section), it would be convenient to attach some extra metadata to each Element. Perhaps TAG and GROUPS (explained in next section) can be specified as Element metadata.

Tagging Records

Tagging Records is the process of marking individual observation Records with metadata, such that a data-driven application can identify and utilize the Records.

The Tagging process should be able to operate independent of Data Ninja, since Tagging is a requirement between a data application and the Record datastore. Here, we define a Basic Tagging process and a Data Ninja Tagging process.

Basic Tagging

We want to define a file format to hold tagging information so the Basic Tagging process can be run multiple times using only this file, which we call a "TagMap". The simplest such format is CSV (and maybe JSON in the future).

The fields of the CSV file are similar to the basic TermConfig. Here is a TagMap for Cases 1 and 2:

COLLECTION,TERMIDKEY,TERMID,UNITSKEY,VALUEKEY,TAG
events,cd,1,units,result,HR
events,cd,2,value1uom,value1,SBP
events,cd,2,value2uom,value2,DBP

The simplicity of the Basic TagMap allows users to manually create and edit this file using any text editor or spreadsheet editor.

An optional field, GROUPS can be provided to allow an application to assign one or more categories or groupings to each tag. Multiple groups for a single tag require a field containing pipe-separated groups.

A Basic TagMap with groups:

COLLECTION,TERMIDKEY,TERMID,UNITSKEY,VALUEKEY,TAG,GROUPS
events,cd,5,units,result,HR,Vitals
events,cd,1,value1uom,value1,SBP,BP|Vitals
events,cd,1,value2uom,value2,DBP,BP|Vitals

A tag operation on a single Record will add an array property to the Record containing all tags. After Basic Tagging, Records from Cases 1 and 2 appear as follows:

// Case 1: Basic Tagged Record
{
  cd: 1,
  desc: "Heart Rate",
  result: 65,
  units: "bpm",
  tags: [{
    units: "bpm",
    value: 65,
    tagvalue: "HR",
    groups: [ "Vitals" ]
  }]
}

// Case 2: Basic Tagged Record
{
  cd: 2,
  desc: "BP",
  value1: 120,
  value2: 90,
  value1units: "mmHg",
  value2units: "mmHg",
  tags: [{
    units: "mmHg",
    value: 120,
    tagvalue: "SBP",
    groups: [ "BP", "Vitals" ]
  },{
    units: "mmHg",
    value: 90,
    tagvalue: "DBP",
    groups: [ "BP", "Vitals" ]
  }]
}

Data Ninja Tagging

Data Ninja allows several extensions to the Basic Tagging process. These include:

  • Scoping by the research Dataset
  • Identifying by the Element name and definition
  • Applying value functions
  • Applying units functions

The Data Ninja TagMap has all the data and fields of the Basic TagMap, but adds a few more fields:

  • UNITSFUNCTION - the unitsfunction of a Term
  • UNITS - the final evaluated units of a Term
  • VALUEFUNCTION - the valuefunction of a Term
  • DATASETID - the Dataset lineageid scope of this mapping
  • DATASETNAME - (optional) the Dataset name
  • ELEMENTID - the Element lineageid of this mapping
  • ELEMENTNAME - (optional) the Element name

The ELEMENTID field in Data Ninja TagMap serves the same purpose as the TAG field in Basic TagMap. The user may edit the Data Ninja TagMap to add the application-specific tags and groups.

A sample Data Ninja TagMap:

COLLECTION TERMIDKEY TERMID UNITSKEY VALUEKEY TAG GROUPS UNITSFUNCTION UNITS VALUEFUNCTION DATASETID DATASETNAME ELEMENTID ELEMENTNAME
events cd 1 units result HR Vitals bpm (ObjectID) VPS (ObjectID) Heart Rate
events cd 2 value1uom value1 SBP BP | Vitals bpm (ObjectID) VPS (ObjectID) Systolic Blood Pressure
events cd 2 value2uom value2 DBP BP | Vitals bpm (ObjectID) VPS (ObjectID) Diastolic Blood Pressure
events cd 3 units result BE Blood Gases | Labs mEq/L function(v){return Number(v.replace(/^NEG /,'-'));} (ObjectID) VPS (ObjectID) Base Excess
events cd 4 units result SBP BP | Vitals bpm function(v){return Number(v.split('/')[0]);}` (ObjectID) VPS (ObjectID) Systolic Blood Pressure
events cd 4 units result DBP BP | Vitals bpm function(v){return Number(v.split('/')[1]);} (ObjectID) VPS (ObjectID) Diastolic Blood Pressure
events cd 5 value1uom value1 Temp Route Vitals (ObjectID) VPS (ObjectID) Temperature Route
events cd 5 value2uom value2 Temp Vitals Celsius (ObjectID) VPS (ObjectID) Temperature
events cd 6 units result Temp Vitals function(v){return 'Celcius';} Celsius function(v){return (v-32)*5/9;} (ObjectID) VPS (ObjectID) Temperature
events cd 7 units result FiO2 Vitals function(v){return '%';} % function(v){return v*100;} (ObjectID) VPS (ObjectID) FiO2
events cd 8 units result BE Blood Gases | Labs function(v){return 'mEq/L';} mEq/L (ObjectID) VPS (ObjectID) Base Excess

Records tagged with Data Ninja Tagging appear as follows:

// Case 3
{
  cd: 3,
  desc: "Base Excess - Arterial",
  result: "NEG 12",
  units: "mEq/L",
  tags: [{
    units: "mEq/L",
    value: -12,
    tagvalue: "BE"
    groups: [ "Blood Gases", "Labs" ]
    datasetid: <ObjectID>,
    datasetname: "VPS",
    elementid: <ObjectID>,
    elementname: "Base Excess Arterial"
  }]
}

// Case 4
{
  cd: 4,
  desc: "Blood Pressure",
  result: "120/90",
  units: "mmHg",
  tags: [{
    units: "mmHg",
    value: 120,
	tagvalue: "SBP",
    groups: [ "BP", "Vitals" ],
    datasetid: <ObjectID>,
    datasetname: "VPS",
    elementid: <ObjectID>,
    elementname: "Systolic Blood Pressure"
  },{
    units: "mmHg",
    value: 90,
	tagvalue: "DBP",
    groups: [ "BP", "Vitals" ],
    datasetid: <ObjectID>,
    datasetname: "VPS",
    elementid: <ObjectID>,
    elementname: "Diastolic Blood Pressure"
  }]
}

// Case 5 is like Case 2

// Case 6
{
  cd: 6,
  desc: "Temperature",
  result: 98.6,
  units: "Fahrenheit",
  tags: [{
    units: "Celsius",
    value: 37,
    tagvalue: "Temp",
    groups: [ "Vitals" ],
    datasetid: <ObjectID>,
    datasetname: "VPS",
    elementid: <ObjectID>,
    elementname: "Temperature"
  }]
}

// Case 7
{
  cd: 8,
  desc: "Base Excess",
  result: 12,
  units: "MMOL/L",
  tags: [{
    units: "mEq/L",
    value: 12,
    tagvalue: "BE",
    groups: [ "Blood Gases", "Labs" ],
    datasetid: <ObjectID>,
    datasetname: "VPS",
    elementid: <ObjectID>,
    elementname: "Base Excess"
  }]
}

Data Ninja should provide a utility that generates a TagMap from a Dataset. Each row in the TagMap CSV file should correspond to a single Term. A Term may be mapped to multiple Elements. An application needing tags has the following options:

  1. Use the Dataset lineageid and Element lineageid as the tag (recommended). Groups would have to be implemented within the application.
  2. Manually edit the TagMap CSV, adding TAG and GROUP values to every row.
  3. Use a script to perform option #2.

Written with StackEdit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment