Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save tresys/6e81545b16c6a1972a8e7eb161d49597 to your computer and use it in GitHub Desktop.
Save tresys/6e81545b16c6a1972a8e7eb161d49597 to your computer and use it in GitHub Desktop.

Describing Self-Descriptive Data with DFDL and Apache Daffodil

Self-descriptive data can be difficult to process since the format of the data is not fixed, but is instead described by metadata. The Data Format Description Language (DFDL) and Apache Daffodil are powerful tools that can describe and parse a wide variety of data, but even some self-descriptive data formats can prove to be a challenge, particularly when they are logically self-descriptive. Below we detail what DFDL is, what Apache Daffodil is, and a generic approach to use them to describe and parse complex self-descriptive data.

Introduction to DFDL

The Data Format Description Language (DFDL) is a specification, developed by the Open Grid Forum, capable of describing many data formats, including both textual and binary, scientific and numeric, legacy and modern, commercial record-oriented, and many industry and military standards. It defines a language that is a subset of W3C XML schema to describe the logical format of the data, and annotations within the schema to describe the physical representation.

As a small example of how one might describe a data format using DFDL, imagine we had this data:

foo:5,bar:-7.1E2

This data format could be described logically as two values named "Foo" and "Bar" with integer and float types, respectively. The physical properties of this format could be described as standard text numbers with US-ASCII encoding that are tagged or "initiated" with the strings "foo:" and "bar:" and are separated by a comma. A DFDL schema that describes both the logical and physical properties of this data format looks like this:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dfdl="http://www.ogf.org/dfdl/v1.0">

  <!-- DFDL default properties here -->
  <xs:annotation>
    <xs:appinfo source="http://www.ogf.org/dfdl/v1.0">
      <dfdl:format
        representation="text"
        textNumberRep="standard"
        encoding="ascii"
        lengthKind="delimited"
        ... />
    </xs:appinfo>
  </xs:annotation>

  <xs:element name="Values">
    <xs:complexType>
      <xs:sequence dfdl:separator=",">
        <xs:element name="Foo" type="xs:int" dfdl:initiator="foo:" />
        <xs:element name="Bar" type="xs:float" dfdl:initiator="bar:" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>

Ignoring the dfdl:format tag and dfdl: attributes, this is a standard XML schema defining a Values element that contains a sequence of two elements named Foo and Bar, with types xs:int and xs:float, respectively. This closely matches the logical description of the data format from above. DFDL annotations are then applied to this logical description to define the physical properties of the format, such as ASCII encoded text, delimited lengths (i.e. lengths determined by scanning for tokens in the data), and elements separated by a comma with appropriate initiator tags.

Introduction to Apache Daffodil

Apache Daffodil is an open source project undergoing Apache incubation that implements the DFDL specification to read these DFDL schemas and parse data to an "infoset". This infoset can have different representations, but is most commonly projected into XML or JSON. This allows the use of well-established XML or JSON technologies and libraries to consume, inspect, and manipulate fixed format data. Using the DFDL schema description from above, Daffodil would parse the data to an infoset that looks like the following, projected into XML:

<Values>
  <Foo>5</Foo>
  <Bar>-710.0</Bar>
</Values>

Or like this projected into JSON:

{
  "Values": {
    "Foo": 5,
    "Bar": -710.0
  }
}

Note that the result contains only the logical and canonicalized information from the data. The fact that the data originally contained commas and initiators, or that a number was in scientific notation is no longer relevant. This provides a considerable benefit to consumers of this data, since they no longer need to understand the specifics of the format, but instead only need a DFDL description and a way to ingest XML or JSON.

Apache Daffodil is also capable of serializing or "unparsing" an XML/JSON infoset back to the original data format using the same DFDL schema. This serialization includes creating the physical properties such as terminators, separators, and original numeric format according to the schema.

Self-Descriptive Data Formats

Most data formats have some kind of self-descriptive elements. For example, some formats include a field that defines the length or number of repetitions of another field. Other examples are a magic number to determine the endianness of the data, or delimiters defined at the beginning of the data. In these cases, the self-descriptive properties are physical, such as length, occurrence, byte order, and delimiters. DFDL is fully capable of describing these self-descriptive physical properties using the "DFDL expression language", which is based on XPath. For example, a payload with a self-descriptive length might be described in DFDL like this:

<xs:element name="Length" type="xs:int" dfdl:length="4" />
<xs:element name="Payload" type="xs:hexBinary" dfdl:length="{ ../Length }" >

In this case, a 4 byte integer is parsed and the resulting value is used as the length for the following payload.

However, some data formats have self-descriptive properties of the logical format, such as element names, order, and types. DFDL generally cannot easily support these formats since logical properties must be hardcoded in the DFDL schema. Although this proves to be a challenge to describe in DFDL, techniques do exist to support this. The following sections describe how one might handle logically self-descriptive data formats using DFDL and the Apache Daffodil Java API.

Example: TypedCSV

Since self-descriptive formats tend to be fairly complex due to the self-descriptive nature, we're going to define a new, simple-ish, data format called TypedCSV (TCSV) to use as an example. This format is similar to comma-separated value (CSV) data formats where the first row is the header, the remaining rows are the actual data, and columns and rows are separated by commas and newlines. To make things a bit more interesting and different from CSV, the header row of TCSV also includes the type of the column in addition to the title. An example of TSCV data looks like this:

string:Name,int:Age,float:Height
Dan,35,6.25
Chuck,50,5.25
Bob,30,5.50
Faythe,25,5.9
Alice,20,5.75
Eve,40,5.88

Note how this data is logically self-descriptive. In order to fully parse and validate all the rows and fields within the rows, we must use the name and type information provided in the header.

Although relatively basic, this is very similar to how many logically self-descriptive formats work. They generally begin with metadata that describes the actual format which is followed by the payload that abides by that format. In TCSV, the metadata is the header row and the payload is all the subsequent rows.

Plan of Attack

Using the Apache Daffodil Java API and DFDL, we can describe and parse this and other types of logically self-descriptive data formats with the following technique, which we go into in more detail in the following sections.

  1. Describe the self-description with a DFDL schema
  2. Parse the self-description to an infoset using Apache Daffodil
  3. Transform that infoset to a DFDL schema that describes the payload
  4. Using the generated DFDL schema, parse the payload to an infoset

This flow diagram shows these steps:

The following sections contain snippets of DFDL schemas, XSLT files, and Java code. The complete files, including error checking and helper functions, are available in the selfDescriptiveData/ directory in the OpenDFDL/examples repository.

1. Describe the Self-Description with a DFDL Schema

In our TypedCSV format, the self-description is the first line in the data. In our example above, this is:

string:Name,int:Age,float:Height

But we could imagine there being any number of columns with different names and types. We can generically describe the logical format of this header row as an unbounded repetition of columns where each column is made up of a type and a title with string types. We can describe the phyiscal properties as having the type and title separated by a colon, each column separated by a comma, and the whole TSCV header row terminated by a newline (represented as %NL; in DFDL). This generic TCSV header can then be described with this DFDL schema:

<xs:element name="TCSVHeader" dfdl:terminator="%NL;">
  <xs:complexType>
    <xs:sequence dfdl:separator=",">
      <xs:element name="Column" maxOccurs="unbounded">
        <xs:complexType>
          <xs:sequence dfdl:separator=":">
            <xs:element name="Type" type="xs:string" />
            <xs:element name="Title" type="xs:string" />
          </xs:sequence>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
  </xs:complexType>
</xs:element>

Complete DFDL schema available on Github: tcsvHeader.dfdl.xsd

2. Parse the Self-Description to an Infoset using Apache Daffodil

We now want to parse only the self-description metadata using the Daffodil Java API and the DFDL schema we just created. The first thing we must do is compile the DFDL schema into a DataProcessor. The DataProcessor is what performs the actions to parse the data and build the infoset.

Compiler c = Daffodil.compiler();
ProcessorFactory pf = c.compileSource(tcsvHeaderSchema);
DataProcessor headerDP = pf.onPath("/");

Complete Java code for this and below snippets available on Github: TypedCSV.java

We then wrap our input data InputStream with a Daffodil InputSourceDataInputStream. This class allows Daffodil to remember where one parse completes and to start another parse where it left off, which we'll need later.

InputStream is = inputData.openStream();
InputSourceDataInputStream dis = new InputSourceDataInputStream(is);

Before we can parse our header, the last thing we must do is define how we want to output the infoset. In this example, we'll output it as a JDOM Document by using the JDOMInfosetOutputter.

JDOMInfosetOutputter headerOutputter = new JDOMInfosetOutputter();

We are now ready to parse the self-description using the DataProcessor and get the result:

ParseResult headerPR = headerDP.parse(dis, headerOutputter);
Document headerDoc = headerOutputter.getResult();

The resulting JDOM Document looks like this:

<TCSVHeader>
  <Column>
    <Type>string</Type>
    <Title>Name</Title>
  </Column>
  <Column>
    <Type>int</Type>
    <Title>Age</Title>
  </Column>
  <Column>
    <Type>float</Type>
    <Title>Height</Title>
  </Column>
</TCSVHeader>

Note that all the physical aspects of the metadata (e.g. colons, commas, newlines) do not exist in the infoset, leaving us with only the logical values that we actually want and need.

3. Transform that Infoset to a DFDL Schema that Describes the Payload

Based on the resulting XML, we now want to create a new DFDL schema that describes the payload. Various technologies can accomplish this, but eXtensible Stylesheet Language Transformations (XSLT) are particularly well suited for transforming XML to a DFDL schema. Two XSLT templates are needed to do that. The first creates the main body of the DFDL schema:

<xsl:template match="/">

  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

    <!-- DFDL default properties here ... -->

    <xs:element name="TCSVPayload">
      <xs:complexType>
        <xs:sequence dfdl:separator="%NL;">
         <xs:element name="Row" maxOccurs="unbounded">
            <xs:complexType>
              <xs:sequence dfdl:separator=",">
                <xsl:apply-templates />
              </xs:sequence>
            </xs:complexType>
          </xs:element>
        </xs:sequence>
      </xs:complexType>
    </xs:element>
  </xs:schema>

</xsl:template>

Complete XSLT file available on Github: transformHeader.xslt

This template creates a DFDL schema that describes everything about a TypedCSV payload except for the specific fields. This includes a main root element called TCSVPayload, an unbounded sequence of Rows with each Row separated by a newline (%NL;), and each field inside the Row separated by a comma.

A second template is used to describe how to transform each of the Columns with Title and Type information into DFDL schema elements that make up the fields of the Row:

<xsl:template match="/TCSVHeader/Column">

  <xs:element>
    <xsl:attribute name="name">
      <xsl:value-of select="Title" />
    </xsl:attribute>
    <xsl:attribute name="type">
      <xsl:value-of select="concat('xs:', Type)" />
    </xsl:attribute>
  </xs:element>

</xsl:template>

For each Column element in the header XML , this template creates a new schema xs:element and sets the name and type attributes based on the Title and Type values from the infoset. For example, the template would convert this XML:

<Column>
  <Type>Foo</Type>
  <Title>bar</Title>
</Column>

to this DFDL schema element:

<xs:element name="Foo" type="xs:bar" />

Note that Daffodil uses the type information in the DFDL schema to determine how to parse the actual payload fields. For example, a type of xs:int will parse and validate differently than a field with type xs:float.

With the XSLT defined, we can use the following code to transform our JDOM Document to a generated DFDL schema:

XSLTransformer headerTR = new XSLTransformer(headerXSLT.openStream());
Document payloadSchemaDoc = headerTR.transform(headerDoc);

The resulting generated DFDL schema looks like this:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">

  <!-- DFDL default properties here ... -->

  <xs:element name="TCSVPayload">
    <xs:complexType>
      <xs:sequence dfdl:separator="%NL;">
        <xs:element name="Row" maxOccurs="unbounded">
          <xs:complexType>
            <xs:sequence dfdl:separator=",">
              <xs:element name="Name" type="xs:string" />
              <xs:element name="Age" type="xs:int" />
              <xs:element name="Height" type="xs:float" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>

This DFDL schema now exactly describes the rest of the TCSV data using the appropriate names and types defined in the self-description.

4. Using the Generated DFDL Schema, Parse the Payload to an Infoset

With this new DFDL schema we are now ready to parse the payload. This is done by compiling the generated DFDL schema into a new DataProcessor:

Compiler c = Daffodil.compiler();
ProcessorFactory pf = c.compileSource(payloadDFDLSchema);
DataProcessor payloadDP = pf.onPath("/");

Again, we need to define how we want the resulting infoset to be output. As before, we will output to a JDOM Document:

JDOMInfosetOutputter payloadOutputter = new JDOMInfosetOutputter();

And finally we can parse the remaining data from the same InputSourceDataInputStream used previously. Since we have not touched this input stream since parsing the header, the next parse will begin where the previous parse left off, which is right where the payload begins:

ParseResult payloadPR = payloadDP.parse(dis, payloadOutputter);
Document payloadDoc = payloadOutputter.getResult();

The resulting XML looks like this:

<TCSVPayload>
  <Row>
    <Name>Dan</Name>
    <Age>35</Age>
    <Height>6.25</Height>
  </Row>
  <Row>
    <Name>Chuck</Name>
    <Age>50</Age>
    <Height>5.25</Height>
  </Row>
  ...
  ...
  <Row>
    <Name>Alice</Name>
    <Age>20</Age>
    <Height>5.75</Height>
  </Row>
  <Row>
    <Name>Eve</Name>
    <Age>40</Age>
    <Height>5.88</Height>
  </Row>
</TCSVPayload>

Notice that just like with the parsed header, the commas and newlines are all gone making it easy to find and consume the actual payload data we need. Additionally, because the generated schema defines the names and types, the resulting XML includes the appropriate names and Daffodil will have validated the payload fields according to the types.

The result is that we now have two chunks of XML, one representing the header and one representing the payload.

At this point, we could perform queries and transformations on this XML, or simply ingest it into a big data system for analysis.

Addendum: Unparsing Self-Descriptive Data

As briefly mentioned in the introduction, Daffodil is capable of serializing, or "unparsing" in DFDL-speak, an infoset back to the original data format. A common use case for this is to first parse data to XML, filter or anonymize sensitive or personally identifiable information using common XML technologies, and then unparse the transformed XML back to the original format. This allows an easy way to transform or filter any data format using commonly available, well-understood, and well-tested XML technologies, as long as a DFDL schema exists for that data format. i

As before, XSLT is a powerful tool for transforming XML, so we can define XSLT to filter our payload XML infoset before unparsing.

One simple transformation is to remove an entire Row. The below XSLT template removes all Row elements (and their children) that have a child element called Name with a value of "Eve".

<xsl:template match="Row[Name='Eve']" />

Another transformation might be to sort the rows. To do this, we can define an XSLT template to sort all the immediate children of the TCSVPayload element (i.e. all the Row elements) based on the value of the Name element in that Row. The result is all rows will be sorted by name.

<xsl:template match="/TCSVPayload">
  <xsl:copy>
    <xsl:apply-templates>
      <xsl:sort select="Name" />
    </xsl:apply-templates>
  </xsl:copy>
</xsl:template>

Complete XSLT file available on Github: transformPayload.xslt

With these templates defined in an XSLT file, we can transform our XML using the JDOM API:

XSLTransformer payloadTR = new XSLTransformer(payloadXSLT.openStream());
Document payloadDocTR = payloadTR.transform(payloadDoc);

With the payload infoset transformed, we can now unparse our infosets back to the original data format. We must first define an output:

WritableByteChannel output = Channels.newChannel(outputStream);

We want to first unparse the original header XML, which can be done by defining a JDOMInfosetInputter to provide the header JDOM Document to Daffodil.

JDOMInfosetInputter headerInputter = new JDOMInfosetInputter(headerDoc);

We can then use the same header DataProcessor that we used for parsing, but this time for unparsing:

UnparseResult headerUR = headerDP.unparse(headerInputter, output);

Finally, we can unparse the transformed payload infoset Document to the same output using a similar technique, using the same payload DataProcessor used for parsing:

JDOMInfosetInputter payloadInputter = new JDOMInfosetInputter(payloadDocTR);
UnparseResult payloadUR = payloadDP.unparse(payloadInputter, output);

The resulting output looks like this:

string:Name,int:Age,float:Height
Alice,20,5.75
Bob,30,5.50
Chuck,50,5.25
Dan,35,6.25
Faythe,25,5.9

Notice that because of the transformation, the rows are all sorted by name and the "Eve" row has been removed, without the need to write any deserialization or serialization code that one might traditionally need to do. All we needed was to define a DFDL schema, some XSLT transformations, and some Java code to tie it all together.

References

Example Sources

Further Reading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment