Skip to content

Instantly share code, notes, and snippets.

@paulfitz
Created December 21, 2010 13:51
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save paulfitz/749949 to your computer and use it in GitHub Desktop.
Save paulfitz/749949 to your computer and use it in GitHub Desktop.
This is format specification for table differences
default:
asciidoc tdiff_spec_draft.txt
tDiff Format Specification
==========================
:Author: Paul Fitzpatrick, Joe Panico
:Email: paulfitz@alum.mit.edu, diffkit-user@googlegroups.com
:Date: 2011-01-12
:Revision: 0.2
[abstract]
Purpose
-------
This document defines the tDiff format; a text encoding for describing
the differences between two tables of data.
tDiff output is intended for two audiences: end-users who are looking
for an easily readable report that describes the differences between
two tables; and tools that will interpret tDiff output in order to
perform further processing or analysis (e.g. "patch" tools).
Scope
-----
tDiff should be able to describe the differences between *any types*
of data tables. That includes: tables that have an identifiable unique
key (e.g. RDBMS tables with primary key) and tables that don't (RDBMS
tables without primary key, or heaps); tables where row ordering is
meaningful, whether explicit or implied, such as spreadsheets or CSV
files; and tables where row ordering has no inherent meaning (RDBMS
tables). Further, tDiff should be able to represent the differences
between two different *types* of table, e.g. an RDBMS table versus a
spreadsheet table.
Limitations
-----------
tDiff is designed to described the differences in *content* between
two tables-- it does not have any facilities for describing the
differences in *structure* between two tables. A future revision will
expand the specification to include structure or schema information.
tDiff is designed to described text (character) content. It has no
facilities to describe differences in content that does not have a
natural default character representation. A future revision may expand
the specification to include arbitrary binary data.
tDiff can represent the differences between exactly *two* tables. It
cannot represent an n-way compare where n > 2. It also cannot
represent the differences between collections of tables, such
as multiple-sheet spreadsheets.
Representation
--------------
tDiff documents use the UTF-8 character encoding.
Top-level structure
-------------------
tDiff documents comprise any number of <<Comment,comment blocks>>,
<<Control,control blocks>>, and
<<Hunk,diff hunks>>, interleaved in any order.
Each diff hunk describes a related set of differences
between two tables. Each hunk could stand on its own as
an independent tDiff document. When there is a choice
in how to decompose differences between two tables as
a sequence of hunks,
generators are encouraged to choose a decomposition that
minimizes ordering effects between hunks.
.Document example
----
# tdiff version 0.2
/*
* this comment could describe the whole document
*/
/*
* this comment could describe the following hunk
*/
* |bridge=Brooklyn|designer:'J.A.Roebling'|length:1595|
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600|
* |bridge=Queensborough|designer:'Palmer & Hornbostel'|length:1182|
/*
* this comment could describe the following hunk, even
* though it's not immediately adjacent to the hunk (there is
* an intervening newline).
*/
- |bridge=Spamspan|designer:'S.Spamington'|length:10000|
+ |bridge=Manhattan|designer:'G.Lindenthal'|length:1470|
/*
* final, wrap up, comment
*/
----
[[Comment]]
Comment blocks
--------------
Comment blocks are delimited using: /* */ (C style). Any content can
occur within a comment block.
.Example comment
----
/* This is an example single-line comment */
/* This is an
example multi-line
comment */
----
[[Control]]
Control blocks
--------------
Control blocks begin with "# ", and are delimited by a newline or
linefeed. Control blocks may hold meta information about diffs,
or environmental information that might be useful to an interpreter.
Apart from the special <<Header,header control block>>, they
lie outside of the scope of this specification.
[[Header]]
Header
------
A tdiff document should begin with a special
<<Control, control block>> called the header.
The header begins with the characters "# tdiff". It is there
to aid in rapid identification of tdiff documents. Documents
*without* this block should be accepted by interpreters, but
this block should always be included by generators.
.Example header
----
# tdiff version 0.2
----
[[Hunk]]
Hunk
----
A hunk is a series of one or more adjacent <<Line,diff lines>>,
optionally preceeded by a <<Column,column line>>, where each
diff line represents the
differences between the source tables for a single row. The
lines within a hunk should be separated by only the newline
characters that terminate each diff line, so that they all appear as
adjacent lines within a text editor. Within a tDiff document,
hunks are delimited from each other via intermediate filler
or comment blocks.
.Example hunk
----
- |bridge=Spamspan|designer:'S.Spamington'|length:10000|
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600|
+ |bridge=Manhattan|designer:'G.Lindenthal'|length:1470|
----
[[Line]]
Diff line
---------
A diff line describes differences in a single row of the two tables that
were compared. One table is designated the left or local table (called *L*)
and the other table is designated the right or remote table (called *R*).
There are three types of diff lines:
- MISSING line: describes a row that is present in *L* but absent in *R*.
- EXTRA line: describes a row that is absent in *L* but present in *R*.
- CHANGE line: describes a row that is present in both tables, but differs
in some specific column values.
Each diff line occupies its own line in the document, and begins with
one of three characters. These three characters are called "line type"
characters:
- MISSING lines begin with plus: _'+'_. In order to make *R* look like *L* we would have to add the missing row to *R*.
- EXTRA lines begin with minus: _'-'_. In order to make *R* look like *L* we would remove the extra row from *R*.
- CHANGE lines begin with equals: _'='_. In order to make *R* look like *L* we would update some of its column values.
The line type character can be left or right padded with any amount of
whitespace, for readability. The line type character is followed by
any number of name-value pairs, where the names represent column
names, and the values are the values for the corresponding column name
in that particular row. The name is separated from the value by an
equals ('=') sign for identifying columns (usually part of the primary
key, but see <<Identity,Keys versus identity>>)
or a colon (':') sign for all other columns.
The name-value pairs, as well as the line type
character, are delimited with whitespace and/or a pipe _'|'_ character.
.Example diff line
----
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600|
----
[[Column]]
Column line
-----------
Optionally, key names can be removed from diff lines. To do this, a
special column line may be given within a hunk. This lists
column names, followed by "=" for identifying columns.
.Column line example
----
@ |bridge=|designer|length|
----
This establishes bridge as an identifying column that appears first,
followed by designer and length columns. We can now rewrite this:
.before
----
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600|
----
as this:
.after
----
= |Williamsburg|'D.D.Duck'->'L.L.Buck'|1600|
----
The effect of column lines is limited to within a single hunk.
ROW Pseudo Column
-----------------
Some sources of table data have a meaningful notion of row or line
number (e.g. spreadsheets, CSV files) but others do not (RDBMS
tables). The representation of the row number may be
external to the table data itself. In order to accommodate these
sources, document generators can make use of the ROW pseudo
column.
.pseudo column examples
----
+ |ROW=3|name3:value3|name4:value4|
- |ROW=4|
= |ROW=5|name3:old-value9->new-value9|name4:old-value10->new-value10 |
----
Note that a diff that uses ROW numbers may be less useful for applying to
RDBMS tables, and alternate row identity choices may be considered by the
generator (or specified to the generator by the user). See
<<Identity,Keys versus identity>>.
Column Numbers
--------------
Some sources of table data do not use named columns-- for instance,
CSV files that have not designated a column header row. In those
cases, column names are replaced with column number designators. The
designators follow this pattern:
1, 2, ...
N.B. The ordinal numbers embedded within column number designators are
1's based, not 2's based. In other words, counting starts at 1, not at
zero.
.column number example
----
= |1=value1|2=value2|3:old-value3->new-value3|
----
If column names are omitted, and no <<Column,column line>> is given,
then ordinal numbers are assumed. All columns are assumed to be
identifying, unless they are modified.
The previous example
can therefore be equivalently expressed as:
.column number example (rewritten)
----
= |value1|value2|old-value3->new-value3|
----
Context
-------
Though not always strictly required to fully and accurately described
differences between two tables, generators are allowed to include
extra rows and extra column values as contextual information. These
can help human readers oriented themselves within the data, and might
help machine interpreters to resolve otherwise ambiguities in applying
a tDiff document.
In the case of missing or extra row lines, extra
column data can be included within the row as name-value pairs, just
like any other name-value pairs. In the case of a change line,
extra name-value pairs can be included within the line.
These contextual name-value pairs should also
include only the old value. In other words, contextual column values should
not include "->new-value". In addition to contextual column
information, a hunk may contain any number of contextual rows. These
rows appear with the line type character _'*'_. Streaks of contextual
rows can prefix and suffix the block of actual (+, -, =) real diff
lines, but cannot be interspersed amongst them. In other words, a
hunk is:
- <streak of context lines>
- <streak of real (+ or - or =) diff lines>
- <streak of context lines>
.context example
----
* |bridge=Brooklyn|designer:'J.A.Roebling'|length:1595|
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600|
* |bridge=Queensborough|designer:'Palmer & Hornbostel'|length:1182|
----
Some tables are ordered, some unordered. The context features
of tDiff make sense for ordered tables. When a diff between
ordered tables is applied as a patch to an unordered table,
context features may be ignored. When a diff between unordered
tables is applied as a patch to an ordered table, the resulting
order may be underspecified (but presumably does not matter).
[[Identity]]
Keys versus identity
--------------------
Determining whether a row is present in *L* and *R* requires
a judgment about row identity. This judgment may be simple.
For example, the identity of a row may simply be the value of
its primary key. However, it is possible that the identity
of a row is distinct from its primary key. Consider for
example a table with an auto-incrementing integer primary key,
rather than something derived from the row data. Comparison
of that key between separately maintained copies of that table
will be meaningless. For meaningful comparison, an alternate
row identity would need to be constructed.
This issue lies outside the tDiff specification, but it is
important that implementors be aware that columns used for
identification may in some cases not be part of the primary key.
Appendix: Quoting
-----------------
Names or values may be quoted in a tDiff document. Quoting is done
as follows:
- All instances of the single-quote character are duplicated into pairs.
- All control characters and the backslash character are escaped as for C literals.
- The name or value is wrapped in single-quotes
It is always safe to single-quote a name or value. Names or values
*must* by quoted in any of the following conditions:
- A name or value conflicts with a reserved word: NULL, ROW.
- A name or value contains any character in the 7-bit ASCII range that is *not* in the following set: [A-Za-z0-9+.]
- A *name* begins with any of the characters [0-9+.].
In the case of column diffs, for each cell that was different between
*L* and *R*, both the old and new values are displayed. The
old value must come first, followed by '->' (dash greater than),
followed by the new value. For all three diff line types, the
generator may include *L* name-value pairs that are not
strictly needed, but may help with row identification.
Appendix: Grammar
-----------------
.Grammar
----
document ::= header (block)*
block ::= hunk | control | comment | filler
hunk ::= (hunk_header)? (diff_line)+
hunk_header ::= '@' (divider name)* break
diff_line ::= ('-' | '+' | '=') (divider term)* break
term ::= (name ('='|':'))? (value '->')? value
break ::= divider? linebreak
control ::= "#" (divider value)* break
comment ::= | ("/*" comment_body "*/")
filler ::= (linebreak | divider)+
header ::= '#' (divider term)* break
divider ::= (' ' | '\t')* ('|')? (' ' | '\t')*
linebreak ::= ('\r\n' | '\r' | '\n')
# unspecified in grammar: name, value, comment_body
----
The linebreak non-terminal needs to be parsed greedily to work with all
common end-of-line formats, since the number of linebreaks is
significant in the grammar. The comment_body non-terminal is
as for C.
Appendix: Complete examples
---------------------------
In example one, both tables are in an RDBMS, both tables have the same
column names, and the rows are identified using column1.
.Example 1 tables
----
L: R:
column1,column2,column3,column4 column1,column2,column3,column4
1, 0000, x, aaaa ----------------------------
---------------------------- 2, 1111, x, aaaa
3, 2222, x, aaaa 3, 2222, y, aaaa
4, 3333, x, aaaa 4, 0000, z, bbbb
5, 4444, x, aaaa 5, 4444, z, bbbb
6, 5555, x, aaaa 6, 5555, u, aaaa
---------------------------- 7, 0000, v, aaaa
---------------------------- 8, 1111, x, aaaa
----
.Example 1 diff, variant 1
----
# tdiff version 0.2
/*
* this is the tDiff document for example 1, using 1 hunk only and no context.
* Note the "|" usage varies from previous examples in this document.
* "|" plays the same role as spaces and tabs in the spec, so varying
* styles are possible.
*/
- | column1=1
+ | column1=2| column2:1111| column3:x| column4:aaaa
= | column1=3| column3:x->y
= | column1=4| column2:3333->0000| column3:x->z| column4:aaaa->bbbb
= | column1=5| column3:x->z| column4:aaaa->bbbb
= | column1=6| column3:x->u
+ | column1=7| column2:0000| column3:v| column4:aaaa
+ | column1=8| column2:1111| column3:x| column4:aaaa
/*
* end of tDiff document
*/
----
.Example 1 diff, variant 2
----
# tdiff version 0.2
/*
* here is a tDiff document that is equivalent to the document above, except
* that it uses 8 hunks, more comments, and adds in some context
*/
/*
* hunk 1: notice that columns 2,3,4 are context-- not strictly necessary
* to specify a remove
*/
- | column1=1| column2:0000| column3:x| column4:aaaa
/*
* hunk 2: notice that the hunks are separated by standalone newline
*/
+ | column1=2| column2:1111| column3:x| column4:aaaa
/*
* hunk 3: notice that column2 and column4 are merely context
*/
= | column1=3| column2:2222| column3:x->y| column4:aaaa
/*
* hunk 4: notice that the column diff line is surrounded by context rows, and
* that the context rows describe the values on the RHS.
*/
* | column1=3| column2:2222| column3:x| column4:aaaa
= | column1=4| column2:3333->0000| column3:x->z| column4:aaaa->bbbb
* | column1=5| column2:4444| column3:x| column4:aaaa
/*
* hunk 5
*/
= | column1=5| column3:x->z| column4:aaaa->bbbb
/*
* hunk 6
*/
= | column1=6| column3:x->u
/*
* hunk 7
*/
+ | column1=7| column2:0000| column3:v| column4:aaaa
/*
* hunk 8
*/
+ | column1=8| column2:1111| column3:x| column4:aaaa
----
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment