Skip to content

Instantly share code, notes, and snippets.

@greeno
Forked from rain-1/example.tsv
Last active October 16, 2019 17:25
Show Gist options
  • Save greeno/d47295f4a4b44205fe3ed9e000fe1aa8 to your computer and use it in GitHub Desktop.
Save greeno/d47295f4a4b44205fe3ed9e000fe1aa8 to your computer and use it in GitHub Desktop.
Tab Separated Values file format specification version 2.0
Name Age Address Text
Paul 23 1115 W Franklin This is a an \n escaped new line
Bessy the Cow 5 Big Farm Way This is an \t escaped tab
Zeke 45 W Main St This is an \uXXXX escaped unicode char

Multiple Tab Separated Values file format specification

This document specifies the .mtsv file format.

  • The field separators is one or more tabs. /\t+/
  • The record separators is a newline. /\n/
  • The fields are nonempty escaped text strings as specified in the subdocument. /[^\t\n]+/

The main difference between .mtsv and .tsv is that multiple tabs are considered a single separator. This means we cannot put an empty field in a TSV document. Applications may choose a special nonce value to indicate an empty field depending on context.

Supplementary formats

We also specify Commented Multiple Tab Separated Values Format .cmtsv as an extension to .mtsv. This file format is useful for config files similar to /etc/fstab.

The differences are:

  • Blank lines are ignored.
  • Any line starting with # is treated as a blank line.

To avoid being treated as comments, If the field in the first column of a record starts with # it should be escaped \#.

Limitations: In .cmtsv documents you cannot express records with no fields.

Escaped text

This subdocument specifies an encoding for escaping text. It was designed for the multiple tab separated values format. It uses backslash escaping and tries to be common to shell and most scripting languages, hopefully this makes the escaped output easy to use in a variety of contexts.

  • Escape all tabs and newlines, so that the result may be used as a TSV field.
  • Escape all terminal control sequences so that escaped text will never accidentally affect the terminal state.
  • Operate correctly on arbitrary text.
  • Operate correctly on Unicode UTF-8 text.

The input is any string of bytes. When the input is valid UTF-8 text the output will also be valid UTF-8. The output is a backslash escaped string of characters, it will not contain any of the following bytes:

  • [0x00-0x1F] (this range includes the NUL byte, tab and newline chars as well as terminal control codes)
  • DEL (0x7F).

Most ascii values are escaped as \xXX. Some special ones have a nicer syntax:

  • \b (0x08)
  • \f (0x0C)
  • \n (0x0A)
  • \r (0x0D)
  • \t (0x09)
  • \v (0x0B)

Furthermore the following bytes will not occur alone. They will be escaped and only occur after a backslash:

  • \ (0x5C)
  • " (0x22)

About escaping unicode codepoints

Any byte starting with 1 (i.e. in the range [128-255]) can be passed through unchanged. This means multiple-byte unicode codepoints are passed through unescaped. An implementation may also choose to escape a set of unicode codepoints with \uXXXX. This can only express 16 bit codepoints but unicode goes up to 21 bits. So for those cases you can either escape each of the bytes using \xXX or use \UXXXXXXXX.

About not escaping $

We choose not to escape $ even though it expands to variables inside a shell "-string. This means that one must check for and manually escape $'s in the output when copying and pasting TSV text into a shell script string. It would be unreadable to escape $ as \x24 so you might prefer to write \$ but while perl ruby and shell do, python doesn't treat \$ as an escaped dollar. Also $ is quite rare in filenames and URLs so it wont be a problem often.

About escaping ASCII characters that don't need escaped

Other than the special escape codes above, any escaped character just denotes that character. For example \# denotes #.

References

Tab Separated Values file format specification version 2.0

This document specifies the .tsv file format.

A TSV file represents a list of lists of strings.

  • The field separator is /\t/ (tab)
  • The record separator is /\n/ (newline)
  • A field is any string not containing tab or newline characters /[^\t\n]*/

Example

For example

Name<TAB>Age<TAB>Address
Paul<TAB>23<TAB>1115 W Franklin
Bessy the Cow<TAB>5<TAB>Big Farm Way
Zeke<TAB>45<TAB>W Main St

represents

(("Name" "Age" "Address")
 ("Paul" "23" "1115 W Franklin")
 ("Bessy the Cow" "5 Big Farm Way")
 ("Zeke" "45" "W Main St"))

Supplementary formats

This specification aims to improve upon the IANA [1] spec by being precise about what a valid field is.

We also specify ascii separated values .asv format using record separator instead of \n and unit separator instead of \t. [2]

One can consider a looser variation of tsv where multiple tabs /\t+/ are considered as a single record separator. This supports proper alignment in a text editor but means that an empty field cannot be expressed. Use .ttsv for this varation.

Implementation notes

A reader or application using tsv may:

  • choose to treat the first record as field names.
  • choose to put a limit on field lengths.
  • choose to enforce tabular format. (all records having the same number of fields)

A serializer must:

  • error if a field contains a tab or newline
  • error if a field contains an ascii separator (in the case of .asv only)
  • error if a field is the empty string (in the case of .ttsv only)

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment