Instantly share code, notes, and snippets.

@rain-1 /example.tsv
Last active Jul 14, 2018

Embed
What would you like to do?
Tab Separated Values file format specification version 2.0
Name Age Address
Paul 23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St

Multiple Tab Separated Values file format specification

This document specifies the .mtsv file format.

  • The field separators is one or more tabs. /\t+/
  • The record separators is a newline. /\n/
  • The fields are nonempty escaped text strings as specified in the subdocument. /[^\t\n]+/

The main difference between .mtsv and .tsv is that multiple tabs are considered a single separator. This means we cannot put an empty field in a TSV document. Applications may choose a special nonce value to indicate an empty field depending on context.

Supplementary formats

We also specify Commented Multiple Tab Separated Values Format .cmtsv as an extension to .mtsv. This file format is useful for config files similar to /etc/fstab.

The differences are:

  • Blank lines are ignored.
  • Any line starting with # is treated as a blank line.

To avoid being treated as comments, If the field in the first column of a record starts with # it should be escaped \#.

Limitations: In .cmtsv documents you cannot express records with no fields.

Escaped text

This subdocument specifies an encoding for escaping text. It was designed for the multiple tab separated values format. It uses backslash escaping and tries to be common to shell and most scripting languages, hopefully this makes the escaped output easy to use in a variety of contexts.

  • Escape all tabs and newlines, so that the result may be used as a TSV field.
  • Escape all terminal control sequences so that escaped text will never accidentally affect the terminal state.
  • Operate correctly on arbitrary text.
  • Operate correctly on Unicode UTF-8 text.

The input is any string of bytes. When the input is valid UTF-8 text the output will also be valid UTF-8. The output is a backslash escaped string of characters, it will not contain any of the following bytes:

  • [0x00-0x1F] (this range includes the NUL byte, tab and newline chars as well as terminal control codes)
  • DEL (0x7F).

Most ascii values are escaped as \xXX. Some special ones have a nicer syntax:

  • \b (0x08)
  • \f (0x0C)
  • \n (0x0A)
  • \r (0x0D)
  • \t (0x09)
  • \v (0x0B)

Furthermore the following bytes will not occur alone. They will be escaped and only occur after a backslash:

  • \ (0x5C)
  • " (0x22)

About escaping unicode codepoints

Any byte starting with 1 (i.e. in the range [128-255]) can be passed through unchanged. This means multiple-byte unicode codepoints are passed through unescaped. An implementation may also choose to escape a set of unicode codepoints with \uXXXX. This can only express 16 bit codepoints but unicode goes up to 21 bits. So for those cases you can either escape each of the bytes using \xXX or use \UXXXXXXXX.

About not escaping $

We choose not to escape $ even though it expands to variables inside a shell "-string. This means that one must check for and manually escape $'s in the output when copying and pasting TSV text into a shell script string. It would be unreadable to escape $ as \x24 so you might prefer to write \$ but while perl ruby and shell do, python doesn't treat \$ as an escaped dollar. Also $ is quite rare in filenames and URLs so it wont be a problem often.

About escaping ASCII characters that don't need escaped

Other than the special escape codes above, any escaped character just denotes that character. For example \# denotes #.

References

Tab Separated Values file format specification version 2.0

This document specifies the .tsv file format.

A TSV file represents a list of lists of strings.

  • The field separator is /\t/ (tab)
  • The record separator is /\n/ (newline)
  • A field is any string not containing tab or newline characters /[^\t\n]*/

Example

For example

Name<TAB>Age<TAB>Address
Paul<TAB>23<TAB>1115 W Franklin
Bessy the Cow<TAB>5<TAB>Big Farm Way
Zeke<TAB>45<TAB>W Main St

represents

(("Name" "Age" "Address")
 ("Paul" "23" "1115 W Franklin")
 ("Bessy the Cow" "5 Big Farm Way")
 ("Zeke" "45" "W Main St"))

Supplementary formats

This specification aims to improve upon the IANA [1] spec by being precise about what a valid field is.

We also specify ascii separated values .asv format using record separator instead of \n and unit separator instead of \t. [2]

One can consider a looser variation of tsv where multiple tabs /\t+/ are considered as a single record separator. This supports proper alignment in a text editor but means that an empty field cannot be expressed. Use .ttsv for this varation.

Implementation notes

A reader or application using tsv may:

  • choose to treat the first record as field names.
  • choose to put a limit on field lengths.
  • choose to enforce tabular format. (all records having the same number of fields)

A serializer must:

  • error if a field contains a tab or newline
  • error if a field contains an ascii separator (in the case of .asv only)
  • error if a field is the empty string (in the case of .ttsv only)

References

#lang racket
;; Copyright (c) 2018 Raymond Nicholson
;; Permission is hereby granted, free of charge, to any person obtaining a copy
;; of this software and associated documentation files (the "Software"), to deal
;; in the Software without restriction, including without limitation the rights
;; to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
;; copies of the Software, and to permit persons to whom the Software is
;; furnished to do so, subject to the following conditions:
;; The above copyright notice and this permission notice shall be included in all
;; copies or substantial portions of the Software.
;; THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
;; IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
;; FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
;; AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
;; LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
;; OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
;; SOFTWARE.
(define (tsv-lexer p)
(if (eof-object? (peek-char p))
'()
(let ((token (regexp-match "^(\t|\n|[^\t\n]*)" p)))
(if token
(stream-cons (bytes->string/utf-8 (car token)) (tsv-lexer p))
(error "tsv-lexer: could not read" (file-position p))))))
(define (tsv-builder acc s)
(if (stream-empty? s)
(if (null? acc)
'()
(list acc))
(let ((t (stream-first s))
(s (stream-rest s)))
(cond ((equal? "\t" t)
(tsv-builder acc s))
((equal? "\n" t)
(cons (reverse acc)
(tsv-builder '() s)))
(else
(tsv-builder (cons t acc) s))))))
(define (read-tsv p) (tsv-builder '() (tsv-lexer p)))
(define (write-tsv tsv)
(for-each (lambda (record)
(for-each (lambda (field)
(unless (regexp-match? "[^\t\n]*" field)
(error "write-tsv: field contains tab or newline" field)))
record))
tsv)
(for-each (lambda (record)
(let loop ((record record))
(if (null? record)
(void)
(begin (display (car record))
(if (null? (cdr record))
(void)
(begin (display "\t")
(loop (cdr record)))))))
(newline))
tsv))
@NattyNarwhal

This comment has been minimized.

NattyNarwhal commented Jul 7, 2018

Why not use the ASCII characters for group->record->unit separation?

@jtolds

This comment has been minimized.

jtolds commented Jul 7, 2018

here's a conforming implementation! https://github.com/jtolds/tsv-tools

@rain-1

This comment has been minimized.

Owner

rain-1 commented Jul 12, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment