Skip to content

Instantly share code, notes, and snippets.

@superbobry
Last active April 23, 2023 05:22
Show Gist options
  • Save superbobry/c67614cbfe2a15d35d5c to your computer and use it in GitHub Desktop.
Save superbobry/c67614cbfe2a15d35d5c to your computer and use it in GitHub Desktop.
TDF format spec

TDF

TDF is a binary format developed by the IGV team at Broad Institute.

Overview

Concepts

Master index
Master index contains information about available datasets and groups.
Dataset
High-level named data container.
Tile
Low-level data container. Each tile holds interval data for a specific genome region.
Group
Key-value metadata container.
Window function
See IGV sources for possible values.
Track name
A human-readable track name.
Track type
See IGV soures for possible values.
Track line
UCSC browser track line. See UCSC documentation.
Zoom level
TODO

Byte-to-byte

Header

TDF header consists of fixed-size 24 byte component and variable size metadata.

Fixed-size component

Field Type
magic int32
version int32
master index offset int64
master index size int32
header size int32

The first three bytes of the file (aka "magic" bytes) can be either "TDF" or "IBD" followed by a single-digit format version. The latest format version is 4. Unforunately between-version changes were not documented.

header size refers to the number of bytes in the following variable-size component.

Variable-size component

Field Type
# of window functions int32
[window function name] null-terminated string (enum)
track type null-terminated string (enum)
track line null-terminated string
# of track names int32
[track name] null-terminated string
build null-terminated string
flags int32

Hereinafter [] mean that the field can be repeated multiple times. The exact number of occurences is given in the preceeding # field.

As of version 4 flags can only carry 0 (uncompressed) or 0x1 (gzip-compressed).

Master index

Field Type
# of datasets int32
[dataset name null-terminated string
offset int64
size in bytes] int32
# of groups int32
[group name null-terminated string
offset int64
size in bytes] int32

It's perfectly valid for the master index to have zero datasets and groups, thus the repeated fields ([] notation) can be empty.

Dataset

Field Type
# of attributes int32
[key null-terminated string
value] null-terminated string
data type null-terminated string
tile width float32 (!)
# of tiles int32
[tile offset int64
size in bytes] int32

In theory dataset is abstract wrt to the data type stored in the tiles, but IGV implementation seems to always use floats.

Group

Field Type
# of attributes int32
[key null-terminated string
value] null-terminated string

Tile

A tile starts with a null-terminated string --- tile type. IGV implements four types of tiles: "fixedStep", "variableStep", "bed" and "bedWithNames".

"fixedStep"

Field Type
# of intervals int32
track start int32
span int32
# of tracks int32 (missing in IGV)
[track] float32 array

Fixed step tile in TDF is conceptually similar to that of the WIG format. It describes non-overlapping fixed-with intervals. For example, a fixed step tile of size 3 with span equal to 5 might look like:

    -2.   4.8    0
  |-----|-----|-----|  track 1
    1.3    3    -1
  |-----|-----|-----|  track 2
start

"variableStep"

Field Type
track start int32 (unused in IGV)
span float32 (!)
# of intervals int32
[start] int32
# of tracks int32
[track] float32 array

Variable step tile also resembles a similarly named concept from the WIG format. As the name suggests it allows the intervals to have arbitrary start offsets. The end offsets remain fixed by the span value.

Here's an example:

0123456789012

  -2.
|-----|       track 1
  1.3
|-----|       track 2
     4.8
   |-----|    track 1
      3
   |-----|    track 2
        0
     |-----|  track 1
       -1
     |-----|  track 2

The above example has span equal to 5 and starts equal to [0, 3, 5].

"bed" and "bedWithName"

Field Type
# of intervals int32
[start] int32 array
[end] int32 array
# of tracks int32
[track] float32 array
[name] null-terminated string (only for "bedWithName")

Bed tile allows for intervals with arbitrary start and end offsets. Tiles with type "bedWithName" can also label each interval with an string.

Queries

TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment