Skip to content

Instantly share code, notes, and snippets.

@superbobry
Last active April 23, 2023 05:22
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save superbobry/c67614cbfe2a15d35d5c to your computer and use it in GitHub Desktop.
Save superbobry/c67614cbfe2a15d35d5c to your computer and use it in GitHub Desktop.
TDF format spec

TDF

TDF is a binary format developed by the IGV team at Broad Institute.

Overview

Concepts

Master index

Master index contains information about available datasets and groups.

Dataset

High-level named data container.

Tile

Low-level data container. Each tile holds interval data for a specific genome region.

Group

Key-value metadata container.

Window function

See IGV sources for possible values.

Track name

A human-readable track name.

Track type

See IGV soures for possible values.

Track line

UCSC browser track line. See UCSC documentation.

Zoom level

TODO

Byte-to-byte

Header

TDF header consists of fixed-size 24 byte component and variable size metadata.

Fixed-size component

Field Type
magic int32
version int32
master index offset int64
master index size int32
header size int32

The first three bytes of the file (aka "magic" bytes) can be either "TDF" or "IBD" followed by a single-digit format version. The latest format version is 4. Unforunately between-version changes were not documented.

header size refers to the number of bytes in the following variable-size component.

Variable-size component

Field Type
# of window functions int32
[window function name] null-terminated string (enum)
track type null-terminated string (enum)
track line null-terminated string
# of track names int32
[track name] null-terminated string
build null-terminated string
flags int32

Hereinafter [] mean that the field can be repeated multiple times. The exact number of occurences is given in the preceeding # field.

As of version 4 flags can only carry 0 (uncompressed) or 0x1 (gzip-compressed).

Master index

Field Type
# of datasets int32
[dataset name null-terminated string

offset

int64

size in bytes]

int32
# of groups int32
[group name null-terminated string

offset

int64

size in bytes]

int32

It's perfectly valid for the master index to have zero datasets and groups, thus the repeated fields ([] notation) can be empty.

Dataset

Field Type
# of attributes int32
[key null-terminated string

value]

null-terminated string
data type null-terminated string
tile width float32 (!)
# of tiles int32
[tile offset int64

size in bytes]

int32

In theory dataset is abstract wrt to the data type stored in the tiles, but IGV implementation seems to always use floats.

Group

Field Type
# of attributes int32
[key null-terminated string

value]

null-terminated string

Tile

A tile starts with a null-terminated string --- tile type. IGV implements four types of tiles: "fixedStep", "variableStep", "bed" and "bedWithNames".

"fixedStep"

Field Type
# of intervals int32
track start int32
span int32
# of tracks int32 (missing in IGV)
[track] float32 array

Fixed step tile in TDF is conceptually similar to that of the WIG format. It describes non-overlapping fixed-with intervals. For example, a fixed step tile of size 3 with span equal to 5 might look like:

-2.   4.8    0
----- track 1

1.3 3 -1

----- track 2

start

"variableStep"

Field Type
track start int32 (unused in IGV)
span float32 (!)
# of intervals int32
[start] int32
# of tracks int32
[track] float32 array

Variable step tile also resembles a similarly named concept from the WIG format. As the name suggests it allows the intervals to have arbitrary start offsets. The end offsets remain fixed by the span value.

Here's an example:

0123456789012

  -2.
|-----|       track 1
  1.3
|-----|       track 2
     4.8
   |-----|    track 1
      3
   |-----|    track 2
        0
     |-----|  track 1
       -1
     |-----|  track 2

The above example has span equal to 5 and starts equal to [0, 3, 5].

"bed" and "bedWithName"

Field Type
# of intervals int32
[start] int32 array
[end] int32 array
# of tracks int32
[track] float32 array
[name] null-terminated string (only for "bedWithName")

Bed tile allows for intervals with arbitrary start and end offsets. Tiles with type "bedWithName" can also label each interval with an string.

Queries

TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment