Skip to content

Instantly share code, notes, and snippets.

@brentp
Last active February 8, 2017 17:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brentp/57850c10342c21e9b60fa2f83999464f to your computer and use it in GitHub Desktop.
Save brentp/57850c10342c21e9b60fa2f83999464f to your computer and use it in GitHub Desktop.
biogo/hts cram proposal

Proposal: CRAM (v3) format support in GO

Last updated: Feb. 8, 2017

Abstract

CRAM format encodes genomic alignments to a reference. It has better lossless (and optional lossy) compression compared to existing BAM format. It is also becoming more widely used. We propose the addition of a CRAM Reader to biogo/hts under hts/cram.

Background

As more projects move to whole-genome sequencing across large cohorts, drive space to store alignments becomes a concern. This is the reason for CRAM which has a much smaller footprint than BAM. This is done largely through encoding difference to a reference rather than saving the full sequence and by integer compression schemes.

The functionality to be implemented will be driven by the specification: https://samtools.github.io/hts-specs/CRAMv3.pdf but limited to the types observed in the wild and implemented and used in htslib

High-Level API

The hts/cram Reader should match the hts/bam API as closely as is reasonable.

import "biogo/hts/cram"

var cr *cram.CRAM 
var err error

cr, err = cram.NewReader(rdr, io.Reader, rd int, reference *sam.Reference)
cr.Omit(bam.AllVariableLengthData)

var hdr *cram.Header = cr.Header()

var rec *sam.Record
rec, err = cr.Read()

The difference from the hts/bam API is the requirement of the reference argument to the constructor. Note that extracting the sequence is costly, especially in CRAM. While the Omit method in hts/bam provides a global level of control over if this cost is incurred, we may wish to add a finer level Record.Sequence(r *sam.Reference) method so that the user has full control over exactly when to incur this cost.

If the sam.Reference must be passed to the Sequence() method, then the cram.NewReader function would not need that value.

We will need to understand how this would work with the regional query method in hts/bam that gets a slice of []bgzf.Chunk to be sent to a bam.Iterator.

the cram.Header will extend sam.Header like:

type Header struct {
    *sam.Header
    ...
}

TODO: what else is needed in the cram Header.

Internals

Codecs

The CRAM specification lists a number of codecs. However, we will limit to those that are used in htslib.github Namely, those are :

  • gzip
  • bzip2
  • lzma
  • rANS
  • huffman in single-code-only mode (?)

because at least gzip and bzip2 are already available in the go standard library as io.Readers all of these should be implemented as io.Reader. This would be one of the first things to be implemented as the API and the need are clear.

itf8

itf8 is a central data type in CRAM. For these, we will define:

type itf8 []uint8

we can follow the implementation here

with a signature like:

func (i itf8) get32() (read int, val int32){...}
func (i itf8) get64() (read int, val int64){...}

where read indicates how far to advance a pointer in the []byte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment