Last updated: Feb. 8, 2017
CRAM format encodes genomic alignments to a reference.
It has better lossless (and optional lossy) compression compared to existing BAM format. It is also
becoming more widely used. We propose the addition of a CRAM Reader to biogo/hts
under hts/cram
.
As more projects move to whole-genome sequencing across large cohorts, drive space to store alignments becomes a concern. This is the reason for CRAM which has a much smaller footprint than BAM. This is done largely through encoding difference to a reference rather than saving the full sequence and by integer compression schemes.
The functionality to be implemented will be driven by the specification: https://samtools.github.io/hts-specs/CRAMv3.pdf but limited to the types observed in the wild and implemented and used in htslib
The hts/cram
Reader should match the hts/bam
API as closely as is reasonable.
import "biogo/hts/cram"
var cr *cram.CRAM
var err error
cr, err = cram.NewReader(rdr, io.Reader, rd int, reference *sam.Reference)
cr.Omit(bam.AllVariableLengthData)
var hdr *cram.Header = cr.Header()
var rec *sam.Record
rec, err = cr.Read()
The difference from the hts/bam
API is the requirement of the reference
argument to the constructor.
Note that extracting the sequence is costly, especially in CRAM. While the Omit
method in hts/bam
provides a global level of control over if this cost is incurred, we may wish to add a finer level
Record.Sequence(r *sam.Reference)
method so that the user has full control over exactly when
to incur this cost.
If the sam.Reference
must be passed to the Sequence()
method, then the cram.NewReader
function would not
need that value.
We will need to understand how this would work with the regional query method in hts/bam
that gets a slice of
[]bgzf.Chunk
to be sent to a bam.Iterator
.
the cram.Header will extend sam.Header
like:
type Header struct {
*sam.Header
...
}
TODO: what else is needed in the cram Header.
The CRAM specification lists a number of codecs. However, we will limit to those that are used in htslib.github Namely, those are :
- gzip
- bzip2
- lzma
- rANS
- huffman in single-code-only mode (?)
because at least gzip
and bzip2
are already available in the go standard library as io.Reader
s all of these
should be implemented as io.Reader
. This would be one of the first things to be implemented as the
API and the need are clear.
itf8
is a central data type in CRAM
. For these, we will define:
type itf8 []uint8
we can follow the implementation here
with a signature like:
func (i itf8) get32() (read int, val int32){...}
func (i itf8) get64() (read int, val int64){...}
where read
indicates how far to advance a pointer in the []byte.