Skip to content

Instantly share code, notes, and snippets.

@dalexander
Last active December 30, 2015 20:09
Show Gist options
  • Save dalexander/7879314 to your computer and use it in GitHub Desktop.
Save dalexander/7879314 to your computer and use it in GitHub Desktop.
Alignment file format proposal

Custom Alignnment File Format Proposal

Here's a starting point for a file format, we can discuss/negotiate.

Let's use the extension .aln for a text file like this:

## Custom Alignment Format v0.1
## nAlns: 1500
## contigBounds: [0, 9999]
## baseCallProvider: /mnt/secondary/Smrtpipe/martin/prod/static/analysisJob/202/202093/data/filtered_subreads.fasta
## pulseFeatureProvider: /mnt/secondary/Smrtpipe/martin/prod/static/analysisJob/202/202093/input.fofn
m131123_175609_42129_c100606812550000001823110906171463_s1_p0/54499/0_4551@[3000,7000]
m131123_175609_42129_c100606812550000001823110906171463_s1_p0/54499/4598_8997@[7005,3004]
m131123_175609_42129_c100606812550000001823110906171463_s1_p0/54500/219_6202@[0, 6000]
...

Note:

  • The fofn file referenced is a "file-of-filenames" file, listing the paths of the bas/bax files used in the relevant runs. These will be used to look up the additional pulse features (QVs) measured from the instrument run that Quiver needs.
  • Intervals [tStart, tEnd] are 0-based and end-inclusive. Thus, intervals [0,4] and [4,0] both include target base positions {0,1,2,3,4}, but the latter refers to the reverse strand.
  • Intervals with tStart > tEnd are understood as being aligned to the reverse strand of the genome.
    For example, if the read bases <x1, x2, x3> are aligned to [0, 5] while the read bases <y1, y2, y3> are aligned to [5, 0], the genomic "pileup" picture will look like this
         0  1  2  3  4  5
         [x1,   x2,   x3]
         [y3*, y2*,  y1*]

where * denotes complementation.

Running quiver using the custom file format

$ quiver Chr1.aln -o Chr1-consensus.fa -o Chr1-consensus.fq

Note that no reference FASTA is required.

Regions of the genome with no coverage or where consensus cannot be called will be no-called with consensusBase, consensusConfidence = ('N'/0)

Caveats

Running quiver in this way will be a bit slower (perhaps 2x) than running from a genome-order-sorted cmp.h5 due to the added I/O overhead (random seeks) in the bas files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment