dalexander/AlignmentFormat.md

## AlignmentFormat.md

      
    Raw
  

              AlignmentFormat.md
            
          
    Custom Alignnment File Format Proposal

Here's a starting point for a file format, we can discuss/negotiate.
Let's use the extension .aln for a text file like this:
## Custom Alignment Format v0.1
## nAlns: 1500
## contigBounds: [0, 9999]
## baseCallProvider: /mnt/secondary/Smrtpipe/martin/prod/static/analysisJob/202/202093/data/filtered_subreads.fasta
## pulseFeatureProvider: /mnt/secondary/Smrtpipe/martin/prod/static/analysisJob/202/202093/input.fofn
m131123_175609_42129_c100606812550000001823110906171463_s1_p0/54499/0_4551@[3000,7000]
m131123_175609_42129_c100606812550000001823110906171463_s1_p0/54499/4598_8997@[7005,3004]
m131123_175609_42129_c100606812550000001823110906171463_s1_p0/54500/219_6202@[0, 6000]
...
Note:

The fofn file referenced is a "file-of-filenames" file, listing the paths of the bas/bax files
used in the relevant runs.  These will be used to look up the additional pulse features (QVs)
measured from the instrument run that Quiver needs.
Intervals [tStart, tEnd] are 0-based and end-inclusive.  Thus, intervals [0,4] and [4,0]
both include target base positions {0,1,2,3,4}, but the latter refers to the reverse strand.
Intervals with tStart > tEnd are understood as being aligned to the reverse strand of the genome.

For example, if the read bases <x1, x2, x3> are aligned to [0, 5] while the read bases
<y1, y2, y3> are aligned to [5, 0], the genomic "pileup" picture will look like this

         0  1  2  3  4  5
         [x1,   x2,   x3]
         [y3*, y2*,  y1*]
where * denotes complementation.
Running quiver using the custom file format

$ quiver Chr1.aln -o Chr1-consensus.fa -o Chr1-consensus.fq
Note that no reference FASTA is required.
Regions of the genome with no coverage or where consensus cannot be called will be no-called with consensusBase, consensusConfidence = ('N'/0)
Caveats

Running quiver in this way will be a bit slower (perhaps 2x) than running from a genome-order-sorted cmp.h5 due to the added I/O overhead (random seeks) in the bas files.