Here's a starting point for a file format, we can discuss/negotiate.
Let's use the extension .aln
for a text file like this:
## Custom Alignment Format v0.1
## nAlns: 1500
## contigBounds: [0, 9999]
## baseCallProvider: /mnt/secondary/Smrtpipe/martin/prod/static/analysisJob/202/202093/data/filtered_subreads.fasta
## pulseFeatureProvider: /mnt/secondary/Smrtpipe/martin/prod/static/analysisJob/202/202093/input.fofn
m131123_175609_42129_c100606812550000001823110906171463_s1_p0/54499/0_4551@[3000,7000]
m131123_175609_42129_c100606812550000001823110906171463_s1_p0/54499/4598_8997@[7005,3004]
m131123_175609_42129_c100606812550000001823110906171463_s1_p0/54500/219_6202@[0, 6000]
...
Note:
- The
fofn
file referenced is a "file-of-filenames" file, listing the paths of the bas/bax files used in the relevant runs. These will be used to look up the additional pulse features (QVs) measured from the instrument run that Quiver needs. - Intervals
[tStart, tEnd]
are 0-based and end-inclusive. Thus, intervals[0,4]
and[4,0]
both include target base positions{0,1,2,3,4}
, but the latter refers to the reverse strand. - Intervals with
tStart > tEnd
are understood as being aligned to the reverse strand of the genome.
For example, if the read bases<x1, x2, x3>
are aligned to[0, 5]
while the read bases<y1, y2, y3>
are aligned to[5, 0]
, the genomic "pileup" picture will look like this
0 1 2 3 4 5
[x1, x2, x3]
[y3*, y2*, y1*]
where *
denotes complementation.
$ quiver Chr1.aln -o Chr1-consensus.fa -o Chr1-consensus.fq
Note that no reference FASTA is required.
Regions of the genome with no coverage or where consensus cannot be called will be no-called with consensusBase, consensusConfidence = ('N'/0)
Running quiver
in this way will be a bit slower (perhaps 2x) than running from a genome-order-sorted cmp.h5
due to the added I/O overhead (random seeks) in the bas files.