In pbcore for 2.2 I'm introducing a new class, FastaTable, which gives easy random access FASTA reading. It requires a FASTA index (.fai) file sitting next to the FASTA on the filesystem, and it requires a constant wrapping length in the FASTA (note that these requirements are already fulfilled by all PacBio reference repository FASTAs).
Internally the class works by mmap'ing the file contents into virtual address space and doing a little math based on the FASTA index to fulfill sequence slice requests. Memory is only allocated when sequence is actually requested.
Using FastaTable is very easy, and it confers at least one major advantage: you do not need to actually load the FASTA contigs into memory, which saves both memory usage and startup time for many applications (especially for scenarios like human amplicons where we only map to a fraction of a very large genome).
>>> from pbcore.io import FastaTable
Loading and basic scanning of contigs is instantaneous, even for human-size references---FASTA contents are not actually loaded:
>>> ft = FastaTable("~/Data/fluidigm_amplicons/MET_EGFR_Full_Genes.fasta")
>>> ft[:]
[<FastaTableRecord: EGFR>, <FastaTableRecord: MET>]
Some basic operations:
>>> met = ft[1] # Random access indexing by position in file...
>>> met = ft["MET"] # ... or by name!
>>> met.sequence # `sequence` is not a string, but is sliceable to strings;
<pbcore.io.FastaTable.MmappedFastaContig at 0x106e80050>
>>> len(met.sequence) # even getting the length of contig sequence doesn't require loading
150000 # anything from the FASTA file
>>> met.sequence[0:20] # use [:] to get full contig as string if desired
'TTAATGGAATTGTTGGGTTT'
-
As with FastaReader, you can iterate through the contigs. But with FastaTable, the iteration can be done repeatedly---it is not a single-shot iterator.
-
Superficially, the API is the same. The primary difference is that when you say
contig.sequence
you don't get a Python string; rather, you get a string proxy object that grabs the appropriate string from the file when you slice it. This difference is likely to be invisible to most users. If you do actually need to access the entire contig as a string, just do:
>>> contig.sequence[:]
- The
FastaTableContig
class does not have an MD5 field. This probably doesn't affect anyone other than me.
.fai files are tiny files containing the file offset locations of the contigs.
Note that our reference uploader already ensures constant line lengths and puts .fai files in the right place, so everything will work transparently if you construct a FastaTable for a FASTA file living in the ref repo.
If you have your own FASTA file, you can make a .fai by
% samtools faidx myfile.fasta