dalexander/FastaTable.md

## FastaTable.md

      
    Raw
  

              FastaTable.md
            
          
    pbcore in 2.2: new class FastaTable

In pbcore for 2.2 I'm introducing a new class, FastaTable, which gives
easy random access FASTA reading.  It requires a FASTA index (.fai)
file sitting next to the FASTA on the filesystem, and it requires a
constant wrapping length in the FASTA (note that these requirements
are already fulfilled by all PacBio reference repository FASTAs).
Internally the class works by mmap'ing the file contents into virtual
address space and doing a little math based on the FASTA index to
fulfill sequence slice requests.  Memory is only allocated when
sequence is actually requested.
Using FastaTable is very easy, and it confers at least one major
advantage: you do not need to actually load the FASTA contigs into
memory, which saves both memory usage and startup time for many
applications (especially for scenarios like human amplicons where we
only map to a fraction of a very large genome).
Using FastaTable: examples

>>> from pbcore.io import FastaTable

Loading and basic scanning of contigs is instantaneous, even for
human-size references---FASTA contents are not actually loaded:
>>> ft = FastaTable("~/Data/fluidigm_amplicons/MET_EGFR_Full_Genes.fasta")
>>> ft[:]
[<FastaTableRecord: EGFR>, <FastaTableRecord: MET>]

Some basic operations:
>>> met = ft[1]                         # Random access indexing by position in file...
>>> met = ft["MET"]                     # ... or by name!

>>> met.sequence                        # `sequence` is not a string, but is sliceable to strings;
<pbcore.io.FastaTable.MmappedFastaContig at 0x106e80050>

>>> len(met.sequence)                   # even getting the length of contig sequence doesn't require loading
150000                                  # anything from the FASTA file

>>> met.sequence[0:20]                  #  use [:] to get full contig as string if desired
'TTAATGGAATTGTTGGGTTT'

Comparison with FastaReader


As with FastaReader, you can iterate through the contigs.  But
with FastaTable, the iteration can be done repeatedly---it is not
a single-shot iterator.


Superficially, the API is the same.  The primary difference is
that when you say contig.sequence you don't get a Python
string; rather, you get a string proxy object that grabs the
appropriate string from the file when you slice it.  This
difference is likely to be invisible to most users.
If you do actually need to access the entire contig as a string,
just do:


>>> contig.sequence[:]


The FastaTableContig class does not have an MD5 field.  This
probably doesn't affect anyone other than me.

On those .fai files

.fai files are tiny files containing the file offset locations of the
contigs.
Note that our reference uploader already ensures constant line lengths
and puts .fai files in the right place, so everything will work
transparently if you construct a FastaTable for a FASTA file living in
the ref repo.
If you have your own FASTA file, you can make a .fai by
% samtools faidx myfile.fasta