Skip to content

Instantly share code, notes, and snippets.

@dalexander
Last active December 22, 2015 01:09
Show Gist options
  • Save dalexander/6394707 to your computer and use it in GitHub Desktop.
Save dalexander/6394707 to your computer and use it in GitHub Desktop.
FastaTable

pbcore in 2.2: new class FastaTable

In pbcore for 2.2 I'm introducing a new class, FastaTable, which gives easy random access FASTA reading. It requires a FASTA index (.fai) file sitting next to the FASTA on the filesystem, and it requires a constant wrapping length in the FASTA (note that these requirements are already fulfilled by all PacBio reference repository FASTAs).

Internally the class works by mmap'ing the file contents into virtual address space and doing a little math based on the FASTA index to fulfill sequence slice requests. Memory is only allocated when sequence is actually requested.

Using FastaTable is very easy, and it confers at least one major advantage: you do not need to actually load the FASTA contigs into memory, which saves both memory usage and startup time for many applications (especially for scenarios like human amplicons where we only map to a fraction of a very large genome).

Using FastaTable: examples

>>> from pbcore.io import FastaTable

Loading and basic scanning of contigs is instantaneous, even for human-size references---FASTA contents are not actually loaded:

>>> ft = FastaTable("~/Data/fluidigm_amplicons/MET_EGFR_Full_Genes.fasta")
>>> ft[:]
[<FastaTableRecord: EGFR>, <FastaTableRecord: MET>]

Some basic operations:

>>> met = ft[1]                         # Random access indexing by position in file...
>>> met = ft["MET"]                     # ... or by name!

>>> met.sequence                        # `sequence` is not a string, but is sliceable to strings;
<pbcore.io.FastaTable.MmappedFastaContig at 0x106e80050>

>>> len(met.sequence)                   # even getting the length of contig sequence doesn't require loading
150000                                  # anything from the FASTA file

>>> met.sequence[0:20]                  #  use [:] to get full contig as string if desired
'TTAATGGAATTGTTGGGTTT'

Comparison with FastaReader

  1. As with FastaReader, you can iterate through the contigs. But with FastaTable, the iteration can be done repeatedly---it is not a single-shot iterator.

  2. Superficially, the API is the same. The primary difference is that when you say contig.sequence you don't get a Python string; rather, you get a string proxy object that grabs the appropriate string from the file when you slice it. This difference is likely to be invisible to most users. If you do actually need to access the entire contig as a string, just do:

>>> contig.sequence[:]
  1. The FastaTableContig class does not have an MD5 field. This probably doesn't affect anyone other than me.

On those .fai files

.fai files are tiny files containing the file offset locations of the contigs.

Note that our reference uploader already ensures constant line lengths and puts .fai files in the right place, so everything will work transparently if you construct a FastaTable for a FASTA file living in the ref repo.

If you have your own FASTA file, you can make a .fai by

% samtools faidx myfile.fasta
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment