Skip to content

Instantly share code, notes, and snippets.

@moonwatcher
Last active Apr 29, 2021
Embed
What would you like to do?
Pheniqs BADAS talk

Pheniqs 2.1.0 BADAS talk

NYU CGSB NY / April 29 2021 11:00

What is Pheniqs and why do you need it?

Pheniqs is a fast and accurate high-throughput sequencing barcode classifier.

  • It can convert sequence reads to and from FASTQ, SAM, BAM and CRAM formats, decode barcodes and manipulate read layout in a single process.
  • It is very accurate, much more than traditional minimum distance based methods that simply compare the number of mismatches between two sequences. More specifically is has lower False Discovery Rate and at the same time a much lower Miss Rate. In simple terms it misclassifies less reads, introduces less noise and at the same time recovers more reads.

transform patterns This plot compares false discovery rates and miss rates between Pheniqs and traditional minimum distance decoding. The X axis shows increasing average error rates. In the error range relevant to most Illumina runs, which is between 1 in 10000 and 1 in 1000, Pheniqs has lower false discovery rates than MDD while still maintaining a much lower Miss Rate.

  • It is very fast. A benchmark of decoding 94 standard Illumina dual indexed barcodes in more than 11.5 billion reads on a CPU with 14 cores completed in just over 3 hours. deML for comparison, another statistical classifier that is still much faster than python implementations, took almost 80 hours to complete the same task.

  • It reports a decoding error probability for every barcode so you can take those metrics into account in your downstream analysis.

  • It is actively developed and can be extended with new decoding algorithms to match new experimental designs. This is an open invitation for everyone to help us make it better by telling us what features you are missing.

  • It can handle complex barcode designs with a flexible syntax that can address any number of barcodes anywhere on the read. That means no need for pre or post processing.

transform patterns In this diagram you can see how Pheniqs uses a syntax similar to python array slicing to address sample, cellular and molecular barcodes on a read made out of 2 biological segments and two index segments and produde a SAM output.

Installing with conda

Pheniqs can be installed in many ways. You can install it with conda, you can use the pheniqs-build-api to build your own portable binary or you can build it from source. You can have a look here for details.

But the good people at the hpc have created a module for us on greene so we will use this for this demo because it takes exactly two seconds to use it :)

module load pheniqs/2.1.0

Example files

In this tutorial we will be using a small toy example with 2500 reads. The reads are pair-ended single indexed and were produced by a MiSeq instrument: Two biological segments that are 51 nucleotides long and a single index that is 8 nucleotides long. You are probably not used to seeing Illumina output this way since samples are often demultiplexed by the core facility but the raw output from the sequencer includes each segment in a gzip compressed fastq file with full read quality data.

You can get the three gzip compressed FASTQ files here: BDGGG_s01.fastq.gz, BDGGG_s02.fastq.gz, BDGGG_s03.fastq.gz.

Format conversions and interleaving

  • Both input and output can be FASTQ or SAM/BAM/CRAM in either split or interleaved lacing.
  • Input format is automatically detected from file content, not from the extension, so files without an extension or even with a wrong extension are fine (though still not good practice).
  • Output format defaults to uncompressed SAM on stdout.
  • Output format is determined by the output file extension: fastq, fastq.gz (gzip compressed FASTQ), sam, bam or cram.
  • When output is -, or more explicitly /dev/stdout, you must specify the output format with a query parameter: -?format=fastq, -?format=fastq&compression=gz (gzip compressed FASTQ), -?format=sam (not usually necessery since this is the default), -?format=bam, -?format=cram.
  • You can also control compression parameters in a similar fashion, see the manual for details.

Lets see some basic commands with Pheniqs..

  • View a gzip compressed FASTQ file as SAM on stdout

pheniqs mux --input BDGGG_s01.fastq.gz|less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --input BDGGG_s01.fastq.gz       VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG

Currently Pheniqs has only one sub command called mux but there will be more soon. The only argument we use here is -i/--input. Most command line arguments have both a short form and a long form. The short form is easier to use when typing interactively but the long form is better when documenting things because it makes the command more easy to read. Short form arguments can be aggregated, for instance -s -i input.fastq is the same as -si input.fastq. This is all just standard POSIX stuff which Pheniqs tries to adhere to as much as possible.

  • View a gzip compressed FASTQ file as FASTQ on stdout. Here you can see what you can expect to find in those 3 files.

pheniqs mux --input BDGGG_s01.fastq.gz --output "-?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG

pheniqs mux --input BDGGG_s02.fastq.gz --output "-?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GGACTCCT
+
B@CCCFC<

pheniqs mux --input BDGGG_s01.fastq.gz --output "-?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
  • Interleave the 3 gzip compressed FASTQ files into one interleaved uncompressed FASTQ file on stdout

pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output "/dev/stdout?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG

-i/--input can show up multiple times on the command line. Notice that the order matters...

pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s03.fastq.gz -i BDGGG_s02.fastq.gz --output "/dev/stdout?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GGACTCCT
+
B@CCCFC<
  • Interleave the 3 gzip compressed FASTQ files into one interleaved SAM output on stdout

pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output "/dev/stdout?format=sam"|less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --input BDGGG_s01.fastq.gz --input BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output /dev/stdout?format=sam      VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:3  TC:i:3
  • Ok, now lets interleave the 3 gzip compressed FASTQ files into one interleaved CRAM formatted file

pheniqs mux -i BDGGG_s01.fastq.gz -i BDGGG_s02.fastq.gz -i BDGGG_s03.fastq.gz --output BDGGG.cram

When done, Pheniqs will give you a JSON formatted report on stderr. The report will provide statistics on both input and output as well as separate statistics for both overall reads and reads passing quality control (Pass Filter or PF reads). The documentation on the website describes what each variable means. You can also redirect the report to a file with -R/--report. Since no sample classification was requested all reads were directed to the "unclassified" read group.

{
    "incoming": {
        "count": 2500,
        "pf count": 2430,
        "pf fraction": 0.972
    },
    "sample": {
        "classified count": 0,
        "classified fraction": 0.0,
        "classified pf fraction": 0.0,
        "count": 2500,
        "index": 0,
        "pf classified count": 0,
        "pf classified fraction": 0.0,
        "pf count": 2430,
        "pf fraction": 0.972,
        "unclassified": {
            "ID": "undetermined",
            "PU": "undetermined",
            "count": 2500,
            "index": 0,
            "pf count": 2430,
            "pf fraction": 0.972,
            "pf pooled fraction": 1.0,
            "pooled fraction": 1.0
        }
    }
}

To inspect your CRAM file you can, again, use Pheniqs. Essentially this is a format conversion from CRAM to SAM... CRAM is the state-of-the-art compressed version of BAM. It is basically a super compressed SAM file. If you want to archive your data this can save you a LOT of space.

pheniqs mux -i BDGGG.cram|less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux -i BDGGG.cram    VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<
M02455:162:000000000-BDGGG:1:1101:10000:10630   76      *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG

Or in an interleaved FASTQ

pheniqs mux -i BDGGG.cram -o "-?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG

Oh wait! This is probably not what you expected?! Pheniqs thinks every one of the records in your file is a new read, when in fact every three records are 3 segments of the same read. To give you absolute control, Pheniqs by default, expects you to explicitly describe the interleaving nature of the input file. You can use the -s/--sense-input flag so ask Pheniqs to try and guess the interleaving nature of the file.

pheniqs mux -si BDGGG.cram|less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux -si BDGGG.cram   VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:3
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:3  TC:i:3

Om this makes more sense... Here it is in interleaved FASTQ

pheniqs mux -si BDGGG.cram -o "-?format=fastq"|less

@M02455:162:000000000-BDGGG:1:1101:10000:10630 1:N:0:
CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA
+
CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG
@M02455:162:000000000-BDGGG:1:1101:10000:10630 2:N:0:
GGACTCCT
+
B@CCCFC<
@M02455:162:000000000-BDGGG:1:1101:10000:10630 3:N:0:
GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA
+
CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG

Here are the sizes of the files in the different formats.

size type
959K fastq
177K fastq.gz
1.1M sam
177K bam
113K cram

You can see how valuable compression can be. bam and cram files not only provide better compression they can also be read and written faster. The differences can become more pronounced with bigger files. Since Pheniqs can convert between the formats in memory you can store data on disk in a cram file and use Pheniqs to feed FASTQ to tools that accept FASTQ from standard input.

Read manipulation

Pheniqs can also parse and manipulate the content of the read. It uses a syntax similar to python array slicing to tokenize every read segment and assemble new segments from the tokens. The syntax for manipulating the read structure is too complex to provide on a command line, you will need a JSON config file.

Let's start with a simple theoretical example. Suppose the first 8 nucleotides on the first segment are actually a cellular barcode and you want to separate them into a different segment. We create a configuration file that describes the layout:

{
  "template": {
    "transform": {
      "token": [ "0:8:", "1:0:", "0:0:8", "2:0:" ]
    }
  }
}

pheniqs mux --config manipulate.json -si BDGGG.cram |less

M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       TAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     FGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     FI:i:1  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       GGACTCCT        B@CCCFC<        FI:i:2  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   13      *       0       0       *       *       0       0       CTAAGAAA        CCCCCGGG        FI:i:3  TC:i:4
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     FI:i:4  TC:i:4

We used the cram file as input since its faster to read and its also easier to feed Pheniqs with an interleaved file.

Since Pheniqs follows the python array slicing syntax it uses a zero based coordinate system. That means counters start from zero and are inclusive of start coordinate and exclusive of end coordinate. It might sound complicated but it actually makes arithmatics very easy. for instance the first 3 nucleotides on the first read segment are "0:0:3".

A token is made of 3 colon separated integers. The first is the mandatory zero based input segment index. The second is an inclusive start coordinate and defaults to 0 if omitted. The third is an exclusive end coordinate. If the end coordinate is omitted the token spans to the end of the segment. start coordinate and end coordinate can take positive or negative values to access the segment from either the 5’ (left) or 3’ (right) end and mimic the python array slicing syntax. The two colons are always mandatory.

The template configuration element controls what the output read will look like. In this case we asked it to have 4 segments. The first output segment is "0:8:", or cycles 8 to end of input segment 0. Pheniqs can validate your config file and describe, in words, what the conversion from input to output will be. here is the description given by pheniqs mux --config manipulate.json -si BDGGG.cram --validate

Output transform

    Output segment cardinality                  4

    Token No.0
        Length        variable
        Pattern       0:8:
        Description   cycles 8 to end of input segment 0

    Token No.1
        Length        variable
        Pattern       1::
        Description   cycles 0 to end of input segment 1

    Token No.2
        Length        8
        Pattern       0::8
        Description   cycles 0 to 8 of input segment 0

    Token No.3
        Length        variable
        Pattern       2::
        Description   cycles 0 to end of input segment 2

    Assembly instruction
        Append token 0 of input segment 0 to output segment 0
        Append token 1 of input segment 1 to output segment 1
        Append token 2 of input segment 0 to output segment 2
        Append token 3 of input segment 2 to output segment 3

Classifying sample barcodes

Now assume that the 8 nucleotides on the second out of three segments in the input read is actually the sample barcode. That would be the standard Illumina protocol for a single indexed pair-end run. We want to classify the reads by that barcode. If sequencers never made mistakes and reads were perfect that would be a very easy task since you would just need to compare two strings with 8 characters. In reality sequencers make mistakes, between 1 in 1000 and 1 in 10000 on most Illumina sequencers.

Most tools that classify barcodes ignore the quality score the sequencer provides and simply look for either a perfect string match or at most 1 mismatch, if the barcodes in your set are different enough from each other to tolerate the mismatch. This method is called minimum distance decoding, abbreviated MDD, and Pheniqs can do that if you insist. You just set the decoder property "algorithm": "mdd".

But Pheniqs can do something much smarter, it can compute the posterior probability of the correct barcode being decoded given the list of possible barcodes (see the paper for all the mathematical nitty gritty) and make decisions based on that. It can even report that probability for every read in a standard SAM auxiliary field. Think of it as a quality score for decoding the barcode. Except its not just a quality score, its an actual probability! if you had two, independent, barcodes (for instance a sample and cellular barcode) the probability that the entire classification is correct is simply the product of the two! We call this a Phred Adjusted Maximum Likelihood Decoder, abbreviated PAMLD.

PAMLD has much less false negatives, a LOT less than MDD. You can expect it to recover a lot of the reads that MDD is missing. How many depends on how confident you want Pheniqs to be, which you specify in the confidence threshold parameter. But PAMLD does another cool thing, it filters noise. It can tell if a read is more likely to be noise than an actual read. For instance a PhiX read that you spiked into the mixture with your pooled libraries or some human DNA that found its way into your sample (how did that get here?!). Benchmarks on simulated data tell us that it is VERY good at picking up noise.

Let's look at a config file for decoding the sample barcode.

{
    "input": [ "BDGGG.cram", "BDGGG.cram", "BDGGG.cram" ],
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    },
    "sample": {
        "algorithm": "pamld",
        "confidence threshold": 0.99,
        "noise": 0.01,
        "transform": {
            "token": [ "1::8" ]
        },
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [ "AGGCAGAA" ],
                "concentration": 1
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [ "CGTACTAG" ],
                "concentration": 1
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [ "GGACTCCT" ],
                "concentration": 1
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [ "TAAGGCGA" ],
                "concentration": 1
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [ "TCCTGAGC" ],
                "concentration": 1
            }
        }
    }
}

Since we don't want the sample barcode to be part of the output read we skip that segment in the template transform, we only want the first and last segment in the output.

Next we declare a sample instruction in the config file. The transform in the sample instruction tells Pheniqs the barcode is on the first 8 nucleotides of the second input segment. The codec section is where you specify details on the barcodes you expect to find in your data. You naturally specify the barcode sequence, or multiple sequences if for instance you use dual indexing, but really you can have any number of segments in your barcodes, as long as their length and structure matches what you specified in the transform. Pheniqs has good error handling, if your config file has an error it will tell you exactly what went wrong.

For instance if in the first barcode you only specified 7 nucleotides:

    "@AGGCAGAA": {
        "LB": "trinidad 5",
        "barcode": [ "AGGCAGA" ],
        "concentration": 1
    },
Configuration error : sample decoder : expected 8 but found 7 nucleotides in segment 0 of barcode @AGGCAGAA

or if you copy/pasted your barcodes wrong and had the first one specified twice:

    "@AGGCAGAA": {
        "LB": "trinidad 5",
        "barcode": [ "CGTACTAG" ],
        "concentration": 1
    },
    "@CGTACTAG": {
        "LB": "trinidad 4",
        "barcode": [ "CGTACTAG" ],
        "concentration": 1
    },
Configuration error : sample decoder : duplicate barcode sequence CGTACTAG

We specified the input in the config file instead of the command line, so we don't have to keep typing it. We also specified how to parse the input explicitly, instead of using the --sense-input, by specifying the interleaved input file 3 times, telling Pheniqs to pull three records for every read, or that it should expect to find 3 consecutive segments for every read in the file. being able to explicitly specify how to parse the input allows you to deal with malformed files (lets face it, bioinformatics isn't perfect).

Notice we specified a 0.99 confidence threshold, so any barcode with a posterior decoding probability lower than that will be marked as failing quality control and moved to the unclassified (or undetermined) group. confidence threshold defaults to 0.95 but it represents a tradeoff between false positives and false negatives and the ideal value depends on your data and your barcode set, and frankly, what you want. We found 0.95 to be a fair tradeoff but if you prefer to loose more reads but be super confident about them you can set higher values and if you feel that your downstream analysis will be able to detect false positives and really want to recover as many as possible reads you can set lower values.

PAMLD's Bayesian model depends on a set of prior probabilities, basically how much noise (the noise variable) and how much of each sample (the concentration variable in each barcode) you expect to find in your data. Noise defaults to 0.01, which is 1%, which is the amount of PhiX textbook Illumina runs have spiked in. But if you are doing anything other than a vanilla Illumina protocol you probably have more. if you leave out the concentration they all default to 1, essentially forming a uniform distribution. Pheniqs will normalize what ever you specify in concentration so that they all sum up to 1 - noise. For instance if you specify 1 in all your barcodes and 0.01 in noise and you had 5 samples they will each be assumed to have a 0.99 / 5 = 0.198 prior. What we have in the above config file could have just beed left out since its the default.

If you have a good guess you can specify your priors in the config. We will see later that Pheniqs can give you a pretty good estimate for them.

First lets see how the output looks

pheniqs mux --config decode.json |less
@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined PU:undetermined
@RG     ID:AGGCAGAA     BC:AGGCAGAA     LB:trinidad 5   PU:AGGCAGAA
@RG     ID:CGTACTAG     BC:CGTACTAG     LB:trinidad 4   PU:CGTACTAG
@RG     ID:GGACTCCT     BC:GGACTCCT     LB:trinidad 9   PU:GGACTCCT
@RG     ID:TAAGGCGA     BC:TAAGGCGA     LB:trinidad 1   PU:TAAGGCGA
@RG     ID:TCCTGAGC     BC:TCCTGAGC     LB:trinidad 8   PU:TCCTGAGC
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode.json     VN:2.1.0
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.0616e-06
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.0616e-06
M02455:162:000000000-BDGGG:1:1101:10005:7907    589     *       0       0       *       *       0       0       CCGTTTGAATGTTGACGGGATGAACATAATAAGCAATGACGGCAGCAATAA     CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGG     RG:Z:undetermined       BC:Z:ATGTTTAT   QT:Z:@----6,,
M02455:162:000000000-BDGGG:1:1101:10005:7907    653     *       0       0       *       *       0       0       GAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCG     CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGG     RG:Z:undetermined       BC:Z:ATGTTTAT   QT:Z:@----6,,

Pheniqs has declared a read group for every sample barcode and one for the undetermined. The BC and QT tags hold the values for the uncorrected sample barcode and the quality scores for it. RG is a reference to the Read Group ID as declared in the header. Those are all standard tags, you will find them in the SAM specifications. XB is the special Pheniqs sauce, it is the probability that the sample barcode decoding was incorrect, the error probability. In the second read (remember, this is an interleaved file, every two records are a read. You can see they have the same read ID) decoding failed. BC and QT are there but RG points to undetermined and there is no XB. If you will also notice the flag field is different since the bit marking this read as failing QC is on.

You can add Read Group metadata to your config file and it will end up where you expect it to. You already saw an example of that with the LB tag. for instance we add some global ones so they are added to all Read Groups.

{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [ "BDGGG.cram", "BDGGG.cram", "BDGGG.cram" ],
    .
    .
    .
}

and we get a richer header

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined CN:CGSB DT:2018-02-25T07:00:00+00:00    PI:300  PL:ILLUMINA     PM:miseq        PU:undetermined SM:trinidad
@RG     ID:AGGCAGAA     BC:AGGCAGAA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 5   PI:300  PL:ILLUMINA     PM:miseq        PU:AGGCAGAA     SM:trinidad
@RG     ID:CGTACTAG     BC:CGTACTAG     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 4   PI:300  PL:ILLUMINA     PM:miseq        PU:CGTACTAG     SM:trinidad
@RG     ID:GGACTCCT     BC:GGACTCCT     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 9   PI:300  PL:ILLUMINA     PM:miseq        PU:GGACTCCT     SM:trinidad
@RG     ID:TAAGGCGA     BC:TAAGGCGA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 1   PI:300  PL:ILLUMINA     PM:miseq        PU:TAAGGCGA     SM:trinidad
@RG     ID:TCCTGAGC     BC:TCCTGAGC     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 8   PI:300  PL:ILLUMINA     PM:miseq        PU:TCCTGAGC     SM:trinidad
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode.json     VN:2.1.0

Estimating the priors

Suppose you have no idea what the priors are. You THINK you add the same amount of each pooled library but who know what actually happened. You also have no idea how much noise is there...

You make an initial Pheniqs run with your best guess, or with the defaults, and specify the --prior command line argument and provide a path to a new config file. Pheniqs will use the run to estimate the relative proportions of barcodes and noise in your data and write a new config file that contains the estimated priors. Since you don't really want the actual output reads from this run you can save some time by using another nifty Pheniqs feature: you specify the output to be /dev/null basically telling Pheniqs you don't want the output. That will save quite a bit of time since Pheniqs won't waste time formatting the output or writing it anywhere.

pheniqs mux --config decode.json --prior decode_with_prior.json --output /dev/null 

when its done running you will find a new config file at decode_with_prior.json.

{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [
        "BDGGG.cram",
        "BDGGG.cram",
        "BDGGG.cram"
    ],
    "output": [
        "/dev/null"
    ],
    "prior adjusted job url": "decode_with_prior.json?format=json",
    "sample": {
        "algorithm": "pamld",
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [
                    "AGGCAGAA"
                ],
                "concentration": 0.186575935178905
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [
                    "CGTACTAG"
                ],
                "concentration": 0.198520089969607
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [
                    "GGACTCCT"
                ],
                "concentration": 0.211699846980038
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [
                    "TAAGGCGA"
                ],
                "concentration": 0.219937195111557
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [
                    "TCCTGAGC"
                ],
                "concentration": 0.168041901882987
            }
        },
        "confidence threshold": 0.99,
        "noise": 0.015225030876904,
        "transform": {
            "token": [
                "1::8"
            ]
        }
    },
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    }
}

Notice the concentration and noise parameters have been adjusted to reflect what Pheniqs guessed from the initial run. The actual methodology for this prior estimate is described in the paper and in the manual. Notice the method is recursive since it uses your initial values as the base for the estimation. you can potentially run it a second time and get an even more accurate estimate but you also run the risk of over training your priors, but this is a different discussion... our benchmarks on synthetic data showed that a single estimation that started from a uniform distribution on data that was VERY non uniform still yielded a significant improvement in both precision and recall, and was just shy of using the real (and obviously unknown when your data is real and not simulated) priors.

one small minor detail is that since you executed Pheniqs with a null output your new config also specifies a null output. you can either edit the config or override it on the command line when you use this new config.

pheniqs mux --config decode_with_prior.json -o -|less

@HD     VN:1.0  SO:unknown      GO:query
@RG     ID:undetermined CN:CGSB DT:2018-02-25T07:00:00+00:00    PI:300  PL:ILLUMINA     PM:miseq        PU:undetermined SM:trinidad
@RG     ID:AGGCAGAA     BC:AGGCAGAA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 5   PI:300  PL:ILLUMINA     PM:miseq        PU:AGGCAGAA     SM:trinidad
@RG     ID:CGTACTAG     BC:CGTACTAG     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 4   PI:300  PL:ILLUMINA     PM:miseq        PU:CGTACTAG     SM:trinidad
@RG     ID:GGACTCCT     BC:GGACTCCT     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 9   PI:300  PL:ILLUMINA     PM:miseq        PU:GGACTCCT     SM:trinidad
@RG     ID:TAAGGCGA     BC:TAAGGCGA     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 1   PI:300  PL:ILLUMINA     PM:miseq        PU:TAAGGCGA     SM:trinidad
@RG     ID:TCCTGAGC     BC:TCCTGAGC     CN:CGSB DT:2018-02-25T07:00:00+00:00    LB:trinidad 8   PI:300  PL:ILLUMINA     PM:miseq        PU:TCCTGAGC     SM:trinidad
@PG     ID:pheniqs      PN:pheniqs      CL:pheniqs mux --config decode_with_prior.json -o -     VN:2.1.0-6-gb006bb7a4990e86e62e4f0085e405662e88daef6
M02455:162:000000000-BDGGG:1:1101:10000:10630   77      *       0       0       *       *       0       0       CTAAGAAATAGACCTAGCAGCTAAAAGAGGGTATCCTGAGCCTGTCTCTTA     CCCCCGGGFGGGAFDFGFGGFGFGFGGGGGGGDEFDFFGGFEFGCFEFGEG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.10298e-06
M02455:162:000000000-BDGGG:1:1101:10000:10630   141     *       0       0       *       *       0       0       GCTCAGGATACCCTCTTTTAGCTGCTAGGTCTATTTCTTAGCTGTCTCTTA     CCCCCGGGGGGGGGGGGGGGGGGGF<FGGGGGGGGGGGGFGFGGGGGGGGG     RG:Z:GGACTCCT   BC:Z:GGACTCCT   QT:Z:B@CCCFC<   XB:f:1.10298e-06

Splitting output

Interleaved files have many advantages: single input and single output allow you to use standard POSIX streams and avoid storing intermediate files. Its faster to read and write. It keeps experimental directory structures simpler and makes archives easier to browse. But sometimes you need to split your reads over multiple files, either by read segment or by barcode. Pheniqs can do that.

In the same fashion that the input directive allowed you to dictate multiple files to pull read segments from the output directive allows you to split the segments over multiple files.

Read segment Splitting

If your output read has multiple segments you can either specify just one output, in which case all segments will be interleaved into that one output. Or you can specify as many outputs as you have output segments and the segments will be written to the seperate files.

either on the command line

pheniqs mux --config decode.json --prior decode_with_prior.json --output first_segment.fastq.gz -o second_segment.bam 

or in the config file

{
    "output": [
        "first_segment.fastq.gz",
        "second_segment.bam"
    ]
}

Each output file is evaluated independently so they don't have to have the same format!

Barcode Splitting

You can also split reads into multiple files by barcodes. Either sample, cellular or molecular barcodes as long as you only split by one decoder. To do that you specify the output directive in the config section for each barcode.

{
    "CN": "CGSB",
    "DT": "2018-02-25T07:00:00+00:00",
    "PI": "300",
    "PL": "ILLUMINA",
    "PM": "miseq",
    "SM": "trinidad",
    "input": [
        "BDGGG.cram",
        "BDGGG.cram",
        "BDGGG.cram"
    ],
    "output": [
        "BDGGG_default.bam"
    ],
    "prior adjusted job url": "decode_with_prior.json?format=json",
    "sample": {
        "algorithm": "pamld",
        "codec": {
            "@AGGCAGAA": {
                "LB": "trinidad 5",
                "barcode": [
                    "AGGCAGAA"
                ],
                "concentration": 0.186575935178905,
                "output": [
                    "BDGGG_AGGCAGAA.bam"
                ]
            },
            "@CGTACTAG": {
                "LB": "trinidad 4",
                "barcode": [
                    "CGTACTAG"
                ],
                "concentration": 0.198520089969607,
                "output": [
                    "BDGGG_CGTACTAG_s01.fastq.gz",
                    "BDGGG_CGTACTAG_s02.fastq"
                ]
            },
            "@GGACTCCT": {
                "LB": "trinidad 9",
                "barcode": [
                    "GGACTCCT"
                ],
                "concentration": 0.211699846980038
            },
            "@TAAGGCGA": {
                "LB": "trinidad 1",
                "barcode": [
                    "TAAGGCGA"
                ],
                "concentration": 0.219937195111557
            },
            "@TCCTGAGC": {
                "LB": "trinidad 8",
                "barcode": [
                    "TCCTGAGC"
                ],
                "concentration": 0.168041901882987
            }
        },
        "confidence threshold": 0.99,
        "noise": 0.015225030876904,
        "transform": {
            "token": [
                "1::8"
            ]
        }
    },
    "template": {
        "transform": {
            "token": [
                "0::",
                "2::"
            ]
        }
    }
}

You can still set a global output and data from any barcode that has not declared its own output directive will be routed to the global output. You can have different layouts for each barcode. When we said Pheniqs is flexible, we meant it ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment