chrisamiller/long_read_alignment.md

## long_read_alignment.md

      
    Raw
  

              long_read_alignment.md
            
          
    Long Read Alignment

Let's start by looking at some outputs of long read sequencing from the Oxford Nanopore (ONT) platform. These are sequences from the K562 cell line, prepared with the ONT cDNA sequencing kit (poly-A selected).  Off the machines, the data will consist of a FAST5 or POD5 file, which are a compressed representation of the raw signal. These are subsequently run through a basecalling algorithm (such as Dorado) to generate FASTQ files.
The choice of basecalling algorithm and parameters goes pretty deep, so we'll assume that reasonable choices have been made. For simplicity, we've also subset the data to include just small portions of the genome, including a few genes of interest.
Go ahead and pull down this fastq file:
wget https://storage.googleapis.com/bfx_workshop_tmp/k562_ont_raw.fastq.gz

To start, let's go ahead and trim the data with the pychopper tool from ONT. It's used to identify, orient, trim, and un-fuse reads.  We don't have a local install of this tool, so let's use docker to run it:
docker run -it -v $(pwd -P):/data ontresearch/wf-isoforms
cd /data

pychopper -t 8 -m edlib -k PCS111 -r pychopper_report.pdf -S pychopper_stats.txt k562_ont.fastq.gz | gzip > k562_ont.trimmed.fastq.gz

Use exit to get out of the docker container when trimming is complete. Now we can look at the data in a few different ways:


Are the number of reads the same in the two files?  Use command line utilities to answer.


How many bases were removed from the fastq file? (hint - don't include non-sequence lines)


Look at the pychopper stats file in your browser. How many reads were removed?  Open your browser and look at the output pdf reports. reports  Roughly how many bases were trimmed/removed, on average?


In this case, our data has been pre-selected from certain regions of the genome, and so essentially all of the reads are of decent quality and usable, but in some real-life runs, a substantial fraction of them might not be!
Sequence QC

Next, let do some basic QC on this fastq using a tool called NanoPlot.
mkdir nanoplot
NanoPlot --fastq k562_ont.trimmed.fastq.gz --prefix nanoplot/

Open the NanoPlot-report.html file and note that our reads here are quite long compared to short read data - over 1000 bp long.  ONT typically advertises reads reaching tens of thousands of base pairs, though - why isn't this the case here?
Answer

This data is created from RNA, which means that the lengths of the molecules are dependent on the lengths of the transcripts, which are not typically tens of thousands of bases long


Scrolling down, you can see this length, along with other stats, represented graphically with embedded dynamic plots, or you can retrieve simple png images from the output folder.  Also note the read mean read quality, which is substantially lower than what we'd see with Illumina short-read sequencing.
Looking at the read lengths plots, they look a little multimodal.  In a real experiment, that would be weird, but nothing to worry about in this case - it's an artifact that appears because we're only using a very small amount of data.
Aligning the reads

Next up, let's align this data, using a tool called minimap2. It's generally the go-to aligner for long-read data, whether from RNA or DNA.
Just like with short-read DNA data, we'll need a reference genome to align against.  Pull down the human build 38 reference and untar it:
wget https://storage.googleapis.com/bfx_workshop_tmp/ref_genome_sm.tar
tar -xvf ref_genome.tar

Your command for running the alignment should look like this:
minimap2 -ax splice -uf ref_genome.fa.gz k562_ont.trimmed.fastq.gz >k562_ont.sam

While that's running, a few things to notice here:


We don't need to create an index ahead of time for minimap. It creates its index of the reference genome very quickly and efficiently compared to short-read aligners like bwa.


We're using the -x splice parameter, which indicates that it should be doing spliced alignment. This handles RNAseq data that has to account for introns interrupting the reads. We'll talk in more detail about spliced alignment that during the RNAseq portion of this course.  We're also using the -uf flags which help it find exon boundaries more effectively, by looking at the sequence context.


Now, we've got an aligned SAM file, and you can look at it with less and see that it follows the familiar format of header, followed by alignments.  As before, we're going to want to sort and index this bam in order to make it useful for downstream steps.
samtools view -Sb k562_ont.sam | samtools sort >k562_ont.bam
samtools index k562_ont.bam

At this point, let's load this up in IGV and get a feel for what we've got going on here.  Let's grab an illumina bam, generated from the same cell line, for comparison:
wget https://storage.googleapis.com/bfx_workshop_tmp/k562_illumina.bam https://storage.googleapis.com/bfx_workshop_tmp/k562_illumina.bam

Open both the Illumina and ONT bams in IGV, using the Human (hg38) reference genome.  Navigate to the U2AF1 gene and answer some questions:


How many exons are spanned by a typical short read?  How about a typical long read?  Which has more even coverage over the exons of these genes?


Right click on the Gene track and choose "Expanded"  Can you match up individual long reads with full-length transcripts?  What about short reads? How might this affect your ability to understand alternative splicing?


Zoom into an exon, until you can see basepair level changes.  Which data has a higher error rate?  Are the classes of errors you see between the two platforms different?  What kinds of analyses would be limited by this error rate?


Now navigate to the DNMT3A gene and zoom in on the 3' (left) end.  What do you notice about the reads and the coverage?  Do these look like full-length transcripts to you?


The longest stable isoform of the DNMT3A gene is 9421 bp long. Can you think of any reason why the length of the transcripts might impact the coverage and bias of the gene?
Answer

This data is created from RNA, which is a much less stable molecule than DNA. Because RNA can fragment (during extraction and library prepration), very few full-length reads were generated.  Since we're Poly-A selecting, pulling down based on that tail, we end up with substantial 3' bias in the long read data, especially on larger transcripts.


Compare this to the short read data. Which is more useful in this situation? What kinds of biases does this filtering present when estimating gene expression or transcript abundance?