sirselim/guppy_basecalling.md

## guppy_basecalling.md

      
    Raw
  

              guppy_basecalling.md
            
          
    Nanopore basecalling on Google Colab


NOTE: this whole idea is the brain child of Jürgen Hench. He got it up and running and posted about it here. I am merely wrapping the idea in a hopefully easy to follow set of instructions for people to test themseleves.

This notebook describes processing of Nanopore sequencing data (fast5 files) in a Google Colab interactive notebook environment. This is made possible by utalising the GPU enabled runtime that is available via Colab.
Before we get started there are some important points to consider.
Caveats

Some things to note before proceeding:

you will need an ONT community forum account to download Guppy, so make sure you have one and can access the downloads section
this is a cloud based approach, meaning all data will be located in some cloud instance somewhere (I use Google Drive in this example). This may not be appropriate for the data you have. Consider this carefully before uploading any of your data.
this is currently a free service, it may well be removed at any stage - this is completely at the discretion of Google.

as part of this Google has the right to monitor usage and may throttle or deny allocation of resources to users that are running constantly
the amount and type of allocated resources can and likely will change. The current GPU instance is using GPUs that work with Guppy, and the available disk is about 64Gb, but this can change


runtime disconnection is a thing, if the notebook is idle too long you'll be disconnected
it is possible to run out of memory/RAM
there is no guarantee that the GPU hardware will be available when you want to use it

the GPU that you get allocated might not be compatible with Guppy. For example, in one instance I was assigned a Telsa K80. This is a Kepler based card and doesn't meet the requirement of CUDA compute >=6.0. This is the error that I recieved:


[guppy/error] *common::LoadModuleFromFatbin: Loading fatbin file shared.fatbin failed with: CUDA error at /builds/ofan/ont_core_cpp/ont_core/common/cuda_common.cpp:54: CUDA_ERROR_NO_BINARY_FOR_GPU

there is no responsibility from myself or ONT, you're on your own! :)


A note of interest from the Google Colab FAQ:

"The types of GPUs that are available in Colab vary over time. This is necessary for Colab to be able to provide access to these resources for free. The GPUs available in Colab often include Nvidia K80s, T4s, P4s and P100s. There is no way to choose what type of GPU you can connect to in Colab at any given time. Users who are interested in more reliable access to Colab’s fastest GPUs may be interested in Colab Pro."

So there are 4 different GPUs on offer, and it's essentially a 'lottery' as to which you get assigned - though it is probably likely to be one of the less powerful options. Here is an overview of these GPUs with respect to which "work" with Guppy:

Nvidia K80 - not compatible with Guppy

Kepler 2.0 microarchitecture

Year of release = 2014


CUDA Compute = 3.7
2496 x2 CUDA cores (essential a dual GPU)


Nvidia P4 - compatible with Guppy

Pascal microarchitecture

Year of release = 2016


CUDA Compute = 6.1
2560 CUDA cores


Nvidia P100 - compatible with Guppy

Pascal microarchitecture

Year of release = 2016


CUDA Compute = 6.0
3584 CUDA cores


Nvidia T4 - compatible with Guppy

Turing microarchitecture

Year of release = 2018


CUDA Compute = 7.5
2560 CUDA cores


So of the 4 types of GPU currently available via the free tier of Google Colab, the Nvidia K80 is the only one which will not work with Guppy as it is currenty implemented. If you end up with an instance with a K80 then there is no point continuing, and you can try again later. If you sign up for the Pro version of Google Colab (9.99 USD p/month) then you are priority to better GPUs - food for thought.

Initiate GPU runtime

The first thing is to make sure the runtime is set to use a GPU. To do this is pretty simple:

go to the Runtime menu
select the Change runtime type option
make sure the Hardware accelerator is set to GPU


Check the presence of a GPU

Once the above is set up you should be able to run the below code block. If successful you should see something like /device:GPU:0 as the output. This means that the GPU is available for use.
import tensorflow as tf
tf.test.gpu_device_name()       # this will tell you device number (should be 0 with a single GPU)

import torch
torch.cuda.get_device_name(0)   # this will tell you the name/model of the GPU
'Tesla T4'

Download Guppy

You will need to have access to the ONT community forum here to be able to access the download section to grab a copy of Guppy.
Once you have access and can navigate to the 'Software Downloads' section of the ONT community forum you will see a listing for Guppy. I recommend grabbing the pre-compiled binaries, i.e. the version listed as Linux x64-bit GPU, it should have a file name similar to ont-guppy_X.X.X_linux64.tar.gz - where the X's denote the version number. You can copy the link to this download and paste it into the code block below, i.e. replace the section [paste_guppy_link_here].
Run the code block and Guppy will be downloaded.
%%shell
GuppyBinary=[paste_guppy_link_here]
wget $GuppyBinary
...
...
...
Resolving americas.oxfordnanoportal.com (americas.oxfordnanoportal.com)... 96.126.99.215
Connecting to americas.oxfordnanoportal.com (americas.oxfordnanoportal.com)|96.126.99.215|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 637723012 (608M) [application/x-tar]
Saving to: ‘ont-guppy_4.5.3_linux64.tar.gz’

ont-guppy_4.5.3_lin 100%[===================>] 608.18M  44.7MB/s    in 14s     

2021-04-14 10:29:27 (42.0 MB/s) - ‘ont-guppy_4.5.3_linux64.tar.gz’ saved [637723012/637723012]

Extract the compressed Guppy binaries

Before we can use the Guppy binaries we need to extract the file we downloaded. Replace the X's in the below code block with the version you downloaded and then run the code block. If we use version 4.5.3 as an example:
%%shell
tar -xzvf ont-guppy_4.5.3_linux64.tar.gz
ont-guppy/bin/
ont-guppy/bin/guppy_basecall_client
ont-guppy/bin/guppy_basecall_server
ont-guppy/bin/guppy_basecaller
ont-guppy/bin/guppy_basecaller_supervisor
ont-guppy/data/
ont-guppy/data/YHR174W.fasta
ont-guppy/data/adapter_scaling_dna_r10.3_min.jsn
ont-guppy/data/adapter_scaling_dna_r10.3_prom.jsn
ont-guppy/data/adapter_scaling_dna_r9.4.1_min.jsn
ont-guppy/data/adapter_scaling_dna_r9.4.1_prom.jsn
ont-guppy/data/certs-bundle.crt
ont-guppy/data/dna_r10.3_450bps_fast.cfg
ont-guppy/data/dna_r10.3_450bps_fast_prom.cfg
ont-guppy/data/dna_r10.3_450bps_hac.cfg
ont-guppy/data/dna_r10.3_450bps_hac_prom.cfg
ont-guppy/data/dna_r10.3_450bps_modbases_5mc_hac_prom.cfg
ont-guppy/data/dna_r10_450bps_fast.cfg
ont-guppy/data/dna_r10_450bps_hac.cfg
ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
ont-guppy/data/dna_r9.4.1_450bps_fast_prom.cfg
ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
ont-guppy/data/dna_r9.4.1_450bps_hac_prom.cfg
ont-guppy/data/dna_r9.4.1_450bps_hac_prom_fw205.cfg
ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac.cfg
ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac_prom.cfg
ont-guppy/data/dna_r9.4.1_450bps_sketch.cfg
ont-guppy/data/dna_r9.5_450bps.cfg
ont-guppy/data/lambda_3.6kb.fasta
ont-guppy/data/lampore_analysis-2.0.0-py3-none-any.whl
ont-guppy/data/mismatch_matrix.txt
ont-guppy/data/rna_r9.4.1_70bps_fast.cfg
ont-guppy/data/rna_r9.4.1_70bps_fast_prom.cfg
ont-guppy/data/rna_r9.4.1_70bps_hac.cfg
ont-guppy/data/rna_r9.4.1_70bps_hac_prom.cfg
ont-guppy/data/template_r10.3_450bps_fast.jsn
ont-guppy/data/template_r10.3_450bps_fast_prom.jsn
ont-guppy/data/template_r10.3_450bps_hac.jsn
ont-guppy/data/template_r10.3_450bps_hac_prom.jsn
ont-guppy/data/template_r10.3_450bps_modbases_5mc_hac_prom.jsn
ont-guppy/data/template_r10_450bps_fast.jsn
ont-guppy/data/template_r10_450bps_hac.jsn
ont-guppy/data/template_r9.4.1_450bps_fast.jsn
ont-guppy/data/template_r9.4.1_450bps_fast_prom.jsn
ont-guppy/data/template_r9.4.1_450bps_hac.jsn
ont-guppy/data/template_r9.4.1_450bps_hac_prom.jsn
ont-guppy/data/template_r9.4.1_450bps_hac_prom_fw205.jsn
ont-guppy/data/template_r9.4.1_450bps_modbases_5mc_hac.jsn
ont-guppy/data/template_r9.4.1_450bps_modbases_5mc_hac_prom.jsn
ont-guppy/data/template_r9.4.1_450bps_sketch.jsn
ont-guppy/data/template_r9.5_450bps_5mer_raw.jsn
ont-guppy/data/template_rna_r9.4.1_70bps_fast.jsn
ont-guppy/data/template_rna_r9.4.1_70bps_fast_prom.jsn
ont-guppy/data/template_rna_r9.4.1_70bps_hac.jsn
ont-guppy/data/template_rna_r9.4.1_70bps_hac_prom.jsn
ont-guppy/bin/
ont-guppy/bin/guppy_aligner
ont-guppy/bin/minimap2
ont-guppy/lib/
ont-guppy/lib/MINIMAP2_LICENSE
ont-guppy/lib/libont_minimap2.so.2
ont-guppy/lib/libont_minimap2.so.2.17.2
ont-guppy/bin/
ont-guppy/bin/Nanopore Product Terms and Conditions (28 November 2018).pdf
ont-guppy/bin/THIRD_PARTY_LICENSES
ont-guppy/bin/
ont-guppy/bin/guppy_barcoder
ont-guppy/data/
ont-guppy/data/barcoding/
ont-guppy/data/barcoding/4x4_mismatch_matrix.txt
ont-guppy/data/barcoding/5x5_mismatch_matrix.txt
ont-guppy/data/barcoding/5x5_mismatch_matrix_simple.txt
ont-guppy/data/barcoding/barcode_arrs_16s.cfg
ont-guppy/data/barcoding/barcode_arrs_dual_nb24_pcr96.cfg
ont-guppy/data/barcoding/barcode_arrs_lwb.cfg
ont-guppy/data/barcoding/barcode_arrs_multivirus1.cfg
ont-guppy/data/barcoding/barcode_arrs_multivirus8.cfg
ont-guppy/data/barcoding/barcode_arrs_nb12.cfg
ont-guppy/data/barcoding/barcode_arrs_nb13-24.cfg
ont-guppy/data/barcoding/barcode_arrs_nb24.cfg
ont-guppy/data/barcoding/barcode_arrs_nb96.cfg
ont-guppy/data/barcoding/barcode_arrs_ncov8.cfg
ont-guppy/data/barcoding/barcode_arrs_ncov96.cfg
ont-guppy/data/barcoding/barcode_arrs_pcr12.cfg
ont-guppy/data/barcoding/barcode_arrs_pcr96.cfg
ont-guppy/data/barcoding/barcode_arrs_rab.cfg
ont-guppy/data/barcoding/barcode_arrs_rbk.cfg
ont-guppy/data/barcoding/barcode_arrs_rbk096.cfg
ont-guppy/data/barcoding/barcode_arrs_rbk4.cfg
ont-guppy/data/barcoding/barcode_arrs_rlb.cfg
ont-guppy/data/barcoding/barcode_arrs_vmk.cfg
ont-guppy/data/barcoding/barcode_arrs_vmk2.cfg
ont-guppy/data/barcoding/barcode_score_vs_classification.png
ont-guppy/data/barcoding/barcodes_masked.fasta
ont-guppy/data/barcoding/configuration.cfg
ont-guppy/data/barcoding/configuration_dual.cfg
ont-guppy/data/barcoding/multivirus_targets.fasta
ont-guppy/data/barcoding/ncov_targets.fasta
ont-guppy/data/barcoding/nw_barcoding_grid.png
ont-guppy/lib/
ont-guppy/lib/libvbz_hdf_plugin.so
ont-guppy/lib/libvbz_hdf_plugin.so.1
ont-guppy/lib/libvbz_hdf_plugin.so.1.0.0

Check Guppy version

We should now be able to run the Guppy binaries we downloaded. They are located in ./ont-guppy/bin. The below code block should run guppy_basecaller and report the version of the software.
%%shell
./ont-guppy/bin/guppy_basecaller --version
: Guppy Basecalling Software, (C) Oxford Nanopore Technologies, Limited. Version 4.5.3+0ab5ebb

Mount your Google Drive

By mounting your Google Drive you will be able to upload fast5 files which can be processed and the output can be written back to the same location within Drive.
The below chunk performs the mounting. You will be asked to authenticate, just follow the instructions and things should go pretty smoothly.
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
Mounted at /content/gdrive

For this example I created a directory within My Drive called ONT and then within this folder another directory called example_data. I then uploaded a few fast5 files to this location.
We can check that the mounted drive and files are identified in the notebook environment below.
%%shell
ls gdrive/MyDrive/ONT/example_data
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_0.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_10.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_11.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_12.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_13.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_14.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_15.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_16.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_17.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_18.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_19.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_1.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_20.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_2.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_3.fast5
PAD42977_84899b42a6019949c8f43092626c45d9beac7752_4.fast5

Looks good! We can see a list of fast5 files.
Basecalling with Guppy

Now for the fun part!
With all the above working then we can now basecall our data. First we will set a few variables. The below code block creates shell variables for input and output locations, the guppy binary (basecaller) and several model files for basecalling (i.e. fast, hac and modified bases).
Fast calling model

Once we're happy with these variables we can then put together the Guppy code to start basecalling. In the below it's a fairly simple run using the fast model and adjusting the parameters slightly for the compute environment.
Run this block and hopefully you'll see base calling kick off. If so that's all there is to it. :)
%%shell
inputPath="gdrive/MyDrive/ONT/example_data"
outputPath="gdrive/MyDrive/ONT/example_data"
guppy_bc=./ont-guppy/bin/guppy_basecaller                               # set guppy_basecaller binary location
guppy_cfg_fast=./ont-guppy/data/dna_r9.4.1_450bps_fast.cfg              # fast model calling
guppy_cfg_hac=./ont-guppy/data/dna_r9.4.1_450bps_hac.cfg                # high accuracy calling
guppy_cfg_mod=./ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac.cfg   # base modification calling

$guppy_bc -i $inputPath -s $outputPath  \
--recursive \
--config $guppy_cfg_fast \
--gpu_runners_per_device 16 \
--cpu_threads_per_caller 2 \
--device cuda:0
ONT Guppy basecalling software version 4.5.3+0ab5ebb
config file:        ./ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
model file:         /content/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
input path:         gdrive/MyDrive/ONT/example_data
save path:          gdrive/MyDrive/ONT/example_data
chunk size:         2000
chunks per runner:  160
minimum qscore:     7
records per file:   4000
num basecallers:    4
gpu device:         cuda:0
kernel path:        
runners per device: 16
Found 16 fast5 files to process.
Init time: 696 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 101781 ms, Samples called: 1912424322, samples/s: 1.87896e+07
Finishing up any open output files.
Basecalling completed successfully.

HAC model run

This basecalling run performs high accuracy calling. I was actually very surprised with the speed of the GPU that generated this output (Nvida T4). I feel it would be a decent option if you wanted to turn around a small amount of data using the hac model.
The below code block will perform hac:
%%shell
inputPath="gdrive/MyDrive/ONT/example_data"
outputPath="gdrive/MyDrive/ONT/example_data"
guppy_bc=./ont-guppy/bin/guppy_basecaller                               # set guppy_basecaller binary location
guppy_cfg_fast=./ont-guppy/data/dna_r9.4.1_450bps_fast.cfg              # fast model calling
guppy_cfg_hac=./ont-guppy/data/dna_r9.4.1_450bps_hac.cfg                # high accuracy calling
guppy_cfg_mod=./ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac.cfg   # base modification calling

$guppy_bc -i $inputPath -s $outputPath  \
--recursive \
--config $guppy_cfg_hac \
--gpu_runners_per_device 16 \
--cpu_threads_per_caller 2 \
--device cuda:0
ONT Guppy basecalling software version 4.5.3+0ab5ebb
config file:        ./ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /content/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         gdrive/MyDrive/ONT/example_data
save path:          gdrive/MyDrive/ONT/example_data
chunk size:         2000
chunks per runner:  512
minimum qscore:     9
records per file:   4000
num basecallers:    4
gpu device:         cuda:0
kernel path:        
runners per device: 16
Found 16 fast5 files to process.
Init time: 1864 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 409705 ms, Samples called: 1904252640, samples/s: 4.64786e+06
Finishing up any open output files.
Basecalling completed successfully.

Modified base run

If you are interested in exploring base modifications then you can provide the appropriate model configuration file and let it run. Again I was quite surprised by the speed in this cloud instance using an Nvidia T4 - good stuff.
Run the below code block for base modification enabled calling:
%%shell
inputPath="gdrive/MyDrive/ONT/example_data"
outputPath="gdrive/MyDrive/ONT/example_data"
guppy_bc=./ont-guppy/bin/guppy_basecaller                               # set guppy_basecaller binary location
guppy_cfg_fast=./ont-guppy/data/dna_r9.4.1_450bps_fast.cfg              # fast model calling
guppy_cfg_hac=./ont-guppy/data/dna_r9.4.1_450bps_hac.cfg                # high accuracy calling
guppy_cfg_mod=./ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac.cfg   # base modification calling

$guppy_bc -i $inputPath -s $outputPath  \
--recursive \
--config $guppy_cfg_mod \
--gpu_runners_per_device 16 \
--cpu_threads_per_caller 2 \
--device cuda:0
ONT Guppy basecalling software version 4.5.3+0ab5ebb
config file:        ./ont-guppy/data/dna_r9.4.1_450bps_modbases_5mc_hac.cfg
model file:         /content/ont-guppy/data/template_r9.4.1_450bps_modbases_5mc_hac.jsn
input path:         gdrive/MyDrive/ONT/example_data
save path:          gdrive/MyDrive/ONT/example_data
chunk size:         2000
chunks per runner:  512
minimum qscore:     9
records per file:   4000
num basecallers:    4
gpu device:         cuda:0
kernel path:        
runners per device: 16
Found 16 fast5 files to process.
Init time: 1820 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 406350 ms, Samples called: 1904252640, samples/s: 4.68624e+06
Finishing up any open output files.
Basecalling completed successfully.

Final thoughts

Well that is really all there is to it, cloud based GPU accelerated basecalling using the free tier in Google Colab is not just possible, it's actually quite usable! Again a massive thanks to Jürgen Hench who put in all the hard work and created the initial post explaining that this was a possibility.
Moving forward it would be interestinig to see how paid tiers perform, the Pro version of Google Colab is only 9.99 USD per month and can be cancelled anytime. I might clock up a month or two and try to do a little benchmarking. It would also be very useful to examine other cloud based options, i.e. AWS with GPU enabled instances. The prices of instances with decent GPUs available in them is dropping rather quickly, which is quite exciting.
Happy GPU basecalling everyone!