alxndrdiaz/blobtoolkit_gsoc2022.md

## blobtoolkit_gsoc2022.md

      
    Raw
  

              blobtoolkit_gsoc2022.md
            
          
    Google Summer of Code 2022 Project Report


Project: Conversion of the BlobToolKit pipeline to Nextflow
GitHub repository: https://github.com/sanger-tol/blobtoolkit
Organization: Wellcome Sanger Tree Of Life
Mentors:  Richard Challis, Priyanka Surana, Sujai Kumar
Project admin: Matthieu Muffato
Contributor: Alexander Ramos

1. Project description

The BlobToolKit pipeline, implemented with Snakemake, is designed to obtain quality metrics for eukaryotic genome assemblies. BlobToolKit is also capable of detecting contamination, which are DNA fragments in the genome assembly that belong to a different species. Converting this pipeline to Nextflow will allow: (1) its integration into the Tree of Life programme’s bioinformatics infrastructure, and (2) an open source nf-core pipeline that can be used by the bioinformatics community.
2. Modules

A module is a Nextflow script  that takes a set of inputs, performs a specific task and then can pass its outputs to another module. Modules are kept in two locations: nf-core ("A community effort to collect a curated set of analysis pipelines built using Nextflow") and locally. nf-core modules are peer-reviewed and must completely follow the nf-core guidelines. Some dedicated nf-core tooling allows importing them in Nextflow pipelines as dependencies. On the other hand, local modules typically live in one pipeline only, and may voluntarily follow the nf-core guidelines. Local modules perform tasks that are specific to the pipeline and would not be used in other pipelines.
Merged PRs in nf-core modules repository:

PR #1927 entrezdirect/esearch: queries NCBI databases.
PR #1833 entrezdirect/esummary: fetches the output from a search in a NCBI database using a unique identifier.
PR #1926 entrezdirect/xtract: transforms the results from a query in a NCBI database to a tabular format.
PR #1866 goat/taxon_search: Queries metadata for a taxon from the  project Genomes on a Tree (GOAT).
PR #2099 Fixes the bug described in nf-core issue #2012. This module is required in the busco_subworkflow to fetch metadata for a taxon. Allows to use the nf-core goat/taxon_search in the pipeline.

3. Subworkflows

Subworkflows are Nextflow scripts that perform a task that cannot be accomplished by a single module. At the outer level, a subworkflow takes a set of inputs and passes a set of outputs to another subworkflow. Internally, it connects those inputs to modules, whose outputs are connected to other modules, and so forth. Modules are the building blocks of subworkflows, and subworkflows are the building blocks of the main workflow. In this project each subworkflow is developed in a separate branch in the BloblToolKit repository and a PR was created for each of them.
List of merged PRs in blobtoolkit repository:

PR #27: io, local modules: input_tol. Takes ToL ID and project name as input. Creates the samplesheet with aligned CRAM files, links to the unmasked (or masked) genome file and its fai index. Additionally users can provide their own CSV samplesheet containing input genome FASTA file identifiers and their paths.

4. Unfinished work

4.1 PRs that were not merged

List of PRs that were not merged in blobtoolkit repository (descriptions are the same as in BlobToolKit sub-pipelines):


PR #28: busco_subworkflow, local modules: goat_taxon_search, extract_busco_genes. Fetches BUSCO lineages for the specified taxon, then runs BUSCO using specific and basal lineages (archaea_odb10, bacteria_odb10, and eukaryota_odb10). Counts BUSCOs in 1kb windows for each contig. Finally it runs a diamond blastp search of busco gene models for basal lineages against the UniProt reference proteomes.

Related to the PR described above (PR #2099): remove the local module goat_taxon_search.nf, then install the nf-core goat/taxonsearch module, finally use the splitCsv operator to convert the module output .tsv file to a list of BUSCO lineages instead of doing it from the local module.
For EXTRACT_BUSCO_GENES, generate the list of paths to BUSCO tables (full_table.tsv files) for archaea_odb10, bacteria_odb10, and eukaryota_odb10. This should be modified in the script subworkflows/local/busco_diamond_blastp.nf. The code used to get these paths to these tables is not working.
The full_table.tsv for the first lineage in the lineages list is an input for diamond_blastx subworkflow and should be an output channel of the subworkflow as well, see comment in the PR.
Make sure that modules EXTRACT_BUSCO_GENES and DIAMOND_BLASTP are working.


PR #30: diamondblastx_subworkflow, local modules: chunk_fasta_by_busco, unchunk_blastx. Runs a diamond blastx search of assembly contigs against the UniProt reference proteomes. Contigs are split into chunks to allow distribution-based taxrules. Contigs over 1Mb are subsampled by retaining only the most BUSCO-dense 100 kb region from each chunk.

This subworkflow requires a BUSCO table from the busco_subworkflow, described above.
Local modules require the Docker image containing blobltoolkit scripts, it would be a good idea to test each of them separately before testing the subworkflow.


PR #31: blastn_subworkflow, local modules: chunk_fasta_by_busco, get_nohit_list, extract_nohit_fasta, run_blastn, unchunk_blastn. NCBI blastn search of assembly contigs with no diamond blastx match against the NCBI nt database.

Use seqtk/subseq nf-core module instead of extract_nohit_fasta.nf, check comment in the PR.
Use blast/blastn nf-core module instead of run_blastn.nf, check comment in the PR.


4.2 Other


There is another bug related to BUSCO module: nf-core modules issue #2089. Once the bug is fixed the module should be updated, see the nf-core documentation on how to update modules. In order to have a working BUSCO module it was locally modified in the busco_subworkflow branch.


Create and merge the following PRs for each subworkflow, see BTK Pipeline v3.0.0 repository and project proposal:

subworkflow minimap, this can be done using minimap2_index and minimap2_align nf-core modules.
subworkflow cov_stats, calculate coverage in 1kb windows using mosdepth nf-core module
subworkflows chunk_stats, window_stats, blobtools and view can be implemented by converting each Snakemake rule to a Nextflow module, these will require the Docker image containing blobtools and btk pipeline scripts.


Scaled testing. For diamond blastp and diamond blastx use the available reduced diamond databases. Use NCBI taxon IDs in goat/taxonsearch module instead of binomial names.


Write pipeline documentation.


About me

My name is Alexander Ramos I am interested in bioinformatics and computational biology. You can find on GitHub  or Twitter. I have experience working with RNASeq datasets, genomes and metagenomes.