- Project: Conversion of the BlobToolKit pipeline to Nextflow
- GitHub repository: https://github.com/sanger-tol/blobtoolkit
- Organization: Wellcome Sanger Tree Of Life
- Mentors: Richard Challis, Priyanka Surana, Sujai Kumar
- Project admin: Matthieu Muffato
- Contributor: Alexander Ramos
The BlobToolKit pipeline, implemented with Snakemake, is designed to obtain quality metrics for eukaryotic genome assemblies. BlobToolKit is also capable of detecting contamination, which are DNA fragments in the genome assembly that belong to a different species. Converting this pipeline to Nextflow will allow: (1) its integration into the Tree of Life programme’s bioinformatics infrastructure, and (2) an open source nf-core pipeline that can be used by the bioinformatics community.
A module is a Nextflow script that takes a set of inputs, performs a specific task and then can pass its outputs to another module. Modules are kept in two locations: nf-core ("A community effort to collect a curated set of analysis pipelines built using Nextflow") and locally. nf-core modules are peer-reviewed and must completely follow the nf-core guidelines. Some dedicated nf-core tooling allows importing them in Nextflow pipelines as dependencies. On the other hand, local modules typically live in one pipeline only, and may voluntarily follow the nf-core guidelines. Local modules perform tasks that are specific to the pipeline and would not be used in other pipelines.
Merged PRs in nf-core modules repository:
- PR #1927 entrezdirect/esearch: queries NCBI databases.
- PR #1833 entrezdirect/esummary: fetches the output from a search in a NCBI database using a unique identifier.
- PR #1926 entrezdirect/xtract: transforms the results from a query in a NCBI database to a tabular format.
- PR #1866 goat/taxon_search: Queries metadata for a taxon from the project Genomes on a Tree (GOAT).
- PR #2099 Fixes the bug described in nf-core issue #2012. This module is required in the busco_subworkflow to fetch metadata for a taxon. Allows to use the nf-core goat/taxon_search in the pipeline.
Subworkflows are Nextflow scripts that perform a task that cannot be accomplished by a single module. At the outer level, a subworkflow takes a set of inputs and passes a set of outputs to another subworkflow. Internally, it connects those inputs to modules, whose outputs are connected to other modules, and so forth. Modules are the building blocks of subworkflows, and subworkflows are the building blocks of the main workflow. In this project each subworkflow is developed in a separate branch in the BloblToolKit repository and a PR was created for each of them.
List of merged PRs in blobtoolkit repository:
- PR #27: io, local modules: input_tol. Takes ToL ID and project name as input. Creates the samplesheet with aligned CRAM files, links to the unmasked (or masked) genome file and its fai index. Additionally users can provide their own CSV samplesheet containing input genome FASTA file identifiers and their paths.
List of PRs that were not merged in blobtoolkit repository (descriptions are the same as in BlobToolKit sub-pipelines):
-
PR #28: busco_subworkflow, local modules: goat_taxon_search, extract_busco_genes. Fetches BUSCO lineages for the specified taxon, then runs BUSCO using specific and basal lineages (archaea_odb10, bacteria_odb10, and eukaryota_odb10). Counts BUSCOs in 1kb windows for each contig. Finally it runs a diamond blastp search of busco gene models for basal lineages against the UniProt reference proteomes.
- Related to the PR described above (PR #2099): remove the local module
goat_taxon_search.nf
, then install the nf-core goat/taxonsearch module, finally use the splitCsv operator to convert the module output.tsv
file to a list of BUSCO lineages instead of doing it from the local module. - For
EXTRACT_BUSCO_GENES
, generate the list of paths to BUSCO tables (full_table.tsv
files) forarchaea_odb10
,bacteria_odb10
, andeukaryota_odb10
. This should be modified in the scriptsubworkflows/local/busco_diamond_blastp.nf
. The code used to get these paths to these tables is not working. - The
full_table.tsv
for the first lineage in the lineages list is an input fordiamond_blastx
subworkflow and should be an output channel of the subworkflow as well, see comment in the PR. - Make sure that modules
EXTRACT_BUSCO_GENES
andDIAMOND_BLASTP
are working.
- Related to the PR described above (PR #2099): remove the local module
-
PR #30: diamondblastx_subworkflow, local modules: chunk_fasta_by_busco, unchunk_blastx. Runs a diamond blastx search of assembly contigs against the UniProt reference proteomes. Contigs are split into chunks to allow distribution-based taxrules. Contigs over 1Mb are subsampled by retaining only the most BUSCO-dense 100 kb region from each chunk.
- This subworkflow requires a BUSCO table from the
busco_subworkflow
, described above. - Local modules require the Docker image containing blobltoolkit scripts, it would be a good idea to test each of them separately before testing the subworkflow.
- This subworkflow requires a BUSCO table from the
-
PR #31: blastn_subworkflow, local modules: chunk_fasta_by_busco, get_nohit_list, extract_nohit_fasta, run_blastn, unchunk_blastn. NCBI blastn search of assembly contigs with no diamond blastx match against the NCBI nt database.
- Use seqtk/subseq nf-core module instead of
extract_nohit_fasta.nf
, check comment in the PR. - Use blast/blastn nf-core module instead of
run_blastn.nf
, check comment in the PR.
- Use seqtk/subseq nf-core module instead of
-
There is another bug related to BUSCO module: nf-core modules issue #2089. Once the bug is fixed the module should be updated, see the nf-core documentation on how to update modules. In order to have a working BUSCO module it was locally modified in the busco_subworkflow branch.
-
Create and merge the following PRs for each subworkflow, see BTK Pipeline v3.0.0 repository and project proposal:
- subworkflow minimap, this can be done using minimap2_index and minimap2_align nf-core modules.
- subworkflow cov_stats, calculate coverage in 1kb windows using mosdepth nf-core module
- subworkflows chunk_stats, window_stats, blobtools and view can be implemented by converting each Snakemake rule to a Nextflow module, these will require the Docker image containing blobtools and btk pipeline scripts.
-
Scaled testing. For
diamond blastp
anddiamond blastx
use the available reduced diamond databases. Use NCBI taxon IDs ingoat/taxonsearch
module instead of binomial names. -
Write pipeline documentation.
My name is Alexander Ramos I am interested in bioinformatics and computational biology. You can find on GitHub or Twitter. I have experience working with RNASeq datasets, genomes and metagenomes.