evanroyrees/walkthrough_notes.md

## walkthrough_notes.md

      
    Raw
  

              walkthrough_notes.md
            
          
    Autometa pipeline walkthrough using Nextflow

NOTE: These instructions are for working off of the KwanLab/dev branch
Overview


Install Autometa environment and commands
Configure nextflow so Autometa commands can be run through your scheduler
Configure run parameters (Set metagenome filepath and output directories)
Run autometa pipeline using nextflow

Running Autometa using nextflow and docker

cd $HOME
git clone --branch dev https://github.com/KwanLab/Autometa
cd Autometa
# NOTE: For a list of all available make options just type `make` with no arguments
# Build Autometa image (requires docker)
# This will create the docker image --> jason-c-kwan/autometa:dev
make image
Running Autometa (using nextflow without docker)

cd $HOME
git clone --branch dev https://github.com/KwanLab/Autometa
cd Autometa
# NOTE: For a list of all available make options just type `make` with no arguments
#Create conda env for Autometa (will create a conda env named autometa)
make create_environment
#Activate Autometa conda environment
conda activate autometa
#Install Autometa commands within environment
make install
# hmmpress markers for single-copy marker gene guided binning
DB_DIR="$HOME/Autometa/autometa/databases"
hmmpress -f autometa/databases/markers/bacteria.single_copy.hmm \
  && hmmpress -f autometa/databases/markers/archaea.single_copy.hmm \
  && autometa-config --section databases --option base --value ${DB_DIR} \
  && echo "databases base directory set in ${DB_DIR}/"
NOTE: After make install you will have access to all of the autometa commands. For more information on these commands see the step-by-step tutorial in the documentation.
Configure nextflow with your 'executor'

For nextflow to run the Autometa pipeline through a job scheduler (e.g. SLURM) you will need to update the respective 'profile' section in nextflow's config file. Each 'profile' may be configured with any available scheduler as noted in the nextflow executors docs. By default nextflow will use your local computer as the 'executor'. The next section briefly walks through nextflow executor configuration to run with the slurm job scheduler.
SLURM

NOTE: You can find the available slurm partitions by running sinfo
Run sinfo to see what partitions are available (e.g. ours is depicted below)

Will need to change $HOME/Autometa/nextflow.config
// Find this section of code in nextflow.config
  }
  slurm {
    process.executor = "slurm"
    // queue is the slurm partition to use.
    // Set SLURM partition with queue directive.
    process.queue = "queue" // <<-- change this to whatever your partition is called
    // See https://www.nextflow.io/docs/latest/executor.html#slurm for more details.
  }
More parameters that are available for the slurm executor are listed in the nextflow executor docs for slurm
Set Autometa parameters

You can use/alter the default template parameters config file here: $HOME/Autometa/nextflow/parameters.config
Data Inputs

NOTE: Data inputs must be wrapped in 'single quotes' or "double quotes"
Example directory structure for results

data="$HOME/autometa_results"
mkdir -p "${data}/raw"
cp path/to/your/final.contigs.fa "${data}/raw/."
params.metagenome = "$HOME/autometa_results/raw/final.contigs.fa" // <<-- Path to your metagenome
params.interim = "$HOME/autometa_results/interim" // <<-- Path to where you want interim results stored - This will make a directory to store intermediate results
params.processed = "$HOME/autometa_results/processed" //<<-- Path to where you want final results stored - This will make a directory to store final results
Database directory setup

Database directory path must contain the following:


diamond formatted nr file => nr.dmnd

Perform the following:

# Download nr.gz
wget -O $HOME/Autometa/autometa/databases/ncbi/nr.gz ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
# Set the number of threads you have available:
num_threads=4
# Format with diamond
diamond makedb --in $HOME/Autometa/autometa/databases/ncbi/nr.gz --db $HOME/Autometa/autometa/databases/ncbi/nr -p $num_threads


Extracted files from tarball taxdump.tar.gz
wget -O $HOME/Autometa/autometa/databases/ncbi/taxdump.tar.gz ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
cd $HOME/Autometa/autometa/databases/ncbi/
tar -xvzf taxdump.tar.gz
cd -


prot.accession2taxid.gz
wget -O $HOME/Autometa/autometa/databases/ncbi/prot.accession2taxid.gz ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz


params.ncbi_database = "$HOME/Autometa/autometa/databases/ncbi"  // <<-- Update this path to folder with all NCBI databases (You will NOT need to update this if you followed the downloads from above)
You may also find the links to the above database files in the Autometa databases documentation
Runtime Parameters

params.cpus = 2  // <<-- Number of CPUs each job uses
Autometa Parameters (Located in $HOME/Autometa/nextflow/parameters.config)

// Metagenome Length filtering
params.length_cutoff = 3000 // <<-- Smallest contig you want binned (3000 is default)
// Kmer counting/normalization/embedding
params.kmer_size = 5
params.kmer_norm_method = "am_clr" // choices: "am_clr" (default), "clr", "ilr"
params.kmer_pca_dimensions = 50
params.kmer_embed_method = "bhsne" // choices: "sksne", "bhsne" (default), "umap"
params.kmer_embed_dimensions = 2
// Binning parameters
params.kingdom = "bacteria" // choices: "bacteria", "archaea"
params.classification_kmer_pca_dimensions = 50
params.clustering_method = "dbscan" // choices: "dbscan", "hdbscan"
params.binning_starting_rank = "superkingdom" // choices: "superkingdom", "phylum", "class", "order", "family", "genus", "species"
params.classification_method = "decision_tree" // choices: "decision_tree", "random_forest"
params.completeness = 20.0 // Will keep clusters over 20% complete
params.purity = 95.0 // Will keep clusters over 95% pure
params.cov_stddev_limit = 25.0 // Will keep clusters less than 25% coverage std.dev.
params.gc_stddev_limit = 5.0 // Will keep clusters less than 5% GC% std.dev.
Final run command

NOTE: This is from within the Autometa directory. If you would like to run the workflow outside of the autometa directory, you will need to supply an additional configuration argument to nextflow that holds your executor configuration.
Inside of the Autometa directory

NOTE: nextflow will find the nextflow.config file from the current directory, so the executor configuration will be available by default.
nextflow run \
    # main logic of autometa workflow
    $HOME/Autometa/main.nf \
    # supplying profile is only needed if you have configured SLURM or some other executor profile
    -profile slurm \
    # parameters configuration
    -c $HOME/Autometa/nextflow/parameters.config \
    # working directory where nextflow intermediate/tmp dirs/files will be written
    -w $HOME/autometa_results/work
Outside of the Autometa directory

nextflow run $HOME/Autometa/main.nf \
    # executor configuration
    -c $HOME/Autometa/nextflow.config \
    # parameters configuration
    -c </path/to/your/parameters.config> \
    # working directory where nextflow intermediate/tmp dirs/files will be written
    -w </path/to/nextflow/work/directory> \
    # available profiles mentioned above are slurm, chtc and standard (default)
    -profile <profile to use>