GDKO/CGP-Pipeline.md

## CGP-Pipeline.md

      
    Raw
  

              CGP-Pipeline.md
            
          
    Preface

This is my recommended pipeline for assembly and annotation of small eukaryotic genomes (50 - 500 Mb).
All small scripts are available at CGP-scripts. For the programs a link is provided.
Please cite if you found the pipeline useful!

Contents


1. Raw Data QC

1.1 Set the initial directory structure
1.2 Raw data QC
1.3 Trimming
1.4 Count kmers
1.5 Error-Correct the Reads


2. Preliminary Assembly QC

2.1 Single-end (SE) Assembly

2.1.1 Using CLC
2.1.2 Using Velvet and Bowtie2


2.2 Plot the insert size distribution
2.3 Blobology

2.3.1 Search NCBI nt with blast
2.3.2 Search uniref with diamond
2.3.3 Make the TAGC plot


2.4 Remove contamination

2.4.1 Contaminants have a reference assembly in NCBI
2.4.2 Contaminants do not have a reference assembly in NCBI


3. Assembly
4. Evaluate Assembly
5. Annotation

5.1 Mask the repeats with RepeatModeler and RepeatMasker
5.2 Annotation Pipeline if we have RNASeq data

5.2.1 Align RNASeq with STAR
5.2.2 Run BRAKER
5.2.3 Add UTR info to the GFF3 file


5.3 Annotation Pipeline if we do not have RNASeq data

5.3.1 Create profile with BUSCO or CEGMA 
5.3.2 Create GeneMark hmm file
5.3.3 Run MAKER2
5.3.4 Run Augustus


PART 1: Raw Data QC

We assume all the files are at raw_data and that files from the same insert size library are concatenated.
1.1 Set the initial directory structure

pwd will refer to top directory [top_dir]
mkdir dataqc trimmed kmer_count blessed blobplot e_coli cleaned preassembly
1.2 Raw data QC

We will use FastQC
fastqc \
--nogroup \ # Do not group the positions of the read
-t 2 \      # No. of threads, change to be equal to no. of files
[top_dir]/raw_data/[files] # Eg. *sanfastq.gz

Inspect visually the produced output
1.3 Trimming

We will use skewer
cd [top_dir]/trimmed

skewer \
-n \         # Remove reads with a lot of N’s
-q 20 \      # Trim reads until Q20 is reached
-l 51 \      # Minimum read length allowed
-t 16 \      # No. of threads
-m pe \      # Mode Paired-End
-o prefix \      # Prefix of output files
[top_dir]/raw_data/[pe1] [top_dir]/raw_data/[pe2]  # One pair at a time

1.4 Count kmers

We will use kmc
cd [top_dir]/kmer_count

readlink -f [top_dir]/trimmed/*.fastq > files.lst  # We want all files

kmc \
-k19 \       # Kmer size (19)
-fq \        # Files are in fastq
-m100 \      # Memory to use (100G)
-t16 \       # No. of threads (16)
-cs500 \     # Maximum value of counter (500)
@files.lst \ # Files list
k19PE \      # Output prefix
./          # Output directory (current)

kmc_tools histogram \
k19PE \   # Input file prefix
-ci3 \    # Exclude kmers less than 3
histo.txt # output file

Generate a plot with kmer_histo.R
Rscript kmer_histo.R histo.txt

Inspect visually the produced output (if as the one below then we are okay!)

1.5 Error-Correct the Reads

We use BLESS

Kmer of size 21 should be fine for these genomes. We don’t want a small kmer, because it will correct erroneously true kmers, nor a large kmer because it will correct less false kmers. Based on the program specifications the following formula must be true U/(4^k) < 10^-4 where U=No. of unique kmers, and k=kmer size. When running the program it will display the number of unique kmers so this can be tested.

cd [top_dir]/blessed

bless \
-kmerlength 21 \ # Use Kmer size of 21
-smpthread 16 \  # No. of threads
-max_mem 100 \   # Max memory allowed (100G)
-prefix [prefix] \ # Output prefix (e.g blessed)
-read1 [top_dir]/trimmed/[trimmed-pe1.fq] \
-read2 [top_dir]/trimmed/[trimmed-pe2.fq]

We can rerun the kmer_counting step just to check. Normally, it should be similar to the one before.
PART 2: Preliminary Assembly QC

2.1 Single-end (SE) Assembly

2.1.1 Using CLC

Assemble SE
cd [top_dir]/blobplot

clc_assembler \
-v \                      # Verbose
-o clc_se_contigs.fasta \ # Output file
-q [top_dir]/blessed/*fastq      # We are using the libraries as SE

Map the reads back to get coverage
clc_mapper \
-o [lib.cas] \ # Output cas file
-d [clc_se_contigs.fasta] # The clc assembly
-l 0.9 \       # at least 90% of the read should map
-s 0.9 \       # with at least 90% identity
-q -i [top_dir]/blessed/[pe1.fastq] \
[top_dir]/blessed/[pe2.fastq]  # Each lib alone

# For clc version 5 and later
clc_mapper \
-o [lib.cas] \ # Output cas file
-d [clc_se_contigs.fasta] # The clc assembly
-l 0.9 \       # at least 90% of the read should map
-s 0.9 \       # with at least 90% identity
-q -p fb ss 50 500 -i [top_dir]/blessed/[pe1.fastq] \
[top_dir]/blessed/[pe2.fastq]  # Each lib alone    

Create the insert size distribution file with clc_insert.pl
clc_insert.pl \
-c [lib.cas] \   # The cas file
-i \             # Output the insert sizes
-o [lib] \       # Output prefix
-lib [lib]       # Lib name

2.1.2 Using Velvet and Bowtie2

Assemble SE
cd [top_dir]/blobplot

velveth \ 
AssemSE61 \                            # Output Directory
61 \                                   # Kmer Size
-short -fastq [top_dir]/blessed/*.fastq \     # Libraries as SE

velvetg AssemSE61

Map Reads back with bowtie2
cd AssemSE61

bowtie2-build contigs.fa contigs

bowtie2 \
-x contigs \                    # Reference
-p 16 \                         # No. of threads
-S [lib.sam] \                  # Sam file
-1 [top_dir]/blessed/[pe1.fastq] \  # Each library as pairs
-2 [top_dir]/blessed/[pe2.fastq]

Create the insert size distribution with sam_insert.pl
sam_insert.pl \
-s [lib.sam] \    # The sam file
-f [contigs.fa] \ # The assembly file
-i \              # Output the insert sizes
-o [lib] \        # Output prefix
-l [lib]          # Lib name

2.2 Plot the insert size distribution

We can now plot the insert size distribution with plot_insert_freq_txt_binned.R
plot_insert_freq_txt_binned.R [lib.insert.freq.txt]  # For clc

plot_insert_freq_txt_binned.R [lib.pairing.hist.txt] # For velvet

For example a 600 bp insert size library


If the insert size is less than 2 times the read length, then we can merge the reads with usearch -fastq_mergepairs

2.3 Blobology

2.3.1 Search NCBI nt with blast

blastn \
-task megablast \
-query [assembly_se.fasta] \
-db nt \
-culling_limit 2 \
-out [assembly_se_nt.blastn] \
-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
-num_threads 16 \
-evalue 1e-25 

2.3.2 Search uniref with diamond

Make the uniref database searchable by diamond (required once)
diamond makedb \
--threads 16 \
--in uniref90.fasta \
--db uniref90.0.7.9 \
--block-size 10

Search the database
diamond blastx \
--max-target-seqs 25 \
--verbose \
-c 1 \
--threads 16 \
-q [assembly_se.fasta] \
-d uniref90.0.7.9 \
-a assembly_se_uniref

Convert the output to TAGC readable with daa_to_tagc.pl, uniref100.taxlist from here. Requires diamond in PATH
daa_to_tagc.pl uniref100.taxlist [assembly_se_uniref.daa]

2.3.3 Make the TAGC plot

Create the BlobDB JSON file
blobtools create \
-i [assembly_se.fasta] \
--cas [lib1.cas] --cas [lib2.cas] \ # if clc, or -y velvet if velvet
-t [assembly_se_uniref.daa.tagc] \
-t [assembly_se_nt.blastn] \
--nodes [path_to/nodes.dmb] \       # nodes.dmp (required once)
--names [path_to/names.dmb]         # names.dmp (required once)

Generate a table
blobtools view -i [BlobDB] --hits --rank all > [BlobDB.table]

Create the plot
blobtools blobplot -i [BlobDB]

We can now inspect the image

2.4 Remove contamination

2.4.1 Contaminants have a reference assembly in NCBI

Most likely we will have E.coli contamination, and we can use the already sequenced genomes to remove the contaminated reads.
cd [top_dir]/e_coli

Download the fasta file from the E_coli_strains_folder in github
Build the database with bowtie2
bowtie2-build E_coli_strains.fna E_coli_strains

Map reads to the database
bowtie2 \
-x E_coli_strains \
-1 [top_dir]/blessed/[pe1.fastq] \
-2 [top_dir]/blessed/[pe2.fastq] \
-p 32 \
-S [top_dir]/cleaned/[Lib.sam]

Retain unmapped reads with keep_unmapped_reads.pl
cd [top_dir]/cleaned

keep_unmapped_reads.pl [Lib.sam] >[clean_1.fq] 2>[clean_2.fq]

2.4.2 Contaminants do not have a reference assembly in NCBI

To get a complete list of contigs that are contaminants we need to perform an assembly with PE information. I recommend running SPAdes because it uses multiple kmers to assemble
cd [top_dir]
mkdir spades_cr
cd spades_cr

Assemble with SPAdes
# For reads of 125 bp these kmer sizes are ok
spades.py \
--only-assembler \
--careful \
-k 21,33,55,77,99 \
-m 450 \
-t 16 \
-o [output_dir] \
--pe1-1 [top_dir]/blessed/[pe1.fastq] \
--pe1-2 [top_dir]/blessed/[pe2.fastq]

Rerun TAGC blob plot pipeline of the new assembly [spades_scaffolds.fna] and use the coverage from the SPAdes assembly -y spades
We use the txt file output of blobtools view to identify the contigs which are contaminants. For example if we have a Firmicute we run
grep "tax=Firmicutes" [BlobDB.table] | \
grep -v Nematoda | grep -v Arthropoda | grep -v Chordata | \
cut -f 1 > Firmicutes.list

We map reads back to the assembly with bowtie2 (similar as before)
We extract the reads that do not map to contaminants with groc_sam.pl
groc_sam.pl Firmicutes.list [Lib.sam] # Reads will be in read_?.fq files

Now we have the clean reads, and we can do one more round of TAGC plotting to check!

Most likely most of the contaminants will be gone, although a few will remain. We can remove them later down the line based on coverage and gc.

PART 3: Assembly

3.1 Pre-Assembly QC

3.1.1 Estimating optimal kmer-size with Kmergenie

We can estimate the optimal kmer-size for single kmer assemblers (i.e Velvet) and the expected genome size
# Add all the read file paths to a single file called read_files.list
kmergenie read_files.list


Check the kmergenie website for potential outputs and their explanation

3.1.2 Creating a report with Preqc

For this we need SGA, and the precomputed preqc files from CGP-scripts
sga preprocess --pe-mode 1 [clean_raw_reads_1.fq] [clean_raw_reads_2.fq] > [species.fastq]
sga index -a ropebwt --no-reverse -t 8 [species.fastq]
sga preqc -t 8 [species.fastq] > [species.preqc]
sga-preqc-report.py [species.preqc] [preqc_files/*.preqc]


Check the paper for detailed explanations of the plots in the report

3.2 Assembly software

There are multiple assemblers that can work with short-read data, and depending on the input files we may need to run a different assembler, but the following two will mostly work in every case.
3.2.1 Velvet

We can use the kmer_size from kmergenie, or try several ones.
velveth [output_dir] [kmer_size] -fastq -shortPaired -separate \
[clean_raw_reads_1.fq] [clean_raw_reads_2.fq]

velvetg [output_dir] -exp_cov auto -cov_cutoff auto

3.2.2 SPAdes

# For reads of 125 bp these kmer sizes are ok
spades.py \
--only-assembler \
--careful \
-k 21,33,55,77,99 \
-m 450 \
-t 16 \
-o [output_dir] \
--pe1-1 [clean_raw_reads_1.fq] \
--pe1-2 [clean_raw_reads_2.fq]

3.3 Scaffold with RNASeq data

We can scaffold the genome assembly using a transcriptome assembly with SCUBAT2
# Blast Transcriptome to the assembly with BLAST
blastn -query [transcriptome] -db [assembly] -num_threads 8 -outfmt 5 -evalue 1e-25 -out [blastn.xml]

# Run SCUBAT, with max intron size allowed of 20000
SCUBAT_v2.py -b [blastn.xml] -f [assembly] -max 20000

PART 4: Evaluate Assembly

We can evaluate our assemblies using the following metrics.
Calculate the likelihood of the assembly with CGAL
Identify the number of core eukaryotic genes present with CEGMA or BUSCO
Check assembly contiguity with REAPR
Search for the mitochondrion scaffold(s) in our assembly with BLAST
PART 5: Annotation

Rename the assembly scaffolds to something sensible with assembly.rename.sh (requires fastaqual_select.pl in path)
assembly.rename.sh [assembly] [scaffold_name] # eg nCe.2.0

5.1 Mask the repeats with RepeatModeler and RepeatMasker

mkdir RepeatModeler
cd RepeatModeler

#Build the database
BuildDatabase -name [species_name] -engine ncbi ../[renamed_assembly]

# De novo model the repeats
RepeatModeler -engine ncbi -pa 16 -database [species_name]

# Extract repeats from Rhabditida
# queryRepeatDatabase.pl inside RepeatMasker/utils
queryRepeatDatabase.pl -species rhabditida | grep -v "Species:" > Rhabditida.repeatmasker

# Concatenate the 2 files
cat RM*/consensi.fa.classified Rhabditida.repeatmasker > all.repeats

# Mask the assembly
RepeatMasker -lib all.repeats -pa 16 ../[renamed_assembly]

5.2 Annotation Pipeline if we have RNASeq data

5.2.1 Align RNASeq with STAR

# Generate genome index
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir [genome_dir] \
--genomeFastaFiles [assembly.masked]

# Map the reads
STAR --runThreadN 8 --genomeDir [genome_dir] --outSAMtype BAM Unsorted \
--twopassMode Basic --readFilesCommand zcat \
--readFilesIn [RNASeq_1.sanfastq.gz] [RNASeq_2.sanfastq.gz] 

5.2.2 Run BRAKER

braker.pl \
--workingdir=[output_dir] \
--species=[species_name] \
--cores=8 \
--genome=[assembly.masked] \
--bam=Aligned.out.bam

5.2.3 Add UTR info to the GFF3 file

Section under testing
5.3 Annotation Pipeline if we do not have RNASeq data

5.3.1 Create profile with BUSCO or CEGMA

Use BUSCO to create Augustus profile (new) or CEGMA to create SNAP profile.
5.3.1.1 Train Augustus with BUSCO

Run BUSCO
python run_BUSCO.py -i [assembly.masked] -o [busco_euk] -l [eukaryota_odb9 dir] -m genome -c 16 -sp caenorhabditis --long

Rename the parameter files
# Files will be in run_busco_euk/augustus_output/retraining_parameters
# Rename the folder to [species_name] 
mv retraining_parameters [species_name]
# Rename the files to [species_name] from [Busco_name] (e.g just before _metapars)
cd [species_name]
for a in * ; do rename 's/[Busco_name]/[species_name]/' $a ; done
for a in * ; do sed -i 's/[Busco_name]/[species_name]/' $a ; done
# Move the folder to augustus config/species

5.3.1.2 Train SNAP with CEGMA

Run CEGMA
cegma -g [assembly.masked] -T 8

Create the SNAP hmm file
cegma2zff [output.cegma.gff] [assembly.masked] # (from MAKER software)
fathom genome.ann genome.dna -categorize 1000
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl [Cv] . > [Cv.cegmasnap.hmm]

5.3.2 Create GeneMark hmm file

Run GeneMark, the output will be named gmhmm.mod.
gmes_petap.pl --ES --max_intron 20000 --cores 8 --sequence [assembly.masked]

5.3.3 Run MAKER2

Create the config files
maker -CTL

Edit maker_opts.ctl and the change the following
genome=[assembly.masked]
protein=[swissprot.fasta]
model_org=                      # Change value to blank
repeat_protein=                 # Change value to blank
snaphmm=[Cv.cegmasnap.hmm]      # If step 5.3.1.2
gmhmm=[genemark.mod]
augustus_species=[species_name] # If step 5.3.1.1
est2genome=1
protein2genome=1
pred_stats=1
keep_preds=1
single_exon=1

Run MAKER2 and Extract the gff3 fasta files
mpiexec -n 16 maker   # No. of threads

gff3_merge -d [datastore_index.log]

5.3.4 Run Augustus

Create the training gff for Augustus (maker2zff from MAKER, zff2gff3.pl from SNAP)
maker2zff -c 0 -e 0 [maker.gff3] 

zff2gff3.pl genome.ann | perl -plne 's/\t(\S+)$/\t\.\t$1/' >augustus.train.gff3

Run Augustus
autoAug.pl \
--species=[species_name] \
--genome=[assembly.masked] \
--trainingset=augustus.train.gff3 \
--useexisting -v --singleCPU \
--optrounds=3 &>augustus.out.txt