This is my recommended pipeline for assembly and annotation of small eukaryotic genomes (50 - 500 Mb).
All small scripts are available at CGP-scripts. For the programs a link is provided.
Please cite if you found the pipeline useful!
- 1. Raw Data QC
- 2. Preliminary Assembly QC
- 3. Assembly
- 4. Evaluate Assembly
- 5. Annotation
We assume all the files are at raw_data
and that files from the same insert size library are concatenated.
pwd
will refer to top directory [top_dir]
mkdir dataqc trimmed kmer_count blessed blobplot e_coli cleaned preassembly
We will use FastQC
fastqc \
--nogroup \ # Do not group the positions of the read
-t 2 \ # No. of threads, change to be equal to no. of files
[top_dir]/raw_data/[files] # Eg. *sanfastq.gz
Inspect visually the produced output
We will use skewer
cd [top_dir]/trimmed
skewer \
-n \ # Remove reads with a lot of N’s
-q 20 \ # Trim reads until Q20 is reached
-l 51 \ # Minimum read length allowed
-t 16 \ # No. of threads
-m pe \ # Mode Paired-End
-o prefix \ # Prefix of output files
[top_dir]/raw_data/[pe1] [top_dir]/raw_data/[pe2] # One pair at a time
We will use kmc
cd [top_dir]/kmer_count
readlink -f [top_dir]/trimmed/*.fastq > files.lst # We want all files
kmc \
-k19 \ # Kmer size (19)
-fq \ # Files are in fastq
-m100 \ # Memory to use (100G)
-t16 \ # No. of threads (16)
-cs500 \ # Maximum value of counter (500)
@files.lst \ # Files list
k19PE \ # Output prefix
./ # Output directory (current)
kmc_tools histogram \
k19PE \ # Input file prefix
-ci3 \ # Exclude kmers less than 3
histo.txt # output file
Generate a plot with kmer_histo.R
Rscript kmer_histo.R histo.txt
Inspect visually the produced output (if as the one below then we are okay!)
We use BLESS
Kmer of size 21 should be fine for these genomes. We don’t want a small kmer, because it will correct erroneously true kmers, nor a large kmer because it will correct less false kmers. Based on the program specifications the following formula must be true U/(4^k) < 10^-4 where U=No. of unique kmers, and k=kmer size. When running the program it will display the number of unique kmers so this can be tested.
cd [top_dir]/blessed
bless \
-kmerlength 21 \ # Use Kmer size of 21
-smpthread 16 \ # No. of threads
-max_mem 100 \ # Max memory allowed (100G)
-prefix [prefix] \ # Output prefix (e.g blessed)
-read1 [top_dir]/trimmed/[trimmed-pe1.fq] \
-read2 [top_dir]/trimmed/[trimmed-pe2.fq]
We can rerun the kmer_counting step just to check. Normally, it should be similar to the one before.
2.1.1 Using CLC
Assemble SE
cd [top_dir]/blobplot
clc_assembler \
-v \ # Verbose
-o clc_se_contigs.fasta \ # Output file
-q [top_dir]/blessed/*fastq # We are using the libraries as SE
Map the reads back to get coverage
clc_mapper \
-o [lib.cas] \ # Output cas file
-d [clc_se_contigs.fasta] # The clc assembly
-l 0.9 \ # at least 90% of the read should map
-s 0.9 \ # with at least 90% identity
-q -i [top_dir]/blessed/[pe1.fastq] \
[top_dir]/blessed/[pe2.fastq] # Each lib alone
# For clc version 5 and later
clc_mapper \
-o [lib.cas] \ # Output cas file
-d [clc_se_contigs.fasta] # The clc assembly
-l 0.9 \ # at least 90% of the read should map
-s 0.9 \ # with at least 90% identity
-q -p fb ss 50 500 -i [top_dir]/blessed/[pe1.fastq] \
[top_dir]/blessed/[pe2.fastq] # Each lib alone
Create the insert size distribution file with clc_insert.pl
clc_insert.pl \
-c [lib.cas] \ # The cas file
-i \ # Output the insert sizes
-o [lib] \ # Output prefix
-lib [lib] # Lib name
Assemble SE
cd [top_dir]/blobplot
velveth \
AssemSE61 \ # Output Directory
61 \ # Kmer Size
-short -fastq [top_dir]/blessed/*.fastq \ # Libraries as SE
velvetg AssemSE61
Map Reads back with bowtie2
cd AssemSE61
bowtie2-build contigs.fa contigs
bowtie2 \
-x contigs \ # Reference
-p 16 \ # No. of threads
-S [lib.sam] \ # Sam file
-1 [top_dir]/blessed/[pe1.fastq] \ # Each library as pairs
-2 [top_dir]/blessed/[pe2.fastq]
Create the insert size distribution with sam_insert.pl
sam_insert.pl \
-s [lib.sam] \ # The sam file
-f [contigs.fa] \ # The assembly file
-i \ # Output the insert sizes
-o [lib] \ # Output prefix
-l [lib] # Lib name
We can now plot the insert size distribution with plot_insert_freq_txt_binned.R
plot_insert_freq_txt_binned.R [lib.insert.freq.txt] # For clc
plot_insert_freq_txt_binned.R [lib.pairing.hist.txt] # For velvet
For example a 600 bp insert size library
If the insert size is less than 2 times the read length, then we can merge the reads with usearch -fastq_mergepairs
2.3 Blobology
blastn \
-task megablast \
-query [assembly_se.fasta] \
-db nt \
-culling_limit 2 \
-out [assembly_se_nt.blastn] \
-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
-num_threads 16 \
-evalue 1e-25
2.3.2 Search uniref with diamond
Make the uniref database searchable by diamond (required once)
diamond makedb \
--threads 16 \
--in uniref90.fasta \
--db uniref90.0.7.9 \
--block-size 10
Search the database
diamond blastx \
--max-target-seqs 25 \
--verbose \
-c 1 \
--threads 16 \
-q [assembly_se.fasta] \
-d uniref90.0.7.9 \
-a assembly_se_uniref
Convert the output to TAGC readable with daa_to_tagc.pl, uniref100.taxlist from here. Requires diamond in PATH
daa_to_tagc.pl uniref100.taxlist [assembly_se_uniref.daa]
Create the BlobDB JSON file
blobtools create \
-i [assembly_se.fasta] \
--cas [lib1.cas] --cas [lib2.cas] \ # if clc, or -y velvet if velvet
-t [assembly_se_uniref.daa.tagc] \
-t [assembly_se_nt.blastn] \
--nodes [path_to/nodes.dmb] \ # nodes.dmp (required once)
--names [path_to/names.dmb] # names.dmp (required once)
Generate a table
blobtools view -i [BlobDB] --hits --rank all > [BlobDB.table]
Create the plot
blobtools blobplot -i [BlobDB]
Most likely we will have E.coli contamination, and we can use the already sequenced genomes to remove the contaminated reads.
cd [top_dir]/e_coli
Download the fasta file from the E_coli_strains_folder in github
Build the database with bowtie2
bowtie2-build E_coli_strains.fna E_coli_strains
Map reads to the database
bowtie2 \
-x E_coli_strains \
-1 [top_dir]/blessed/[pe1.fastq] \
-2 [top_dir]/blessed/[pe2.fastq] \
-p 32 \
-S [top_dir]/cleaned/[Lib.sam]
Retain unmapped reads with keep_unmapped_reads.pl
cd [top_dir]/cleaned
keep_unmapped_reads.pl [Lib.sam] >[clean_1.fq] 2>[clean_2.fq]
To get a complete list of contigs that are contaminants we need to perform an assembly with PE information. I recommend running SPAdes because it uses multiple kmers to assemble
cd [top_dir]
mkdir spades_cr
cd spades_cr
Assemble with SPAdes
# For reads of 125 bp these kmer sizes are ok
spades.py \
--only-assembler \
--careful \
-k 21,33,55,77,99 \
-m 450 \
-t 16 \
-o [output_dir] \
--pe1-1 [top_dir]/blessed/[pe1.fastq] \
--pe1-2 [top_dir]/blessed/[pe2.fastq]
Rerun TAGC blob plot pipeline of the new assembly [spades_scaffolds.fna] and use the coverage from the SPAdes assembly -y spades
We use the txt file output of blobtools view
to identify the contigs which are contaminants. For example if we have a Firmicute we run
grep "tax=Firmicutes" [BlobDB.table] | \
grep -v Nematoda | grep -v Arthropoda | grep -v Chordata | \
cut -f 1 > Firmicutes.list
We map reads back to the assembly with bowtie2 (similar as before)
We extract the reads that do not map to contaminants with groc_sam.pl
groc_sam.pl Firmicutes.list [Lib.sam] # Reads will be in read_?.fq files
Now we have the clean reads, and we can do one more round of TAGC plotting to check!
Most likely most of the contaminants will be gone, although a few will remain. We can remove them later down the line based on coverage and gc.
3.1.1 Estimating optimal kmer-size with Kmergenie
We can estimate the optimal kmer-size for single kmer assemblers (i.e Velvet) and the expected genome size
# Add all the read file paths to a single file called read_files.list
kmergenie read_files.list
Check the kmergenie website for potential outputs and their explanation
3.1.2 Creating a report with Preqc
For this we need SGA, and the precomputed preqc files from CGP-scripts
sga preprocess --pe-mode 1 [clean_raw_reads_1.fq] [clean_raw_reads_2.fq] > [species.fastq]
sga index -a ropebwt --no-reverse -t 8 [species.fastq]
sga preqc -t 8 [species.fastq] > [species.preqc]
sga-preqc-report.py [species.preqc] [preqc_files/*.preqc]
Check the paper for detailed explanations of the plots in the report
There are multiple assemblers that can work with short-read data, and depending on the input files we may need to run a different assembler, but the following two will mostly work in every case.
3.2.1 Velvet
We can use the kmer_size from kmergenie, or try several ones.
velveth [output_dir] [kmer_size] -fastq -shortPaired -separate \
[clean_raw_reads_1.fq] [clean_raw_reads_2.fq]
velvetg [output_dir] -exp_cov auto -cov_cutoff auto
3.2.2 SPAdes
# For reads of 125 bp these kmer sizes are ok
spades.py \
--only-assembler \
--careful \
-k 21,33,55,77,99 \
-m 450 \
-t 16 \
-o [output_dir] \
--pe1-1 [clean_raw_reads_1.fq] \
--pe1-2 [clean_raw_reads_2.fq]
We can scaffold the genome assembly using a transcriptome assembly with SCUBAT2
# Blast Transcriptome to the assembly with BLAST
blastn -query [transcriptome] -db [assembly] -num_threads 8 -outfmt 5 -evalue 1e-25 -out [blastn.xml]
# Run SCUBAT, with max intron size allowed of 20000
SCUBAT_v2.py -b [blastn.xml] -f [assembly] -max 20000
We can evaluate our assemblies using the following metrics.
Calculate the likelihood of the assembly with CGAL
Identify the number of core eukaryotic genes present with CEGMA or BUSCO
Check assembly contiguity with REAPR
Search for the mitochondrion scaffold(s) in our assembly with BLAST
Rename the assembly scaffolds to something sensible with assembly.rename.sh (requires fastaqual_select.pl in path)
assembly.rename.sh [assembly] [scaffold_name] # eg nCe.2.0
5.1 Mask the repeats with RepeatModeler and RepeatMasker
mkdir RepeatModeler
cd RepeatModeler
#Build the database
BuildDatabase -name [species_name] -engine ncbi ../[renamed_assembly]
# De novo model the repeats
RepeatModeler -engine ncbi -pa 16 -database [species_name]
# Extract repeats from Rhabditida
# queryRepeatDatabase.pl inside RepeatMasker/utils
queryRepeatDatabase.pl -species rhabditida | grep -v "Species:" > Rhabditida.repeatmasker
# Concatenate the 2 files
cat RM*/consensi.fa.classified Rhabditida.repeatmasker > all.repeats
# Mask the assembly
RepeatMasker -lib all.repeats -pa 16 ../[renamed_assembly]
5.2.1 Align RNASeq with STAR
# Generate genome index
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir [genome_dir] \
--genomeFastaFiles [assembly.masked]
# Map the reads
STAR --runThreadN 8 --genomeDir [genome_dir] --outSAMtype BAM Unsorted \
--twopassMode Basic --readFilesCommand zcat \
--readFilesIn [RNASeq_1.sanfastq.gz] [RNASeq_2.sanfastq.gz]
5.2.2 Run BRAKER
braker.pl \
--workingdir=[output_dir] \
--species=[species_name] \
--cores=8 \
--genome=[assembly.masked] \
--bam=Aligned.out.bam
Section under testing
Use BUSCO to create Augustus profile (new) or CEGMA to create SNAP profile.
Run BUSCO
python run_BUSCO.py -i [assembly.masked] -o [busco_euk] -l [eukaryota_odb9 dir] -m genome -c 16 -sp caenorhabditis --long
Rename the parameter files
# Files will be in run_busco_euk/augustus_output/retraining_parameters
# Rename the folder to [species_name]
mv retraining_parameters [species_name]
# Rename the files to [species_name] from [Busco_name] (e.g just before _metapars)
cd [species_name]
for a in * ; do rename 's/[Busco_name]/[species_name]/' $a ; done
for a in * ; do sed -i 's/[Busco_name]/[species_name]/' $a ; done
# Move the folder to augustus config/species
Run CEGMA
cegma -g [assembly.masked] -T 8
Create the SNAP hmm file
cegma2zff [output.cegma.gff] [assembly.masked] # (from MAKER software)
fathom genome.ann genome.dna -categorize 1000
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl [Cv] . > [Cv.cegmasnap.hmm]
5.3.2 Create GeneMark hmm file
Run GeneMark, the output will be named gmhmm.mod
.
gmes_petap.pl --ES --max_intron 20000 --cores 8 --sequence [assembly.masked]
5.3.3 Run MAKER2
Create the config files
maker -CTL
Edit maker_opts.ctl
and the change the following
genome=[assembly.masked]
protein=[swissprot.fasta]
model_org= # Change value to blank
repeat_protein= # Change value to blank
snaphmm=[Cv.cegmasnap.hmm] # If step 5.3.1.2
gmhmm=[genemark.mod]
augustus_species=[species_name] # If step 5.3.1.1
est2genome=1
protein2genome=1
pred_stats=1
keep_preds=1
single_exon=1
Run MAKER2 and Extract the gff3 fasta files
mpiexec -n 16 maker # No. of threads
gff3_merge -d [datastore_index.log]
5.3.4 Run Augustus
Create the training gff for Augustus (maker2zff
from MAKER, zff2gff3.pl
from SNAP)
maker2zff -c 0 -e 0 [maker.gff3]
zff2gff3.pl genome.ann | perl -plne 's/\t(\S+)$/\t\.\t$1/' >augustus.train.gff3
Run Augustus
autoAug.pl \
--species=[species_name] \
--genome=[assembly.masked] \
--trainingset=augustus.train.gff3 \
--useexisting -v --singleCPU \
--optrounds=3 &>augustus.out.txt