Skip to content

Instantly share code, notes, and snippets.

@zjlahey
Forked from darencard/maker_genome_annotation.md
Last active December 16, 2020 16:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zjlahey/3c400c3039eef674e335d3d850ad595f to your computer and use it in GitHub Desktop.
Save zjlahey/3c400c3039eef674e335d3d850ad595f to your computer and use it in GitHub Desktop.
In-depth description of running MAKER for genome annotation.

Genome Annotation with MAKER

The following pipeline was put together by Daren Card, the link for which can be found here.

I am in the process of modifying his protocol to annotate the genome of the parasitic wasp Trissolcus basalis (Hymenoptera: Proctotrupomorpha: Platygastroidea).

Software & Data

Software prerequisites:

  1. RepeatModeler version 2.0.1 (06-Aug-2020).
  2. RepeatMasker version 4.1.0 (06-Aug-2020). 2a. RepBase (version 20170127).
  3. MAKER version 3.01.03 (07-Aug-2020).
  4. Augustus version 3.3.3.
  5. BUSCO version 4.1.2 (conda install -c bioconda -c conda-forge busco=4.1.2 augustus=3.3.3).
  6. SNAP version 2013-02-16.
  7. BEDtools version 2.29.0.

Raw data/resources:

  1. (genome) OSU_Tbal_v1.1.fasta: Our best Trissolcus basalis genome assembly produced from 5 plates of 454 reads, one lane of 50 bp illumina reads (unpaired), one gb of long reads (Oxford Nanopore), and RNA-seq data from pooled tissues (used for scaffolding). This assembly differs from the version submitted to NCBI (OSU_Tbal_v1.0.fasta) by the removal of the two mitochondrial contigs and contigs shorter than 2000 bp.
  2. (est) TB_FA_trinity.fasta.transdecoder.cds; TB_FB_trinity.fasta.transdecoder.cds; TB_FW_trinity.fasta.transdecoder.cds; TB_MA_trinity.fasta.transdecoder.cds; TB_MB_trinity.fasta.transdecoder.cds; TB_MW_trinity.fasta.transdecoder.cds: coding sequences from tissue-specific transcriptome assemblies annotated with TransDecoder.
  3. (altest) Coding sequences from the recently annotated genome assembly of Telenomus remus Nixon (Huayan Chen, unpublished data).
  4. (protein) TB_FA_trinity.fasta.transdecoder.pep; TB_FB_trinity.fasta.transdecoder.pep; TB_FW_trinity.fasta.transdecoder.pep; TB_MA_trinity.fasta.transdecoder.pep; TB_MB_trinity.fasta.transdecoder.pep; TB_MW_trinity.fasta.transdecoder.pep: protein sequences from tissue-specific transcriptome assemblies annotated with TransDecoder. These probably weren't necessary since the coding sequences were supplied, but I included them just to be thorough.
  5. (protein) All 170 arthropod proteomes in OrthoDB10v1 (accessed 06-Aug-2020).
  6. (protein) Swiss-Prot protein sequences from UniProt (accessed 08-Aug-2020).

Repeat Annotation

1. De Novo Repeat Identification

The first step of any genome annotation project is to identify repetitive elements. Existing libraries from RepBase or from internal efforts are great, but it is also important to identify repeats de novo from your reference genome using RepeatModeler. The time required to perform this task is variable: smaller genomes, less time.

BuildDatabase -name Trissolcus_basalis -engine ncbi OSU_Tbal_v1.1.fasta

RepeatModeler -pa 8 \ #number of cores/4 (each rmblast process utilizes 4 cores; I'm using a 32-core machine)
-engine ncbi \ #optional (ncbi is the default engine, but I like to flag it just to be sure)
-database Trissolcus_basalis 2>&1 | tee trissolcus_basalis_OSU_Tbal_v1.1_repeatmodeler.log

2. Full Repeat Annotation

Depending on the species, the de novo library can be fed right into MAKER; however, you should perform at least one additional round of annotation using curated repeats from RepBase. This process is described below.

cd RepeatMasker
RepeatMasker -pa 8 -e ncbi \
-gccalc \ #RepeatMasker calculates the GC content even for batch files/small seqs
-a \ #writes alignments in .align output file
-s \ #slow search; 0-5% more sensitive, 2-3 times slower than default
-gff \ #creates an additional Gene Feature Finding format output file
-lib Trissolcus_basalis-families.fa \ #RepeatModeler output
-dir OSU_Tbal_v1.1_repeatmasker_round1 \
$PATH/OSU_Tbal_v1.1.fasta

Then the masked FASTA from this search can be used as input for the next search, using the arthropoda library from RepBase. I also normally rename the outputs after each round so they are more representative of what they contain.

RepeatMasker -pa 8 -e ncbi -gccalc -a -s \
-species arthropoda \ #taxon name must be in the NCBI Taxonomy Database
-dir OSU_Tbal_v1.1_repeatmasker_round2_arthropoda_mask \
$PATH/OSU_Tbal_v1.1_repeatmasker_round1/OSU_Tbal_v1.1.fasta.masked

Finally, results from each round must be analyzed together to produce the final repeat annotation:

mkdir full_mask
cd full_mask
gunzip -d $PATH/RepMasker_OSU_Tbal_v1.1_repeatmasker_round[12].cat.gz
cat $PATH/RepMasker_OSU_Tbal_v1.1_repeatmasker_round[12].cat > OSU_Tbal_v1.1_full_mask.cat
ProcessRepeats -a -species arthropoda OSU_Tbal_v1.1_full_mask.cat

Finally, in order to feed these repeats into MAKER properly, we must separate out the complex repeats (more info on this below). The rmOutToGFF3.pl file is included with RepeatMasker.

# create GFF3
rmOutToGFF3.pl full_mask/OSU_Tbal_v1.1_full_mask.out > full_mask/OSU_Tbal_v1.1_full_mask.out.gff3
# isolate complex repeats
grep -v -e "Satellite" -e ")n" -e "-rich" OSU_Tbal_v1.1_full_mask.out.gff3 \
  > OSU_Tbal_v1.1_full_mask.out.complex.gff3
# reformat to work with MAKER
cat OSU_Tbal_v1.1_full_mask.out.complex.gff3 | \
  perl -ane '$id; if(!/^\#/){@F = split(/\t/, $_); chomp $F[-1];$id++; $F[-1] .= "\;ID=$id"; $_ = join("\t", @F)."\n"} print $_' \
  > OSU_Tbal_v1.1_full_mask.out.complex.reformat.gff3

Now we have the prerequisite repeat data for running MAKER.

3. Initial MAKER Analysis

MAKER is pretty easy to get going and relies on properly completed control files. These can be generated by issuing the command maker -CTL. The only control file we will be using for the first round is the maker_opts.ctl file. In this first round, we will provide the data files from the repeat annotation (rm_gff), sex- and tissue-specific transcriptome assemblies of Trissolcus basalis (est), and the OrthoDB and Swiss-Prot protein sequences (protein). We will also set the model_org to 'simple' so that only simple repeats are annotated (along with RepeatRunner). Here is the full control file for reference, with unused parameters commented out (#).

cat round1_maker_opts.ctl

#-----Genome (these are always required)
genome=/home/mbd-johnson-group/tbal_genome_annotation/assembly_file/OSU_Tbal_v1.1.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est=/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/est/TB_MW_trinity.fasta.transdecoder.cds:TB_MW_CDS,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/est/TB_MB_trinity.fasta.transdecoder.cds:TB_MB_CDS,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/est/TB_FW_trinity.fasta.transdecoder.cds:TB_FW_CDS,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/est/TB_FB_trinity.fasta.transdecoder.cds:TB_FB_CDS,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/est/TB_FA_trinity.fasta.transdecoder.cds:TB_FA_CDS,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/est/TB_MA_trinity.fasta.transdecoder.cds:TB_MA_CDS #set of ESTs or assembled mRNA-seq in fasta format
altest=/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/est/telenomus_remus_EVM.final.gene.gff3.cds:TELREM_CDS #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/home/mbd-johnson- group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/prot/TB_FA_trinity.fasta.transdecoder.pep:TB_FA_PEP,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/prot/TB_MB_trinity.fasta.transdecoder.pep:TB_MB_PEP,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/prot/TB_FB_trinity.fasta.transdecoder.pep:TB_FB_PEP,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/prot/telenomus_remus_EVM.final.gene.gff3.pep:TELREM_PEP,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/prot/TB_FW_trinity.fasta.transdecoder.pep:TB_FW_PEP,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/prot/uniprot_swissprot_08Aug2020.fasta:SWISS_PROT,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/prot/odb10v1_arthropods_170taxa_prot.fasta:odb10v1,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/prot/TB_MW_trinity.fasta.transdecoder.pep:TB_MW_PEP,/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/prot/TB_MA_trinity.fasta.transdecoder.pep:TB_MA_PEP  #protein sequence file in fasta format (i.e. from mutiple organisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=simple #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/home/mbd-johnson-group/tbal_genome_annotation/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff=/home/mbd-johnson-group/tbal_genome_annotation/RepeatMasker/full_mask_rounds123/OSU_Tbal_v1.1_full_mask.out.complex.reformat.gff3 #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=1 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=1 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
snoscan_meth= #-O-methylation site fileto have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap=0 #allowed gene overlap fraction (value from 0 to 1, blank for default)

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=4 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
min_intron=20 #minimum intron length (used for alignment polishing)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP=/home/lahey.18/raid/maker_tmp #specify a directory other than the system default temporary directory for temporary files

Specifying the GFF3 annotation file for the annotated complex repeats (rm_gff) has the effect of hard masking these repeats so that they do not confound our ability to identify coding genes. We let MAKER identify simple repeats internally, since it will soft mask these, allowing them to be available for gene annotation. This isn't a typical approach but has to be done if you want to do more than one succeessive round of RepeatMasker. Daren Card verified this would work with the MAKER maintainers here.

Given MAKER will be using BLAST to align transcripts and proteins to the genome, this will take up to a couple days with 30 cores. Speed is a product of the resources you allow (more cores == faster) and the assembly quality (smaller, less contiguous scaffolds == longer). We conclude by assembling together the GFF and FASTA outputs.

cd $PATH/round_1_OSU_Tbal_v1.1.maker.output
gff3_merge -s -d round_1_OSU_Tbal_v1.1_master_datastore_index.log > round_1_OSU_Tbal_v1.1.all.maker.gff
fasta_merge -d round_1_OSU_Tbal_v1.1_master_datastore_index.log
# GFF w/o the sequences
gff3_merge -n -s -d round_1_OSU_Tbal_v1.1_master_datastore_index.log > round_1_OSU_Tbal_v1.1.all.maker.noseq.gff

4. Training Gene Prediction Software

Besides mapping the empirical transcript and protein evidence to the reference genome and repeat annotation (not much of this in our example, given we've done so much up front), the most important product of this MAKER run is the gene models. These are what is used for training gene prediction software like Augustus and SNAP.

SNAP

SNAP is pretty quick and easy to train. Issuing the following commands will perform the training. It is best to put some thought into what kind of gene models you use from MAKER. In this case, we use models with an AED of 0.25 or better and a length of 50 or more amino acids, which helps get rid of junky models.

mkdir snap
mkdir snap/round1_OSU_Tbal_v1.1
cd snap/round1_OSU_Tbal_v1.1
# export 'confident' gene models from MAKER and rename to something meaningful
maker2zff -x 0.25 -l 50 $PATH/round_1_OSU_Tbal_v1.1.all.maker.gff
rename genome round_1_OSU_Tbal_v1.1.zff.length50_aed0.25 *
# gather some stats and validate
$PATH/fathom round_1_OSU_Tbal_v1.1.zff.length50_aed0.25.ann round_1_OSU_Tbal_v1.1.zff.length50_aed0.25.dna -gene-stats > gene-stats.log 2>&1
$PATH/fathom round_1_OSU_Tbal_v1.1.zff.length50_aed0.25.ann round_1_OSU_Tbal_v1.1.zff.length50_aed0.25.dna -validate > validate.log 2>&1
# collect the training sequences and annotations, plus 1000 surrounding bp for training
$PATH/fathom round_1_OSU_Tbal_v1.1.zff.length50_aed0.25.ann round_1_OSU_Tbal_v1.1.zff.length50_aed0.25.dna -categorize 1000 > categorize.log 2>&1
$PATH/fathom uni.ann uni.dna -export 1000 -plus > uni-plus.log 2>&1
# create the training parameters
mkdir params
cd params
$PATH/forge ../export.ann ../export.dna > ../forge.log 2>&1
cd ..
# assembly the HMM
$PATH/hmm-assembler.pl round_1_OSU_Tbal_v1.1.zff.length50_aed0.25 params > round_1_OSU_Tbal_v1.1.zff.length50_aed0.25.hmm
Augustus

Training Augustus is a more laborious process. Luckily, the recent release of BUSCO provides a nice pipeline for performing the training, while giving you an idea of how good your annotation already is. If you don't want to go this route, there are scripts provided with Augustus to perform the training. First, the Parallel::ForkManager module for Perl is required to run BUSCO with more than one core. You can easily install it before the first time you use BUSCO by running sudo apt-get install libparallel-forkmanager-perl.

This probably isn't an ideal training environment but appears to work well. First, we must put together training sequences using the gene models we created in our first run of MAKER. We do this by issuing the following command to excise the regions that contain mRNA annotations based on our initial MAKER run (with 1000bp on each side).

awk -v OFS="\t" '{ if ($3 == "mRNA") print $1, $4, $5 }' $PATH/round_1_OSU_Tbal_v1.1.all.maker.noseq.gff | \
  awk -v OFS="\t" '{ if ($2 < 1000) print $1, "0", $3+1000; else print $1, $2-1000, $3+1000 }' | \
  bedtools getfasta -fi $PATH/OSU_Tbal_v1.1.fasta - -fo round_1_OSU_Tbal_v1.1.all.maker.transcripts1000.fasta

There are some important things to note based on this approach. First is that you will likely get warnings from BEDtools that certain coordinates could not be used to extract FASTA sequences. This is because the end coordinate of a transcript plus 1000 bp is beyond the total length of a given scaffold. This script does account for transcripts being within the beginning 1000bp of the scaffold, but there was no easy way to do the same with transcrpts within the last 1000bp of the scaffold. This is okay, however, as we still end up with sequences from thousands of gene models and BUSCO will only be searching for a small subset of genes itself.

While we've only provided sequences from regions likely to contain genes, we've totally eliminated any existing annotation data about the starts/stops of gene elements. Augustus would normally use this as part of the training process. However, BUSCO will essentially do a reannotation of these regions using BLAST and built-in HMMs for a set of conserved genes (hundreds to thousands). This has the effect of recreating some version of our gene models for these conserved genes. We then leverage the internal training that BUSCO can perform (the --long argument) to optimize the HMM search model to train Augustus and produce a trained HMM for MAKER. Here is the command we use to perform the Augustus training inside BUSCO.

# create conda environment named busco
conda create --name busco
# activate the conda environment you just created
conda activate busco
# install busco and augustus into activated environment
conda install -c bioconda -c conda-forge busco=4.1.2 augustus=3.3.3
# run busco
busco -i $PATH/round_1_OSU_Tbal_v1.1.all.maker.transcripts1000.fasta -o round_1_OSU_Tbal_v1.1_maker_transcripts1000 \
-l hymenoptera -m genome -c 24 --long --augustus_species nasonia --augustus_parameters='progress=true'

In this case, we are using the Hymenoptera set of conserved genes (N = 5991 genes), so BUSCO will try to identify those gene using BLAST and an initial HMM model for each that comes stocked within BUSCO. We specify the -m genome option since we are giving BUSCO regions that include more than just transcripts. The initial HMM model we'll use is from Nasonia vitripennis (--augustus_species nasonia), which is a reasonably close species. Finally, the --long option tells BUSCO to use the initial gene models it creates to optimize the HMM settings of the raw nasonia HMM, thus training it for our use on Trissolcus basalis.

Once BUSCO is complete, it will give you an idea of how complete your annotation is (though be cautious, because we haven't filtered away known alternative transcripts that will be binned as duplicates). We need to do some post-processing of the HMM models to get them ready for MAKER. First, we'll rename the files within run_tb_rnd1_maker/augustus_output/retraining_paramters.

cd $HOME/busco/round_1_OSU_Tbal_v1.1_maker_transcripts1000/run_hymenoptera_odb10/augustus_output/retraining_parameters/BUSCO_round_1_OSU_Tbal_v1.1_maker_transcripts1000
rename BUSCO_round_1_OSU_Tbal_v1.1_maker_transcripts1000 Trissolcus_basalis *

We also need to rename the files cited within certain HMM configuration files.

sed -i 's/BUSCO_round_1_OSU_Tbal_v1.1_maker_transcripts1000/Trissolcus_basalis/g' Trissolcus_basalis_parameters.cfg
sed -i 's/BUSCO_round_1_OSU_Tbal_v1.1_maker_transcripts1000/Trissolcus_basalis/g' Trissolcus_basalis_parameters.cfg.orig1

Finally, we must copy these into the $AUGUSTUS_CONFIG_PATH species HMM location so they are accessible by Augustus and MAKER.

# may need to sudo
mkdir $AUGUSTUS_CONFIG_PATH/species/Trissolcus_basalis
cp Trissolcus_basalis*  $AUGUSTUS_CONFIG_PATH/species/Trissolcus_basalis/

5. MAKER With Ab Initio Gene Predictors

Now let's run a second round of MAKER, but this time we will have SNAP and Augustus run within MAKER to help create more sound gene models. MAKER will use the annotations from these two prediction programs when constructing its models. Before running, let's first recycle the mapping of empicial evidence we have from the first MAKER round, so we don't have to perform all the BLASTs, etc. again.

# transcript alignments
awk '{ if ($2 ~ "est2genome") print $0 }' round_1_OSU_Tbal_v1.1.all.maker.noseq.gff > round_1_OSU_Tbal_v1.1.all.maker.est2genome.gff
# cdna alignments (Telenomus remus transcripts)
awk '{ if ($2 ~ "cdna2genome") print $0 }' round_1_OSU_Tbal_v1.1.all.maker.noseq.gff > round_1_OSU_Tbal_v1.1.all.maker.cdna2genome.gff
# protein alignments
awk '{ if ($2 ~ "protein2genome") print $0 }' round_1_OSU_Tbal_v1.1.all.maker.noseq.gff > round_1_OSU_Tbal_v1.1.all.maker.protein2genome.gff
# repeat alignments
awk '{ if ($2 ~ "repeat") print $0 }' round_1_OSU_Tbal_v1.1.all.maker.noseq.gff > round_1_OSU_Tbal_v1.1.all.maker.repeats.gff

Then we will modify the previous control file, removing the FASTA sequences files to map and replacing them with the GFFs (est_gff, altest_gff, protein_gff, and rm_gff, respectively. We can also specify the path to the SNAP HMM and the species name for Augustus, so that these gene prediciton programs are run. We will also switch est2genome and protein2genome to 0 so that gene predictions are based on the Augustus and SNAP gene models. Here is the full version of this control file.

cat round2_maker_opts.ctl

#-----Genome (these are always required)
genome=/home/lahey.18/mbd/tbal_genome_annotation/assembly_file/OSU_Tbal_v1.1.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est=
altest=
est_gff=/home/lahey.18/mbd/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/est/round_1_OSU_Tbal_v1.1.all.maker.est2genome.gff #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff=/home/lahey.18/mbd/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/est/round_1_OSU_Tbal_v1.1.all.maker.cdna2genome.gff #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=
protein_gff=/home/lahey.18/mbd/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/prot/round_1_OSU_Tbal_v1.1.all.maker.protein2genome.gff  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff=/home/lahey.18/mbd/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/repeats/round_1_OSU_Tbal_v1.1.all.maker.repeats.gff #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm=/home/lahey.18/mbd/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/snap/round1_OSU_Tbal_v1.1/round_1_OSU_Tbal_v1.1.zff.length50_aed0.25.hmm #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species=Trissolcus_basalis #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
snoscan_meth= #-O-methylation site fileto have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
allow_overlap= #allowed gene overlap fraction (value from 0 to 1, blank for default)

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=400000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
min_intron=20 #minimum intron length (used for alignment polishing)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP=/home/lahey.18/raid/maker_tmp #specify a directory other than the system default temporary directory for temporary files

Then we can run MAKER, substituting this new control file, and summarize the output, as we did before.

6. Iteratively Running MAKER to Improve Annotation

One of the beauties of MAKER is that it can be run iteratively, using the gene models from the one round to train ab initio software to improve the inference of gene models in the next round. Essentially, all one has to do is repeat steps 4 and 5 to perform another round of annotation. The MAKER creators/maintainers recommend at least a couple rounds of ab initio software training and MAKER annotation (i.e., 3 rounds total) and returns start to diminish (at differing rates) thereafter. One needs to be careful not to overtrain Augustus and SNAP, so more rounds isn't necessarily always better. Below are a few ways of evaluating your gene models after successive rounds of MAKER to identify when you have sound models.

A. Count the number of gene models and the gene lengths after each round.

cat round_2_OSU_Tbal_v1.1.all.maker.gff | awk '{ if ($3 == "gene") print $0 }' | awk '{ sum += ($5 - $4) } END { print NR, sum / NR }'

B. Visualize the AED distribution. AED ranges from 0 to 1 and quantifies the confidence in a gene model based on empirical evidence. Basically, the lower the AED, the better a gene model is likely to be. Ideally, 95% or more of the gene models will have an AED of 0.5 or better in the case of good assemblies. You can use this AED_cdf_generator.pl script to help with this.

perl AED_cdf_generator.pl -b 0.025 round_2_OSU_Tbal_v1.1.all.maker.gff

C. Run BUSCO one last time using the species Augustus HMM and take a look at the results (this will be quick since we are not training Augustus). Also, only include the transcript sequences (not the 1000 bp on each side) and be sure to take the best (i.e., longest) transcript for each gene so you aren't artificially seeding duplicates. You can also run it on the best protein sequence per gene instead. Your command will be some derivative of the following:

conda activate busco
busco -i $PATH/round_2_OSU_Tbal_v1.1.all.maker.transcripts1000.fasta -o round_2_OSU_Tbal_v1.1_maker_transcripts1000 \
-l hymenoptera -m genome -c 24 --long --augustus_species Trissolcus_basalis --augustus_parameters='progress=true'

D. Visualize the gene models from Augustus, SNAP, and MAKER using a genome browser. JBrowse is a good option for this. You can essentially follow this guide to get this started. A helpful resource is this gff2jbrowse.pl script, which automates adding tracks to the browser based on the GFF output of your MAKER run. It is best to use 5-10 longer, gene dense scaffolds and visually inspect them. When SNAP and Augustus are well trained, their models should overlap pretty closely with the final MAKER models. Moreover, there will be spurious hits from SNAP and Augustus, but they are usually short, 1-2 exon annotations and don't have empirical support. You'll get a sense of a good annotation with some experience. Also, it is possible SNAP won't produce good results, depending on your organism, which the MAKER folks have pointed out in the past (Augustus usually does pretty well).

7. Downstream Processing and Homology Inference

After running MAKER one now has protein models, but that isn't all together very useful. First, the MAKER default names are long, ugly, and likely difficult for programs to parse. Moreover, even if they were named "gene1", etc. that doesn't tell you anything about what the genes actually are. Therefore, it is necessary to do some downstream processing of the MAKER output and to use homology searches against existing databases to annotate more functional information about genes.

A. First, let's rename the IDs that MAKER sets by default for genes and transcripts. MAKER comes with some scripts to do just this and to swap them out in the GFF and FASTA output (instructions for generated are above). The commands below first create custom IDs and store them as a table, and then use that table to rename the original GFF and FASTA files (they are overwritten, but it is possible to regenerate the raw ones again).

# create naming table (there are additional options for naming beyond defaults)
maker_map_ids --prefix TRIBAS_ --justify 5 round_2_OSU_Tbal_v1.1.all.maker.gff > round_2_OSU_Tbal_v1.1.all.maker.name.map
# replace names in GFF files
map_gff_ids round_2_OSU_Tbal_v1.1.all.maker.name.map round_2_OSU_Tbal_v1.1.all.maker.gff
map_gff_ids round_2_OSU_Tbal_v1.1.all.maker.name.map round_2_OSU_Tbal_v1.1.all.maker.noseq.gff
# replace names in FASTA headers
map_fasta_ids round_2_OSU_Tbal_v1.1.all.maker.name.map round_2_OSU_Tbal_v1.1.all.maker.transcripts.fasta
map_fasta_ids round_2_OSU_Tbal_v1.1.all.maker.name.map round_2_OSU_Tbal_v1.1.all.maker.proteins.fasta

B. Assign putative gene functions using protein sequences from one of two UniProt databases: (1) Swiss-Prot or (2) UniRef. These are the only databases that will work with the maker_functional_gff and maker_functional_fasta scripts required to parse the fasta headers.

Note: the ability to parse the UniRef fastas correctly was added to the above scripts in MAKER version 3.01.03. I am not aware if the MAKER version 2 scripts are able to parse anything other than the Swiss-Prot fasta headers.

Your choice of database depends, in large part, on how much time you have to run blastp. If you decide to use one of the UniRef databases, I suggest narrowing your search close to your taxon of interest. I chose the UniRef50 Arthropoda database for my annotation (2,104,840 clusters), but any of the UniRef databases will work (50, 90, or 100).

# make blast database
mkdir blastdbs
cd blastdbs
makeblastdb -in uniref50_arthropoda_6656_18Aug2020.fasta -input_type fasta -dbtype prot

# run blastp
blastp -db $PATH/uniref50_arthropoda_6656_18Aug2020.fasta \
-query $PATH/round_2_OSU_Tbal_v1.1.all.maker.proteins.fasta \
-out round_2_OSU_Tbal_v1.1.all.maker.proteins.arthropoda_uniref50.blastp \
-evalue 1e-6 -outfmt 6 -max_hsps 1 -max_target_seqs 1 -num_threads 30

# add protein homology data from blastp output to gff
maker_functional_gff $PATH/uniref50_arthropoda_6656_18Aug2020.fasta \
$PATH/round_2_OSU_Tbal_v1.1.all.maker.proteins.arthropoda_uniref50.blastp \
$PATH/round_2_OSU_Tbal_v1.1.all.maker.gff \
round_2_OSU_Tbal_v1.1.all.maker.functional.uniref50_arthropoda.gff

# add protein homology data from blastp output to fastas
# protein fasta
maker_functional_fasta $PATH/uniref50_arthropoda_6656_18Aug2020.fasta \
$PATH/round_2_OSU_Tbal_v1.1.all.maker.proteins.arthropoda_uniref50.blastp \
$PATH/round_2_OSU_Tbal_v1.1.all.maker.proteins.fasta \
round_2_OSU_Tbal_v1.1.all.maker.proteins.functional.uniref50_arthropoda.fasta

# transcript fasta
maker_functional_fasta $PATH/uniref50_arthropoda_6656_18Aug2020.fasta \
$PATH/round_2_OSU_Tbal_v1.1.all.maker.proteins.arthropoda_uniref50.blastp \
$PATH/round_2_OSU_Tbal_v1.1.all.maker.transcripts.fasta \
round_2_OSU_Tbal_v1.1.all.maker.transcripts.functional.uniref50_arthropoda.fasta
@niconm89
Copy link

niconm89 commented Dec 4, 2020

Hi Zachary, thank you! This is very useful (I'm also working with arthropods).
I have one question with the final repeat annotation file you created. When I look the content of the file, I see that the third column it is only present 'dispersed_repeat' instead of match/match_part. Do you know why or how to change this? Because MAKER fails when I put the repeat gff as input. Holt (maker developer) told me that's the reason.

Thank you!
Nicolás

@zjlahey
Copy link
Author

zjlahey commented Dec 8, 2020

Hi Nicolás,

What error message are you getting? MAKER runs fine for me when I give it the OSU_Tbal_v1.1_full_mask.out.complex.reformat.gff3 as input. The output of the first round results in a repeat gff that looks like this:

scaffold_4215   repeat_gff:repeatmasker match   1       111     310     -       .       ID=scaffold_4215:hit:454207:1.3.0.0;Name=52533;Target=52533 2 51 +
scaffold_4215   repeat_gff:repeatmasker match_part      1       111     310     -       .       ID=scaffold_4215:hsp:1739484:1.3.0.0;Parent=scaffold_4215:hit:454207:1.3.0.0;Target=52533 2 51 +
scaffold_4215   repeatmasker    match   370     415     20      +       .       ID=scaffold_4215:hit:454208:1.3.0.0;Name=species:A-rich|genus:Low_complexity;Target=species:A-rich|genus:Low_complexity 1 48 +
scaffold_4215   repeatmasker    match_part      370     415     20      +       .       ID=scaffold_4215:hsp:1739485:1.3.0.0;Parent=scaffold_4215:hit:454208:1.3.0.0;Target=species:A-rich|genus:Low_complexity 1 48 +

@niconm89
Copy link

niconm89 commented Dec 9, 2020

Hi Zachary, thanks for your reply.

The error is:

processing all repeats
in cluster::shadow_cluster...
...finished clustering.
doing repeat masking
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Did not specify a Hit End or Hit Begin
STACK: Error::throw
STACK: Bio::Root::Root::throw /gpfs3/well/bsg/revale/miniconda3/envs/maker/lib/site_perl/5.26.2/Bio/Root/Root.pm:447
STACK: Bio::Search::HSP::GenericHSP::_subject_seq_feature /gpfs3/well/bsg/revale/miniconda3/envs/maker/lib/site_perl/5.26.2/Bio/Search/HSP/GenericHSP.pm:1603
STACK: Bio::Search::HSP::GenericHSP::hit /gpfs3/well/bsg/revale/miniconda3/envs/maker/lib/site_perl/5.26.2/Bio/Search/HSP/GenericHSP.pm:987
STACK: repeat_mask_seq::separate_types /gpfs3/well/bsg/revale/software/maker/3.01.03/bin/../lib/repeat_mask_seq.pm:307
STACK: repeat_mask_seq::mask_chunk /gpfs3/well/bsg/revale/software/maker/3.01.03/bin/../lib/repeat_mask_seq.pm:191
STACK: Process::MpiChunk::_go /gpfs3/well/bsg/revale/software/maker/3.01.03/bin/../lib/Process/MpiChunk.pm:762
STACK: Process::MpiChunk::run /gpfs3/well/bsg/revale/software/maker/3.01.03/bin/../lib/Process/MpiChunk.pm:340
STACK: Process::MpiChunk::run_all /gpfs3/well/bsg/revale/software/maker/3.01.03/bin/../lib/Process/MpiChunk.pm:356
STACK: Process::MpiTiers::run_all /gpfs3/well/bsg/revale/software/maker/3.01.03/bin/../lib/Process/MpiTiers.pm:287
STACK: Process::MpiTiers::run_all /gpfs3/well/bsg/revale/software/maker/3.01.03/bin/../lib/Process/MpiTiers.pm:287
STACK: /well/bsg/revale/software/maker/3.01.03/bin/maker:680

--> rank=NA, hostname=compd003.hpc.in.bmrc.ox.ac.uk
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:scaffold2|size426038
ERROR: Chunk failed at level:2, tier_type:0
FAILED CONTIG:scaffold2|size426038
examining contents of the fasta file and run log

Now I see that your *reformat.gff3 file has 'match' and 'match_part' in the third column. The gff3 file I generated only has 'dispersed_repeat' on the third column. I have revised all steps but I did not find anything weird. I will take another look, maybe the engine (like ncbi, rmblast, etc) is the problem.

If you have any ideas, please let me know.
Best,

Nicolás

@zjlahey
Copy link
Author

zjlahey commented Dec 9, 2020

Hi Nicolás,

When I run this block of code:

# create GFF3
rmOutToGFF3.pl full_mask/OSU_Tbal_v1.1_full_mask.out > full_mask/OSU_Tbal_v1.1_full_mask.out.gff3
# isolate complex repeats
grep -v -e "Satellite" -e ")n" -e "-rich" OSU_Tbal_v1.1_full_mask.out.gff3 \
  > OSU_Tbal_v1.1_full_mask.out.complex.gff3
# reformat to work with MAKER
cat OSU_Tbal_v1.1_full_mask.out.complex.gff3 | \
  perl -ane '$id; if(!/^\#/){@F = split(/\t/, $_); chomp $F[-1];$id++; $F[-1] .= "\;ID=$id"; $_ = join("\t", @F)."\n"} print $_' \
  > OSU_Tbal_v1.1_full_mask.out.complex.reformat.gff3

I end up with the OSU_Tbal_v1.1_full_mask.out.complex.reformat.gff3 file which looks like this:

##gff-version 3
##sequence-region scaffold_0 1 244633
scaffold_0      RepeatMasker    dispersed_repeat        7052    7160    314     -       .       Target=rnd-1_family-119 31 141;ID=1
scaffold_0      RepeatMasker    dispersed_repeat        9566    9639    255     +       .       Target=rnd-1_family-128 53 125;ID=2
scaffold_0      RepeatMasker    dispersed_repeat        9725    9853    226     +       .       Target=rnd-1_family-128 57 164;ID=3
scaffold_0      RepeatMasker    dispersed_repeat        9855    9922    348     -       .       Target=rnd-1_family-128 1 72;ID=4
scaffold_0      RepeatMasker    dispersed_repeat        11177   11220   251     +       .       Target=rnd-5_family-4081 251 295;ID=5
scaffold_0      RepeatMasker    dispersed_repeat        11211   11555   1619    -       .       Target=rnd-5_family-4450 33 368;ID=6
scaffold_0      RepeatMasker    dispersed_repeat        11660   11720   304     +       .       Target=rnd-5_family-4081 596 660;ID=7
scaffold_0      RepeatMasker    dispersed_repeat        13740   13809   368     +       .       Target=rnd-5_family-585 1 67;ID=8

What does your *.complex.reformat.gff3 file look like?

Zach

@niconm89
Copy link

niconm89 commented Dec 9, 2020

Hi Zach, I obtain that exact format and third column. But then I get the error I showed below. It seems that MAKER expect a two-level gff annotation file (like match/match_part).

Could you use that gff (OSU_Tbal_v1.1_full_mask.out.complex.reformat.gff3) without problems?

Nico

@zjlahey
Copy link
Author

zjlahey commented Dec 9, 2020

Yes, MAKER runs fine when I supply it with the OSU_Tbal_v1.1_full_mask.out.complex.reformat.gff3 file.

Here's the repeat section of the maker_opts.ctl file that specifies the gff3 file above:

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff=/home/lahey.18/mbd/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/inputs/repeats/round_1_OSU_Tbal_v1.1.all.maker.repeats.gff #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

and here's the start of the MAKER run log file that shows it's working correctly:

STATUS: Now running MAKER...
examining contents of the fasta file and run log



--Next Contig--

#---------------------------------------------------------------------
Now starting the contig!!
SeqID: scaffold_4
Length: 178610
#---------------------------------------------------------------------


setting up GFF3 output and fasta chunks
doing repeat masking
running  repeat masker.
#--------- command -------------#
Widget::RepeatMasker:
cd /home/lahey.18/raid/maker_tmp/maker_C4Z67W; /home/mbd-johnson-group/tbal_genome_annotation/RepeatMasker/RepeatMasker /home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/round_1_OSU_Tbal_v1.1_maker_annotation/round_1_OSU_Tbal_v1.1.maker.output/round_1_OSU_Tbal_v1.1_datastore/FD/27/scaffold_4//theVoid.scaffold_4/0/scaffold_4.0.simple.rb -dir /home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/round_1_OSU_Tbal_v1.1_maker_annotation/round_1_OSU_Tbal_v1.1.maker.output/round_1_OSU_Tbal_v1.1_datastore/FD/27/scaffold_4//theVoid.scaffold_4/0 -pa 4 -lib /home/lahey.18/raid/maker_tmp/maker_C4Z67W/simple.lib
#-------------------------------#
A data structure will be created for you at:
/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/round_1_OSU_Tbal_v1.1_maker_annotation/round_1_OSU_Tbal_v1.1.maker.output/round_1_OSU_Tbal_v1.1_datastore

To access files for individual sequences use the datastore index:
/home/mbd-johnson-group/tbal_genome_annotation/maker/OSU_Tbal_v1.1_maker_annotation/round_1_OSU_Tbal_v1.1_maker_annotation/round_1_OSU_Tbal_v1.1.maker.output/round_1_OSU_Tbal_v1.1_master_datastore_index.log

I'm not sure what's going wrong for you. Can you copy and paste the first 10 lines of your complex.reformat.gff3 file?

Zach

@niconm89
Copy link

niconm89 commented Dec 9, 2020

Thanks, Zach!

I have the same repeat masking configuration in the maker_opts.ctl.

Here are the first lines of the complex.reformat.gff3 file:

##gff-version 3
##sequence-region Backbone_11 1 9531409
Backbone_11	RepeatMasker	dispersed_repeat	29	130	228	+	.	Target=rnd-6_family-13457 925 1026;ID=1
Backbone_11	RepeatMasker	dispersed_repeat	232	692	3604	-	.	Target=rnd-1_family-70 1 466;ID=2
Backbone_11	RepeatMasker	dispersed_repeat	685	964	2299	-	.	Target=rnd-1_family-70 1670 1949;ID=3
Backbone_11	RepeatMasker	dispersed_repeat	1045	1592	4381	-	.	Target=rnd-1_family-70 1119 1669;ID=4
Backbone_11	RepeatMasker	dispersed_repeat	1690	1960	6708	-	.	Target=rnd-1_family-70 1260 1662;ID=5
Backbone_11	RepeatMasker	dispersed_repeat	2014	2857	6708	-	.	Target=rnd-1_family-70 1 1259;ID=6
Backbone_11	RepeatMasker	dispersed_repeat	2917	3275	5070	+	.	Target=rnd-1_family-82 1 361;ID=7
Backbone_11	RepeatMasker	dispersed_repeat	3313	3676	5070	+	.	Target=rnd-1_family-82 362 729;ID=8
Backbone_11	RepeatMasker	dispersed_repeat	3754	3802	235	+	.	Target=TAHRE 2173 2221;ID=9
Backbone_11	RepeatMasker	dispersed_repeat	3916	3997	549	+	.	Target=rnd-1_family-103 1 82;ID=10

Nico

@zjlahey
Copy link
Author

zjlahey commented Dec 9, 2020

Hi Nico,

Are you running MAKER on a university cluster or your desktop computer? Your file is formatted correctly, so the gff3 file is not the issue. I recommend reinstalling BioPerl. Here's a link from the MAKER google forum which looks very, very similar to the same issue that you are having.

Also, when I first started using the program, I could never get it to run properly on the university cluster. MPI would never work. We have a 32 core machine in our lab that I used to run MAKER but without MPI. I was still able to utilize all of the cores by following the steps in the PARALLELIZED DE NOVO GENOME ANNOTATION WITHOUT MPI section of Campbell et al. 2014. If you can't access the PDF, shoot me an email (lahey.18@osu.edu) and I'll send you a copy.

Zach

@niconm89
Copy link

Hi Zach, sorry for the late response, and thank you very much for taking some time to help me. I tried maker on both university cluster and desktop machine. Some weird errors appeared in RepBase even when I installed it on repeatmasker. Leaving in blank the model_org= parameter, I solved this. BUT then the error I asked you before showed again. I tried with this but nothing change. I also tried maker 2 & 3. I will keep looking for the reason.

Best,

Nico

@zjlahey
Copy link
Author

zjlahey commented Dec 11, 2020

Hi Nico,

If you send me your log files, I would be happy to look at them for you!

Zach

@niconm89
Copy link

Ok, I appreciate it. Here is the first part of the log, I stop the process when I saw the first error in repeat masking. Let me know if you need more information.

STATUS: Parsing control files...
STATUS: Processing and indexing input FASTA files...
STATUS: Setting up database for any GFF3 input...
A data structure will be created for you at:
/home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore

To access files for individual sequences use the datastore index:
/home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_master_datastore_index.log

STATUS: Now running MAKER...
examining contents of the fasta file and run log



--Next Contig--

#---------------------------------------------------------------------
Now starting the contig!!
SeqID: scaffold1|size657019
Length: 657019
#---------------------------------------------------------------------


setting up GFF3 output and fasta chunks
doing repeat masking
doing blastx repeats
formating database...
#--------- command -------------#
Widget::formater:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/makeblastdb -dbtype prot -in /tmp/maker_dfhOnw/0/blastprep/te_proteins%2Efasta.mpi.10.0
#-------------------------------#
running  blast search.
#--------- command -------------#
Widget::blastx:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/blastx -db /tmp/maker_dfhOnw/te_proteins%2Efasta.mpi.10.0 -query /tmp/maker_dfhOnw/0/scaffold1%7Csize657019.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore/FE/93/scaffold1%7Csize657019//theVoid.scaffold1%7Csize657019/0/scaffold1%7Csize657019.0.te_proteins%2Efasta.repeatrunner.temp_dir/te_proteins%2Efasta.mpi.10.0.repeatrunner
#-------------------------------#
deleted:0 hits
doing blastx repeats
formating database...
#--------- command -------------#
Widget::formater:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/makeblastdb -dbtype prot -in /tmp/maker_dfhOnw/0/blastprep/te_proteins%2Efasta.mpi.10.1
#-------------------------------#
running  blast search.
#--------- command -------------#
Widget::blastx:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/blastx -db /tmp/maker_dfhOnw/te_proteins%2Efasta.mpi.10.1 -query /tmp/maker_dfhOnw/0/scaffold1%7Csize657019.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore/FE/93/scaffold1%7Csize657019//theVoid.scaffold1%7Csize657019/0/scaffold1%7Csize657019.0.te_proteins%2Efasta.repeatrunner.temp_dir/te_proteins%2Efasta.mpi.10.1.repeatrunner
#-------------------------------#
deleted:0 hits
doing blastx repeats
formating database...
#--------- command -------------#
Widget::formater:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/makeblastdb -dbtype prot -in /tmp/maker_dfhOnw/0/blastprep/te_proteins%2Efasta.mpi.10.2
#-------------------------------#
running  blast search.
#--------- command -------------#
Widget::blastx:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/blastx -db /tmp/maker_dfhOnw/te_proteins%2Efasta.mpi.10.2 -query /tmp/maker_dfhOnw/0/scaffold1%7Csize657019.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore/FE/93/scaffold1%7Csize657019//theVoid.scaffold1%7Csize657019/0/scaffold1%7Csize657019.0.te_proteins%2Efasta.repeatrunner.temp_dir/te_proteins%2Efasta.mpi.10.2.repeatrunner
#-------------------------------#
deleted:0 hits
doing blastx repeats
formating database...
#--------- command -------------#
Widget::formater:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/makeblastdb -dbtype prot -in /tmp/maker_dfhOnw/0/blastprep/te_proteins%2Efasta.mpi.10.3
#-------------------------------#
running  blast search.
#--------- command -------------#
Widget::blastx:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/blastx -db /tmp/maker_dfhOnw/te_proteins%2Efasta.mpi.10.3 -query /tmp/maker_dfhOnw/0/scaffold1%7Csize657019.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore/FE/93/scaffold1%7Csize657019//theVoid.scaffold1%7Csize657019/0/scaffold1%7Csize657019.0.te_proteins%2Efasta.repeatrunner.temp_dir/te_proteins%2Efasta.mpi.10.3.repeatrunner
#-------------------------------#
deleted:0 hits
doing blastx repeats
formating database...
#--------- command -------------#
Widget::formater:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/makeblastdb -dbtype prot -in /tmp/maker_dfhOnw/0/blastprep/te_proteins%2Efasta.mpi.10.4
#-------------------------------#
running  blast search.
#--------- command -------------#
Widget::blastx:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/blastx -db /tmp/maker_dfhOnw/te_proteins%2Efasta.mpi.10.4 -query /tmp/maker_dfhOnw/0/scaffold1%7Csize657019.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore/FE/93/scaffold1%7Csize657019//theVoid.scaffold1%7Csize657019/0/scaffold1%7Csize657019.0.te_proteins%2Efasta.repeatrunner.temp_dir/te_proteins%2Efasta.mpi.10.4.repeatrunner
#-------------------------------#
deleted:0 hits
doing blastx repeats
formating database...
#--------- command -------------#
Widget::formater:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/makeblastdb -dbtype prot -in /tmp/maker_dfhOnw/0/blastprep/te_proteins%2Efasta.mpi.10.5
#-------------------------------#
running  blast search.
#--------- command -------------#
Widget::blastx:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/blastx -db /tmp/maker_dfhOnw/te_proteins%2Efasta.mpi.10.5 -query /tmp/maker_dfhOnw/0/scaffold1%7Csize657019.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore/FE/93/scaffold1%7Csize657019//theVoid.scaffold1%7Csize657019/0/scaffold1%7Csize657019.0.te_proteins%2Efasta.repeatrunner.temp_dir/te_proteins%2Efasta.mpi.10.5.repeatrunner
#-------------------------------#
deleted:0 hits
doing blastx repeats
formating database...
#--------- command -------------#
Widget::formater:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/makeblastdb -dbtype prot -in /tmp/maker_dfhOnw/0/blastprep/te_proteins%2Efasta.mpi.10.6
#-------------------------------#
running  blast search.
#--------- command -------------#
Widget::blastx:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/blastx -db /tmp/maker_dfhOnw/te_proteins%2Efasta.mpi.10.6 -query /tmp/maker_dfhOnw/0/scaffold1%7Csize657019.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore/FE/93/scaffold1%7Csize657019//theVoid.scaffold1%7Csize657019/0/scaffold1%7Csize657019.0.te_proteins%2Efasta.repeatrunner.temp_dir/te_proteins%2Efasta.mpi.10.6.repeatrunner
#-------------------------------#
deleted:0 hits
doing blastx repeats
formating database...
#--------- command -------------#
Widget::formater:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/makeblastdb -dbtype prot -in /tmp/maker_dfhOnw/0/blastprep/te_proteins%2Efasta.mpi.10.7
#-------------------------------#
running  blast search.
#--------- command -------------#
Widget::blastx:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/blastx -db /tmp/maker_dfhOnw/te_proteins%2Efasta.mpi.10.7 -query /tmp/maker_dfhOnw/0/scaffold1%7Csize657019.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore/FE/93/scaffold1%7Csize657019//theVoid.scaffold1%7Csize657019/0/scaffold1%7Csize657019.0.te_proteins%2Efasta.repeatrunner.temp_dir/te_proteins%2Efasta.mpi.10.7.repeatrunner
#-------------------------------#
deleted:0 hits
doing blastx repeats
formating database...
#--------- command -------------#
Widget::formater:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/makeblastdb -dbtype prot -in /tmp/maker_dfhOnw/0/blastprep/te_proteins%2Efasta.mpi.10.8
#-------------------------------#
running  blast search.
#--------- command -------------#
Widget::blastx:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/blastx -db /tmp/maker_dfhOnw/te_proteins%2Efasta.mpi.10.8 -query /tmp/maker_dfhOnw/0/scaffold1%7Csize657019.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore/FE/93/scaffold1%7Csize657019//theVoid.scaffold1%7Csize657019/0/scaffold1%7Csize657019.0.te_proteins%2Efasta.repeatrunner.temp_dir/te_proteins%2Efasta.mpi.10.8.repeatrunner
#-------------------------------#
deleted:0 hits
doing blastx repeats
formating database...
#--------- command -------------#
Widget::formater:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/makeblastdb -dbtype prot -in /tmp/maker_dfhOnw/0/blastprep/te_proteins%2Efasta.mpi.10.9
#-------------------------------#
running  blast search.
#--------- command -------------#
Widget::blastx:
/home/nmoreyra/Software/miniconda3/envs/MAKER/bin/blastx -db /tmp/maker_dfhOnw/te_proteins%2Efasta.mpi.10.9 -query /tmp/maker_dfhOnw/0/scaffold1%7Csize657019.0 -num_alignments 10000 -num_descriptions 10000 -evalue 1e-06 -dbsize 300 -searchsp 500000000 -num_threads 1 -seg yes -soft_masking true -lcase_masking -show_gis -out /home/nmoreyra/Data/MAKER_test/Dbuz-jz3/D.buzzatii_jz3_v01.maker.output/D.buzzatii_jz3_v01_datastore/FE/93/scaffold1%7Csize657019//theVoid.scaffold1%7Csize657019/0/scaffold1%7Csize657019.0.te_proteins%2Efasta.repeatrunner.temp_dir/te_proteins%2Efasta.mpi.10.9.repeatrunner
#-------------------------------#
deleted:0 hits
collecting blastx repeatmasking
processing all repeats
doing repeat masking

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Did not specify a Hit End or Hit Begin
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/nmoreyra/Software/miniconda3/envs/orthomcl/lib/perl5/site_perl/5.22.0/Bio/Root/Root.pm:449
STACK: Bio::Search::HSP::GenericHSP::_subject_seq_feature /home/nmoreyra/Software/miniconda3/envs/orthomcl/lib/perl5/site_perl/5.22.0/Bio/Search/HSP/GenericHSP.pm:1603
STACK: Bio::Search::HSP::GenericHSP::hit /home/nmoreyra/Software/miniconda3/envs/orthomcl/lib/perl5/site_perl/5.22.0/Bio/Search/HSP/GenericHSP.pm:987
STACK: repeat_mask_seq::separate_types /home/nmoreyra/Software/maker2/bin/../lib/repeat_mask_seq.pm:307
STACK: repeat_mask_seq::mask_chunk /home/nmoreyra/Software/maker2/bin/../lib/repeat_mask_seq.pm:191
STACK: Process::MpiChunk::_go /home/nmoreyra/Software/maker2/bin/../lib/Process/MpiChunk.pm:763
STACK: Process::MpiChunk::run /home/nmoreyra/Software/maker2/bin/../lib/Process/MpiChunk.pm:341
STACK: Process::MpiChunk::run_all /home/nmoreyra/Software/maker2/bin/../lib/Process/MpiChunk.pm:357
STACK: Process::MpiTiers::run_all /home/nmoreyra/Software/maker2/bin/../lib/Process/MpiTiers.pm:287
STACK: Process::MpiTiers::run_all /home/nmoreyra/Software/maker2/bin/../lib/Process/MpiTiers.pm:287
STACK: /home/nmoreyra/Software/maker2/bin/maker:683
-----------------------------------------------------------
--> rank=NA, hostname=andromeda
ERROR: Failed while doing repeat masking
ERROR: Chunk failed at level:0, tier_type:1
FAILED CONTIG:scaffold1|size657019

ERROR: Chunk failed at level:2, tier_type:0
FAILED CONTIG:scaffold1|size657019

examining contents of the fasta file and run log



--Next Contig--

#---------------------------------------------------------------------
Now starting the contig!!
SeqID: scaffold2|size426038
Length: 426038

@niconm89
Copy link

Hi again, Zach!
I tried to use another repeat gff file and worked!
We are reannotating the genome right now, so we set rm_gff with the gff file obtain at this step in the first annotation. So we think that is probably because of the format of the complex.reformat.gff3 file.

Do you know if is possible to obtain the hit information to convert the file format to a two level format (eg. match/match_part)?

Nico

@zjlahey
Copy link
Author

zjlahey commented Dec 16, 2020

Hi Nico,

I'm happy to hear that the program is finally working for you! I'm still really confused as to why that rm gff doesn't work for you and why it works for me. It probably is possible to convert the file to the format you mentioned, but I honestly don't know. The person to contact would be Daren Card, since he's the one who originally put the pipeline together. If you do contact him, please let me know what he says.

Zach

@niconm89
Copy link

Hi Zach, thank you very much! You are right, I will write to him and then let you know what he says as well.

Best regards,

Nico

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment