Juke34/annotation_using_MAKER.md

## annotation_using_MAKER.md

      
    Raw
  

              annotation_using_MAKER.md
            
          
    Making a good genome annotation using MAKER3


Foreword

Making a good annotation isn't easy, MAKER or not MAKER!

Here I explain the main steps to avoid the pitfalls and make a good annotation using the MAKER annotation tool.
Information on commands and protocols can also be found here: https://nbisweden.github.io/workshop-genome_annotation_elixir/exercises
You would need AGAT and GAAS (available in bioconda or as container) and Nextflow.
Steps from assembly verification to submission of annotations to the public archives

Here's the recipe I used, which gives good results
1) Sanitize your assembly

Make sure your genome assembly in fasta format doesn't have any elements that could cause problems
(e.g. Ns at the end or beginning of the sequence that will prevent you from submitting to a public archive,
IUPAC code for MAKER, lowercase nucleotide that could be interpreted as repeats by MAKER, etc ).
Make a BUSCO that you will compare with the final Busco you'll make on your annotation.
This will help you gauge the quality of your annotation.
We've developed this nextflow pipeline to do that step: https://github.com/NBISweden/pipelines-nextflow/tree/master/subworkflows/annotation_preprocessing
If you want to slip this pipeline run `gaas_fasta_purify.pl from GAAS (available in conda) and run BUSCO manually on our assembly.
2) Repeats

If no repeat library exists for your organism you need to create a repeat library

See my protocol as a minimum: https://www.biostars.org/p/411101/
Or see Protocol Maker for something advanced: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced

3) Gather Evidence

Collect all the information you can: ESTs, transcriptomes from your organism assembled de-novo and guided (at worst from a close organism), and proteins.
For proteins, you need to play on quality and quantity... if you have the time and the right machines, you can use the whole of Uniprot,
but in a normal case you'll optimize your set. Take all the reviewed proteins from Uniprot + unreviewed protein
from all the genomes of a branch of the tree of life that's not too big and that contains your species

potentially a reference proteome from a distant but well-studied species (e.g. human, mouse, arabidopsis thaliana).

4) First MAKER RUN (annotation evidence-based)

This first MAKER run will Mask the repeats, align the evidence (EST, transcripts, protein), make an annotation based on evidence and annotate trna and rrna:
You must set the MAKER parameter file in oder to perfomrs all these tasks. Use the sanitized assembly you get out of step 1) !
Once finished. You may run gaas_maker_check_progress.sh or/and gaas_maker_check_progress_deeply.sh from GAAS to be sure the MAKER run is really complete. Then run gaas_maker_merge_outputs_from_datastore.pl to gather all the results with statistics in one place. The maker_mix.gff file you will get is a mix of all GFF data produced by MAKER that are also available by type of source among the output of this script.
5) Ab-initio training

You train the abinitios tools at least Augustus (snap rarely helps, Genemark depends).

To train Genemark I've put the protocol here: https://www.biostars.org/p/420356/
To train Augustus I've put the protocol here: https://www.biostars.org/p/394385/#9542648 otherwise we've developed this nextflow pipeline to do it automatically (it also trains snap): https://github.com/NBISweden/pipelines-nextflow/tree/master/subworkflows/abinitio_training

6) abinitio-evidence driven annotation

With maker there are two modes that I call normal and fused.
For both, your working directory must be the same as in step 4). (You can make a copy of the MAKER parameter file to keep a track of the one you used for the evidence-based annotation, because you will have to modify the parameter for this run).
You will relaunch maker, reusing the outputs of step 4) such as repeats and evidence alignments thanks to this parameter block:
#-----Re-annotation Using MAKER Derived GFF3
maker_gff=maker_mix_from_evidence-build.gff #MAKER derived GFF3 file 
est_pass=1 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=1 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=1 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=1 #use repeats in maker_gff: 1 = yes, 0 = no, et en activant les abinitios

In this example, the maker_mix_from_evidence-build.gff file contains all the information in GFF format (repeats, gene models, aligned evidences, etc.). I generate it using gaas_maker_merge_outputs_from_datastore.pl, which you can find in GAAS (bioconda).
The other options in this block are used to specify what you want to reuse from this GFF file.

Consequently you can remove all the other options in other part of the MAKER parameter file (those for repeat libray, transcripts, proteins..., because there's no need to redo the work already done in 3) ).
For this build, of course, you need to activate your ab-initio by providing the HMM profiles.
For fused mode, you need an evidence-build annotation (if you follow this protocol, you've already done it in 4) ). In contrast to normal mode, this approach allows you to have pure evidence-based predictions in addition to abinitio evidence-driven predictions. Indeed, this approach will keep evidence-based gene models in loci where there is no ab-initio prediction. When an evidence-based prediction competes with an ab-initio prediction, MAKER will choose the most appropriate.
For this approach, simply add to the options listed above:
#-----Re-annotation Using MAKER Derived GFF3
...
model_pass=1 #use gene models in maker_gff: 1 = yes, 0 = no

Once finished. You may run gaas_maker_check_progress.sh or/and gaas_maker_check_progress_deeply.sh from GAAS to be sure the MAKER run is really complete. Then run gaas_maker_merge_outputs_from_datastore.pl to gather all the results with statistics in one place. The maker_mix.gff file you will get is a mix of all GFF data produced by MAKER that are also available by type of source among the output from this script.
7) Retrieve some gene model lost

Despite the fused approach (see 6)), this step allows you to save some gene models.
This step consists in adding to the reference annotation (which comes from either step 4 or step 6, depending on whether you want something very conservative or not) the gene models that are only present in another gene build. Use the agat_sp_complement_annotations.pl script for this. It will add all non-overlapping genes from a target build to the reference build.
Completing abinitio_build.gff with evidence_build.gff does not provide the same result as completing evidence_build.gff with abinitio_build.gff. Depending on the case, one may be better than the other (you can try it both ways and do a Busco to see which one looks better).
8) Annotate ncRNA using RFAM

I use rfam and the script gaas_rfam2grid.pl available in GAAS, and I report the result with agat_sq_rfam_analyzer.pl available in AGAT.
9) Check annotation

Run a BUSCO on the proteins of the final annotation obtained in 8). To do so you need to extract the proteins via agat_sp_extract_sequences.pl using the GFF and the assembly fasta file.  If the BUSCO result is not good, e.g. too fragmented, you have to review some parameters like intron max length, etc. If you have too many or too few genes in number (with a good or bad busco) you can play on keeping the predictions pure abinitio (even without any supporting evidence)
10) Functional annotation

If you are happy with your final annotation, you can make your functional annotation with interproscan and blast on uniprot.
Protocol here: https://nbisweden.github.io/workshop-genome_annotation_elixir/labs/functional_annotation
11) Submission to public archive

Prepare your data for submission.
Protocol here https://nbisweden.github.io/workshop-genome_annotation_elixir/labs/submission
This step may force you to slightly modify your annotation. So the idea and that you check that it passes all the check of the EMBL validator.
You will have to re-run statistics and BUSCO in order your results are synchronized with what you publish!
12) Gather all results

Create a folder where you will gather all results and important information:

Get the coding gene annotation from step 11. Run agat_sp_statistics.pl and agat_sp_functional_statistics.pl on it. (Step 11 may have slightly modify the gene models you had so use that annotation).
Get tRNA GFF file from step 6 and run agat_sp_statistics.pl on it.
Get rRNA GFF file from step 6 and run agat_sp_statistics.pl on it.
Get repeat GFF file from step 6 and run agat_sq_repeats_analyzer.pl on it.
Get ncRNA GFF file from step 8 and run agat_sq_rfam_analyzer.pl on it.
copy the MAKER parameter files you used to keep track of what has been done (both for evidence-based and abinitio-evidence-based annotation).
Copy your original assembly file and the one out of step 1), which must be the one you have used for the annotation. Run gaas_fasta_statistics.pl on it.
Copy the busco results you get on you assembly and on your final annotation (step 11).

Miscelaneous


For steps 7/9/10/11 you can limit yourself to using the gff/fasta of the coding genes. You keep the rest separate.
For steps 4/6/7 the best is to load in a genome browser to visually judge the differences and see if everything goes well. It can help to understand if the parameters of intron max etc… are well chosen.
By the way if you have transcriptomes you can calculate the intron max length value with agat_sp_manage_introns.pl. the value found will be the one to put in the split_hit option of MAKER (it should not be forgotten that max_dna_length must be at least three times greater than the split_hit value)