Skip to content

Instantly share code, notes, and snippets.

@GDKO
Last active March 1, 2016 11:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save GDKO/bc507bc9b620e6006a44 to your computer and use it in GitHub Desktop.
Save GDKO/bc507bc9b620e6006a44 to your computer and use it in GitHub Desktop.
Eukaryotic gene prediction on a bacterial genome

The goal is to check whether abinitio gene finders with eukaryotic parameters will predict genes with "real" introns in bacterial genomes.

Download E. coli from NCBI

http://www.ncbi.nlm.nih.gov/nuccore/296142109

Rename Header to >E_coli

Run CEGMA

cegma -g E_coli_BL21.fa -T 8 -o e_coli

Completeness report

#      Statistics of the completeness of the genome based on 248 CEGs      #

              #Prots  %Completeness  -  #Total  Average  %Ortho

  Complete       34       13.71      -    42     1.24     23.53

   Group 1        7       10.61      -     9     1.29     28.57
   Group 2        7       12.50      -     8     1.14     14.29
   Group 3        9       14.75      -    11     1.22     22.22
   Group 4       11       16.92      -    14     1.27     27.27

   Partial       35       14.11      -    45     1.29     28.57

   Group 1        8       12.12      -    12     1.50     50.00
   Group 2        7       12.50      -     8     1.14     14.29
   Group 3        9       14.75      -    11     1.22     22.22
   Group 4       11       16.92      -    14     1.27     27.27

Run GeneMark with Eukaryotic parameters

gmes_petap.pl --ES --cores 8 --sequence E_coli_BL21.fa

GeneMark predicts 2034 genes, 797 of which have introns.

Extract splice sites

Convert genemark gtf to gff3 (script from MAKER)

genemark_gtf2gff3 genemark.gtf > genemark.gff3

Add introns to gff (GenomeTools)

gt gff3 -addintrons yes -retainids yes genemark.gff3 > genemark_introns.gff3

Extract splice introns regions

grep intron genemark_introns.gff3 | cut -f 4,5,7 | \
perl -ane '
    if ($F[2] eq "+") {
        printf STDOUT "E_coli\t%d\t%d\n",$F[0]-20,$F[0]+20; printf STDERR "E_coli\t%d\t%d\n",$F[1]-20,$F[1]+20; 
    } 
    else {
        printf STDERR "E_coli\t%d\t%d\n",$F[0]+20,$F[0]-20;printf STDOUT "E_coli\t%d\t%d\n",$F[1]+20,$F[1]-20;
    }' > 5_prime_splice.txt 2>3_prime_splice.txt

Extract splice intron sequences (script from BlaxterLab github)

fastaqual_select.pl -f E_coli_BL21.fa -i 5_prime_splice.txt -int > 5_prime_splice.txt.fna
fastaqual_select.pl -f E_coli_BL21.fa -i 3_prime_splice.txt -int > 3_prime_splice.txt.fna

Visualise Results

Upload files to WebLogo

5' splice logo

3' splice logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment