The goal is to check whether abinitio gene finders with eukaryotic parameters will predict genes with "real" introns in bacterial genomes.
http://www.ncbi.nlm.nih.gov/nuccore/296142109
Rename Header to >E_coli
cegma -g E_coli_BL21.fa -T 8 -o e_coli
Completeness report
# Statistics of the completeness of the genome based on 248 CEGs #
#Prots %Completeness - #Total Average %Ortho
Complete 34 13.71 - 42 1.24 23.53
Group 1 7 10.61 - 9 1.29 28.57
Group 2 7 12.50 - 8 1.14 14.29
Group 3 9 14.75 - 11 1.22 22.22
Group 4 11 16.92 - 14 1.27 27.27
Partial 35 14.11 - 45 1.29 28.57
Group 1 8 12.12 - 12 1.50 50.00
Group 2 7 12.50 - 8 1.14 14.29
Group 3 9 14.75 - 11 1.22 22.22
Group 4 11 16.92 - 14 1.27 27.27
gmes_petap.pl --ES --cores 8 --sequence E_coli_BL21.fa
GeneMark predicts 2034 genes, 797 of which have introns.
Convert genemark gtf to gff3 (script from MAKER)
genemark_gtf2gff3 genemark.gtf > genemark.gff3
Add introns to gff (GenomeTools)
gt gff3 -addintrons yes -retainids yes genemark.gff3 > genemark_introns.gff3
Extract splice introns regions
grep intron genemark_introns.gff3 | cut -f 4,5,7 | \
perl -ane '
if ($F[2] eq "+") {
printf STDOUT "E_coli\t%d\t%d\n",$F[0]-20,$F[0]+20; printf STDERR "E_coli\t%d\t%d\n",$F[1]-20,$F[1]+20;
}
else {
printf STDERR "E_coli\t%d\t%d\n",$F[0]+20,$F[0]-20;printf STDOUT "E_coli\t%d\t%d\n",$F[1]+20,$F[1]-20;
}' > 5_prime_splice.txt 2>3_prime_splice.txt
Extract splice intron sequences (script from BlaxterLab github)
fastaqual_select.pl -f E_coli_BL21.fa -i 5_prime_splice.txt -int > 5_prime_splice.txt.fna
fastaqual_select.pl -f E_coli_BL21.fa -i 3_prime_splice.txt -int > 3_prime_splice.txt.fna
Upload files to WebLogo