Skip to content

Instantly share code, notes, and snippets.

@darencard
Last active April 18, 2023 13:35
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save darencard/fcb32168c243b92734e85c5f8b59a1c3 to your computer and use it in GitHub Desktop.
Save darencard/fcb32168c243b92734e85c5f8b59a1c3 to your computer and use it in GitHub Desktop.
Script to produce estimates of gene structure

Please see the most up-to-date version of this protocol on my blog at https://darencard.net/blog/.

Inferring the structure of gene annotations

When annotating genomes it is often desireable to know the overall structure of genes, including information like exon and intron lengths among other metrics. Here is a program genestats that will calculate such measures for a user.

#!/usr/bin/env bash

usage()
{
cat << EOF
genestats
Version 1.0 (10 January, 2018)
License: GNU GPLv2
To report bugs or errors, please contact Daren Card (dcard@uta.edu).
This script is provided as-is, with no support and no guarantee of proper or
desirable functioning.

This script calculates several gene structure measures based on a GFF annotation file.
The bgzip and tabix programs (http://www.htslib.org/doc/tabix.html) and bedtools
(https://bedtools.readthedocs.io/en/latest/) must be installed and in the \$PATH.
The user supplies the GFF3 file as a single argument to the command and the output is a
12 column tab-delimited text written to STDOUT with the following columns:
1. transcript ID
2. transcript sequence length
3. number of exons
4. total exon sequence length
5. number of introns
6. total intron sequence length
7. number of CDS chunks
8. total CDS sequence length
9. number of 5' UTR sequences
10. total 5' UTR sequence length
11. number of 3' UTR sequences
12. total 3' UTR sequence length

To work optimally and produce reliable results, the following tags in column 3 must be
included within the GFF file:
mRNA = transcript
exon = exons
CDS = coding sequence
five_prime_UTR = 5' UTR sequence
three_prime_UTR = 3' UTR sequence

The first three tags should be present in any GFF file but the latter two may not be
depending on the source of the file. In those cases, estimates of UTR features will be
erroneous, but the other features should be reported correctly. Note that temporary files
are created in the working directory as the program runs.

USAGE:
genestats <file.gff>

EOF
}

if [[ -z $1 ]] || [[ $1 == "-h" ]] || [[ $1 == "-help" ]]
then
	usage
	exit 1
fi

# this script basically pastes together 6 queries of the GFF file for each mRNA annotation
# for the mRNA, exons, introns, CDS, and UTRs (x2)
# for each, tabix is used to rapidly pull out the feature from the GFF file and
# the appropriate feature lines are then used to produce counts and sequence lengths
# for the introns, bedtools is used to essentially find the features not matching exons
# (i.e. inverse match)

cat $1 | sort -k1,1 -k4,4n -k5,5n | bgzip -c > $1.gz && \
tabix -p gff $1.gz && \
cat $1 | \
awk '{ if ($3 == "mRNA") print $0 }' | \
awk -F "ID=|;" '{ print $1, $2 }' | \
while read line; do \
parent=`echo $line | awk '{ print $9 }'`; \
chrom=`echo $line | awk '{ print $1 }'`; \
start=`echo $line | awk '{ print $4 }'`; \
end=`echo $line | awk '{ print $5 }'`; \
paste <(echo $parent) \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "mRNA") print $0 }' | \
	awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print sum }') \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "exon") print $0 }' | \
	awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print count, sum }') \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "mRNA") print $0 }' > mRNA.txt && \
	tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "exon") print $0 }' > exons.txt && \
	bedtools subtract -a mRNA.txt -b exons.txt | \
	awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print count, sum }' && \
	rm mRNA.txt exons.txt) \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "CDS") print $0 }' | \
	awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print count, sum }') \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "five_prime_UTR") print $0 }' | \
	awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print count, sum }') \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "three_prime_UTR") print $0 }' | \
	awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print count, sum }'); \
done

Let's run this on a test GFF file to demonstrate it. Here is the GFF file we will use, called test.gff.

##gff-version 3
scaffold-162	.	contig	1	2399525	.	.	.	ID=scaffold-162;Name=scaffold-162
scaffold-162	maker	gene	1481900	1501151	.	+	.	ID=BoaCon13386;Name=BoaCon13386;Alias=maker-scaffold-162-augustus-gene-5.4;
scaffold-162	maker	mRNA	1481900	1501151	.	+	.	ID=BoaCon13386-RA;Parent=BoaCon13386;Name=BoaCon13386-RA;Alias=maker-scaffold-162-augustus-gene-5.4-mRNA-1;_AED=0.05;_QI=0|0|0|0.8|0.75|0.6|5|0|297;_eAED=0.25;
scaffold-162	maker	exon	1481900	1481924	.	+	.	ID=BoaCon13386-RA:exon:218;Parent=BoaCon13386-RA;
scaffold-162	maker	exon	1485901	1486158	.	+	.	ID=BoaCon13386-RA:exon:219;Parent=BoaCon13386-RA;
scaffold-162	maker	exon	1490203	1490387	.	+	.	ID=BoaCon13386-RA:exon:220;Parent=BoaCon13386-RA;
scaffold-162	maker	exon	1495786	1495929	.	+	.	ID=BoaCon13386-RA:exon:221;Parent=BoaCon13386-RA;
scaffold-162	maker	exon	1500870	1501151	.	+	.	ID=BoaCon13386-RA:exon:222;Parent=BoaCon13386-RA;
scaffold-162	maker	CDS	1481900	1481924	.	+	0	ID=BoaCon13386-RA:cds;Parent=BoaCon13386-RA;
scaffold-162	maker	CDS	1485901	1486158	.	+	2	ID=BoaCon13386-RA:cds;Parent=BoaCon13386-RA;
scaffold-162	maker	CDS	1490203	1490387	.	+	2	ID=BoaCon13386-RA:cds;Parent=BoaCon13386-RA;
scaffold-162	maker	CDS	1495786	1495929	.	+	0	ID=BoaCon13386-RA:cds;Parent=BoaCon13386-RA;
scaffold-162	maker	CDS	1500870	1501151	.	+	0	ID=BoaCon13386-RA:cds;Parent=BoaCon13386-RA;
scaffold-162	maker	gene	1641672	1650129	.	-	.	ID=BoaCon13387;Name=BoaCon13387;Alias=maker-scaffold-162-augustus-gene-5.5;
scaffold-162	maker	mRNA	1641672	1650129	.	-	.	ID=BoaCon13387-RA;Parent=BoaCon13387;Name=BoaCon13387-RA;Alias=maker-scaffold-162-augustus-gene-5.5-mRNA-1;_AED=0.21;_QI=0|0|0|0.5|0.66|0.75|4|0|130;_eAED=0.06;
scaffold-162	maker	exon	1650055	1650129	.	-	.	ID=BoaCon13387-RA:exon:226;Parent=BoaCon13387-RA;
scaffold-162	maker	exon	1647269	1647309	.	-	.	ID=BoaCon13387-RA:exon:225;Parent=BoaCon13387-RA;
scaffold-162	maker	exon	1644412	1644586	.	-	.	ID=BoaCon13387-RA:exon:224;Parent=BoaCon13387-RA;
scaffold-162	maker	exon	1641672	1641773	.	-	.	ID=BoaCon13387-RA:exon:223;Parent=BoaCon13387-RA;
scaffold-162	maker	CDS	1650055	1650129	.	-	0	ID=BoaCon13387-RA:cds;Parent=BoaCon13387-RA;
scaffold-162	maker	CDS	1647269	1647309	.	-	0	ID=BoaCon13387-RA:cds;Parent=BoaCon13387-RA;
scaffold-162	maker	CDS	1644412	1644586	.	-	1	ID=BoaCon13387-RA:cds;Parent=BoaCon13387-RA;
scaffold-162	maker	CDS	1641672	1641773	.	-	0	ID=BoaCon13387-RA:cds;Parent=BoaCon13387-RA;
scaffold-162	maker	gene	2299416	2309756	.	+	.	ID=BoaCon13393;Name=BoaCon13393;Alias=maker-scaffold-162-augustus-gene-6.11;
scaffold-162	maker	mRNA	2299416	2309756	.	+	.	ID=BoaCon13393-RA;Parent=BoaCon13393;Name=BoaCon13393-RA;Alias=maker-scaffold-162-augustus-gene-6.11-mRNA-1;_AED=0.02;_QI=171|1|1|1|1|1|2|4907|158;_eAED=0.02;
scaffold-162	maker	exon	2299416	2299884	.	+	.	ID=BoaCon13393-RA:exon:227;Parent=BoaCon13393-RA;
scaffold-162	maker	exon	2304671	2309756	.	+	.	ID=BoaCon13393-RA:exon:228;Parent=BoaCon13393-RA;
scaffold-162	maker	five_prime_UTR	2299416	2299586	.	+	.	ID=BoaCon13393-RA:five_prime_utr;Parent=BoaCon13393-RA;
scaffold-162	maker	CDS	2299587	2299884	.	+	0	ID=BoaCon13393-RA:cds;Parent=BoaCon13393-RA;
scaffold-162	maker	CDS	2304671	2304849	.	+	2	ID=BoaCon13393-RA:cds;Parent=BoaCon13393-RA;
scaffold-162	maker	three_prime_UTR	2304850	2309756	.	+	.	ID=BoaCon13393-RA:three_prime_utr;Parent=BoaCon13393-RA;
scaffold-162	maker	gene	2319095	2328392	.	+	.	ID=BoaCon13395;Name=BoaCon13395;Alias=maker-scaffold-162-augustus-gene-6.12;
scaffold-162	maker	mRNA	2319095	2328392	.	+	.	ID=BoaCon13395-RA;Parent=BoaCon13395;Name=BoaCon13395-RA;Alias=maker-scaffold-162-augustus-gene-6.12-mRNA-1;_AED=0.08;_QI=0|1|0.8|1|1|1|5|789|478;_eAED=0.08;
scaffold-162	maker	exon	2319095	2319800	.	+	.	ID=BoaCon13395-RA:exon:229;Parent=BoaCon13395-RA;
scaffold-162	maker	exon	2322272	2322404	.	+	.	ID=BoaCon13395-RA:exon:230;Parent=BoaCon13395-RA;
scaffold-162	maker	exon	2323841	2323973	.	+	.	ID=BoaCon13395-RA:exon:231;Parent=BoaCon13395-RA;
scaffold-162	maker	exon	2325881	2326087	.	+	.	ID=BoaCon13395-RA:exon:232;Parent=BoaCon13395-RA;
scaffold-162	maker	exon	2327346	2328392	.	+	.	ID=BoaCon13395-RA:exon:233;Parent=BoaCon13395-RA;
scaffold-162	maker	CDS	2319095	2319800	.	+	0	ID=BoaCon13395-RA:cds;Parent=BoaCon13395-RA;
scaffold-162	maker	CDS	2322272	2322404	.	+	2	ID=BoaCon13395-RA:cds;Parent=BoaCon13395-RA;
scaffold-162	maker	CDS	2323841	2323973	.	+	1	ID=BoaCon13395-RA:cds;Parent=BoaCon13395-RA;
scaffold-162	maker	CDS	2325881	2326087	.	+	0	ID=BoaCon13395-RA:cds;Parent=BoaCon13395-RA;
scaffold-162	maker	CDS	2327346	2327603	.	+	0	ID=BoaCon13395-RA:cds;Parent=BoaCon13395-RA;
scaffold-162	maker	three_prime_UTR	2327604	2328392	.	+	.	ID=BoaCon13395-RA:three_prime_utr;Parent=BoaCon13395-RA;
scaffold-162	maker	gene	2049228	2051664	.	+	.	ID=BoaCon13389;Name=BoaCon13389;Alias=augustus_masked-scaffold-162-processed-gene-6.2;
scaffold-162	maker	mRNA	2049228	2051664	.	+	.	ID=BoaCon13389-RA;Parent=BoaCon13389;Name=BoaCon13389-RA;Alias=augustus_masked-scaffold-162-processed-gene-6.2-mRNA-1;_AED=0.17;_QI=0|0|0|0.33|1|1|3|0|99;_eAED=0.16;
scaffold-162	maker	exon	2049228	2049258	.	+	.	ID=BoaCon13389-RA:exon:234;Parent=BoaCon13389-RA;
scaffold-162	maker	exon	2049470	2049538	.	+	.	ID=BoaCon13389-RA:exon:235;Parent=BoaCon13389-RA;
scaffold-162	maker	exon	2051465	2051664	.	+	.	ID=BoaCon13389-RA:exon:236;Parent=BoaCon13389-RA;
scaffold-162	maker	CDS	2049228	2049258	.	+	0	ID=BoaCon13389-RA:cds;Parent=BoaCon13389-RA;
scaffold-162	maker	CDS	2049470	2049538	.	+	2	ID=BoaCon13389-RA:cds;Parent=BoaCon13389-RA;
scaffold-162	maker	CDS	2051465	2051664	.	+	2	ID=BoaCon13389-RA:cds;Parent=BoaCon13389-RA;

Now it is very easy to run genestats.

genestats test.gff

Which will write the following output to STDOUT.

BoaCon13386-RA	19251	5	889	4	18354	5	889	0	0	0	0
BoaCon13387-RA	8457	4	389	3	8062	4	389	0	0	0	0
BoaCon13393-RA	10340	2	5553	1	4785	2	475	1	170	1	4906
BoaCon13395-RA	9297	5	2221	4	7068	5	1432	0	0	1	788
BoaCon13389-RA	2436	3	297	2	2135	3	297	0	0	0	0

It is possible to redirect this to an output file, but the user may also wish to calculate some basic statistics like mean exon and intron lengths, like this.

genestats test.gff | \
awk -v OFS="\t" '{ exon_sum += $3; exon_len += $4; intron_sum += $5; intron_len += $6 } END { print exon_sum, exon_len / exon_sum, intron_sum, intron_len / intron_sum }'

Which provides the number of exons, mean exon length, the number of introns, and the mean intron length.

@netocostaferreira
Copy link

Hello @darencard,
I am analyzing a gene family in a legume. I would like to use your tool to extract some information about the genes I'm working on.
The script is already running, but it is giving the error below in the your test file (and in my working file too):

./genestats.sh test.gff3

./genestats.sh: line 73: bedtools: command not found
BoaCon13386-RA 19251 5 889 0 0 5 889 0 0 0 0
./genestats.sh: line 73: bedtools: command not found
BoaCon13387-RA 8457 4 389 0 0 4 389 0 0 0 0
./genestats.sh: line 73: bedtools: command not found
BoaCon13393-RA 10340 2 5553 0 0 2 475 1 170 1 4906
./genestats.sh: line 73: bedtools: command not found
BoaCon13395-RA 9297 5 2221 0 0 5 1432 0 0 1 788
./genestats.sh: line 73: bedtools: command not found
BoaCon13389-RA 2436 3 297 0 0 3 297 0 0 0 0

In the output, the colums '5. number of introns' and '6. total intron sequence length' are always '0'.

Can you help me solve it?
Congratulations for your work!
I thank you in advance.

@Carol-Symbiomics
Copy link

Hi @darencard,

I found your gene_structure_stats script very useful. I would like to know how should I cite it? Can I refer to any particular publication?

Cheers,
Carol

@darencard
Copy link
Author

Hi @Carol-Symbiomics,

No need to cite in any formal way, as this isn't in any paper. Can just provide the URL for the record.

Daren

@cejackson
Copy link

Hi Daren,

Creative script and thanks for sharing, but a small error. If using this for GFF files, which are 1-based coordinates, length calculations should be sum += $5 - $4 + 1 with the added +1.

Otherwise, each exon, etc is 1-base shorter than actual length, making the total length n bases shorter, where n is number of exons.
Also noticeable looking at total CDS lengths, which should multiples of 3, but the off-by-one length errors skew CDS lengths (see your CDS lengths in test output file).

See https://standage.github.io/on-genomic-interval-notation.html for clarification if needed.

Cheers,
Craig

@darencard
Copy link
Author

Thanks Craig, for reporting that issue to me. Those damn genomics coordinates can be such a pain, but looks like you are correct and I can update accordingly.

@pselva7
Copy link

pselva7 commented Feb 25, 2021

Works charm... Helpful!!!

@abacolla
Copy link

Hi Daren:

Thanks for the script. I tried it on the test file and it works fine for me. I now have Ensembl mouse m39 GTF or GFF3 files, how can I use your script with them?

Thanks for your time,
Albino

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment