GFF is a common format for storing genetic feature annotations. In the case of gene annotations, subsets of elements are split over multiple lines, as things like exons and CDS features will have gaps based on the full genome sequence. Therefore, while it is easy to extract exon and CDS lines, it can be difficult to associate them together based on a parent (e.g., transcript) ID and perform downstream operations. Even extracting the full CDS sequence using a GFF file can be tricky for this reason, even though it seems trivial.
Here we'll overcome this difficulty using the gffread
tool. Installation is pretty easy and is documented in the GitHub README. gffread
has a lot of options, but here we'll just document one that extracts the spliced CDS for each GFF transcript (-x
option). Note that you can do the same thing for exons (-w
option) and can also produce the protein sequence (-y
option).
Let's extra