Skip to content

Instantly share code, notes, and snippets.

@darencard
Last active February 21, 2017 15:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save darencard/e04ab8d664c3477f9aa6fd4e94a7b88e to your computer and use it in GitHub Desktop.
Save darencard/e04ab8d664c3477f9aa6fd4e94a7b88e to your computer and use it in GitHub Desktop.
parse transcripts out of NCBI GFF based on gene ids

Script to parse a NCBI GFF based on transcript IDs (e.g., XM_000..). These transcript IDs must not include the version suffix (.1, .2, etc.).

Columns returned:

  1. chromosome/scaffold
  2. start position of transcript
  3. end position of transcript
  4. transcript number
  5. gene number
  6. gene ID (NCBI)
  7. transcript ID (same as query)
  8. transcript ID/version
  9. gene symbol
cut -f 1 GCF_000186305.1_Python_molurus_bivittatus-5.0.2_rna_metadata_parsed.tsv | \
while read tx; do grep -m 1 -w "$tx" \
<(cat GCF_000186305.1_Python_molurus_bivittatus-5.0.2_genomic.gff | awk '{ if ($3 != "exon") print $0 }') | \
cut -f 1,4,5,7,9 | awk -v OFS="\t" -F ';' '{ print $1, $2, $3, $6 }' | tr ',' '\t' | \
sed -e 's/ID=//g' -e 's/Parent=//g' -e 's/Dbxref=//g' -e 's/GeneID://g' -e 's/Genbank://g' -e 's/gene=//g' | \
awk -v OFS="\t" -F '.' '{ print $1, $2, $3 }' | \
awk -v OFS="\t" '{ print $1"."$2, $3, $4, $5, $6, $7, $8, $9, $9"."$10, $11 }'; \
done \
> GCF_000186305.1_Python_molurus_bivittatus-5.0.2_genomic_rna_metadata_parsed.tsv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment