Skip to content

Instantly share code, notes, and snippets.

@fjossinet
Created June 16, 2012 18:42
Show Gist options
  • Save fjossinet/2942223 to your computer and use it in GitHub Desktop.
Save fjossinet/2942223 to your computer and use it in GitHub Desktop.
Select CDS by keyword in all E. coli genomes
#!/bin/bash
query=$1
genome_ids=$(wget -qO - "http://www.ncbi.nlm.nih.gov/genome/genomes/167?&subset=complete&limit=refseq" | grep 'title="chromosome">Chr' | sed -E 's/.+(NC_.+|NZ_.+)/\1/' | cut -d \< -f 1)
for genome_id in $genome_ids
do
wget -qO - "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=$genome_id&rettype=gb&retmode=xml" > genome.xml
gene_ids=$(xmllint --xpath "//GBFeature[GBFeature_key[.='CDS'] and GBFeature_quals/GBQualifier[GBQualifier_name[.='product'] and GBQualifier_value[contains(.,\"$query\")]]]" genome.xml | grep "GI:" | sed -E 's/.+GI:(.+)<.+/\1/')
for gene_id in $gene_ids
do
wget -qO - "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=$gene_id&rettype=fasta" | grep "^>" | cut -d \> -f 2
done
done
rm genome.xml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment