Skip to content

Instantly share code, notes, and snippets.

@lindenb
Last active December 19, 2015 22:08
Show Gist options
  • Save lindenb/6024788 to your computer and use it in GitHub Desktop.
Save lindenb/6024788 to your computer and use it in GitHub Desktop.
NCBI Biosystems to BED: join "ncbi;biosystem2gene", "ncbi:biosystem-label" and "biomart-ensembl:gene"
#!/bin/bash
CURLOPT=" "
LC_ALL=C join -t ' ' -1 1 -2 1 <(curl ${CURLOPT} "ftp://ftp.ncbi.nih.gov/pub/biosystems/biosystems.20130711/biosystems_gene.gz" | gunzip -c | LC_ALL=C sort -t ' ' -k1,1 ) <(curl ${CURLOPT} "ftp://ftp.ncbi.nih.gov/pub/biosystems/biosystems.20130711/bsid2info.gz" | gunzip -c | cut -d ' ' -f1,4 | LC_ALL=C sort -t ' ') | LC_ALL=C sort -t ' ' -k2,2 | LC_ALL=C join -t ' ' -1 4 -2 2 <(curl ${CURLOPT} -d "query=%3CQuery%20virtualSchemaName%3D%22default%22%20formatter%3D%22TSV%22%20header%3D%220%22%20uniqueRows%3D%221%22%20count%3D%22%22%20datasetConfigVersion%3D%220.6%22%3E%3CDataset%20name%3D%22hsapiens_gene_ensembl%22%20interface%3D%22default%22%3E%3CAttribute%20name%3D%22chromosome_name%22%2F%3E%3CAttribute%20name%3D%22start_position%22%2F%3E%3CAttribute%20name%3D%22end_position%22%2F%3E%3CAttribute%20name%3D%22entrezgene%22%2F%3E%3C%2FDataset%3E%3C%2FQuery%3E" "http://www.biomart.org/biomart/martservice/result" | awk -F ' ' '($4!="")' | LC_ALL=C sort -t ' ' -k4,4) - | awk -F ' ' '{OFS=" ";print $2,$3,$4,$1,$5,$6,$7;}' | tr " " "_" | LC_ALL=C sort -t ' ' -k1,1 -k2,2n -k3,3n -k4 | bgzip -c > ncbibiosystem.bed.gz && tabix -p bed -f ncbibiosystem.bed.gz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment