Last active
December 21, 2015 09:39
-
-
Save danielecook/6286237 to your computer and use it in GitHub Desktop.
Simple bash script to download publication tables from ucsc genome browser and merge.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Downloading pub tables. Requires wget. Can be installed using home brew for mac | |
# More information is available here: http://brew.sh/ | |
mkdir ../data/ | |
# Download Files | |
wget --timestamping --directory-prefix='../data/' 'http://hgdownload.cse.ucsc.edu/goldenPath/hgFixed/database/pubsArticle.txt.gz' | |
wget --timestamping --directory-prefix='../data/' 'http://hgdownload.cse.ucsc.edu/goldenPath/hgFixed/database/pubsMarkerAnnot.txt.gz' | |
## wget --timestamping --directory-prefix='../data/' 'http://hgdownload.cse.ucsc.edu/goldenPath/hgFixed/database/pubsSequenceAnnot.txt.gz' | |
# Unzip Files | |
gunzip ../data/pubsArticle.txt.gz | |
gunzip ../data/pubsMarkerAnnot.txt.gz | |
## gunzip ../data/pubsSequenceAnnot.txt.gz | |
# Cut out needed columns from files. Sort, and make unique. | |
cut -f 1,5,6,8 pubsMarkerAnnot.txt | sort -n -k 1 | uniq -u > pubsMarkerAnnot_cut.txt | |
cut -f 1,2,3,8 pubsArticle.txt | sort -n -k 1 | uniq -u > pubsArticle_cut.txt | |
# Join the cut files, rearrange columns and strip the <B> and </B> Tags. Also removes the article index. | |
# Apparently, the tab needs to be specified as a literal (hence the $ sign). | |
join -t $'\t' pubsMarkerAnnot_cut.txt pubsArticle_cut.txt | uniq -u | awk -F $'\t' '{print $6"\t"$5"\t"$2"\t"$3"\t"$7"\t"$4}' | sed "s/<B>//;s/<\/B>//" > pubs_join.txt | |
# columns: | |
# pmid | |
# pmc id | |
# marker type (gene, snp, band) | |
# marker name (e.g. BRCA1) | |
# publication title | |
# Snippet | |
# Create the 'unique titles', and 'unique snippets' files: | |
cut -f 5 pubs_join.txt | uniq > titles_unique.txt | |
cut -f 6 pubs_join.txt | uniq > snippets_unique.txt |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks Dan!