Skip to content

Instantly share code, notes, and snippets.

@yannabraham
Last active September 2, 2016 15:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yannabraham/eba3db5382772bb3ae900e4d81fd3fca to your computer and use it in GitHub Desktop.
Save yannabraham/eba3db5382772bb3ae900e4d81fd3fca to your computer and use it in GitHub Desktop.
This script parses the (very useful but broken) list of CD markers and associated genes from Uniprot found at http://www.uniprot.org/docs/cdlist.txt
library(stringr)
screwed <- readLines(con='http://www.uniprot.org/docs/cdlist.txt')
screwed <- screwed[76:521]
parser <- c(0,8,21,29,37,55,1000000L) # use fixed length parsing
screwed <- lapply(screwed,function(scr) {
sapply(seq(length(parser)-1),function(i) str_trim(substr(scr,parser[i]+1,parser[i+1])))
}
)
screwed <- do.call(rbind,screwed)
colnames(screwed) <- c('CD_Number','SwissProt_Name','AC_Number','MIM_Number','Gene_Names','Synonyms')
head(screwed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment