Skip to content

Instantly share code, notes, and snippets.

@mdavy86
Forked from cfljam/new_gist_file.r
Last active August 29, 2015 14:16
Show Gist options
  • Save mdavy86/22e30cb276ec73543003 to your computer and use it in GitHub Desktop.
Save mdavy86/22e30cb276ec73543003 to your computer and use it in GitHub Desktop.
Fixing eBrida identifiers

Problem

Incorrectly formatting of Vine identifiers from Excel sources for eBrida. For example, the identifier

Incorrect Excel format conversion Correct eBrida format
TO8.33.06.15F => T08.33-06-15f

Loading example IDs

x <- scan(what=character(0))
TO8.33.02.01A
TO8.33.02.01B
TO8.33.02.02A
TO8.33.02.02B
""

Regular expression to fix

First approach

formatIDs <- function(x){
  ## Convert TO\\ds => T0\\ds
  x <- gsub("^(T)O(\\d)", "\\10\\2", x)
  ## -'s and lower case
  x <- gsub("(.+?\\.)(.+?)\\.(.+?)\\.(.+)","\\1\\2-\\3-\\L\\4", x, perl=TRUE)
return(x)
}

cat("[ MARC IDs ]\n")
print(x)

cat("[ eBrida conversion ]\n")
print(formatIDs(x))

Second approach using strsplit()

formatIDs <- function(x) {
  ## Using strsplit
  parts <- strsplit(x, "\\.")
  ## Convert TO\\ds => T0\\ds
  st  <- gsub("^(T)O(\\d)", "\\10\\2", parts[[1]])
  en <- tolower(parts[[4]])
  return(paste0(st, ".", parts[[2]], "-", parts[[3]], "-", en))
}

cat("[ MARC IDs ]\n")
print(x)

cat("[ eBrida conversion ]\n")
sapply(idparts, formatIDs)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment