Skip to content

Instantly share code, notes, and snippets.

@thegargiulian
Last active November 11, 2019 23:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thegargiulian/320e3d10bb58c6fddf58ccfd778e56f3 to your computer and use it in GitHub Desktop.
Save thegargiulian/320e3d10bb58c6fddf58ccfd778e56f3 to your computer and use it in GitHub Desktop.
Generate "dirty" date data.

"Dirty" Dates

Problem Statement: Generate a large number (~10,000) dirty or broken dates to be parsed into YYYY-MM-DD form that replicate common errors.

Types of errors I frequently run into during analysis:

  • Inconsistent field ordering: sometimes date information is kept in separate fields, sometimes it's stored as DD-MM instead of MM-DD, etc.
  • Inconsistent field values: take, for example, the month of June which could be represented in the following ways: "6", "06", "June", "jun", among others; when working with data across different (natural) languages the number of potential representations increases further.
  • Weird symbols and spacing: somehow there is more whitespace than I think there should be and field separator characters appear in seemingly incorrect places or are used inconsistently across observations. I also often see things like "?" to represent uncertainty in an observation or invalid entries, like "00" or "9999", to represent missingness.

Method

  1. Establish a list of valid dirty numeric field values, valid dirty string field values, and valid symbols (like separators or extra spacing).
  2. Collect all valid field representations from the global environment and separate into year, month, and day categories.
  3. Establish valid ways for combinations of year, month, and day fields to be ordered. For example, some dates might be in YYYY format, while others are MM-YYYY or YYYY-DD-MM, etc.
  4. Generate n dirty dates by randomly selecting a field ordering, randomly selecting a value for each field type, pasting the resulting values together in the specified order, and randomly inserting symbols to further perturb the output.

Example dirty dates generated by dirty-dates.R: Example dirty dates generated by dirty-dates.R

Running the code

This code relies on the contents of the global environment in order to grab field values, which means that it needs to be run sequentially in order to function properly. Ideally the script would be run in batch mode from the command line using R CMD BATCH 2_dirty-dates.R with a file write location specified at the end of the script. Prior to running you'll also want to make sure that you have the stringr package installed. From R you can install stringr using install.packages("stringr").

# library packages
library(stringr)
# define functions
# return a random value for a field appearing in randomly selected `field_combo`,
# based on available `fields` using `lookup` to isolate year, month, and day
get_value <- function(field_combo, fields, lookup) {
fs <- eval(parse(text=sample(fields, 1)))
val <- sample(fs, as.integer(grepl(lookup, field_combo)))
return(val)
}
# randomly insert symbols depending on whether date is numeric or character to further perturb
insert_symbols <- function(string, ordering) {
has_char <- any(str_detect(string, months))
if (!has_char) {
sym <- sample(symbols_numeric, 1)
string <- gsub(" ", sym, string, fixed=TRUE)
} else if (has_char &
(ordering == "ymd" | ordering == "my" | ordering == "ym")) {
sym <- sample(symbols_char, 1)
string <- gsub(" ", sym, string, fixed=TRUE)
}
return(string)
}
# begin main
# valid numeric field values
yyyy <- c(1900:2015, "9999", "199", "199X", "1996?")
y <- 0:99
yy <- ifelse(y < 10, paste0("0", y), y)
m <- 1:12
mm <- ifelse(m < 10, paste0("0", m), m)
d <- 1:31
dd <- c(ifelse(d < 10, paste0("0", d), d), "8a", "1o")
symbols_numeric <- c(".", " ", "-", "/")
# valid string field values
months <- c("January", "Jan.", "ene",
"FEBRUARY", "Feb", "FEBUARY ",
"march", "MAR", "Mars",
"April", " apr.", "ABRIL",
"MAY", "may", "MAI",
"june", "Jun ",
"July", "Jul", "Julliet",
"AUGUST", "Aug.", "Août",
"september", "Sept.", "Sep",
"October", "oct",
"NOVEMBER", "Nov. ",
"december", "Dec", "dic", "DÉCEMBRE")
symbols_char <- c(" ", " ", "")
# grab names of valid fields from global environment, separate into year, month, and day categories
fields <- ls(globalenv())
year_fields <- grep("^y", fields, value=TRUE)
month_fields <- grep("^m", fields, value=TRUE)
day_fields <- grep("^d", fields, value=TRUE)
# establish valid ways for fields to be ordered
valid_field_combos <- c("paste(y_)", "paste(m_, y_)", "paste(y_, m_)",
"paste(m_, d_)", "paste(y_, m_, d_)",
"paste(m_, d_, y_)", "paste(y_, d_, m_)")
order <- c("y", "my", "ym", "md", "ymd", "mdy", "ydm")
f_params <- as.data.frame(cbind(valid_field_combos, order), stringsAsFactors=FALSE)
# generate n dirty dates
n <- 10000
dirty_dates <- rep(NA, n)
for (i in 1:n) {
row_ <- sample(1:nrow(f_params), 1) # randomly select a field ordering
these_fields <- f_params$valid_field_combos[row_]
y_ <- get_value(these_fields, year_fields, "y") # randomly get a year value
m_ <- get_value(these_fields, month_fields, "m") # randomly get a month value
d_ <- get_value(these_fields, day_fields, "d") # randomly get a day value
date_temp <- eval(parse(text=these_fields)) # assemble field values together
date_temp <- ifelse(nchar(date_temp) <= 2, paste0("19", date_temp), date_temp) # pad YY to be 19YY if necessary
dirty_dates[i] <- insert_symbols(date_temp, f_params$order[row_]) # randomly insert symbols to perturb output
}
# specify file write location here!
# done.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment