thegargiulian/1_introduction.md

## 1_introduction.md

      
    Raw
  

              1_introduction.md
            
          
    "Dirty" Dates

Problem Statement: Generate a large number (~10,000) dirty or broken dates to be parsed into YYYY-MM-DD form that replicate common errors.
Types of errors I frequently run into during analysis:

Inconsistent field ordering: sometimes date information is kept in separate fields, sometimes it's stored as DD-MM instead of MM-DD, etc.
Inconsistent field values: take, for example, the month of June which could be represented in the following ways: "6", "06", "June", "jun", among others; when working with data across different (natural) languages the number of potential representations increases further.
Weird symbols and spacing: somehow there is more whitespace than I think there should be and field separator characters appear in seemingly incorrect places or are used inconsistently across observations. I also often see things like "?" to represent uncertainty in an observation or invalid entries, like "00" or "9999", to represent missingness.

Method


Establish a list of valid dirty numeric field values, valid dirty string field values, and valid symbols (like separators or extra spacing).
Collect all valid field representations from the global environment and separate into year, month, and day categories.
Establish valid ways for combinations of year, month, and day fields to be ordered. For example, some dates might be in YYYY format, while others are MM-YYYY or YYYY-DD-MM, etc.
Generate n dirty dates by randomly selecting a field ordering, randomly selecting a value for each field type, pasting the resulting values together in the specified order, and randomly inserting symbols to further perturb the output.

Example dirty dates generated by dirty-dates.R:

Running the code

This code relies on the contents of the global environment in order to grab field values, which means that it needs to be run sequentially in order to function properly. Ideally the script would be run in batch mode from the command line using R CMD BATCH 2_dirty-dates.R with a file write location specified at the end of the script. Prior to running you'll also want to make sure that you have the stringr package installed. From R you can install stringr using install.packages("stringr").

  
## 2_dirty-dates.R
# library packages
library(stringr)

# define functions


# return a random value for a field appearing in randomly selected `field_combo`,
# based on available `fields` using `lookup` to isolate year, month, and day
get_value <- function(field_combo, fields, lookup) {
  fs <- eval(parse(text=sample(fields, 1)))
  val <- sample(fs, as.integer(grepl(lookup, field_combo)))
  return(val)
}


# randomly insert symbols depending on whether date is numeric or character to further perturb
insert_symbols <- function(string, ordering) {
  has_char <- any(str_detect(string, months))
  if (!has_char) {
    sym <- sample(symbols_numeric, 1)
    string <- gsub(" ", sym, string, fixed=TRUE)
  } else if (has_char &
             (ordering == "ymd" | ordering == "my" | ordering == "ym")) {
    sym <- sample(symbols_char, 1)
    string <- gsub(" ", sym, string, fixed=TRUE)
  }
  return(string)
}


# begin main

# valid numeric field values
yyyy <- c(1900:2015, "9999", "199", "199X", "1996?")
y <- 0:99
yy <- ifelse(y < 10, paste0("0", y), y)
m <- 1:12
mm <- ifelse(m < 10, paste0("0", m), m)
d <- 1:31
dd <- c(ifelse(d < 10, paste0("0", d), d), "8a", "1o")
symbols_numeric <- c(".", " ", "-", "/")

# valid string field values
months <- c("January", "Jan.", "ene",
            "FEBRUARY", "Feb", "FEBUARY        ",
            "march", "MAR", "Mars",
            "April", "   apr.", "ABRIL",
            "MAY", "may", "MAI",
            "june", "Jun ",
            "July", "Jul", "Julliet",
            "AUGUST", "Aug.", "Août",
            "september", "Sept.", "Sep",
            "October", "oct",
            "NOVEMBER", "Nov.           ",
            "december", "Dec", "dic", "DÉCEMBRE")
symbols_char <- c(" ", " ", "")

# grab names of valid fields from global environment, separate into year, month, and day categories
fields <- ls(globalenv())
year_fields <- grep("^y", fields, value=TRUE)
month_fields <- grep("^m", fields, value=TRUE)
day_fields <- grep("^d", fields, value=TRUE)

# establish valid ways for fields to be ordered
valid_field_combos <- c("paste(y_)", "paste(m_, y_)", "paste(y_, m_)",
                        "paste(m_, d_)", "paste(y_, m_, d_)",
                        "paste(m_, d_, y_)", "paste(y_, d_, m_)")
order <- c("y", "my", "ym", "md", "ymd", "mdy", "ydm")
f_params <- as.data.frame(cbind(valid_field_combos, order), stringsAsFactors=FALSE)

# generate n dirty dates
n <- 10000
dirty_dates <- rep(NA, n)
for (i in 1:n) {
  row_ <- sample(1:nrow(f_params), 1)                     # randomly select a field ordering
  these_fields <- f_params$valid_field_combos[row_]
  y_ <- get_value(these_fields, year_fields, "y")         # randomly get a year value
  m_ <- get_value(these_fields, month_fields, "m")        # randomly get a month value
  d_ <- get_value(these_fields, day_fields, "d")          # randomly get a day value

  date_temp <- eval(parse(text=these_fields))             # assemble field values together
  date_temp <- ifelse(nchar(date_temp) <= 2, paste0("19", date_temp), date_temp)    # pad YY to be 19YY if necessary

  dirty_dates[i] <- insert_symbols(date_temp, f_params$order[row_])      # randomly insert symbols to perturb output
}

# specify file write location here!

# done.
	# library packages
	library(stringr)

	# define functions


	# return a random value for a field appearing in randomly selected `field_combo`,
	# based on available `fields` using `lookup` to isolate year, month, and day
	get_value <- function(field_combo, fields, lookup) {
	fs <- eval(parse(text=sample(fields, 1)))
	val <- sample(fs, as.integer(grepl(lookup, field_combo)))
	return(val)
	}


	# randomly insert symbols depending on whether date is numeric or character to further perturb
	insert_symbols <- function(string, ordering) {
	has_char <- any(str_detect(string, months))
	if (!has_char) {
	sym <- sample(symbols_numeric, 1)
	string <- gsub(" ", sym, string, fixed=TRUE)
	} else if (has_char &
	(ordering == "ymd" \| ordering == "my" \| ordering == "ym")) {
	sym <- sample(symbols_char, 1)
	string <- gsub(" ", sym, string, fixed=TRUE)
	}
	return(string)
	}


	# begin main

	# valid numeric field values
	yyyy <- c(1900:2015, "9999", "199", "199X", "1996?")
	y <- 0:99
	yy <- ifelse(y < 10, paste0("0", y), y)
	m <- 1:12
	mm <- ifelse(m < 10, paste0("0", m), m)
	d <- 1:31
	dd <- c(ifelse(d < 10, paste0("0", d), d), "8a", "1o")
	symbols_numeric <- c(".", " ", "-", "/")

	# valid string field values
	months <- c("January", "Jan.", "ene",
	"FEBRUARY", "Feb", "FEBUARY ",
	"march", "MAR", "Mars",
	"April", " apr.", "ABRIL",
	"MAY", "may", "MAI",
	"june", "Jun ",
	"July", "Jul", "Julliet",
	"AUGUST", "Aug.", "Août",
	"september", "Sept.", "Sep",
	"October", "oct",
	"NOVEMBER", "Nov. ",
	"december", "Dec", "dic", "DÉCEMBRE")
	symbols_char <- c(" ", " ", "")

	# grab names of valid fields from global environment, separate into year, month, and day categories
	fields <- ls(globalenv())
	year_fields <- grep("^y", fields, value=TRUE)
	month_fields <- grep("^m", fields, value=TRUE)
	day_fields <- grep("^d", fields, value=TRUE)

	# establish valid ways for fields to be ordered
	valid_field_combos <- c("paste(y_)", "paste(m_, y_)", "paste(y_, m_)",
	"paste(m_, d_)", "paste(y_, m_, d_)",
	"paste(m_, d_, y_)", "paste(y_, d_, m_)")
	order <- c("y", "my", "ym", "md", "ymd", "mdy", "ydm")
	f_params <- as.data.frame(cbind(valid_field_combos, order), stringsAsFactors=FALSE)

	# generate n dirty dates
	n <- 10000
	dirty_dates <- rep(NA, n)
	for (i in 1:n) {
	row_ <- sample(1:nrow(f_params), 1) # randomly select a field ordering
	these_fields <- f_params$valid_field_combos[row_]
	y_ <- get_value(these_fields, year_fields, "y") # randomly get a year value
	m_ <- get_value(these_fields, month_fields, "m") # randomly get a month value
	d_ <- get_value(these_fields, day_fields, "d") # randomly get a day value

	date_temp <- eval(parse(text=these_fields)) # assemble field values together
	date_temp <- ifelse(nchar(date_temp) <= 2, paste0("19", date_temp), date_temp) # pad YY to be 19YY if necessary

	dirty_dates[i] <- insert_symbols(date_temp, f_params$order[row_]) # randomly insert symbols to perturb output
	}

	# specify file write location here!

	# done.