Problem Statement: Generate a large number (~10,000) dirty or broken dates to be parsed into YYYY-MM-DD form that replicate common errors.
Types of errors I frequently run into during analysis:
- Inconsistent field ordering: sometimes date information is kept in separate fields, sometimes it's stored as DD-MM instead of MM-DD, etc.
- Inconsistent field values: take, for example, the month of June which could be represented in the following ways: "6", "06", "June", "jun", among others; when working with data across different (natural) languages the number of potential representations increases further.
- Weird symbols and spacing: somehow there is more whitespace than I think there should be and field separator characters appear in seemingly incorrect places or are used inconsistently across observations. I also often see things like "?" to represent uncertainty in an observation or invalid entries, like "00" or "9999", to represent missingness.
- Establish a list of valid dirty numeric field values, valid dirty string field values, and valid symbols (like separators or extra spacing).
- Collect all valid field representations from the global environment and separate into year, month, and day categories.
- Establish valid ways for combinations of year, month, and day fields to be ordered. For example, some dates might be in YYYY format, while others are MM-YYYY or YYYY-DD-MM, etc.
- Generate
n
dirty dates by randomly selecting a field ordering, randomly selecting a value for each field type, pasting the resulting values together in the specified order, and randomly inserting symbols to further perturb the output.
Example dirty dates generated by dirty-dates.R
:
This code relies on the contents of the global environment in order to grab field values, which means that it needs to be run sequentially in order to function properly. Ideally the script would be run in batch mode from the command line using R CMD BATCH 2_dirty-dates.R
with a file write location specified at the end of the script. Prior to running you'll also want to make sure that you have the stringr
package installed. From R
you can install stringr
using install.packages("stringr")
.