Problem Statement: Generate a large number (~10,000) dirty or broken dates to be parsed into YYYY-MM-DD form that replicate common errors.
Types of errors I frequently run into during analysis:
- Inconsistent field ordering: sometimes date information is kept in separate fields, sometimes it's stored as DD-MM instead of MM-DD, etc.
- Inconsistent field values: take, for example, the month of June which could be represented in the following ways: "6", "06", "June", "jun", among others; when working with data across different (natural) languages the number of potential representations increases further.
- Weird symbols and spacing: somehow there is more whitespace than I think there should be and field separator characters appear in seemingly incorrect places or are used inconsistently across observations. I also often see things like "?" to represent uncertainty in an observation or invalid entries, like "00" or "9999", to represent missingness.