Skip to content

Instantly share code, notes, and snippets.

@thegargiulian
thegargiulian / 1_introduction.md
Last active January 30, 2020 16:07
Blocking with a human oracle

Creating blocks for a human oracle for record linkage

Sometimes we ask a human oracle to label record pair match status to generate training data to later be used in rule-learning record linkage models. 2_oracle-preparation.R has some example code for accomplishing this task based on my experience working with human oracles; I use synthetic record data from the RecordLinkage package in R.

Block size

I aim for blocks with between 150 and 250 records in them: anything smaller is of limited utility (you don't want a model learning to make really intricate rules because you will get too many candidate rules) and anything much larger is difficult for the oracle to work with.

Preparing the data

After you've decided on some blocks, you'll want to prepare the data for the oracle. They'll likely be looking at this in a spreadsheet, and you'll want to make the fields relevant to them easy to access, while still maintaining th

@thegargiulian
thegargiulian / 1_introduction.md
Last active November 11, 2019 23:43
Generate "dirty" date data.

"Dirty" Dates

Problem Statement: Generate a large number (~10,000) dirty or broken dates to be parsed into YYYY-MM-DD form that replicate common errors.

Types of errors I frequently run into during analysis:

  • Inconsistent field ordering: sometimes date information is kept in separate fields, sometimes it's stored as DD-MM instead of MM-DD, etc.
  • Inconsistent field values: take, for example, the month of June which could be represented in the following ways: "6", "06", "June", "jun", among others; when working with data across different (natural) languages the number of potential representations increases further.
  • Weird symbols and spacing: somehow there is more whitespace than I think there should be and field separator characters appear in seemingly incorrect places or are used inconsistently across observations. I also often see things like "?" to represent uncertainty in an observation or invalid entries, like "00" or "9999", to represent missingness.

Method