Sometimes we ask a human oracle to label record pair match status to generate training data to later be used in rule-learning record linkage models. 2_oracle-preparation.R
has some example code for accomplishing this task based on my experience working with human oracles; I use synthetic record data from the RecordLinkage
package in R
.
I aim for blocks with between 150 and 250 records in them: anything smaller is of limited utility (you don't want a model learning to make really intricate rules because you will get too many candidate rules) and anything much larger is difficult for the oracle to work with.
After you've decided on some blocks, you'll want to prepare the data for the oracle. They'll likely be looking at this in a spreadsheet, and you'll want to make the fields relevant to them easy to access, while still maintaining th