Skip to content

Instantly share code, notes, and snippets.

@thegargiulian
Last active January 30, 2020 16:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thegargiulian/70666c261eea980f56249780a6fdaf2f to your computer and use it in GitHub Desktop.
Save thegargiulian/70666c261eea980f56249780a6fdaf2f to your computer and use it in GitHub Desktop.
Blocking with a human oracle

Creating blocks for a human oracle for record linkage

Sometimes we ask a human oracle to label record pair match status to generate training data to later be used in rule-learning record linkage models. 2_oracle-preparation.R has some example code for accomplishing this task based on my experience working with human oracles; I use synthetic record data from the RecordLinkage package in R.

Block size

I aim for blocks with between 150 and 250 records in them: anything smaller is of limited utility (you don't want a model learning to make really intricate rules because you will get too many candidate rules) and anything much larger is difficult for the oracle to work with.

Preparing the data

After you've decided on some blocks, you'll want to prepare the data for the oracle. They'll likely be looking at this in a spreadsheet, and you'll want to make the fields relevant to them easy to access, while still maintaining the information you'll need to uptake the data once it's been through hand matching. Here are some practices I follow when preparing the data for the oracle:

  • Each block of records (cluster) should get a cluster number. I generate this cluster number by hashing all of the record identifiers (I use sha1 hashes) of the records included in the block. In R, I use the digest package for hashing.
  • Records should be sorted so that likely matches appear near each other. For example, if I have person level data, I might sort the records in ascending order according to (last name, first name).
  • After records have been sorted within a cluster, they should get match group numbers. To begin, the match group number should be the row number of the record within the cluster. The oracle will manipulate this field during hand matching so that matching records get the same match group number.
  • Order the dataframe columns based on what fields you believe to be substantively important/relevant for making good matches. For example, first and last name might are important for the oracle to make a good match; record hashid is not. The match group number should be the leftmost column, regardless of the ordering of the other columns. You can also add a blank notes column for the oracle. If this is the case, the columns should be ordered: notes, match group, ...
  • Leave a few blank rows in between different blocks (this is to save the oracle the trouble of having to check cluster numbers to see where one cluster ends and the next begins).
  • Save all of the blocks you create somewhere that won’t get overwritten. Even if you decide to change direction in terms of blocking, having the blocks can still be useful for model building in the future.
library(pacman)
p_load(RecordLinkage, digest)
# === functions
generate_block <- function(year) {
block <- RLdata10000[RLdata10000$by == year, ]
block <- block[order(block$lname_c1, block$fname_c1), ]
block$matchgroup <- 1:nrow(block)
block$clusterid <- digest(RLdata10000$hashid, "sha1")
block$notes <- ""
block <- block[, var_order]
return(block)
}
# === main
data("RLdata10000")
RLdata10000$hashid <- apply(RLdata10000, 1, digest, "sha1")
var_order <- c("notes", "matchgroup", "lname_c1", "fname_c1", "lname_c2",
"fname_c2", "by", "bm", "bd", "hashid", "clusterid")
block_1 <- generate_block(1964)
block_2 <- generate_block(1965)
blank_df <- as.data.frame(matrix(data="", nrow=5, ncol=length(var_order)))
names(blank_df) <- var_order
appended_records <- rbind(block_1, blank_df, block_2)
# done.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment