thegargiulian/1_introduction.md

## 1_introduction.md

      
    Raw
  

              1_introduction.md
            
          
    Creating blocks for a human oracle for record linkage

Sometimes we ask a human oracle to label record pair match status to generate training data to later be used in rule-learning record linkage models. 2_oracle-preparation.R has some example code for accomplishing this task based on my experience working with human oracles; I use synthetic record data from the RecordLinkage package in R.
Block size

I aim for blocks with between 150 and 250 records in them: anything smaller is of limited utility (you don't want a model learning to make really intricate rules because you will get too many candidate rules) and anything much larger is difficult for the oracle to work with.
Preparing the data

After you've decided on some blocks, you'll want to prepare the data for the oracle. They'll likely be looking at this in a spreadsheet, and you'll want to make the fields relevant to them easy to access, while still maintaining the information you'll need to uptake the data once it's been through hand matching. Here are some practices I follow when preparing the data for the oracle:

Each block of records (cluster) should get a cluster number. I generate this cluster number by hashing all of the record identifiers (I use sha1 hashes) of the records included in the block. In R, I use the digest package for hashing.
Records should be sorted so that likely matches appear near each other. For example, if I have person level data, I might sort the records in ascending order according to (last name, first name).
After records have been sorted within a cluster, they should get match group numbers. To begin, the match group number should be the row number of the record within the cluster. The oracle will manipulate this field during hand matching so that matching records get the same match group number.
Order the dataframe columns based on what fields you believe to be substantively important/relevant for making good matches. For example, first and last name might are important for the oracle to make a good match; record hashid is not. The match group number should be the leftmost column, regardless of the ordering of the other columns. You can also add a blank notes column for the oracle. If this is the case, the columns should be ordered: notes, match group, ...
Leave a few blank rows in between different blocks (this is to save the oracle the trouble of having to check cluster numbers to see where one cluster ends and the next begins).
Save all of the blocks you create somewhere that won’t get overwritten. Even if you decide to change direction in terms of blocking, having the blocks can still be useful for model building in the future.


## 2_oracle-preparation.R
library(pacman)

p_load(RecordLinkage, digest)

# === functions


generate_block <- function(year) {
  block <- RLdata10000[RLdata10000$by == year, ]
  block <- block[order(block$lname_c1, block$fname_c1), ]
  block$matchgroup <- 1:nrow(block)
  block$clusterid <- digest(RLdata10000$hashid, "sha1")
  block$notes <- ""
  block <- block[, var_order]
  return(block)
}


# === main
data("RLdata10000")
RLdata10000$hashid <- apply(RLdata10000, 1, digest, "sha1")

var_order <- c("notes", "matchgroup", "lname_c1", "fname_c1", "lname_c2",
               "fname_c2", "by", "bm", "bd", "hashid", "clusterid")

block_1 <- generate_block(1964)
block_2 <- generate_block(1965)

blank_df <- as.data.frame(matrix(data="", nrow=5, ncol=length(var_order)))
names(blank_df) <- var_order

appended_records <- rbind(block_1, blank_df, block_2)

# done.
	library(pacman)

	p_load(RecordLinkage, digest)

	# === functions


	generate_block <- function(year) {
	block <- RLdata10000[RLdata10000$by == year, ]
	block <- block[order(block$lname_c1, block$fname_c1), ]
	block$matchgroup <- 1:nrow(block)
	block$clusterid <- digest(RLdata10000$hashid, "sha1")
	block$notes <- ""
	block <- block[, var_order]
	return(block)
	}


	# === main
	data("RLdata10000")
	RLdata10000$hashid <- apply(RLdata10000, 1, digest, "sha1")

	var_order <- c("notes", "matchgroup", "lname_c1", "fname_c1", "lname_c2",
	"fname_c2", "by", "bm", "bd", "hashid", "clusterid")

	block_1 <- generate_block(1964)
	block_2 <- generate_block(1965)

	blank_df <- as.data.frame(matrix(data="", nrow=5, ncol=length(var_order)))
	names(blank_df) <- var_order

	appended_records <- rbind(block_1, blank_df, block_2)

	# done.