Skip to content

Instantly share code, notes, and snippets.

@joelbecker
Last active November 23, 2017 20:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joelbecker/95acc1c4d1382dd8c2e6a49741e055a2 to your computer and use it in GitHub Desktop.
Save joelbecker/95acc1c4d1382dd8c2e6a49741e055a2 to your computer and use it in GitHub Desktop.
Data fusion pull request description.

Data Fusion for recordlinkage

This pull request introduces a new section of the recordlinkage API for merging matched records into a new data frame. This PR implements the following:

  • A suite of conflict resolution functions for handling conflicts based on observed values, metadata, and user specifications (see: recordlinkage/algorithms/conflict_resolution.py)
  • The FuseCore abstract class which implements features common to all data fusion use cases, including a general-purpose framework for handling data conflicts and implementations of several common conflict resolution strategies (see: recordlinkage/fusion.py).
  • The FuseLinks class, which implements data fusion for two data frames (see: recordlinkage/fusion.py).

Example Usage

Here is an example demonstrating how to fuse two data frames (using load_febrl4 as in linking two datsets). To run the script, download the full code here.

# Fusion step
fuse = rl.FuseLinks()

# Prefer values in dataframe a
fuse.trust_your_friends('given_name', 'given_name', trusted='a', name='given_name')

# Choose values from the row that was updated most recently
fuse.keep_up_to_date('surname', 'surname', 'dates_updated', 'dates_updated', name='surname')

# Take the average of salary values
fuse.meet_in_the_middle('salary', 'salary', metric='mean', name='salary')

# Choose randomly between street numbers
fuse.roll_the_dice('street_number', 'street_number', name='street_number')

# Keep all social security id values for future processing.
fuse.pass_it_on('soc_sec_id', 'soc_sec_id', name='soc_sec_id')

# Handle data conflicts between multiple columns in each data frame
fuse.meet_in_the_middle(['min', 'max'], ['min', 'max'], metric='stdev', name='spread')

# Create custom conflict handling strategies with the resolve method
fuse.resolve(
    cr.choose_longest,
    ['address_1', 'address_2'],
    ['address_1', 'address_2'],
    tie_break=cr.choose_random,
    name='longest_address'
)

# Execute the scheduled conflict resolution jobs for the given
#   candidate links, data, and classifications.
fused = fuse.fuse(pairs, dfA, dfB, matches)

Notes

Data Fusion for Deduplication

As discussed in the relevant issue page, data fusion for deduplication has a different set of challenges than data fusion for dataframe linking. The FuseCore class was designed to be general enough to implement both FuseLinks and FuseDuplicates, but at this time only FuseLinks has been implemented. FuseDuplicates will require (a) clustering algorithms to determine groups of records to be fused into a new row, and (b) implementing FuseDuplicates._make_resolution_series to process data from grouped records into the format required by FuseCore.

Refining Link Mapping

We also discussed implementing a feature which would "refine" the set of candidate links to enforce 1-to-n, m-to-1, or 1-to-1 link mappings. I chose not to implement this feature as a part of FuseLinks. This was partly to manage the scope of this pull request, and partly because I think that mapping refinement is an aspect of the classification step, not the data fusion step.

Code Review & Questions

This is a fairly large pull request, and I can't explain all the specifics of the data fusion classes in this description. Please ask as many questions as you want!

Acknowledgements

Jillian Anderson was instrumental in designing the data fusion API implemented in this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment