Data Fusion for
This pull request introduces a new section of the
recordlinkage API for merging matched records into a new data frame. This PR implements the following:
- A suite of conflict resolution functions for handling conflicts based on observed values, metadata, and user specifications (see:
FuseCoreabstract class which implements features common to all data fusion use cases, including a general-purpose framework for handling data conflicts and implementations of several common conflict resolution strategies (see:
FuseLinksclass, which implements data fusion for two data frames (see:
# Fusion step fuse = rl.FuseLinks() # Prefer values in dataframe a fuse.trust_your_friends('given_name', 'given_name', trusted='a', name='given_name') # Choose values from the row that was updated most recently fuse.keep_up_to_date('surname', 'surname', 'dates_updated', 'dates_updated', name='surname') # Take the average of salary values fuse.meet_in_the_middle('salary', 'salary', metric='mean', name='salary') # Choose randomly between street numbers fuse.roll_the_dice('street_number', 'street_number', name='street_number') # Keep all social security id values for future processing. fuse.pass_it_on('soc_sec_id', 'soc_sec_id', name='soc_sec_id') # Handle data conflicts between multiple columns in each data frame fuse.meet_in_the_middle(['min', 'max'], ['min', 'max'], metric='stdev', name='spread') # Create custom conflict handling strategies with the resolve method fuse.resolve( cr.choose_longest, ['address_1', 'address_2'], ['address_1', 'address_2'], tie_break=cr.choose_random, name='longest_address' ) # Execute the scheduled conflict resolution jobs for the given # candidate links, data, and classifications. fused = fuse.fuse(pairs, dfA, dfB, matches)
Data Fusion for Deduplication
As discussed in the relevant issue page, data fusion for deduplication has a different set of challenges than data fusion for dataframe linking. The
FuseCore class was designed to be general enough to implement both
FuseDuplicates, but at this time only
FuseLinks has been implemented.
FuseDuplicates will require (a) clustering algorithms to determine groups of records to be fused into a new row, and (b) implementing
FuseDuplicates._make_resolution_series to process data from grouped records into the format required by
Refining Link Mapping
We also discussed implementing a feature which would "refine" the set of candidate links to enforce 1-to-n, m-to-1, or 1-to-1 link mappings. I chose not to implement this feature as a part of
FuseLinks. This was partly to manage the scope of this pull request, and partly because I think that mapping refinement is an aspect of the classification step, not the data fusion step.
Code Review & Questions
This is a fairly large pull request, and I can't explain all the specifics of the data fusion classes in this description. Please ask as many questions as you want!
Jillian Anderson was instrumental in designing the data fusion API implemented in this pull request.