This pull request introduces a new section of the recordlinkage
API for merging matched records into a new data frame. This PR implements the following:
- A suite of conflict resolution functions for handling conflicts based on observed values, metadata, and user specifications (see:
recordlinkage/algorithms/conflict_resolution.py
) - The
FuseCore
abstract class which implements features common to all data fusion use cases, including a general-purpose framework for handling data conflicts and implementations of several common conflict resolution strategies (see:recordlinkage/fusion.py
). - The
FuseLinks
class, which implements data fusion for two data frames (see:recordlinkage/fusion.py
).
Here is an example demonstrating how to fuse two data frames (using load_febrl4
as in linking two datsets). To run the script, download the full code here.
# Fusion step
fuse = rl.FuseLinks()
# Prefer values in dataframe a
fuse.trust_your_friends('given_name', 'given_name', trusted='a', name='given_name')
# Choose values from the row that was updated most recently
fuse.keep_up_to_date('surname', 'surname', 'dates_updated', 'dates_updated', name='surname')
# Take the average of salary values
fuse.meet_in_the_middle('salary', 'salary', metric='mean', name='salary')
# Choose randomly between street numbers
fuse.roll_the_dice('street_number', 'street_number', name='street_number')
# Keep all social security id values for future processing.
fuse.pass_it_on('soc_sec_id', 'soc_sec_id', name='soc_sec_id')
# Handle data conflicts between multiple columns in each data frame
fuse.meet_in_the_middle(['min', 'max'], ['min', 'max'], metric='stdev', name='spread')
# Create custom conflict handling strategies with the resolve method
fuse.resolve(
cr.choose_longest,
['address_1', 'address_2'],
['address_1', 'address_2'],
tie_break=cr.choose_random,
name='longest_address'
)
# Execute the scheduled conflict resolution jobs for the given
# candidate links, data, and classifications.
fused = fuse.fuse(pairs, dfA, dfB, matches)
As discussed in the relevant issue page, data fusion for deduplication has a different set of challenges than data fusion for dataframe linking. The FuseCore
class was designed to be general enough to implement both FuseLinks
and FuseDuplicates
, but at this time only FuseLinks
has been implemented. FuseDuplicates
will require (a) clustering algorithms to determine groups of records to be fused into a new row, and (b) implementing FuseDuplicates._make_resolution_series
to process data from grouped records into the format required by FuseCore
.
We also discussed implementing a feature which would "refine" the set of candidate links to enforce 1-to-n, m-to-1, or 1-to-1 link mappings. I chose not to implement this feature as a part of FuseLinks
. This was partly to manage the scope of this pull request, and partly because I think that mapping refinement is an aspect of the classification step, not the data fusion step.
This is a fairly large pull request, and I can't explain all the specifics of the data fusion classes in this description. Please ask as many questions as you want!
Jillian Anderson was instrumental in designing the data fusion API implemented in this pull request.