Skip to content

Instantly share code, notes, and snippets.

Joel Becker joelbecker

View GitHub Profile
joelbecker /
Last active Dec 20, 2017
An implementation of XKCD 1930: Calendar Facts in Python 3.
import random
class Statement:
def __init__(self, *args, suffix=''):
for item in args:
assert (
isinstance(item, (str, Branch))
joelbecker / ref-fusion.rst
Created Nov 23, 2017
Draft API reference for the recordlinkage data fusion suite.
View ref-fusion.rst

Data Fusion

The recordlinkage.FuseLinks class can be used to turn two linked data sets into a single dataset. It provides a flexible framework for handling conflicting data.

.. automodule:: recordlinkage.fusion

.. autoclass:: FuseLinks
joelbecker / ref-conflict-resolution.rst
Created Nov 23, 2017
Draft API reference for the recordlinkage conflict resolution function suite.
View ref-conflict-resolution.rst

Conflict Resolution

The recordlinkage.algorithms.conflict_resolution module contains a large number of conflict resolution functions. These functions can be used with recordlinkage.FuseLinks.resolve if a conflict handling strategy is needed, which is not currently implemented in the recordlinkage.FuseLinks interface.

These conflict resolution functions are based on:

joelbecker /
Last active Nov 23, 2017
Data fusion pull request description.

Data Fusion for recordlinkage

This pull request introduces a new section of the recordlinkage API for merging matched records into a new data frame. This PR implements the following:

  • A suite of conflict resolution functions for handling conflicts based on observed values, metadata, and user specifications (see: recordlinkage/algorithms/
  • The FuseCore abstract class which implements features common to all data fusion use cases, including a general-purpose framework for handling data conflicts and implementations of several common conflict resolution strategies (see: recordlinkage/
  • The FuseLinks class, which implements data fusion for two data frames (see: recordlinkage/

Example Usage

joelbecker /
Last active Nov 26, 2017
A recordlinkage data fusion example.
from datetime import datetime
from random import randrange
import recordlinkage as rl
import recordlinkage.algorithms.conflict_resolution as cr
from recordlinkage.datasets import load_febrl4
dfA, dfB = load_febrl4()
# Adapt dataset for example
joelbecker /
Last active Aug 3, 2017
Some example code for upcoming data fusion tools in recordlinkage.
# Initialize
fuse = rl.FuseLinks(unique_a=False, unique_b=False)
# Queue inclusion of non-conflicting columns
fuse.keep(['dfa_col_1', 'dfa_col_2', 'dfa_col_3'], ['dfb_col_1', 'dfb_col_2', 'dfb_col_3'])
# Queue conflict resolution jobs
fuse.no_gossiping('col1', 'col2', name='no_gossip')
fuse.roll_the_dice('col1', 'col2', name='random')
fuse.trust_your_friends('col1', 'col2', trusted='b', name='trust_b')
joelbecker /
Created Jul 19, 2017
A prototype data fusion workflow for the Python recordlinkage toolkit.
import pandas as pd
import recordlinkage as rl
# Pseudo Source Code
class FuseCore(object):
joelbecker /
Created Jul 17, 2017
This is a supplementary script for NetLab's tutorial on candidate link indexing with the recordlinkage data integration library.
Adapted from
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
N = 10
You can’t perform that action at this time.