Skip to content

Instantly share code, notes, and snippets.

Joel Becker joelbecker

Block or report user

Report or block joelbecker

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@joelbecker
joelbecker / rl_indexing_grid_graphics.py
Created Jul 17, 2017
This is a supplementary script for NetLab's tutorial on candidate link indexing with the recordlinkage data integration library.
View rl_indexing_grid_graphics.py
"""
Adapted from https://stackoverflow.com/questions/19586828/drawing-grid-pattern-in-matplotlib
"""
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
N = 10
@joelbecker
joelbecker / rl_fuse_pseudocode.py
Created Jul 19, 2017
A prototype data fusion workflow for the Python recordlinkage toolkit.
View rl_fuse_pseudocode.py
import pandas as pd
import recordlinkage as rl
####################
# Pseudo Source Code
####################
class FuseCore(object):
@joelbecker
joelbecker / data_fusion_example.py
Last active Aug 3, 2017
Some example code for upcoming data fusion tools in recordlinkage.
View data_fusion_example.py
# Initialize
fuse = rl.FuseLinks(unique_a=False, unique_b=False)
# Queue inclusion of non-conflicting columns
fuse.keep(['dfa_col_1', 'dfa_col_2', 'dfa_col_3'], ['dfb_col_1', 'dfb_col_2', 'dfb_col_3'])
# Queue conflict resolution jobs
fuse.no_gossiping('col1', 'col2', name='no_gossip')
fuse.roll_the_dice('col1', 'col2', name='random')
fuse.trust_your_friends('col1', 'col2', trusted='b', name='trust_b')
@joelbecker
joelbecker / ref-conflict-resolution.rst
Created Nov 23, 2017
Draft API reference for the recordlinkage conflict resolution function suite.
View ref-conflict-resolution.rst

Conflict Resolution

The recordlinkage.algorithms.conflict_resolution module contains a large number of conflict resolution functions. These functions can be used with recordlinkage.FuseLinks.resolve if a conflict handling strategy is needed, which is not currently implemented in the recordlinkage.FuseLinks interface.

These conflict resolution functions are based on:

@joelbecker
joelbecker / ref-fusion.rst
Created Nov 23, 2017
Draft API reference for the recordlinkage data fusion suite.
View ref-fusion.rst

Data Fusion

The recordlinkage.FuseLinks class can be used to turn two linked data sets into a single dataset. It provides a flexible framework for handling conflicting data.

.. automodule:: recordlinkage.fusion

.. autoclass:: FuseLinks
@joelbecker
joelbecker / 2017-11-22-fusion-pr.md
Last active Nov 23, 2017
Data fusion pull request description.
View 2017-11-22-fusion-pr.md

Data Fusion for recordlinkage

This pull request introduces a new section of the recordlinkage API for merging matched records into a new data frame. This PR implements the following:

  • A suite of conflict resolution functions for handling conflicts based on observed values, metadata, and user specifications (see: recordlinkage/algorithms/conflict_resolution.py)
  • The FuseCore abstract class which implements features common to all data fusion use cases, including a general-purpose framework for handling data conflicts and implementations of several common conflict resolution strategies (see: recordlinkage/fusion.py).
  • The FuseLinks class, which implements data fusion for two data frames (see: recordlinkage/fusion.py).

Example Usage

@joelbecker
joelbecker / 2017-11-22-fusion-pullreq-usage.py
Last active Nov 26, 2017
A recordlinkage data fusion example.
View 2017-11-22-fusion-pullreq-usage.py
from datetime import datetime
from random import randrange
import recordlinkage as rl
import recordlinkage.algorithms.conflict_resolution as cr
from recordlinkage.datasets import load_febrl4
dfA, dfB = load_febrl4()
# Adapt dataset for example
@joelbecker
joelbecker / xkcd-calendar-facts.py
Last active Dec 20, 2017
An implementation of XKCD 1930: Calendar Facts in Python 3.
View xkcd-calendar-facts.py
import random
class Statement:
def __init__(self, *args, suffix=''):
for item in args:
assert (
isinstance(item, (str, Branch))
)
You can’t perform that action at this time.