Skip to content

Instantly share code, notes, and snippets.

Joel Becker joelbecker

View GitHub Profile
@joelbecker
joelbecker / xkcd-calendar-facts.py
Last active Dec 20, 2017
An implementation of XKCD 1930: Calendar Facts in Python 3.
View xkcd-calendar-facts.py
import random
class Statement:
def __init__(self, *args, suffix=''):
for item in args:
assert (
isinstance(item, (str, Branch))
)
@joelbecker
joelbecker / ref-fusion.rst
Created Nov 23, 2017
Draft API reference for the recordlinkage data fusion suite.
View ref-fusion.rst

Data Fusion

The recordlinkage.FuseLinks class can be used to turn two linked data sets into a single dataset. It provides a flexible framework for handling conflicting data.

.. automodule:: recordlinkage.fusion

.. autoclass:: FuseLinks
@joelbecker
joelbecker / ref-conflict-resolution.rst
Created Nov 23, 2017
Draft API reference for the recordlinkage conflict resolution function suite.
View ref-conflict-resolution.rst

Conflict Resolution

The recordlinkage.algorithms.conflict_resolution module contains a large number of conflict resolution functions. These functions can be used with recordlinkage.FuseLinks.resolve if a conflict handling strategy is needed, which is not currently implemented in the recordlinkage.FuseLinks interface.

These conflict resolution functions are based on:

@joelbecker
joelbecker / 2017-11-22-fusion-pr.md
Last active Nov 23, 2017
Data fusion pull request description.
View 2017-11-22-fusion-pr.md

Data Fusion for recordlinkage

This pull request introduces a new section of the recordlinkage API for merging matched records into a new data frame. This PR implements the following:

  • A suite of conflict resolution functions for handling conflicts based on observed values, metadata, and user specifications (see: recordlinkage/algorithms/conflict_resolution.py)
  • The FuseCore abstract class which implements features common to all data fusion use cases, including a general-purpose framework for handling data conflicts and implementations of several common conflict resolution strategies (see: recordlinkage/fusion.py).
  • The FuseLinks class, which implements data fusion for two data frames (see: recordlinkage/fusion.py).

Example Usage

@joelbecker
joelbecker / 2017-11-22-fusion-pullreq-usage.py
Last active Nov 26, 2017
A recordlinkage data fusion example.
View 2017-11-22-fusion-pullreq-usage.py
from datetime import datetime
from random import randrange
import recordlinkage as rl
import recordlinkage.algorithms.conflict_resolution as cr
from recordlinkage.datasets import load_febrl4
dfA, dfB = load_febrl4()
# Adapt dataset for example
@joelbecker
joelbecker / data_fusion_example.py
Last active Aug 3, 2017
Some example code for upcoming data fusion tools in recordlinkage.
View data_fusion_example.py
# Initialize
fuse = rl.FuseLinks(unique_a=False, unique_b=False)
# Queue inclusion of non-conflicting columns
fuse.keep(['dfa_col_1', 'dfa_col_2', 'dfa_col_3'], ['dfb_col_1', 'dfb_col_2', 'dfb_col_3'])
# Queue conflict resolution jobs
fuse.no_gossiping('col1', 'col2', name='no_gossip')
fuse.roll_the_dice('col1', 'col2', name='random')
fuse.trust_your_friends('col1', 'col2', trusted='b', name='trust_b')
@joelbecker
joelbecker / rl_fuse_pseudocode.py
Created Jul 19, 2017
A prototype data fusion workflow for the Python recordlinkage toolkit.
View rl_fuse_pseudocode.py
import pandas as pd
import recordlinkage as rl
####################
# Pseudo Source Code
####################
class FuseCore(object):
@joelbecker
joelbecker / rl_indexing_grid_graphics.py
Created Jul 17, 2017
This is a supplementary script for NetLab's tutorial on candidate link indexing with the recordlinkage data integration library.
View rl_indexing_grid_graphics.py
"""
Adapted from https://stackoverflow.com/questions/19586828/drawing-grid-pattern-in-matplotlib
"""
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
N = 10
You can’t perform that action at this time.