Skip to content

Instantly share code, notes, and snippets.

View joelbecker's full-sized avatar

Joel Becker joelbecker

View GitHub Profile
@joelbecker
joelbecker / chemical_city_lyrics.md
Last active April 18, 2021 17:54
Chemical City listening party lyrics sheet

Chemical City Lyrics

The Gate

Over land, I traveled time and space and quicksand
And for sixty days I knew no other road

And through the maelstrom, which turned in time with the kick drum
I was swallowed and twisted and spit out on a coast
And in this place there stands a gate that leads to the heart of the city

@joelbecker
joelbecker / xkcd-calendar-facts.py
Last active December 20, 2017 19:33
An implementation of XKCD 1930: Calendar Facts in Python 3.
import random
class Statement:
def __init__(self, *args, suffix=''):
for item in args:
assert (
isinstance(item, (str, Branch))
)
@joelbecker
joelbecker / ref-fusion.rst
Created November 23, 2017 20:24
Draft API reference for the recordlinkage data fusion suite.

Data Fusion

The recordlinkage.FuseLinks class can be used to turn two linked data sets into a single dataset. It provides a flexible framework for handling conflicting data.

recordlinkage.fusion

FuseLinks

@joelbecker
joelbecker / ref-conflict-resolution.rst
Created November 23, 2017 20:22
Draft API reference for the recordlinkage conflict resolution function suite.

Conflict Resolution

The recordlinkage.algorithms.conflict_resolution module contains a large number of conflict resolution functions. These functions can be used with recordlinkage.FuseLinks.resolve if a conflict handling strategy is needed, which is not currently implemented in the recordlinkage.FuseLinks interface.

These conflict resolution functions are based on:

@joelbecker
joelbecker / 2017-11-22-fusion-pr.md
Last active November 23, 2017 20:56
Data fusion pull request description.

Data Fusion for recordlinkage

This pull request introduces a new section of the recordlinkage API for merging matched records into a new data frame. This PR implements the following:

  • A suite of conflict resolution functions for handling conflicts based on observed values, metadata, and user specifications (see: recordlinkage/algorithms/conflict_resolution.py)
  • The FuseCore abstract class which implements features common to all data fusion use cases, including a general-purpose framework for handling data conflicts and implementations of several common conflict resolution strategies (see: recordlinkage/fusion.py).
  • The FuseLinks class, which implements data fusion for two data frames (see: recordlinkage/fusion.py).

Example Usage

@joelbecker
joelbecker / 2017-11-22-fusion-pullreq-usage.py
Last active November 26, 2017 20:56
A recordlinkage data fusion example.
from datetime import datetime
from random import randrange
import recordlinkage as rl
import recordlinkage.algorithms.conflict_resolution as cr
from recordlinkage.datasets import load_febrl4
dfA, dfB = load_febrl4()
# Adapt dataset for example
@joelbecker
joelbecker / data_fusion_example.py
Last active August 3, 2017 14:38
Some example code for upcoming data fusion tools in recordlinkage.
# Initialize
fuse = rl.FuseLinks(unique_a=False, unique_b=False)
# Queue inclusion of non-conflicting columns
fuse.keep(['dfa_col_1', 'dfa_col_2', 'dfa_col_3'], ['dfb_col_1', 'dfb_col_2', 'dfb_col_3'])
# Queue conflict resolution jobs
fuse.no_gossiping('col1', 'col2', name='no_gossip')
fuse.roll_the_dice('col1', 'col2', name='random')
fuse.trust_your_friends('col1', 'col2', trusted='b', name='trust_b')
@joelbecker
joelbecker / rl_fuse_pseudocode.py
Created July 19, 2017 20:55
A prototype data fusion workflow for the Python recordlinkage toolkit.
import pandas as pd
import recordlinkage as rl
####################
# Pseudo Source Code
####################
class FuseCore(object):
@joelbecker
joelbecker / rl_indexing_grid_graphics.py
Created July 17, 2017 19:43
This is a supplementary script for NetLab's tutorial on candidate link indexing with the recordlinkage data integration library.
"""
Adapted from https://stackoverflow.com/questions/19586828/drawing-grid-pattern-in-matplotlib
"""
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
N = 10