Skip to content

Instantly share code, notes, and snippets.

@fgregg
fgregg / score_duplicates_logic.py
Last active August 29, 2015 13:57
Working out the logic we need for score duplicates.
import multiprocessing
import time
Q = multiprocessing.Queue()
R = multiprocessing.JoinableQueue()
def work(jobs, results):
while True:
task = jobs.get()
if task is None:
--------------------------------------------------------------------------------
Command: python2.6 mysql_example.py
Massif arguments: --massif-out-file=out.txt --depth=1
ms_print arguments: out.txt
--------------------------------------------------------------------------------
GB
3.690^ #
| #
@fgregg
fgregg / entity_map.csv
Created June 9, 2014 18:33
Entity Map proposal
We can make this file beautiful and searchable if this error is corrected: It looks like row 2 should actually have 5 columns, instead of 6. in line 1.
OCD-PersonID,ISBE ID Type,ISBE ID,Name,Address
ocd-person/d7c7ec50-ae0d-11e3-bb35-1231380db829,Committee Officer,1,Robert V Shuff,Rr 1 Auburn, IL 62615
ocd-person/d7c7ec50-ae0d-11e3-bb35-1231380db829,Committee Officer,2,Robert V Shuff,Rr 1 Auburn, IL 62615
Company name Duplicate cluster
龙铁广告有限责任公司
龙铁纵横北京)轨道交通设备有限公司
龙铃灵石文化麦查柯
龙锦广告有限公司 1
龙锦广告有限公司 1
龙锦枫广告有限公司 2
龙锦枫广告有限公司 2
龙锦综合开发(成都)有限公司
龙锦设计 3
@fgregg
fgregg / data
Last active August 29, 2015 14:04
hcluster.cophenet bug
This file has been truncated, but you can view the full file.
How an Streaming Gazetteer would work
# Initializing Object
gazette = Gazetteer(model)
gazette.readTraining(saved_training)
gazette.train()
# Loading in initial canonical data
import dedupe
records = dict([(i, {'name': 'Margret',
'age': '32'})
for i in xrange(10**4)])
deduper = dedupe.Dedupe([{'field' : "name", 'type' : 'String'}], ())
deduper.sample(records, 100000)
--------------------------------------------------------------------------------
Command: python mysql_example.py
Massif arguments: --massif-out-file=out.txt --depth=1
ms_print arguments: out.txt
--------------------------------------------------------------------------------
MB
686.3^ #
| @:: #::
@fgregg
fgregg / fdistribution.R
Created February 9, 2015 22:14
Sampling distribution of F statistic
pums <- read.csv("small_pums.csv")
# We want to test the hypothesis that standard deviation of earnings
# are the same for men and women in Illinois
#
# For this hypothesis, we will use the F statistic.
male.earnings <- pums$WAGP[pums$SEX=="male"]
n.male.earnings <- length(na.omit(male.earnings))