Forest Gregg fgregg

## score_duplicates_logic.py
import multiprocessing
import time

Q = multiprocessing.Queue()
R = multiprocessing.JoinableQueue()

def work(jobs, results):
    while True:
        task = jobs.get()
        if task is None:

## massif.txt
--------------------------------------------------------------------------------
Command:            python2.6 mysql_example.py
Massif arguments:   --massif-out-file=out.txt --depth=1
ms_print arguments: out.txt
--------------------------------------------------------------------------------


    GB
3.690^                                                                 #
     |                                                                 #

## entity_map.csv
OCD-PersonID,ISBE ID Type,ISBE ID,Name,Address
ocd-person/d7c7ec50-ae0d-11e3-bb35-1231380db829,Committee Officer,1,Robert V Shuff,Rr 1 Auburn, IL 62615
ocd-person/d7c7ec50-ae0d-11e3-bb35-1231380db829,Committee Officer,2,Robert V Shuff,Rr 1 Auburn, IL 62615


## Chinese.csv

          
            Company name
            Duplicate cluster

            
              龙铁广告有限责任公司

            
              龙铁纵横北京)轨道交通设备有限公司

            
              龙铃灵石文化麦查柯

            
              龙锦广告有限公司
              1

            
              龙锦广告有限公司
              1

            
              龙锦枫广告有限公司
              2

            
              龙锦枫广告有限公司
              2

            
              龙锦综合开发（成都）有限公司

            
              龙锦设计
              3

## data

      
              3 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                fgregg
                / data
            
            
              Last active
              August 29, 2015 14:04
            
              
                hcluster.cophenet bug
              
          
      This file has been truncated, but you can view the full file.
    

            View raw
        
    
## gist:23e2db0c4b5fbceb92ff
How an Streaming Gazetteer would work

# Initializing Object

gazette = Gazetteer(model)

gazette.readTraining(saved_training)
gazette.train()

# Loading in initial canonical data

## scraping_links.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                fgregg
                / scraping_links.md
            
            
              Last active
              August 29, 2015 14:06
            
          
    Forest Gregg
fgregg@datamade.us
DataMade
http://datamade.us
```

Almost every website you go to is a view of some data that has been organized into tables. Web pages are fancy view of spreadsheets

* [Tiers Fusion Table](https://www.google.com/fusiontables/data?docid=11PNEL-A6MFtYLLGvgtHqK7K1Pm4viKiK9IHY0tYf#rows:id=1)


## gist:0796bb9accf57dce92a8
import dedupe

records = dict([(i, {'name': 'Margret',
                  'age': '32'})
                for i in xrange(10**4)])


deduper = dedupe.Dedupe([{'field' : "name", 'type' : 'String'}], ())

deduper.sample(records, 100000)

## massif.txt
--------------------------------------------------------------------------------
Command:            python mysql_example.py
Massif arguments:   --massif-out-file=out.txt --depth=1
ms_print arguments: out.txt
--------------------------------------------------------------------------------


    MB
686.3^                                                                    #
     |                                 @::                                #::

## fdistribution.R
pums <- read.csv("small_pums.csv")

# We want to test the hypothesis that standard deviation of earnings
# are the same for men and women in Illinois
#
# For this hypothesis, we will use the F statistic.

male.earnings <- pums$WAGP[pums$SEX=="male"]
n.male.earnings <- length(na.omit(male.earnings))
	import multiprocessing
	import time

	Q = multiprocessing.Queue()
	R = multiprocessing.JoinableQueue()

	def work(jobs, results):
	while True:
	task = jobs.get()
	if task is None:
	--------------------------------------------------------------------------------
	Command: python2.6 mysql_example.py
	Massif arguments: --massif-out-file=out.txt --depth=1
	ms_print arguments: out.txt
	--------------------------------------------------------------------------------


	GB
	3.690^ #
	\| #
	OCD-PersonID,ISBE ID Type,ISBE ID,Name,Address
	ocd-person/d7c7ec50-ae0d-11e3-bb35-1231380db829,Committee Officer,1,Robert V Shuff,Rr 1 Auburn, IL 62615
	ocd-person/d7c7ec50-ae0d-11e3-bb35-1231380db829,Committee Officer,2,Robert V Shuff,Rr 1 Auburn, IL 62615
	Company name	Duplicate cluster
	龙铁广告有限责任公司
	龙铁纵横北京)轨道交通设备有限公司
	龙铃灵石文化麦查柯
	龙锦广告有限公司	1
	龙锦广告有限公司	1
	龙锦枫广告有限公司	2
	龙锦枫广告有限公司	2
	龙锦综合开发（成都）有限公司
	龙锦设计	3
	How an Streaming Gazetteer would work

	# Initializing Object

	gazette = Gazetteer(model)

	gazette.readTraining(saved_training)
	gazette.train()

	# Loading in initial canonical data
	import dedupe

	records = dict([(i, {'name': 'Margret',
	'age': '32'})
	for i in xrange(10**4)])


	deduper = dedupe.Dedupe([{'field' : "name", 'type' : 'String'}], ())

	deduper.sample(records, 100000)
	--------------------------------------------------------------------------------
	Command: python mysql_example.py
	Massif arguments: --massif-out-file=out.txt --depth=1
	ms_print arguments: out.txt
	--------------------------------------------------------------------------------


	MB
	686.3^ #
	\| @:: #::
	pums <- read.csv("small_pums.csv")

	# We want to test the hypothesis that standard deviation of earnings
	# are the same for men and women in Illinois
	#
	# For this hypothesis, we will use the F statistic.

	male.earnings <- pums$WAGP[pums$SEX=="male"]
	n.male.earnings <- length(na.omit(male.earnings))