Forest Gregg
fgregg@datamade.us
DataMade
http://datamade.us
```
Almost every website you go to is a view of some data that has been organized into tables. Web pages are fancy view of spreadsheets
* [Tiers Fusion Table](https://www.google.com/fusiontables/data?docid=11PNEL-A6MFtYLLGvgtHqK7K1Pm4viKiK9IHY0tYf#rows:id=1)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import multiprocessing | |
import time | |
Q = multiprocessing.Queue() | |
R = multiprocessing.JoinableQueue() | |
def work(jobs, results): | |
while True: | |
task = jobs.get() | |
if task is None: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-------------------------------------------------------------------------------- | |
Command: python2.6 mysql_example.py | |
Massif arguments: --massif-out-file=out.txt --depth=1 | |
ms_print arguments: out.txt | |
-------------------------------------------------------------------------------- | |
GB | |
3.690^ # | |
| # |
We can make this file beautiful and searchable if this error is corrected: It looks like row 2 should actually have 5 columns, instead of 6. in line 1.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
OCD-PersonID,ISBE ID Type,ISBE ID,Name,Address | |
ocd-person/d7c7ec50-ae0d-11e3-bb35-1231380db829,Committee Officer,1,Robert V Shuff,Rr 1 Auburn, IL 62615 | |
ocd-person/d7c7ec50-ae0d-11e3-bb35-1231380db829,Committee Officer,2,Robert V Shuff,Rr 1 Auburn, IL 62615 | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Company name | Duplicate cluster | |
---|---|---|
龙铁广告有限责任公司 | ||
龙铁纵横北京)轨道交通设备有限公司 | ||
龙铃灵石文化麦查柯 | ||
龙锦广告有限公司 | 1 | |
龙锦广告有限公司 | 1 | |
龙锦枫广告有限公司 | 2 | |
龙锦枫广告有限公司 | 2 | |
龙锦综合开发(成都)有限公司 | ||
龙锦设计 | 3 |
This file has been truncated, but you can view the full file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
How an Streaming Gazetteer would work | |
# Initializing Object | |
gazette = Gazetteer(model) | |
gazette.readTraining(saved_training) | |
gazette.train() | |
# Loading in initial canonical data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import dedupe | |
records = dict([(i, {'name': 'Margret', | |
'age': '32'}) | |
for i in xrange(10**4)]) | |
deduper = dedupe.Dedupe([{'field' : "name", 'type' : 'String'}], ()) | |
deduper.sample(records, 100000) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-------------------------------------------------------------------------------- | |
Command: python mysql_example.py | |
Massif arguments: --massif-out-file=out.txt --depth=1 | |
ms_print arguments: out.txt | |
-------------------------------------------------------------------------------- | |
MB | |
686.3^ # | |
| @:: #:: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pums <- read.csv("small_pums.csv") | |
# We want to test the hypothesis that standard deviation of earnings | |
# are the same for men and women in Illinois | |
# | |
# For this hypothesis, we will use the F statistic. | |
male.earnings <- pums$WAGP[pums$SEX=="male"] | |
n.male.earnings <- length(na.omit(male.earnings)) |
OlderNewer