Skip to content

Instantly share code, notes, and snippets.

@cjdd3b
cjdd3b / fingerprint.py
Created February 22, 2015 14:17
Python implementation of Google Refine fingerprinting algorithms here: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
# -*- coding: utf-8 -*-
import re, string
from unidecode import unidecode
PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))
class Fingerprinter(object):
'''
Python implementation of Google Refine fingerprinting algorithm described here:
@cjdd3b
cjdd3b / csvjoin.py
Created April 2, 2015 21:54
CSV-flattening code for Harsh's research
import csv, os
# This chunk iterates through all of the csv files in a directory, turns them
# into 2-dimensional arrays (lists of lists), and puts all those arrays into
# a list called "tables"
tables = []
# Loop over all files in the current directory (which is what "." means)
for f in os.listdir('.'):
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 1419,
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 1,
@cjdd3b
cjdd3b / cluster.py
Last active July 27, 2023 08:16
Example of perceptual hashing for near-duplicate image detection
'''
cluster.py
Uses the Hamming distance between perceptual hashes to surface near-duplicate
images.
To install and run:
1. pip install imagehash
2. Put some .dat files in a folder someplace (script assumes ./data/imgs/*.dat)
@cjdd3b
cjdd3b / scraping_solution.py
Created April 13, 2016 15:50
Solution to scraping assignment
import csv, mechanize
from bs4 import BeautifulSoup
# Get the output file ready
# datafile = open('output.csv', 'w')
# writer = csv.writer(datafile)
br = mechanize.Browser()
br.open('http://enr.sos.mo.gov/EnrNet/CountyResults.aspx')
@cjdd3b
cjdd3b / virtualenv.txt
Last active April 26, 2016 17:00
Virtual environment configuration instrux
sudo pip install virtualenvwrapper
export WORKON_HOME=~/Envs
mkdir -p $WORKON_HOME
source /usr/local/bin/virtualenvwrapper.sh
echo 'export WORKON_HOME=$HOME/Envs; source /usr/local/bin/virtualenvwrapper.sh' >> ~/.bash_profile
mkvirtualenv dataj
pip install jupyter
pip install agate
pip install WHATEVER_ELSE
@cjdd3b
cjdd3b / s3count.md
Last active June 18, 2020 18:31
How to count files in an S3 bucket

Counting files in S3 buckets and folders is harder than it should be. But here's a way to get it done using s3cmd:

  1. Install S3cmd
  • On Mac, brew install s3cmd
  • On Windows, go here
  1. From the command line, run s3cmd --configure

  2. Add your credentials when prompted.

@cjdd3b
cjdd3b / data-journalism-software.md
Last active August 31, 2016 11:52
Software installation guide for Mizzou's Advanced Data Journalism course, Fall 2016.

Advanced Data Journalism (J4432) software requirements

Below is a list of the key software you'll need for class, along with some resources offering tips about how to get it installed.

Text editor

A good programming text editor will help you organize your code, catch typos and generally make your life a lot easier. We recommend Sublime Text 2, which you can easily download and install from their website.

Terminal client

@cjdd3b
cjdd3b / strib-suicides.txt
Created December 11, 2017 22:27
Data from the first chart on this interactive about Minnesota suicides: http://www.startribune.com/suicide-rate-in-minnesota-has-been-rising/440778623/
Year Suicides
1981-01-01 442
1982-01-01 470
1983-01-01 444
1984-01-01 443
1985-01-01 459
1986-01-01 541
1987-01-01 546
1988-01-01 488
1989-01-01 515