Skip to content

Instantly share code, notes, and snippets.

View SohierDane's full-sized avatar

Sohier Dane SohierDane

View GitHub Profile
"""
Python equivalent of the Kuzushiji competition metric (https://www.kaggle.com/c/kuzushiji-recognition/)
Kaggle's backend uses a C# implementation of the same metric. This version is
provided for convenience only; in the event of any discrepancies the C# implementation
is the master version.
Tested on Python 3.6 with numpy 1.16.4 and pandas 0.24.2.
"""
@SohierDane
SohierDane / prepare_data.py
Created October 17, 2017 23:36
Downloads, reformats, and cleans the US Census Bureau's Business & Industry reports.
"""
Download all files from
https://www.census.gov/econ/currentdata/datasets/index
and repackage into a data.csv, metadata.csv, and notes.txt
"""
import argparse
import os
import pandas as pd
import re
@SohierDane
SohierDane / prepare_data.py
Last active October 11, 2017 20:03
NBER macrohistory database preparation
"""
Download all files from
http://www.nber.org/databases/macrohistory/contents/
and repackage into a data csv and a documentation csv
"""
import argparse
import os
import pandas as pd
import requests
"""
Pull movie metadata from the https://www.themoviedb.org API.
Requires an API key stored in a .config file
The code is currently restricted to the movie category. To get it to run with
other categories, update the constants
(CATEGORY_SPECIFIC_CALLS, JSON_COLUMNS, KEYS_TO_DROP)
and delete the movie specific section of the export_data() function.
@SohierDane
SohierDane / parse_mortality_data.py
Created August 3, 2017 18:47
CDC Mortality Dataset Preparation 2005-2015
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
For each year, parse the pdf manual, then use that information to
unpack the fixed-width data file.
Source data files can be found here:
https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Mortality_Multiple
Passes basic tests for 2005-2015. Untested on earlier years.
"""
Calculates key performance indicators for the 1 click appliers.
Manual validation of 5 cl missing reply link failures (is link actually there?)
https://columbus.craigslist.org/eng/5939246199.html', correct
'https://seattle.craigslist.org/tac/edu/5946859577.html', correct
'https://seattle.craigslist.org/see/egr/5955565817.html', correct
'https://minneapolis.craigslist.org/hnp/npo/5961895846.html', correct
#! /usr/bin/env python
from ats.one_click_applier import ATSApplierInput
from ats.taleo.applier_type_test import TaleoApplier
from collections import Counter
from postings.selenium_visitor import PostingSeleniumVisitor
from util.orator.job_scout import JobScout
from util.proxies import setup_phantom
from util.database import setup_cursor, execute_query
from util.aws import INTERNAL_BUCKET, S3Downloader