Skip to content

Instantly share code, notes, and snippets.

View datapolitan's full-sized avatar

Datapolitan datapolitan

View GitHub Profile
@datapolitan
datapolitan / idea.md
Last active March 7, 2023 03:17 — forked from chriswhong/idea.md
Idea for git-powered distributed dataset management

The Problem:

If you follow the open data scene, you'll often hear about how the "feedback loop" for making corrections, comments, or asking questions about datasets is either fuzzy, disjointed, or nonexistent. If I know for a fact that something in a government dataset is wrong, how do I get that record fixed? Do I call 311? Will the operator even know what I am talking about if I say I want to make a correction to a single record in a public dataset? There's DAT. There's storing your data as a CSV in Github. These approaches work, but are very much developer-centric. (Pull requests and diffs are hard to wrap your head around if you spend your day analyzing data in Exce

@datapolitan
datapolitan / load_dmp_2_cartodb.sh
Last active March 7, 2023 03:21 — forked from emacgillavry/load_dmp_2_cartodb.sh
Use Postgres dump files to populate CartoDB instance
#!/bin/bash
#------------------------------------------------------------
# We assume the dump files have been generated using pg_dump:
#
# pg_dump -a --column-inserts -x -O -t table_name database_name -f /tmp/dmp_file_name
#
#------------------------------------------------------------
# Provide details of your CartoDB account:
@datapolitan
datapolitan / rf_iris.py
Last active March 7, 2023 03:23 — forked from glamp/rf_iris.py
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names) #change from pd.Factor(), which has been deprecated
df.head()