Skip to content

Instantly share code, notes, and snippets.

Simon Willison simonw

View GitHub Profile
@simonw
simonw / fetch_metadata_for_doc_ids.py
Created Apr 3, 2019
Fetch metadata from Google Drive API for a list of doc_ids (because their batch API is extremely difficult to figure out)
View fetch_metadata_for_doc_ids.py
def fetch_metadata_for_doc_ids(doc_ids, oauth_token):
boundary = 'batch_boundary'
headers = {
'Authorization': 'Bearer {}'.format(oauth_token),
'Content-Type': 'multipart/mixed; boundary=%s' % boundary,
}
body = ''
for doc_id in doc_ids:
req = 'GET https://www.googleapis.com/drive/v3/files/{}?fields=*'.format(doc_id)
body += '--%s\n' % boundary
View readable_diff.py
import csv
from dictdiffer import diff
def load_trees(filepath):
fp = csv.reader(open(filepath))
headings = next(fp)
rows = [dict(zip(headings, line)) for line in fp]
return {r["TreeID"]: r for r in rows}
@simonw
simonw / README.md
Last active Mar 10, 2019
How I created dams.now.sh
View README.md

How I created dams.now.sh

Try it out at https://dams.now.sh/ - see this Twitter thread for background.

I started by grabbing the URLs to every downloadable Excel spreadsheet.

I navigated to the "Downloads (Public)" link starting from https://nid-test.sec.usace.army.mil/ - then I ran this JavaScript in my browser's console to extract all of the URLs as a JSON blob.

console.log(JSON.stringify(
    Array.from(
View build.sh
csvs-to-sqlite https://candidates.democracyclub.org.uk/media/candidates-all.csv \
--table=candidates \
-c election \
-f name \
-f party_name \
-f post_label \
democracyclub.db
datasette publish heroku democracyclub.db \
--name="democracyclub-datasette" \
@simonw
simonw / sessions.json
Created Jan 21, 2019
SRCCON sessions from 2018 (just in case they get over-written for 2019) - from https://schedule.srccon.org/sessions.json
View sessions.json
[
{
"day": "Thursday",
"description": "Get your badges and get some food (plus plenty of coffee), as you gear up for the first day of SRCCON!",
"everyone": "y",
"facilitators": "",
"facilitators_twitter": "",
"id": "thursday-breakfast",
"length": "",
"notepad": "",
@simonw
simonw / toss-up-one-liner.md
Last active Jan 21, 2019
toss-up.now.sh one-liner
View toss-up-one-liner.md

Bash one-liner I used to create toss-up.now.sh

git clone https://github.com/dwillis/toss-up \
    && csvs-to-sqlite toss-up/data/*.csv toss-up.db \
    && datasette publish now toss-up.db 
        --source_url=https://github.com/dwillis/toss-up \
        --install=datasette-vega \
        --install=datasette-cluster-map \
        --alias=toss-up.now.sh
@simonw
simonw / demo_bm25_bug.py
Last active Jan 6, 2019
Demonstrating a bug in Peewee's bm25 function - see https://github.com/coleifer/peewee/issues/1826
View demo_bm25_bug.py
import math
import struct
import sqlite3
conn = sqlite3.connect(":memory:")
conn.executescript("""
CREATE VIRTUAL TABLE docs USING fts4(c0, c1);
INSERT INTO docs (c0, c1) VALUES ("this is about a dog", "more about that dog dog");
INSERT INTO docs (c0, c1) VALUES ("this is about a cat", "stuff on that cat cat");
@simonw
simonw / gargoyle-selective-exclude.md
Created Nov 21, 2018
How gargoyle selective exclude rules work
View gargoyle-selective-exclude.md
@simonw
simonw / Dockerfile
Last active Mar 30, 2019
The Dockerfile used by the new Datasette Publish to generate images that are smaller than 100MB
View Dockerfile
FROM python:3.6-slim-stretch as csvbuilder
# This one uses csvs-to-sqlite to compile the DB, and then uses datasette
# inspect to generate inspect-data.json Compiling pandas takes way too long
# under alpine so we use slim-stretch for this one instead.
RUN apt-get update && apt-get install -y python3-dev gcc
COPY *.csv csvs/
RUN pip install csvs-to-sqlite datasette
RUN csvs-to-sqlite csvs/names.csv data.db -f "name" -c "legislature" -c "country"
@simonw
simonw / import_github.md
Created Oct 31, 2018
How to import a GitHub repository as a subdirectory of a new repository while maintaining commits and datestamps
View import_github.md

How to import a GitHub repository as a subdirectory of a new repository while maintaining commits and datestamps

There is probably a better way to do this, but this worked for me.

I had a repository called docsearch that I had been building a prototype in.

I wanted to move the contents of that repository into an existing repository called search_experiments - but I wanted the contents to live in a docsearch/ subdirectory rather than living in the root of the repo.

I solved this using the combination of git format-patch and git apply.

You can’t perform that action at this time.