Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@simonw
Last active November 11, 2023 17:42
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save simonw/eb5ad8e55d75bbc3003dd9e5d6eb438b to your computer and use it in GitHub Desktop.
Save simonw/eb5ad8e55d75bbc3003dd9e5d6eb438b to your computer and use it in GitHub Desktop.
How to turn a list of JSON objects into a Datasette

How to turn a list of JSON objects into a Datasette

This repository has a dataset of 184.879 crimes committed in Buenos Aires: https://github.com/ramadis/delitos-caba

Download the raw data like this:

wget 'https://github.com/ramadis/delitos-caba/releases/download/3.0/delitos.json'

Now use Pandas to load that into a dataframe:

>>> import pandas as pd
>>> df = pd.read_json(open('delitos.json'))
>>> df
                   barrio  cantidad_vehiculos  cantidad_victimas     comuna  \
0                FLORESTA                   2                  0  Comuna 10   
1                 PALERMO                   1                  0  Comuna 14   
2            VILLA CRESPO                   3                  0  Comuna 15   
3       PARQUE AVELLANEDA                   1                  0   Comuna 9   
4        VILLA GRAL MITRE                   1                  0  Comuna 11   
5                FLORESTA                   2                  0  Comuna 10   
6        PARQUE CHACABUCO                   1                  0   Comuna 7   
7                RECOLETA                   4                  0   Comuna 2   
8                  FLORES                   2                  0   Comuna 7   
9           NUEVA POMPEYA                   2                  0   Comuna 4   

Pandas can save that file to a SQLite database like so:

>>> import sqlite3
>>> conn = sqlite3.connect('delitos.db')
>>> df.to_sql('delitos', conn)

Now you can deploy that SQLite file as a web-based API using datasette publish. You'll need to install Zeit Now and Datasette first.

Include source information with command-line options:

datasette publish now delitos.db --source_url="https://github.com/ramadis/delitos-caba" --source="ramadis/delitos-caba"
> Deploying /private/var/folders/jj/fngnv0810tn2lt_kd3911pdc0000gp/T/tmpcagpkk5f/datasette under simonw
> Ready! https://datasette-tkgtsfyjyj.now.sh (copied to clipboard) [6s]
> Synced 2 files (399B) [0ms] 
> ...

A few seconds later: https://datasette-tkgtsfyjyj.now.sh

Let's give it a nice URL:

now alias https://datasette-tkgtsfyjyj.now.sh delitos-caba.now.sh

Now you can visit it at https://delitos-caba.now.sh/

Here's an example SQL query showing the most common barrios: https://delitos-caba.now.sh/delitos-6395324?sql=select+%22barrio%22%2C+count%28%2A%29+as+%22count%22+from+delitos++group+by+%22barrio%22+order+by+%22count%22+desc+limit+100

@psychemedia
Copy link

FWIW, it may also be worth limiting the amount of data ingested at any one time (pandas holds dataframes in memory), eg https://www.dataquest.io/blog/python-json-tutorial/

If the json file is just lines of JSON, pandas has aa chunksize param to allow you to load in chunks of lines: http://pandas.pydata.org/pandas-docs/stable/io.html#line-delimited-json

The pandas library can also load large CSV files in chunkwise: http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk

@Slowly
Copy link

Slowly commented Feb 17, 2020

Yes, you can use pandas read_json - IF it is heeeasssyy and sssiiiiiimple. But if you've got a very deeply nested json datasource that isn't a file and has lists mixed in with dictionaries - it won't work. For example, a graphql query result from a Tableau metadata API call.

@simonw
Copy link
Author

simonw commented Sep 19, 2022

Since writing this my sqlite-utils CLI tool has grown the ability to do this as a one-liner:

curl -L 'https://github.com/ramadis/delitos-caba/releases/download/3.0/delitos.json' | \
 sqlite-utils insert delitos.db delitos - --pk id

You can see the schema like this:

sqlite-utils schema delitos.db
CREATE TABLE [delitos] (
   [id] INTEGER PRIMARY KEY,
   [comuna] TEXT,
   [barrio] TEXT,
   [latitud] FLOAT,
   [longitud] FLOAT,
   [fecha] TEXT,
   [hora] TEXT,
   [uso_arma] TEXT,
   [uso_moto] TEXT,
   [lugar] TEXT,
   [origen_dato] TEXT,
   [tipo_delito] TEXT,
   [cantidad_vehiculos] INTEGER,
   [cantidad_victimas] INTEGER
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment