Instantly share code, notes, and snippets.

Embed
What would you like to do?
How to turn a list of JSON objects into a Datasette

How to turn a list of JSON objects into a Datasette

This repository has a dataset of 184.879 crimes committed in Buenos Aires: https://github.com/ramadis/delitos-caba

Download the raw data like this:

wget 'https://raw.githubusercontent.com/ramadis/delitos-caba/master/delitos.json'

Now use Pandas to load that into a dataframe:

>>> import pandas as pd
>>> df = pd.read_json(open('delitos.json'))
>>> df
                   barrio  cantidad_vehiculos  cantidad_victimas     comuna  \
0                FLORESTA                   2                  0  Comuna 10   
1                 PALERMO                   1                  0  Comuna 14   
2            VILLA CRESPO                   3                  0  Comuna 15   
3       PARQUE AVELLANEDA                   1                  0   Comuna 9   
4        VILLA GRAL MITRE                   1                  0  Comuna 11   
5                FLORESTA                   2                  0  Comuna 10   
6        PARQUE CHACABUCO                   1                  0   Comuna 7   
7                RECOLETA                   4                  0   Comuna 2   
8                  FLORES                   2                  0   Comuna 7   
9           NUEVA POMPEYA                   2                  0   Comuna 4   

Pandas can save that file to a SQLite database like so:

>>> import sqlite3
>>> conn = sqlite3.connect('delitos.db')
>>> df.to_sql('delitos', conn)

Now you can deploy that SQLite file as a web-based API using datasette publish. You'll need to install Zeit Now and Datasette first.

Include source information with command-line options:

datasette publish now delitos.db --source_url="https://github.com/ramadis/delitos-caba" --source="ramadis/delitos-caba"
> Deploying /private/var/folders/jj/fngnv0810tn2lt_kd3911pdc0000gp/T/tmpcagpkk5f/datasette under simonw
> Ready! https://datasette-tkgtsfyjyj.now.sh (copied to clipboard) [6s]
> Synced 2 files (399B) [0ms] 
> ...

A few seconds later: https://datasette-tkgtsfyjyj.now.sh

Let's give it a nice URL:

now alias https://datasette-tkgtsfyjyj.now.sh delitos-caba.now.sh

Now you can visit it at https://delitos-caba.now.sh/

Here's an example SQL query showing the most common barrios: https://delitos-caba.now.sh/delitos-6395324?sql=select+%22barrio%22%2C+count%28%2A%29+as+%22count%22+from+delitos++group+by+%22barrio%22+order+by+%22count%22+desc+limit+100

@psychemedia

This comment has been minimized.

psychemedia commented Jan 22, 2018

You can also pd.read_json("https://raw.githubusercontent.com/ramadis/delitos-caba/master/delitos.json") so should be easy to do a simple python cli tool to take url, load json into pandas dataframe, then create the sqlite db as output.

Alternatively, use py to download the file rather than wget.

@psychemedia

This comment has been minimized.

psychemedia commented Jan 22, 2018

FWIW, it may also be worth limiting the amount of data ingested at any one time (pandas holds dataframes in memory), eg https://www.dataquest.io/blog/python-json-tutorial/

If the json file is just lines of JSON, pandas has aa chunksize param to allow you to load in chunks of lines: http://pandas.pydata.org/pandas-docs/stable/io.html#line-delimited-json

The pandas library can also load large CSV files in chunkwise: http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment