simonw/nicar-2020-datasette-tipsheet.md Secret

## nicar-2020-datasette-tipsheet.md

      
    Raw
  

              nicar-2020-datasette-tipsheet.md
            
          
    Datasette: An ecosystem of tools for exploring data and collaborating on data projects

NICAR 2020 workshop, 5th March 2020
Some introductory information


Datasette
The Datasette Ecosystem

Four of my projects:

FARA: Deploying a data API using GitHub Actions and Cloud Run - resulting in https://fara.datasettes.com/
Tracking PG&E outages by scraping to a git repo - resulting in https://pge-outages.simonwillison.net/
Analyzing US Election Russian Facebook Ads - resulting in https://russian-ira-facebook-ads.datasettes.com/
Niche Museums - a heavily templated Datasette instance, see simonw/museums on GitHub

An example from elsewhere:

Baltimore Sun Public Salary Records

Exercises


Use the FARA Datasette instance to find documents that mention Paul Manafort. You may have to run searches across all four of the tables.
Use the Russian Internet Research Agency Facebook Ads Datasette to find adverts that were targetted at people_who_match:behaviors:African American (US). You can start by running a text search, then click on targeting links to get more accurate listings.

Getting started with Datasette Cloud

Datasette Cloud is a brand new hosted Datasette platform I have been building over the past two months. You are the first group to try it!
Sign into the site at https://www.datasettecloud.com/login/ (I strongly encourage using the "Sign in with Google" button so you don't need to create an account with a new password) and fill out the Request a team form to create a team instance.
New teams usually require an approval step, but during the workshop they will be approved instantly.
Exercise: import New Orleans Occupational Business Licenses

Navigate to your team page (https://your-team.datasetecloud.com/) and click "Import data" in the navigation bar.
You can upload a CSV file using the form here, or you can paste in the URL to a CSV on the internet. I suggest starting off with this one, which lists all of the Occupational Business Licenses issued in New Orleans.
https://data.nola.gov/api/views/hjcd-grvu/rows.csv?accessType=DOWNLOAD
It may take a few seconds for the table to appear in your list of tables, so hit refresh until you see it.
If you visit the new page, it will show you a map and allow you to browse the data.
Click "Configure search for table" and select "BusinessName", "OwnerName" and "BusinessType".
Now you can search for "taxi" to see all of the taxi licenses.
Getting started with Glitch

Datasette Cloud is designed for privately analyzing data and collaborating with a small set of team mates.
If you want to publish data to the internet, the easiest way to get started is to use Glitch. It can give you a production web environment launched and managed entirely through your browser.
Exercise: Build and publish the NICAR schedule

Let's build a website showing the NICAR schedule for this year's conference.

Download nicar-2020-schedule.csv from https://github.com/ireapps/nicar-2020-schedule - direct link to CSV.
Visit https://glitch.com/~datasette-csvs and click the "Remix Your Own" button. This will give you a copy of the Datasette demo project.
Upload the CSV file to Glitch by dragging it into the left hand pane. If it uploads to the assets directory you have run into a Glitch bug: rename the file to nicar-2020-schedule.csv.txt, upload it to Glitch again and then in the Glitch interface rename it back to nicar-2020-schedule.csv.

This should be enough for the "Show -> In a New Window" menu option to show you a Datasette instance with the NICAR schedule.
Now, edit metadata.json in the Glitch editor and replace it with the following contents:
{
  "title": "NICAR 2020",
  "source": "ireapps/nicar-2020-schedule",
  "source_url": "https://github.com/ireapps/nicar-2020-schedule",
  "license": "MIT",
  "license_url": "https://github.com/ireapps/nicar-2020-schedule/blob/master/LICENSE"
}
This will customize the title and metadata displayed on the page.
Finally, let's configure full-text search. Edit the install.sh script to look like this:
pip3 install -U -r requirements.txt --user && \
  mkdir -p .data && \
  rm .data/data.db || true && \
  for f in *.csv
    do
        sqlite-utils insert .data/data.db ${f%.*} $f --csv
    done
sqlite-utils enable-fts .data/data.db nicar-2020-schedule session_title session_description speakers --fts4
Your NICAR schedule site should now be searchable! You can click the "Create Account" button to create a Glitch account and store your site there - if you don't the site will be automatically deleted in five days time.
Getting started on your own laptop

On the workshop laptops

$ cd Desktop/hands_on_classes/datasette_an_ecosystem_of_tools_for_exploring_data_and_collaborating_on_data_projects_2160
$ pipenv shell
$ datasette --version
datasette, version 0.35
$ csvs-to-sqlite --version
csvs-to-sqlite, version 1.0

On your own laptop

First we need to install some software. Assuming you have a working Python 3 environment I recommend using pipenv:
$ pip3 install pipenv
$ mkdir nicar-2020-datasette
$ cd nicar-2020-datasette
$ pipenv shell

pipenv shell creates and activates a new Python virtual environment for that directory. We can install software in that environment like so:
$ pip install datasette csvs-to-sqlite sqlite-transform

To confirm that the correct software is installed, run these commands:
$ datasette --version
datasette, version 0.37.1
$ csvs-to-sqlite --version
csvs-to-sqlite, version 1.0

Exercise: Build a FARA database

The workshop laptops should have FARA_All_ForeignPrincipals.csv, FARA_All_RegistrantDocs.csv, FARA_All_Registrants.csv and FARA_All_ShortForms.csv.
Alternatively you can download these files as ZIPs from https://efile.fara.gov/ords/f?p=API:BULKDATA - then unzip them.
Turn those into a SQLite database like this:
$ csvs-to-sqlite *.csv fara.db

If one of the files fails to import due to errors in the CSV you can try this instead:
$ rm fara.db
$ csvs-to-sqlite *.csv fara.db --skip-errors

Now you can browse it using Datasette like this:
$ datasette fara.db
Serve! files=('fara.db',) (immutables=()) on port 8001
INFO:     Started server process [23025]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)

Exercise: Build and clean a SQLite database of 2019 NOLA Police Calls For Service

Download the CSV from https://data.nola.gov/Public-Safety-and-Preparedness/Calls-for-Service-2019/qf6q-pp4b - hit "Export -> CSV"
Use csvs-to-sqlite to convert it to a SQLite database on your laptop:
$ csvs-to-sqlite Calls_for_Service_2019.csv nola-calls.db
Loaded 1 dataframes
Created nola-calls.db from 1 CSV file

Browse it locally with Datasette:
$ datasette nola-calls.db 
Serve! files=('nola-calls.db',) (immutables=()) on port 8001
INFO:     Started server process [23025]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)

Let's try faceting by InitialTypeText to see which are the most common call reasons:
http://127.0.0.1:8001/nola-calls/Calls_for_Service_2019?_facet=InitialTypeText
This will probably return an error: These facets timed out: InitialTypeText
We can fix this by running Datasette with some extra options:
$ datasette nola-calls.db \
    --config sql_time_limit_ms:10000 \
    --config facet_time_limit_ms:10000

Exercise: Cleaning the data with sqlite-transform

There are a few issues with this table.
The dates are formatted like 04/30/2019 12:21:44 AM. This means you can't sort by those columns.
We can use sqlite-transform to fix this:
$ sqlite-transform parsedatetime nola-calls.db Calls_for_Service_2019 \
    TimeCreate TimeDispatch TimeArrival TimeClosed
  4%|██▏                              | 66191/1840132 [00:05<02:18, 12821.41it/s]

Also, the latitude and longitude are bundled together in a Location column as (30.00592151, -90.12148943). Let's fix that.
First we will add empty latitude and longitude columns:
$ sqlite-utils add-column nola-calls.db Calls_for_Service_2019 latitude
$ sqlite-utils add-column nola-calls.db Calls_for_Service_2019 longitude

Next, copy the current value of the Location column into both of these new columns:
$ sqlite3 nola-calls.db 'update Calls_for_Service_2019 set latitude = Location, longitude = Location'

Now we can transform the latitude value using a Python one-liner:
$ sqlite-transform lambda nola-calls.db Calls_for_Service_2019 latitude \
    --code='value.split(",")[0].replace("(", "").strip()'
100%|███████████████████████████| 460033/460033 [00:03<00:00, 124646.78it/s]

And the longitude:
$ sqlite-transform lambda nola-calls.db Calls_for_Service_2019 longitude \
    --code='value.split(",")[1].replace(")", "").strip()'
100%|███████████████████████████| 460033/460033 [00:03<00:00, 123232.23it/s]

Now if you install the datasette-cluster-map plugin (pip install datasette-cluster-map) you should see a map showing the location of all of the calls matching your current query.