pwalsh/ckan_data_quality.md

## ckan_data_quality.md

      
    Raw
  

              ckan_data_quality.md
            
          
    CKAN Data Quality

Background

We are working on a set of "data quality" tools, to check the quality of open data publications.
These tools are currently very much WIP, here:

Dashboard: https://github.com/okfn/data-quality-dashboard/tree/feature/refactor

Example: http://uk-25k.datadashboards.io


CLI: https://github.com/okfn/data-quality-cli/tree/feature/refactor
Dataset (UK 25k spend data): https://github.com/okfn/data-quality-uk-25k-spend/tree/feature/refactor

You do not need to learn all these tools. They just provide context for what we are doing, which is:

Check sets of data that are published, and make a list of the data sources
Assess the quality of the data published
Show the results of our quality assessment in a dashboard app

Task

The dataset repository above contains a script (id_data.py) to build out two files: publishers.csv and sources.csv. It does this in a very particular way due to the type of data we are collecting there.
We want to create a different, generic script that does similar: builds publishers.csv and sources.csv from any CKAN instance.
So, the key differences from the above script are:

Get all resources from a CKAN instance, not just a particular type (so, unlike the 25k spend data above)
Write a generic script/class/whatever that just takes the URL of the CKAN instance, and builds out the list of publishers.csv, from the instance's Organizations, and sources.csv from the instances resources.

Spec


Given a CKAN instance, use its API to...

Build a list of publishers followng this schema ( https://github.com/okfn/data-quality-uk-25k-spend/blob/feature/refactor/data/publishers.csv )
Build a list of sources following this schema ( https://github.com/okfn/data-quality-uk-25k-spend/blob/feature/refactor/data/sources.csv )


Fields from the schema above that do not make sense in this generic implementation - that is fine, just note them.
Expected output

The output should be:

A repository on GitHub with:

The script/module that implements the "CSV database"
An example database using the Queensland government CKAN instance: https://data.qld.gov.au/
Using the standard Open Knowledge license for such apps ( https://github.com/okfn/data-quality-dashboard/blob/master/LICENSE )
A README to describe the usage of the script/module


We should then be able to take this code, and build the publishers.csv and sources.csv tables for any other CKAN instance.