Skip to content

Instantly share code, notes, and snippets.

@pwalsh
Created March 27, 2016 09:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pwalsh/3136548822909611c843 to your computer and use it in GitHub Desktop.
Save pwalsh/3136548822909611c843 to your computer and use it in GitHub Desktop.

CKAN Data Quality

Background

We are working on a set of "data quality" tools, to check the quality of open data publications.

These tools are currently very much WIP, here:

You do not need to learn all these tools. They just provide context for what we are doing, which is:

  • Check sets of data that are published, and make a list of the data sources
  • Assess the quality of the data published
  • Show the results of our quality assessment in a dashboard app

Task

The dataset repository above contains a script (id_data.py) to build out two files: publishers.csv and sources.csv. It does this in a very particular way due to the type of data we are collecting there.

We want to create a different, generic script that does similar: builds publishers.csv and sources.csv from any CKAN instance.

So, the key differences from the above script are:

  • Get all resources from a CKAN instance, not just a particular type (so, unlike the 25k spend data above)
  • Write a generic script/class/whatever that just takes the URL of the CKAN instance, and builds out the list of publishers.csv, from the instance's Organizations, and sources.csv from the instances resources.

Spec

Fields from the schema above that do not make sense in this generic implementation - that is fine, just note them.

Expected output

The output should be:

We should then be able to take this code, and build the publishers.csv and sources.csv tables for any other CKAN instance.

@pwalsh
Copy link
Author

pwalsh commented Mar 27, 2016

Answering questions

Q: Should the script be resilient, and do verbose error handling as per this example?

A: It is not a requirement, no. You have to balance resilience, code quality, and time constraints accordingly.

Q: A SOLR query example from the example repo - is this important, etc?

A: This is part of the very specific use case in that repo (filtering out data that does not match certain criteria), and should not at all be relevant here. Here, we just want to get all resources, so there is no need for special queries of this type.

Q: Should development be in Python 3?

A: All new code at Open Knowledge is compatible across Python 2 and 3. You can read about our coding standards, and see an example implementation of those standards. So, while we do have a clear preference for 2/3 code, I also do understand that this can be a little tricky in a short project like this, if you have not had this experience before. So, you will not be "penalised" for targeting a single Python runtime, 2 or 3, as long as this is stated clearly in the README.

Q: Is the mentioned "example" database just the two CSV files?

A: Yes, that is correct.

@pwalsh
Copy link
Author

pwalsh commented Mar 27, 2016

Q: What if the test instance suggested does not have extras (or, other...) data which is required for the schema?

A: You can't handle data that does not exist. However, seeing as the script needs to be generic, it might be a good idea to try it against 4-5 ckan instances. Some examples, in addition to the QLD instance, could be: http://data.gov.au ; http://data.gov.uk ; http://www.data.gov ; http://opendata.aragon.es ; http://daten.berlin.de

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment