First problem: Have a common format for storing datasets
At Transparência Hackday Portugal, as with any other open data interest group, we work with many datasets. An issue that has been slowing us down for a long time is that we never had a centralized solution for storing datasets: some are in Google Docs, others in Git repositories, others live on web servers.
Before that, another issue was the data format: we found ourselves lost among CSV or JSON files, SQL database dumps, spreadsheets and plaintext files. Converting these was something we'd do in an ad hoc basis, and the challenge of finding (or devising) a common format usually stumbled into differing personal preferences and the difficulty involved in mass-conversion of heterogeneous data collections.
Solution: Tabular data packages
We stumbled almost accidentally into the Data Package standards page. It was a revelation to see how elegant a solution this was to our format problems: using the Tabular Data Package spec, we could go ahead and convert our datasets into CSV, along with their metadata -- which is fairly easy to generate and maintain using the existing tools for the job. From there, we can also develop scripts to re-fetch and update the datasets, as well as post-processing tools to generate other formats from the data package.
There is already much information available on Data Packages:
- the Frictionless Data vision, which clearly lays out the problem and the proposed workflow to deal with heterogeneous sets of data
- the Data Package info page
- the Tabular Data Package info page, which is the format we use
- the comprehensive specifications for Data Packages and Tabular Data Packages
- many tools to manage and publish data packages at data.okfn.org/tools
So our common data format problem is now solved. We then faced another issue: how to publish and distribute these datasets in an equally frictionless way.
Second problem: Simple system to publish data packages
Something that we've also been missing was a central point from which to distribute the datasets we have. Having a site to aggregate all of our data packages would be a necessary step for some requirements we had:
- It would make hosting data workshops easier, by providing a quick way to access bulk data instead of fumbling around with USB sticks, Google documents and Dropbox links.
- It'd make our efforts more visible, by aggregating all our work that is currently all over the place and presenting them in a simple manner.
- More importantly, it gives us an easier way to present our work in gathering and converting data, and a better argument to present to public entities for publishing their data: instead of saying "Give us your data so we can convert it and make it open", we can simply say "Give us your data so it can be available at OurGreatOpenDataPortal.pt". Having a separate "brand" makes things easier to explain -- and open data matters are involved enough to be able to hold people's attention.
There are existing solutions, such as DataTank or, more prominently, CKAN. So why wouldn't CKAN be an option?
CKAN is a brilliant framework for hosting, managing and dealing with groups of heterogeneous datasets. However, installing CKAN is an involved process, and its power comes at the cost of maintaining a full web application: it requires a carefully configured server, doing regular updates, and ensuring server resources are not going above a reasonable level. And since we're a small team, we don't require most of its advanced features (like permissions).
Finally, at Transparência Hackday we have to manage many web applications already, and being too familiar with the experience made us look for a simpler application design.
Solution: Data Central, a static site generator for data package collections
We set out to design a simple application that could meet our purposes. The main design principles are:
Enable access to bulk data sets. Easy, straightforward access to the actual files is the main driver behind the current implementation. This differs from an API-driven approach which, while powerful, would require significant additional complexity.
Generated static HTML site -- Publishing datasets doesn't need a real-time server-based application to query the data and show it. We would only need to update the site daily, at most, and we could then skip the server-side logic.
Generate locally and upload -- The site generation ought to happen locally. We decided to have one of our non-remote servers take care of the hard work of generating the site, and then upload it with rsync to a hosted service.
Low hardware footprint -- Local generation means that our system spec requirements are low. Not needing specialized hardware means that we can use an old computer for this task. It's actually what we do -- the site generation is being done on an old 2007 Sony Vaio laptop with a broken screen.
Separate the datasets from the site -- By hosting each data package on a separate Git repository, the local generator could fetch it and re-generate the site without having to host and manage a separate copy of the data package and run the risk of both versions going out of sync. We found this happens often when building a database-driven web application. By separating the data packages and the web frontend, packagers and editors can work independently on the data, while the site generator updates the live version periodically.
Operated via the command line -- For the sake of simplicity and at the cost of user-friendliness, we settled for a CLI-centered management workflow. We realised that managing this kind of site should be a mostly automated process, and an efficient way to do this would be to restrict the application to a set of scripts that can be managed through Makefiles and run by cron jobs.
There are some significant downsides to this direction, though.
There is no API since it's all just HTML. This might be the most evident shortcoming of a static approach.
This also means there are no search capabilities. One could always consider using a third-party search engine since the site is plain HTML that can be scraped by Google, DuckDuckGo and other web crawlers.
Since the application management is done locally through the command line, there isn't any web interface to make edits or changes inside the browser.
How it works
The workflow goes like this:
- Data packages are published and updated on individual repositories by package maintainers.
- The Datacentral application is configured to become aware of which repositories it should track.
- The first run of the application clones all repositories and generates the HTML pages for each data package.
- Individual HTML pages (About, Contact) are generated from local Markdown files.
- The generated output can then be pushed through FTP or rsync to a remote, public web server.
In practice, there is a
generate.py script that inspects each data package
and uses Jinja to fill up a set of HTML template files. It saves the generated
HTML in an
_output directory, that can then be inspected using a local
webserver or pushed into a live VPS. All actions, from installation to generation and upload, can be carried out by means of a Makefile.
If you're interested in reading more about Data Central and even trying it out (it's simple!), check out the project site. We'd heartily welcome all possible feedback, so please let us know about any bugs, suggestions or feature requests at the Datacentral issue tracker.