rlafuente/datacentral_writeup.md

## datacentral_writeup.md

      
    Raw
  

              datacentral_writeup.md
            
          
    First problem: Have a common format for storing datasets

At Transparência Hackday Portugal, as with any other open data interest group,
we work with many datasets. An issue that has been slowing us down for a long
time is that we never had a centralized solution for storing datasets: some are
in Google Docs, others in Git repositories, others live on web servers.
Before that, another issue was the data format: we found ourselves lost among
CSV or JSON files, SQL database dumps, spreadsheets and plaintext files.
Converting these was something we'd do in an ad hoc basis, and the challenge of
finding (or devising) a common format usually stumbled into differing personal
preferences and the difficulty involved in mass-conversion of heterogeneous
data collections.
Solution: Tabular data packages

We stumbled almost accidentally into the Data Package standards page. It was a
revelation to see how elegant a solution this was to our format problems: using
the Tabular Data Package spec, we could go ahead and convert our datasets into
CSV, along with their metadata -- which is fairly easy to generate and maintain
using the existing tools for the job. From there, we can also develop scripts
to re-fetch and update the datasets, as well as post-processing tools to
generate other formats from the data package.
There is already much information available on Data Packages:

the Frictionless Data vision, which clearly lays out the problem and the
proposed workflow to deal with heterogeneous sets of data
the Data Package info page
the Tabular Data Package info page, which is the format we use
the comprehensive specifications for Data Packages and Tabular Data Packages
many tools to manage and publish data packages at data.okfn.org/tools

So our common data format problem is now solved. We then faced another issue:
how to publish and distribute these datasets in an equally frictionless way.
Second problem: Simple system to publish data packages

Something that we've also been missing was a central point from which to
distribute the datasets we have. Having a site to aggregate all of our data
packages would be a necessary step for some requirements we had:

It would make hosting data workshops easier, by providing a quick way to
access bulk data instead of fumbling around with USB sticks, Google documents
and Dropbox links.
It'd make our efforts more visible, by aggregating all our work that is
currently all over the place and presenting them in a simple manner.
More importantly, it gives us an easier way to present our work in gathering
and converting data, and a better argument to present to public entities for
publishing their data: instead of saying "Give us your data so we can convert
it and make it open", we can simply say "Give us your data so it can be
available at OurGreatOpenDataPortal.pt". Having a separate "brand" makes
things easier to explain -- and open data matters are involved enough to be
able to hold people's attention.

There are existing solutions, such as DataTank or, more prominently, CKAN. So
why wouldn't CKAN be an option?
CKAN is a brilliant framework for hosting, managing and dealing with groups of
heterogeneous datasets. However, installing CKAN is an involved process, and its
power comes at the cost of maintaining a full web application: it requires a
carefully configured server, doing regular updates, and ensuring server
resources are not going above a reasonable level. And since we're a small team,
we don't require most of its advanced features (like permissions).
Finally, at Transparência Hackday we
have to manage many web applications already, and being too familiar with the
experience made us look for a simpler application design.
Solution: Data Central, a static site generator for data package collections

We set out to design a simple application that could meet our purposes. The main
design principles are:


Enable access to bulk data sets. Easy, straightforward access to
the actual files is the main driver behind the current implementation. This
differs from an API-driven approach which, while powerful, would require
significant additional complexity.


Generated static HTML site -- Publishing datasets doesn't need a real-time
server-based application to query the data and show it. We would only need to
update the site daily, at most, and we could then skip the server-side logic.


Generate locally and upload -- The site generation ought to happen locally.
We decided to have one of our non-remote servers take care of the hard work of
generating the site, and then upload it with rsync to a hosted service.


Low hardware footprint -- Local generation means that our system spec
requirements are low. Not needing specialized hardware means that we can use
an old computer for this task. It's actually what we do -- the site
generation is being done on an old 2007 Sony Vaio laptop with a broken screen.


Separate the datasets from the site -- By hosting each data package on a
separate Git repository, the local generator could fetch it and re-generate
the site without having to host and manage a separate copy of the data
package and run the risk of both versions going out of sync. We found this
happens often when building a database-driven web application. By separating
the data packages and the web frontend, packagers and editors can work
independently on the data, while the site generator updates the live version
periodically.


Operated via the command line -- For the sake of simplicity and at the cost
of user-friendliness, we settled for a CLI-centered management workflow. We
realised that managing this kind of site should be a mostly automated process,
and an efficient way to do this would be to restrict the application to a set
of scripts that can be managed through Makefiles and run by cron jobs.


There are some significant downsides to this direction, though.


There is no API since it's all just HTML. This might be the most evident
shortcoming of a static approach.


This also means there are no search capabilities. One could always consider
using a third-party search engine since the site is plain HTML that can be
scraped by Google, DuckDuckGo and other web crawlers.


There is no support for dynamic content, such as a site blog. Listing external
feeds could be done through widgets in JavaScript.


Since the application management is done locally through the command line,
there isn't any web interface to make edits or changes inside the
browser.


How it works

The workflow goes like this:

Data packages are published and updated on individual repositories by
package maintainers.
The Datacentral application is configured to become aware of which
repositories it should track.
The first run of the application clones all repositories and generates the
HTML pages for each data package.
Individual HTML pages (About, Contact) are generated from local Markdown
files.
The generated output can then be pushed through FTP or rsync to a remote,
public web server.

In practice, there is a generate.py script that inspects each data package
and uses Jinja to fill up a set of HTML template files. It saves the generated
HTML in an _output directory, that can then be inspected using a local
webserver or pushed into a live VPS. All actions, from installation to generation and upload, can be carried out by means of a Makefile.
If you're interested in reading more about Data Central and even trying it out (it's simple!), check out the project site. We'd heartily welcome all possible feedback, so please let us know about any bugs, suggestions or feature requests at the Datacentral issue tracker.