Skip to content

Instantly share code, notes, and snippets.

@Aakash3101
Last active September 7, 2021 06:53
Show Gist options
  • Save Aakash3101/cdc1cc775a672b96a46d3ef8ea366df9 to your computer and use it in GitHub Desktop.
Save Aakash3101/cdc1cc775a672b96a46d3ef8ea366df9 to your computer and use it in GitHub Desktop.
GSoC 2021 - Data Retriever, NumFOCUS

Gsoc 21

 

Google Summer Of Code 2021 Final Work Report

Abstract

What is Data Retriever ?

Finding data is one thing. Getting it ready for analysis is another. Acquiring, cleaning, standardizing and importing publicly available data is time consuming because many datasets lack machine readable metadata and do not conform to established data structures and formats. The Data Retriever automates the first steps in the data analysis pipeline by downloading, cleaning, and standardizing datasets, and importing them into relational databases, flat files, or programming languages. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days.

A number of data providers require the use of an account with an associated Login or API key to access data programmatically. The Data Retriever currently has support for the Kaggle API allowing users to securely use the Data Retriever to install datasets hosted by Kaggle.

The goal of this project is to find sources of public Data which require a Login/API key to access the data and integrate them into Data Retriever. The following APIs match the goal of the project and have been added to Data Retriever:

  • Socrata API

    The Socrata data platform enables governments to use data as a strategic asset in the design, management, and delivery of programs. Data flows easily between staff and departments leading to more efficient programs and better decision making. Since a lot of governments host their data using the Socrata Data platform and the data hosted is Open Data, it was a good idea to use the Socrata API to fetch those datasets and make them available for the users of Data Retriever. Currently Data Retriever only supports tabular Socrata Datasets (except map: tabular type datasets). The total number of datasets supported from Socrata : 85,244 out of 213,965. The approach for integrating the Socrata API was to create scripts for the Socrata Datasets when a user requests for a dataset. The listing methods provide a list of datasets which match according to the user search results, then the user selects one dataset and we display the important information related to that dataset. The user then uses the socrata identifier to download the dataset. For more information refer to the documentation.

  • Rdatasets

    The Rdatasets is a public repository on Github which hosts multiple datasets present in various R packages. Since using a dataset which is specifically present in an R package would require the user to install R and that R package on their system. This repository hosts a bunch of datasets (Currently hosts 1737 datasets) from various R packages. So the approach for integrating Rdatasets to Data Retriever was the same as the Socrata API. User can list the various packages and datasets present in Rdatasets and then select a dataset, which is going to be installed. Then retriever creates the script on that dataset and install the dataset. For more information refer to the documentation.

  • tidycensus

    tidycensus is an R package that allows users to interface with the US Census Bureau's decennial Census and five-year American Community APIs and return tidyverse-ready data frames, optionally with simple feature geometry included. This dataset was requested on Data Retriever in the issue #1539. The task at hand was to get the data from the R package and then install it using retriever. To access data you would require an API key. A python module named rpy2 was used to interface between Python and R. Using this module, we were able to get the data from the package and store them as raw data files. And then install the dataset using retriever.

Data Retriever repository: https://www.github.com/weecology/retriever

Work Progress: My Blogs

Tasks Completed

Below is the summary of my contributions during the program:

Pull Requests

  • PR #1594 :   Remove nuclear power plants dataset from retriever because it was constantly changing(unstable).

  • PR #1598 :   Fixes issues #1570 and #1572. The retriever ls -v command used for verbose listing of datasets was malfunctioning due to missing license fields in dataset scripts. Added necessary checks for the license fields. Authored 3 commits in the PR.

  • PR #1600 :   Added the support for Socrata API in Data Retriever. The Socrata datasets are not saved as scripts on Data Retriever by default, but the users select which dataset they want, and then we create the script for that particular dataset. The following new commands are supported by retriever:

    • retriever ls -s command for displaying an interactive prompt for the user to select one dataset from the autocompleted suggestions on their input.

    • retriever download socrata-<socrata id> command to download the raw data of the socrata dataset identified by the socrata id.

    • retriever install <engine> socrata-<socrata id> command to install the socrata dataset identified by the socrata id into the given engine.

  • PR #1601 :   The CI was breaking for one dependency during setup. Fixed the error by updating the docker-compose.yml and Dockerfile. It helped other failing PRs to get merged.

  • PR #1603 :   I observed that the github workflows for the CI were not running tests on all the python versions mentioned, but it was running tests on only one version three times. Updated the old workflow for the CI and verified that the tests were running on different python versions.

  • PR #1605 :   Updated the function clean_column_name, which cleans column names in a dataset script.

  • PR #1606 :   Added the tidycensus dataset, which is contained in the R package tidycensus. Used the rpy2 module as an interface between R and Python to download the raw data and then install them into the engine.

  • PR #1613 :   Added the support for Rdatasets in Data Retriever. Total number of datasets available currently : 1737. The following new commands are supported by retriever:

    • retriever ls rdataset command to display the package name, dataset name and script name of the Rdatasets present in the package(s) requested by the user.

    • retriever download rdataset-<package>-<dataset name> command to download the raw data of the rdataset dataset name present in the package package.

    • retriever install <engine> rdataset-<package>-<dataset name> command to install the rdataset dataset name present in the package package.

  • PR #125 :   Helped correcting the issue in coronavirus-belgium dataset.

Issues

  • #1593 :   Created an issue which contains research information done on API based dataset platforms for GSoC program.

  • #1608 :   Added issue for Rdatasets integration to Data Retriever.

 

Future Work

The project's goal was to support all datasets accesible through Login/API. Only a couple of these APIs have been integrated to Data Retriever. Some tasks that require more work and attention:

  • Continue searching for more of these APIs.

  • The CKAN API could also be integrated. More research should be conducted over its functioning.

  • Add support for all the tabular Socrata datasets.

  • Add support for Raster and Vector type Socrata datasets.

I plan to continue contributing more to Data Retriever after GSoC'21 and become an active contributor for the repository.


For me, the last three months have been an incredible learning experience, and I am grateful for everything I've learned. I learnt CI/CD using Docker and Github Actions, interfacing between R and Python, and using REST APIs. The entire experience has really aided my overall development as a developer, and I can confidently state that this has been the most fruitful summer of my life!

Finally, I'd like to express my gratitude to my mentors Henry Senyondo and Ethan White for allowing me to work with them. Henry Senyondo deserves special recognition for his unwavering support and leadership. Without his mentoring, I might have strayed from the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment