- Name : Aakash Chaudhary
- Organisation: NumFOCUS
- Sub Organisation: Data Retriever
- Project: Data Retriever: Support for Login/API
- Proposal: Proposal Link
- Mentors: Henry Senyondo, Ethan White
What is Data Retriever ?
Finding data is one thing. Getting it ready for analysis is another. Acquiring, cleaning, standardizing and importing publicly available data is time consuming because many datasets lack machine readable metadata and do not conform to established data structures and formats. The Data Retriever automates the first steps in the data analysis pipeline by downloading, cleaning, and standardizing datasets, and importing them into relational databases, flat files, or programming languages. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days.
A number of data providers require the use of an account with an associated Login or API key to access data programmatically. The Data Retriever currently has support for the Kaggle API allowing users to securely use the Data Retriever to install datasets hosted by Kaggle.
The goal of this project is to find sources of public Data which require a Login/API key to access the data and integrate them into Data Retriever. The following APIs match the goal of the project and have been added to Data Retriever:
-
The Socrata data platform enables governments to use data as a strategic asset in the design, management, and delivery of programs. Data flows easily between staff and departments leading to more efficient programs and better decision making. Since a lot of governments host their data using the Socrata Data platform and the data hosted is Open Data, it was a good idea to use the Socrata API to fetch those datasets and make them available for the users of Data Retriever. Currently Data Retriever only supports
tabular
Socrata Datasets (exceptmap: tabular
type datasets). The total number of datasets supported from Socrata : 85,244 out of 213,965. The approach for integrating the Socrata API was to create scripts for the Socrata Datasets when a user requests for a dataset. The listing methods provide a list of datasets which match according to the user search results, then the user selects one dataset and we display the important information related to that dataset. The user then uses the socrata identifier to download the dataset. For more information refer to the documentation. -
The Rdatasets is a public repository on Github which hosts multiple datasets present in various R packages. Since using a dataset which is specifically present in an R package would require the user to install R and that R package on their system. This repository hosts a bunch of datasets (Currently hosts 1737 datasets) from various R packages. So the approach for integrating Rdatasets to Data Retriever was the same as the Socrata API. User can list the various packages and datasets present in Rdatasets and then select a dataset, which is going to be installed. Then retriever creates the script on that dataset and install the dataset. For more information refer to the documentation.
-
tidycensus is an R package that allows users to interface with the US Census Bureau's decennial Census and five-year American Community APIs and return tidyverse-ready data frames, optionally with simple feature geometry included. This dataset was requested on Data Retriever in the issue #1539. The task at hand was to get the data from the R package and then install it using retriever. To access data you would require an API key. A python module named
rpy2
was used to interface between Python and R. Using this module, we were able to get the data from the package and store them as raw data files. And then install the dataset using retriever.
Data Retriever repository: https://www.github.com/weecology/retriever
Work Progress: My Blogs
Below is the summary of my contributions during the program:
-
PR #1594 : Remove nuclear power plants dataset from retriever because it was constantly changing(unstable).
-
PR #1598 : Fixes issues #1570 and #1572. The
retriever ls -v
command used for verbose listing of datasets was malfunctioning due to missing license fields in dataset scripts. Added necessary checks for the license fields. Authored 3 commits in the PR. -
PR #1600 : Added the support for Socrata API in Data Retriever. The Socrata datasets are not saved as scripts on Data Retriever by default, but the users select which dataset they want, and then we create the script for that particular dataset. The following new commands are supported by
retriever
:-
retriever ls -s
command for displaying an interactive prompt for the user to select one dataset from the autocompleted suggestions on their input. -
retriever download socrata-<socrata id>
command to download the raw data of the socrata dataset identified by thesocrata id
. -
retriever install <engine> socrata-<socrata id>
command to install the socrata dataset identified by thesocrata id
into the given engine.
-
-
PR #1601 : The CI was breaking for one dependency during setup. Fixed the error by updating the
docker-compose.yml
andDockerfile
. It helped other failing PRs to get merged. -
PR #1603 : I observed that the github workflows for the CI were not running tests on all the python versions mentioned, but it was running tests on only one version three times. Updated the old workflow for the CI and verified that the tests were running on different python versions.
-
PR #1605 : Updated the function
clean_column_name
, which cleans column names in a dataset script. -
PR #1606 : Added the
tidycensus
dataset, which is contained in the R package tidycensus. Used therpy2
module as an interface betweenR
andPython
to download the raw data and then install them into the engine. -
PR #1613 : Added the support for Rdatasets in Data Retriever. Total number of datasets available currently : 1737. The following new commands are supported by
retriever
:-
retriever ls rdataset
command to display the package name, dataset name and script name of the Rdatasets present in the package(s) requested by the user. -
retriever download rdataset-<package>-<dataset name>
command to download the raw data of the rdatasetdataset name
present in the packagepackage
. -
retriever install <engine> rdataset-<package>-<dataset name>
command to install the rdatasetdataset name
present in the packagepackage
.
-
-
PR #125 : Helped correcting the issue in
coronavirus-belgium
dataset.
-
#1593 : Created an issue which contains research information done on API based dataset platforms for GSoC program.
-
#1608 : Added issue for Rdatasets integration to Data Retriever.
The project's goal was to support all datasets accesible through Login/API. Only a couple of these APIs have been integrated to Data Retriever. Some tasks that require more work and attention:
-
Continue searching for more of these APIs.
-
The CKAN API could also be integrated. More research should be conducted over its functioning.
-
Add support for all the tabular Socrata datasets.
-
Add support for Raster and Vector type Socrata datasets.
I plan to continue contributing more to Data Retriever after GSoC'21 and become an active contributor for the repository.
For me, the last three months have been an incredible learning experience, and I am grateful for everything I've learned. I learnt CI/CD using Docker and Github Actions, interfacing between R and Python, and using REST APIs. The entire experience has really aided my overall development as a developer, and I can confidently state that this has been the most fruitful summer of my life!
Finally, I'd like to express my gratitude to my mentors Henry Senyondo and Ethan White for allowing me to work with them. Henry Senyondo deserves special recognition for his unwavering support and leadership. Without his mentoring, I might have strayed from the project.