Skip to content

Instantly share code, notes, and snippets.

@dwreeves
Last active September 10, 2019 00:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dwreeves/2d0411125641dd5b86027ef791dc767c to your computer and use it in GitHub Desktop.
Save dwreeves/2d0411125641dd5b86027ef791dc767c to your computer and use it in GitHub Desktop.
Safe Drinking Water Project - Recommendations for storage, processing, and cleaning of data

Code for Boston - Safe Drinking Water Project

Daniel Reeves - process notes and recommendations

Summary of recommendations

  • Separate out processed and raw data in the data/ folder.
  • Include README.md files within each folder containing data files; also utilize the Issues tab in GitHub more often.
  • Utilizes branches in the GitHub workflow, and keep the master branch's data folder "clean," so to speak.

1. Data folder organization

For most data projects, you separate out raw and procesed data something like this:

data/
    processed/
        private_wells/
        rainfall_water_system_matching/
    raw/
        private_wells/
        rainfall/
        sdwis/

There are a few issues with this approach for this particular project (covered in Appendix A), but I nonetheless recommend it for the Safe Drinking Water project. There are various advantages to this structure.

It does not necessarily need to be the case that all data folders have both raw and processed versions. Some data will appear in different forms across both, some data will just be in one, some data will just be in the other. All that matters is that "raw" data is unprocessed and can be very straightforwardly derived from its source. If any assumptions need to be made to make the data usable, or if any data transformations are done (e.g. dropping some rows, or merging to other data), then it is no longer "raw."

I've discussed this with one other member of this team, and I don't want to speak for them in this medium, but my understanding is that there is additional support for a structure akin to this.

2. Documenting the data

I've thought quite a bit about this, and I think the solution is to directly include a README.md file in each data folder (both raw and processed).

data/
    processed/
        private_wells/
            README.md
            wells.csv
        sdwis/
            README.md
            file1.csv
            file2.csv
    raw/
        private_wells/
            README.md
            wells.csv
        sdwis/
            README.md
            file1.csv
            file2.csv

The following things should be included in each README file:

  • Summary: Brief, 1-2 sentence description of your data near the top of the document.
  • Collected date: The date(s) you collected the data. (In ISO 8601 date format, i.e. 2019-09-15)
  • Last updated: The date the data last updated/processed (in ISO 8601 date format, i.e. 2019-09-15), if it was processed. Otherwise "N/A".
  • Source: Where you found this data. This should contain enough information for someone to be able to reasonably replicate the process if they want to. Note here which file you used to process the data, if you did process it.
  • Description: Details on what this data contains, e.g. point to the columns you believe will be useful, and/or describe what each row represents.
  • Use: Brief explanation of why you believe this data will be useful. It doesn't have to be long, a couple sentences will often suffice, it just enough so that people understand how they can make use of it.

Note I previously suggested two more points to be included, i.e. "Work Performed" and "To-Do." However, another team member made a convincing case these should not be included:

  • Work Performed should be self-evident in any code that is done to process the data.
  • Any things that are "To-Do" should be directed to the Issues tab of the GitHub repo. This centralizes the list of things that need to be done, instead of requiring people to peek at each source of data for work that needs to be performed. This is contigent on people actually utilizing the Issues tab, however.

In Appendix B, I discuss other solutions to organizing data and how I came to this particular solution.

3. Where to put data

The previous two sections detail how the master branch's data/ should be structured (Section 1), and how each folder within data/*/ should be structured (Section 2). But there is one ingredient missing, which is: How do we know which data should go anywhere?

This is an issue for two reasons:

  • GitHub limits the size of single files (100 MB), and additionally there is a limit to storage on a single repository (1 GB).
  • It is possible to clutter the master branch with too much deprecated, low-value, or useless data. Even without data storage issues on both GitHub and local machines, this can make the data folder unwieldy.

So called "low-value" data can take a few forms:

  • It might be a poor feature for all models or outside the scope of project goals. (Note that data that would be useful, except for it being unclean, is covered by the raw/processed delineation.)
  • It can be existing data, but simply filtered by rows or columns, or with a couple of calculations done. It can be of use to keep data like this for a single analysis to help speed things up for you, but it might not be something that someone else actually needs on their machine.
  • It can be the results of an analysis that do not necessarily need to be incorporated into other models.

Note that "low-value" from a data storage perspective does not mean that the data (or analysis that produced the data) is useless. Low-value means it does not necessarily need to be stored and shared through GitHub. I think the most obvious example of this is simple_time_based_model.zip. My understanding is that this is a .csv file from a DataFrame from someone's analysis. The analysis itself may be very useful, but storing it as a .zip file, in my opinion, is not useful. Anyone who is interested in the results of this analysis should read the corresponding Jupyter notebook, and somewhere in the Jupyter notebook could be a link to (for example) a Zip file on the Slack for those who are having difficulty running the code.

Sometimes the process of deciding what is low-value and not from a storage perspective is tricky. For example, there is a great analysis where Mia and John associated water systems with their nearest rainfall data. I believe that this is high-value and should be in the data/ folder for a few reasons:

  • It takes a long time to run this Notebook.
  • It is something that others may want to readily incorporate into their own analyses.
  • Someone looking to answer the question of "has anyone linked these two data sources?" may intuitively look through the data/ folder for an answer to this question, instead of through the analyses folder.

In a sense, their work is data processing as much as it is an analysis.

I believe that these are criteria that are relevant for the master branch. However, by utilizing additional branches, these issues become less pronounced. If someone is working on a "New Hampshire analysis" branch, for example, it's not a big deal if there is a folder such as data/processed/whatever that has data very specific to this analysis that may not be useful elsewhere. Or if there is a branch called "feature engineering exploration" that specifically devoted to messing around with features that may or may not ultimately be of use, we can dump more things into that branch without impeding other people's workflows.

There should not be many issues with data storage, I believe. However, if GitHub's 1 GB limit becomes more binding, then README files should also contain a link to download a .zip file that extracts the data; the README file would be kept in the GitHub by itself.

I jointly discuss alternatives to the solutions proposed in sections 2 and 3 in Appendix B.

4. Additional issues

This document concerns the storage, processing, and cleaning of data. However, it does not specifically address where to store notebooks/scripts that specifically address data cleaning (as opposed to notebooks/scripts that are purely analytical or modeling focused). I do believe that for this process there should be a separation of these two tasks. E.g. in the code/ folder, we could create subfolders such as data_processing/, data_exploration/, and modeling/ or something akin to this.

I currently have no recommendations on this other than that all README files associated with processed data clearly link to whatever analysis was used to generate the processed data from the raw data source. I hope that once a data contribution process is clearly laid out that we can begin to discuss this next step! :)


Appendix A - Issues with raw/processed data foler organization solution

The reason why you typically separate out raw and processed data into two folders is because you're usually designing a code base that can be run start to finish, and the first two steps of the analysis are to (1) gather the raw data, and (2) process/clean the raw data. Steps 3 and onward are then the actual analysis. (E.g. in this project I did in 2017, raw/ and data/ are separate folders in the root directory, and step 1 is cleaning; there is no collection step in this case because the data was emailed to me.)

A good data analysis project can start with nothing (or raw data), and with a click of a button, produce everything. This is what it means for a project to be replicable, and it ensures that every line of the code that got you to the solution is actually run. For example, projects without this workflow might run into issues where critical data cannot be replicated over time as the project evolves, or critical data is not being updated at run-time and you end up accidentally working with old data.

Some issues may arise with the proposed solution for this particular project:

  1. This folder design is usually for projects with code that is run start-to-finish, but this project workflow likely isn't going to be like this. (Note that we should still value replicability.)
  2. Creating a processed/ subdirectory would require refactoring a lot of existing code.
  3. The line between raw and processed can be unclear. For example, there are certainly cases where SDWIS is "unclean," as outlined by the GAO report on the data integrity issues, but it's usable in its current state despite room for improvement.
  4. There may be storage limitations both on the cloud and on people's personal computers.
  5. There may be some conflict or confusion with the other proposed solution of keeping a README file in each folder of data because it then becomes unclear whether a README.md file should be in the processed or raw directory.

My responses to these issues are:

  1. I believe that our current workflow can fit within this folder structure relatively easily.
  2. Refactoring code is as simple as replacing all instances of ../../../data with ../../../data/processed in the code base.
  3. For our purposes, it may be better to think of the processed/ folder as the current working version of every data set, instead of seeing it as a "final" version. In fact, I suggest that each README.md has a "last updated" line in the file for this exact reason.
  4. The storage limitation for raw data on the cloud would be that we might need to separate out SDWIS_processed.zip and SDWIS_raw.zip, which creates an extra step. This isn't a big deal. For personal computers: While we should encourage people who have ample space on their computers to download raw data, it can be more "optional" for people not working directly with a certain set of data. So for example, if we onboard someone who wants to work with Census data for feature engineering, they don't need to download raw/ SDWIS data.
  5. I don't think it's a big deal if there are two copies of README between raw/ and processed/.

Appendix B - Other solutions for data documentation and storage.

I proposed utilizing branches more regularly and including README's as solutions for documentation and storage.

  • Utilize the Google Drive for data storage and documentation. I initially suggested using the Google Drive because it is easier to add things to it, but another team member suggested that using the GitHub has the benefit of centralizing the work, while the branches outside of the master branch allow people to be a bit "messier" with their contributions. An additional benefit of the branches is that it provides a hierarchy of what exactly some data is useful for, e.g. a folder such as data/processed/mydata may be under a branch called "foobar analysis," in which case we know that the additional data is for that specific task (whereas on the Google Drive, it would just get cluttered in a sea of data). With all that said, I do believe there could be some value in utilizing the Google Drive for very exploratory things that are still intended on being collaborative, but the gap between collaborative exploratory data cleaning/analysis that shouldn't go on GitHub, and something that should be on a GitHub branch, is probably very small.

  • Keep READMEs in the docs/data/ folder. My issue with this is that it's possible .zip files will be passed around, and it's easier to keep the README.md alongside some .txt files. It's also more intuitive to peek around the data folder at your data and read the README alongside this, instead of peeking around the docs and the data folder at the same time. With all that said, it makes sense to document some very important things in the documentation separately of the README's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment