- Separate out processed and raw data in the
data/
folder. - Include README.md files within each folder containing data files; also utilize the Issues tab in GitHub more often.
- Utilizes branches in the GitHub workflow, and keep the master branch's data folder "clean," so to speak.
For most data projects, you separate out raw and procesed data something like this:
data/
processed/
private_wells/
rainfall_water_system_matching/
raw/
private_wells/
rainfall/
sdwis/
There are a few issues with this approach for this particular project (covered in Appendix A), but I nonetheless recommend it for the Safe Drinking Water project. There are various advantages to this structure.
It does not necessarily need to be the case that all data folders have both raw and processed versions. Some data will appear in different forms across both, some data will just be in one, some data will just be in the other. All that matters is that "raw" data is unprocessed and can be very straightforwardly derived from its source. If any assumptions need to be made to make the data usable, or if any data transformations are done (e.g. dropping some rows, or merging to other data), then it is no longer "raw."
I've discussed this with one other member of this team, and I don't want to speak for them in this medium, but my understanding is that there is additional support for a structure akin to this.
I've thought quite a bit about this, and I think the solution is to directly include a README.md
file in each data folder (both raw and processed).
data/
processed/
private_wells/
README.md
wells.csv
sdwis/
README.md
file1.csv
file2.csv
raw/
private_wells/
README.md
wells.csv
sdwis/
README.md
file1.csv
file2.csv
The following things should be included in each README file:
- Summary: Brief, 1-2 sentence description of your data near the top of the document.
- Collected date: The date(s) you collected the data. (In ISO 8601 date format, i.e. 2019-09-15)
- Last updated: The date the data last updated/processed (in ISO 8601 date format, i.e. 2019-09-15), if it was processed. Otherwise "N/A".
- Source: Where you found this data. This should contain enough information for someone to be able to reasonably replicate the process if they want to. Note here which file you used to process the data, if you did process it.
- Description: Details on what this data contains, e.g. point to the columns you believe will be useful, and/or describe what each row represents.
- Use: Brief explanation of why you believe this data will be useful. It doesn't have to be long, a couple sentences will often suffice, it just enough so that people understand how they can make use of it.
Note I previously suggested two more points to be included, i.e. "Work Performed" and "To-Do." However, another team member made a convincing case these should not be included:
- Work Performed should be self-evident in any code that is done to process the data.
- Any things that are "To-Do" should be directed to the Issues tab of the GitHub repo. This centralizes the list of things that need to be done, instead of requiring people to peek at each source of data for work that needs to be performed. This is contigent on people actually utilizing the Issues tab, however.
In Appendix B, I discuss other solutions to organizing data and how I came to this particular solution.
The previous two sections detail how the master branch's data/
should be structured (Section 1), and how each folder within data/*/
should be structured (Section 2). But there is one ingredient missing, which is: How do we know which data should go anywhere?
This is an issue for two reasons:
- GitHub limits the size of single files (100 MB), and additionally there is a limit to storage on a single repository (1 GB).
- It is possible to clutter the master branch with too much deprecated, low-value, or useless data. Even without data storage issues on both GitHub and local machines, this can make the data folder unwieldy.
So called "low-value" data can take a few forms:
- It might be a poor feature for all models or outside the scope of project goals. (Note that data that would be useful, except for it being unclean, is covered by the raw/processed delineation.)
- It can be existing data, but simply filtered by rows or columns, or with a couple of calculations done. It can be of use to keep data like this for a single analysis to help speed things up for you, but it might not be something that someone else actually needs on their machine.
- It can be the results of an analysis that do not necessarily need to be incorporated into other models.
Note that "low-value" from a data storage perspective does not mean that the data (or analysis that produced the data) is useless. Low-value means it does not necessarily need to be stored and shared through GitHub. I think the most obvious example of this is simple_time_based_model.zip
. My understanding is that this is a .csv file from a DataFrame from someone's analysis. The analysis itself may be very useful, but storing it as a .zip file, in my opinion, is not useful. Anyone who is interested in the results of this analysis should read the corresponding Jupyter notebook, and somewhere in the Jupyter notebook could be a link to (for example) a Zip file on the Slack for those who are having difficulty running the code.
Sometimes the process of deciding what is low-value and not from a storage perspective is tricky. For example, there is a great analysis where Mia and John associated water systems with their nearest rainfall data. I believe that this is high-value and should be in the data/
folder for a few reasons:
- It takes a long time to run this Notebook.
- It is something that others may want to readily incorporate into their own analyses.
- Someone looking to answer the question of "has anyone linked these two data sources?" may intuitively look through the
data/
folder for an answer to this question, instead of through the analyses folder.
In a sense, their work is data processing as much as it is an analysis.
I believe that these are criteria that are relevant for the master branch. However, by utilizing additional branches, these issues become less pronounced. If someone is working on a "New Hampshire analysis" branch, for example, it's not a big deal if there is a folder such as data/processed/whatever
that has data very specific to this analysis that may not be useful elsewhere. Or if there is a branch called "feature engineering exploration" that specifically devoted to messing around with features that may or may not ultimately be of use, we can dump more things into that branch without impeding other people's workflows.
There should not be many issues with data storage, I believe. However, if GitHub's 1 GB limit becomes more binding, then README files should also contain a link to download a .zip file that extracts the data; the README file would be kept in the GitHub by itself.
I jointly discuss alternatives to the solutions proposed in sections 2 and 3 in Appendix B.
This document concerns the storage, processing, and cleaning of data. However, it does not specifically address where to store notebooks/scripts that specifically address data cleaning (as opposed to notebooks/scripts that are purely analytical or modeling focused). I do believe that for this process there should be a separation of these two tasks. E.g. in the code/
folder, we could create subfolders such as data_processing/
, data_exploration/
, and modeling/
or something akin to this.
I currently have no recommendations on this other than that all README files associated with processed data clearly link to whatever analysis was used to generate the processed data from the raw data source. I hope that once a data contribution process is clearly laid out that we can begin to discuss this next step! :)
The reason why you typically separate out raw and processed data into two folders is because you're usually designing a code base that can be run start to finish, and the first two steps of the analysis are to (1) gather the raw data, and (2) process/clean the raw data. Steps 3 and onward are then the actual analysis. (E.g. in this project I did in 2017, raw/
and data/
are separate folders in the root directory, and step 1 is cleaning; there is no collection step in this case because the data was emailed to me.)
A good data analysis project can start with nothing (or raw data), and with a click of a button, produce everything. This is what it means for a project to be replicable, and it ensures that every line of the code that got you to the solution is actually run. For example, projects without this workflow might run into issues where critical data cannot be replicated over time as the project evolves, or critical data is not being updated at run-time and you end up accidentally working with old data.
Some issues may arise with the proposed solution for this particular project:
- This folder design is usually for projects with code that is run start-to-finish, but this project workflow likely isn't going to be like this. (Note that we should still value replicability.)
- Creating a
processed/
subdirectory would require refactoring a lot of existing code. - The line between raw and processed can be unclear. For example, there are certainly cases where SDWIS is "unclean," as outlined by the GAO report on the data integrity issues, but it's usable in its current state despite room for improvement.
- There may be storage limitations both on the cloud and on people's personal computers.
- There may be some conflict or confusion with the other proposed solution of keeping a README file in each folder of data because it then becomes unclear whether a README.md file should be in the processed or raw directory.
My responses to these issues are:
- I believe that our current workflow can fit within this folder structure relatively easily.
- Refactoring code is as simple as replacing all instances of
../../../data
with../../../data/processed
in the code base. - For our purposes, it may be better to think of the
processed/
folder as the current working version of every data set, instead of seeing it as a "final" version. In fact, I suggest that eachREADME.md
has a "last updated" line in the file for this exact reason. - The storage limitation for raw data on the cloud would be that we might need to separate out
SDWIS_processed.zip
andSDWIS_raw.zip
, which creates an extra step. This isn't a big deal. For personal computers: While we should encourage people who have ample space on their computers to download raw data, it can be more "optional" for people not working directly with a certain set of data. So for example, if we onboard someone who wants to work with Census data for feature engineering, they don't need to downloadraw/
SDWIS data. - I don't think it's a big deal if there are two copies of README between
raw/
andprocessed/
.
I proposed utilizing branches more regularly and including README's as solutions for documentation and storage.
-
Utilize the Google Drive for data storage and documentation. I initially suggested using the Google Drive because it is easier to add things to it, but another team member suggested that using the GitHub has the benefit of centralizing the work, while the branches outside of the master branch allow people to be a bit "messier" with their contributions. An additional benefit of the branches is that it provides a hierarchy of what exactly some data is useful for, e.g. a folder such as
data/processed/mydata
may be under a branch called "foobar analysis," in which case we know that the additional data is for that specific task (whereas on the Google Drive, it would just get cluttered in a sea of data). With all that said, I do believe there could be some value in utilizing the Google Drive for very exploratory things that are still intended on being collaborative, but the gap between collaborative exploratory data cleaning/analysis that shouldn't go on GitHub, and something that should be on a GitHub branch, is probably very small. -
Keep READMEs in the
docs/data/
folder. My issue with this is that it's possible .zip files will be passed around, and it's easier to keep the README.md alongside some .txt files. It's also more intuitive to peek around the data folder at your data and read the README alongside this, instead of peeking around the docs and the data folder at the same time. With all that said, it makes sense to document some very important things in the documentation separately of the README's.