Skip to content

Instantly share code, notes, and snippets.

@ericmjl
Last active April 19, 2024 17:13
Show Gist options
  • Save ericmjl/27e50331f24db3e8f957d1fe7bbbe510 to your computer and use it in GitHub Desktop.
Save ericmjl/27e50331f24db3e8f957d1fe7bbbe510 to your computer and use it in GitHub Desktop.
How to organize your Python data science project

UPDATE: I have baked the ideas in this file inside a Python CLI tool called pyds-cli. Please find it here: https://github.com/ericmjl/pyds-cli

How to organize your Python data science project

Having done a number of data projects over the years, and having seen a number of them up on GitHub, I've come to see that there's a wide range in terms of how "readable" a project is. I'd like to share some practices that I have come to adopt in my projects, which I hope will bring some organization to your projects.

Disclaimer: I'm hoping nobody takes this to be "the definitive guide" to organizing a data project; rather, I hope you, the reader, find useful tips that you can adapt to your own projects.

Disclaimer 2: What I’m writing below is primarily geared towards Python language users. Some ideas may be transferable to other languages; others may not be so. Please feel free to remix whatever you see here!

Disclaimer 3: I found the Cookiecutter Data Science page after finishing this blog post. Many ideas overlap here, though some directories are irrelevant in my work -- which is totally fine, as their Cookiecutter DS Project structure is intended to be flexible! Consistency is the thing that matters the most.

README.md

Let’s start with the most front-facing file in your repository, the README file. It should contain information that will help your forgetful future self, newcomers, and collaborators figure out why this project exists, how things are organized, conventions used in the project, and where they can go to find more information.

Note here that the why portion is the most important. It gives the necessary context for the reader of your README file. Think of it as documentation that you leave behind, so you don’t have to sit down and explain over and over the high-level overview of the project.

Directory Structure

Here is the tl;dr overview: everything gets its own place, and all things related to the project should be placed under child directories one directory.

$ pwd
/path/to/project/directory/

$ ls
|- notebooks/
   |- 01-first-logical-notebook.ipynb
   |- 02-second-logical-notebook.ipynb
   |- prototype-notebook.ipynb
   |- archive/
	  |- no-longer-useful.ipynb
|- projectname/
   |- projectname/
	  |- __init__.py
	  |- config.py
	  |- data.py
	  |- utils.py
   |- setup.py
|- README.md
|- data/
   |- raw/
   |- processed/
   |- cleaned/
|- scripts/
   |- script1.py
   |- script2.py
   |- archive/
      |- no-longer-useful.py
|- environment.yml

Let's go through each section in order.

notebooks/

/path/to/project/directory/
|- notebooks/
   |- 01-first-logical-notebook.ipynb
   |- 02-second-logical-notebook.ipynb
   |- prototype-notebook.ipynb
   |- archive/
	  |- no-longer-useful.ipynb
   |- figures/

Quite self-explanatory. We put our notebooks in this directory. As we develop the project, a narrative begins to develop, and we can start structuring our notebooks in "logical chunks" ({something-logical}-notebook.ipynb). They should also be ordered, which explains the numbering on the file names. We may use some notebooks for prototyping ({something}-prototype.ipynb). Additionally, we may find that some analyses are no longer useful, (archive/no-longer-useful.ipynb). Finally, we have a figures/ directory, which can be optionally further organized, in which figures relevant to the project are placed.

projectname/

/path/to/project/directory/
|- projectname/
   |- projectname/
      |- __init__.py
      |- config.py
      |- custom_funcs.py
      |- test_config.py
      |- test_custom_funcs.py
   |- setup.py

If this looks intimidating, unnecessarily complicated, or something along those lines, humour me for a moment. I have a lesson learned from multiple months of working with other people that led me to this somewhat complicated, but hopefully ultimately useful directory structure.

Under this folder called projectname/, we put in a lightweight Python package called projectname that has all things that are refactored out of notebooks to keep them clean. It has an __init__.py underneath it so that we can import functions and variables into our notebooks and scripts:

from projectname import something

config.py

In projectname/projectname/config.py, we place in special paths and variables that are used across the project. An example might be:

# config.py

from pathlib import Path  # pathlib is seriously awesome!

data_dir = Path('/path/to/some/logical/parent/dir')
data_path = data_dir / 'my_file.csv'  # use feather files if possible!!!

customer_db_url = 'sql:///customer/db/url'
purchases_db_url = 'sql:///purchases/db/url'

Then, in our notebooks, we can easily import these variables and not worry about custom strings littering our code.

# notebook.ipynb
from projectname.config import data_path
import pandas as pd

df = pd.read_csv(data_path)  # clean!

By using these config.py files, we get clean code in exchange for an investment of time naming variables logically.

custom_funcs.py

In projectname/projectname/custom_funcs.py, we can put in custom code that gets used across more than notebook. One example would be downstream data preprocessing that is only necessary for a subset of notebooks.

# custom_funcs.py

def custom_preprocessor(df):  # give the function a more informative name!!!
    """
    Processes the dataframe such that {insert intent here}. (Write better docstrings than this!!!!)

    Intended to be used under this particular circumstance, with {that other function} called before it, and potentially {yet another function} called after it, but optional.

    :param pd.DataFrame df: A pandas dataframe. Should contain the following columns:
        - col1
        - col2
    :returns: A modified dataframe.
    """
    return (df.groupby('col1').count()['col2'])

Now, in our notebooks, we can do:

# notebook.ipynb

import pandas as pd
from projectname.config import data_path
from projectname.custom_funcs import custom_preprocessor

df = pd.read_csv(data_path)
processed = custom_preprocessor(df)

test_{stuff}.py

Finally, you may have noticed that there is a test_config.py and test_custom_funcs.py file. Those two modules, which I'll call "test modules", house tests for their respective Python modules (the config.py and custom_funcs.py files).

Yes, I'm a big believer that data scientists should be writing tests for their code. Now, these tests don't have to be software-engineer-esque, production-ready tests. The bare minimum is just a single example that shows exactly what you're trying to accomplish with the function. If you accidentally break the function, the test will catch it for you. That's all a test is, and the single example is all that the "bare minimum test" has to cover.

setup.py

The final part of this is to create a setup.py file for the custom Python package (called projectname). Here is a simple boilerplate for how it has to look:

from setuptools import setup, find_packages

setup(name="projectname",
      version="0.1")

Because this is a package that is intended to stay local and not be uploaded to PyPI, we only need to know its name and its version. Everything else, including its description, long description, author name, email address and more, are optional. You can include it, but it isn't mandatory.

data/

/path/to/project/directory/
|- data/
   |- raw/
   |- processed/
   |- cleaned/
   |- README.md

Under data/, we keep separate directories for the raw/ data, intermediate processed/ data, and final cleaned/ data. (These names, by the way, are completely arbitrary, you can name them in some other way if you desire, as long as they convey the same ideas.)

You'll note that there is also a README.md associated with this directory. This is intentional: it should contain the following details:

  1. Where the data come from,
  2. What scripts under the scripts/ directory transformed which files under raw/ into which files under processed/ and cleaned/, and
  3. Why each file under cleaned/ exists, with optional references to particular notebooks. (Optional, especially when things are still in flux.)

Here, I'm suggesting placing the data under the same project directory, but only under certain conditions. Firstly, only when you're the only person working on the project, and so there's only one authoritative source of data. Secondly, only when your data can fit on disk.

If you're working with other people, you will want to make sure that all of you agree on what the "authoritative" data source is. If it is a URL (e.g. to an s3 bucket, or to a database), then that URL should be stored and documented in the custom Python package, with a concise variable name attached to it. If it is a path on an HPC cluster and it fits on disk, there should be a script that downloads it so that you have a local version.

scripts/

/path/to/project/directory/
|- scripts/
   |- script1.py
   |- script2.py
   |- archive/
	  |- no-longer-useful.py

Like the notebooks/ section, I think this is quite self-explanatory. Scripts, defined as logical units of computation that aren't part of the notebook narratives, but nonetheless important for, say, getting the data in shape, or stitching together figures generated by individual notebooks.

But wait, it's complicated, no?

I proposed this project structure to colleagues, and was met with some degree of ambivalence.

Why not just put everything in notebooks?

After all, aren't notebooks supposed to be comprehensive, reproducible units? Yes, but that doesn't mean that they have to be littered with every last detail embedded inside them. Notebooks are great for a data project's narrative, but if they get cluttered up with chunks of code that are copied & pasted from cell to cell, then we not only have an unreadable notebook, we also legitimately have a coding practices problem on hand. This is where the practices of refactoring code come in really handy.

Why not just custom.py under notebooks/?

Now, one may ask, "If we can import a custom.py from the same directory as the other notebooks, then why bother with the setup.py overhead?" My responses are as follows.

If the project truly is small in scale, and you're working on it alone, then yes, don't bother with the setup.py. It's too much overhead to worry about.

However, if the project grows big, and multiple people are working on the same project code base (e.g. a "data engineer" + a "data scientist"), then creating the setup.py has a few advantages.

Firstly, by creating a custom Python package for project-wide variables, functions, and classes, then they are available for not only notebooks, but also for, say, custom data engineering or report-generation scripts that may need to be run from time to time. This is especially relevant if installed into a project's data science environment (say, using conda environments), and I would consider this to be the biggest advantage to creating a custom Python package for the project.

Secondly, we gain a single reference point for custom code. Mentally, if anything, a single reference point for code makes things easier to manage. We can also perform proper code review on the functions without having to worry about digging through the unreadable JSON blobs that Jupyter notebooks are under-the-hood. (Thankfully, we also have nbdime to help us with this!)

Conclusions

I have to admit that I went back-and-forth many, many times over the course over a few months before I finally coalesced on this project structure. It's taken repeated experimentation on new projects and modifying existing ones to reach this point. My hope is that this organizational structure provides some inspiration for your project.

Perhaps you disagree with me, that this structure isn't the best. I'd love to hear your rationale for a different structure; there may well be inspiration that I could borrow!

@eli-s-goldberg
Copy link

eli-s-goldberg commented Jun 12, 2018

Nice work, the structure nice and generic. Concerning preprocessing, and just as an added note, I tend to use transformer function (fit, transform, fit_transform) style when I code preprocessers. This way they stay generic, conform to a style I'm comfortable working with, and can be pipelined.

from sklearn.base import TransformerMixin

class NewTransform(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [[1] for _ in X]

Also, cookie-cutter is great, but often overkill - especially if you don't plan to host your module.

@nblauch
Copy link

nblauch commented Jan 8, 2020

Hi Eric. This is nice and helpful for my refactoring. I think you are missing the lines: import sys; sys.path.append('..') in your notebook example. Alternatively, it would be helpful to mention that you need to run setup.py to install packagename (every time you make a change to it). Otherwise your notebooks won't see packagename (or its most recent version).

@aeid99
Copy link

aeid99 commented Aug 7, 2020

I really appreciate the post!
Where do you save the model pickle? Or summary reports on the findings?

@ericmjl
Copy link
Author

ericmjl commented Aug 7, 2020

@aeid99 model pickles and summary reports are what I might consider "generated artifacts". They can go anywhere you want, though probably best separated from the "source" that generated them. I'm still waiting for a "version controlled artifact store". Maybe an Artifactory is what we need!

@rjweis
Copy link

rjweis commented Sep 2, 2020

I learned a lot from this post, thanks for sharing it!

@mencia
Copy link

mencia commented Sep 7, 2020

Hi Eric, thanks for the post. What part of the project would you recommend having under version control: perhaps the whole thing or certain directories only?

@ericmjl
Copy link
Author

ericmjl commented Sep 7, 2020

@mencia thanks for pinging in! I’d recommend treating the repo like software, and committing in only the pieces that are hand-curated. Clear all notebooks of output before committing, and work hard to engineer notebooks such that they run quickly. These are things that will save you headache in the long-run!

@mencia
Copy link

mencia commented Sep 24, 2020

What about the results folder?

@ericmjl
Copy link
Author

ericmjl commented Sep 25, 2020

Results usually are not the hand-curated pieces, but the result of computation. They shouldn't be version-controlled, but can be cached/dumped. This one is definitely tricky; if the computation that produces a result is expensive, they should maybe be stored in a place that is easily accessible to stakeholders. A lot of the decision-making process will follow the requirements of where and how you have to deliver the results, I think.

@mencia
Copy link

mencia commented Sep 25, 2020

Thanks for the answer @ericmjl, but I meant to ask where in your project directory would you put a results folder?

@ericmjl
Copy link
Author

ericmjl commented Sep 25, 2020

I think that too depends on the requirements of the project. If you’re keeping hand-curated logs, top-level directory and version-controlled is a great idea. If you’re just dumping things to be shared with a team, I’d recommend a user-agnostic location. Cloud, shared dir — all good choices, depends on your team’s preferences.

@isachard
Copy link

Thank @ericmjl I have been looking for something lightweight to structure for my DS projects.

@dimiphoton
Copy link

Hello, thank you for this page. Is there a simple python modelling and analysis repo that is well structured (for example just a biased coin toss)? How is a model and its parameter inference well written in python?

@ericmjl
Copy link
Author

ericmjl commented Sep 22, 2021

@dimiphoton I don't have an example off the top of my head, but I do know that having model source code in your custom repo lets you use it across notebooks. For parameter estimation, I would check out the package PyMC3, for which many talks, tutorials, and blog posts are available online to reference how to do it.

@dimiphoton
Copy link

Hello, to be more precise, I would like to know how a data scientist should write a model that may be complexified later. Should I write a class that computes an attribute "observable" when instanciated?

@udaylunawat
Copy link

I have been coming back to this time again. Extremely helpful!

@fabianjkrueger
Copy link

Hi! Great content! Thank you for posting! I just implemented a structure like this and combined it with some elements from Cookie Cutter.

Only thing I did not really understand is the purpose of the environment.yml file. What is it used for? What does it do?

@dcarver1
Copy link

@fabianjkrueger It seems like the environmental.yml is a document for storing information about a conda environment reference

@olsgaard
Copy link

olsgaard commented Mar 7, 2024

Hi Eric,

Thank you for posting this.

  • How often should I run setup.py? Once? Everytime I make a change to projectname/projectname-folder?

I've avoided going the setup.py-route for my projects, as shared functions are constantly changing and being added. However, module imports puts a big restriction on folder structure.

@ericmjl
Copy link
Author

ericmjl commented Mar 7, 2024

@olsgaard thank you for your question! If you do pip install -e . in your environment, any time you make a change, the edits will be reflected in your library! 😄

FYI, this guide is getting a bit dated; we should be using pyproject.toml instead of setup.py if we want to adhere to modern Python conventions. I have been updating my guide here: https://ericmjl.github.io/data-science-bootstrap-notes/get-bootstrapped-on-your-data-science-projects/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment