Skip to content

Instantly share code, notes, and snippets.

@malcolmgreaves
Created August 17, 2019 22:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save malcolmgreaves/6e33fabd95fda0ff092d6af5f139b7ff to your computer and use it in GitHub Desktop.
Save malcolmgreaves/6e33fabd95fda0ff092d6af5f139b7ff to your computer and use it in GitHub Desktop.
Software Engineering Hygiene for Data Science

Software Engineering Hygiene for Data Science

All production-ready data science code should read and feel no different from any other high-quality codebase. The principles of reliable builds, testable code, straightforward and distilled interfaces,and clearly documented design decisions vastly improve the ease-of-use and maintenance of any code.

At a minmum, any machine learning or otherwise data-interacting code should have the following in its git repository:

  • A descriptive, Markdown-formatted README.md that explains what the code does and its purpose.
  • Python docstrings for the most important classes, modules, and public functions.
  • A simple, repeatable process for environment creation (e.g. a conda create. and pip install -r).
  • Automated tests on core functionality (i.e. meaningful) that can be reliably executed (i.e. pytest).
  • The master branch in a clean, working state at all times. Feature branches are where development should occur, including exploring breaking changes.
  • The commit history on master should be clean, clear, and descriptive. Intermediate commits should never be merged into the main code branch.
  • Have continuous integration (CI) setup. The GitLab CI job should fail if any code fails to build or if any dependency fails to download or if any test fails to pass.

Additionally, production-quality machine learning code should strive for:

  • Automated code formatting using black and git hooks.
  • Actively use coverage to inspect test code coverage to keep the coverage percentage as high as possible.
  • 100% code coverage via automated unit tests.
  • Documentation on every function, class, and module.
  • Automated documentation building using sphinx or another community-supported documentation standard.
  • Prose-style documentation further describing the project's function and how different components interact with one another. Clear, direct technical writing documenting the project's intent, the data situation, the business impact and importance, as well as how the code architecture is designed is incredibly useful information for onboarding new scientists and engineeers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment