All production-ready data science code should read and feel no different from any other high-quality codebase. The principles of reliable builds, testable code, straightforward and distilled interfaces,and clearly documented design decisions vastly improve the ease-of-use and maintenance of any code.
At a minmum, any machine learning or otherwise data-interacting code should have the following in its
git
repository:
- A descriptive, Markdown-formatted
README.md
that explains what the code does and its purpose. - Python docstrings for the most important classes, modules, and public functions.
- A simple, repeatable process for environment creation (e.g. a
conda create
. andpip install -r
). - Automated tests on core functionality (i.e. meaningful) that can be reliably executed (i.e.
pytest
). - The
master
branch in a clean, working state at all times. Feature branches are where development should occur, including exploring breaking changes. - The commit history on
master
should be clean, clear, and descriptive. Intermediate commits should never be merged into the main code branch. - Have continuous integration (CI) setup. The GitLab CI job should fail if any code fails to build or if any dependency fails to download or if any test fails to pass.
Additionally, production-quality machine learning code should strive for:
- Automated code formatting using
black
andgit
hooks. - Actively use
coverage
to inspect test code coverage to keep the coverage percentage as high as possible. - 100% code coverage via automated unit tests.
- Documentation on every function, class, and module.
- Automated documentation building using
sphinx
or another community-supported documentation standard. - Prose-style documentation further describing the project's function and how different components interact with one another. Clear, direct technical writing documenting the project's intent, the data situation, the business impact and importance, as well as how the code architecture is designed is incredibly useful information for onboarding new scientists and engineeers.