malcolmgreaves/sw_eng_hygeine.md

## sw_eng_hygeine.md

      
    Raw
  

              sw_eng_hygeine.md
            
          
    Software Engineering Hygiene for Data Science

All production-ready data science code should read and feel no different from any other high-quality codebase.
The principles of reliable builds, testable code, straightforward and distilled interfaces,and clearly documented
design decisions vastly improve the ease-of-use and maintenance of any code.
At a minmum, any machine learning or otherwise data-interacting code should have the following in its
git repository:

A descriptive, Markdown-formatted README.md that explains what the
code does and its purpose.
Python docstrings for the most important classes, modules, and
public functions.
A simple, repeatable process for environment creation (e.g. a
conda create. and
pip install -r).
Automated tests on core functionality (i.e. meaningful) that can be reliably executed (i.e.
pytest).
The master branch in a clean, working state at all times. Feature branches are where development should occur,
including exploring breaking changes.
The commit history on master should be
clean, clear, and descriptive.
Intermediate commits should never be merged into the main code branch.
Have continuous integration (CI) setup. The GitLab CI job should fail
if any code fails to build or if any dependency fails to download or if any test fails to pass.

Additionally, production-quality machine learning code should strive for:

Automated code formatting using black and git hooks.
Actively use coverage to inspect test code coverage to keep the coverage
percentage as high as possible.
100% code coverage via automated unit tests.
Documentation on every function, class, and module.
Automated documentation building using sphinx or another community-supported
documentation standard.
Prose-style documentation further describing the project's function and how different components interact with one
another. Clear, direct technical writing documenting the project's intent, the data situation, the business impact and
importance, as well as how the code  architecture is designed is incredibly useful information for onboarding new
scientists and engineeers.