codez0mb1e/ds_project_template.md

## ds_project_template.md

      
    Raw
  

              ds_project_template.md
            
          
    Template for Data Science Project

Main Principles


Reproducibility:

code files: under version control, code review
data: data pipeline or snapshots
environment: venv/conda/docker
models: training pipeline or pickled models, saved hyper-parameters and metrics
experiment: tracking, report


Maintainability:

code: modularity, code review, documentation, logging
data: data quality checks, format docs, metadata, data versioning
environment: venv/conda, requirements.txt
models: hyper-parameters as configuration, model versioning
experiment: code/data/env/models comparison using its artifacts, changelog


Security and Privacy:

No data outside DMZ.


Repository Structure

Directories:
|-- src/
|   |-- core/                                       <- Core functions and utils
|       |-- abstracts.(py|R)
|       |-- configuration.(py|R)
|       |-- experiment.(py|R)
|       |-- logging.(py|R)
|       |-- ...
|       |-- utils.(py|R)
|   |-- training/
|       |-- model.(py|R)                            <- Model definition
|       |-- preprocessing.(py|R)                    <- Preprocessing functions  
|       |-- ...
|       |-- utils.(py|R)
|   |-- __init.(py|R)
|   |-- 1_load_data.(py|R)                          <- Data loading pipeline
|   |-- 2_preprocessing.(py|R)                      <- Data preprocessing pipeline
|   |-- 2.1_hypothesis_1.ipynb                      <- Hypothesis testing and data exploration notebook
|   |-- 2.2_hypothesis_2.ipynb
|   |-- 3_feature_engineering.(py|R)                <- Feature engineering pipeline
|   |-- 4_model_training.(py|R)                     <- Model training pipeline, e.g. hyper-params optimization
|   |-- 5_model_evaluation.(py|R)                   <- Model evaluation pipeline
|   |-- ...
|   |-- config.yml
|   |-- config-(dev|release).yml
|   |-- secrets.yml
|   |-- secrets-(dev|release).yml
|-- data/                                           <- Data directory (not under version control, in S3)
|   |-- {data_version}/                             <- Raw data
|-- experiments/                                    <- Experiments artifacts, outputs and temp files
|   |-- {experiment_version}/
|      |-- cache/                                   <- Cache for different experiment stages
|      |-- output/                                  <- validate dataset, test dataset, hyper-opt artifacts, plots
|      |-- models/ or model.pkl                     <- Final model (or models ensemble)
|      |-- report.md                                <- Manual report
|      |-- changelog                                <- Automated report
|-- logs/
|   |-- {experiment_name}_{stage_name}_{timestamp}.log
|-- tests/
|   |-- unit/
|   |-- integration/
|   |-- e2e/
|-- docs/
|-- labs/                                         <- Jupyter notebooks and other experiments                 
|-- requirements.txt
|-- requirements-dev.txt
|-- Dockerfile
|-- Dockerfile.release
|-- .dockerignore
|-- .gitignore
|-- .github/workflows/
|   |-- build.yml
|   |-- release.yml
|-- run.(sh|ps)
|-- README.md
|-- LICENSE
|-- CHANGELOG