Skip to content

Instantly share code, notes, and snippets.

@codez0mb1e
Last active September 11, 2024 09:52
Show Gist options
  • Save codez0mb1e/135f5d4b1d0440984d4b5eb094ddfa9a to your computer and use it in GitHub Desktop.
Save codez0mb1e/135f5d4b1d0440984d4b5eb094ddfa9a to your computer and use it in GitHub Desktop.
Data Science Project: Template

Template for Data Science Project

Main Principles

  • Reproducibility:
    • code files: under version control, code review
    • data: data pipeline or snapshots
    • environment: venv/conda/docker
    • models: training pipeline or pickled models, saved hyper-parameters and metrics
    • experiment: tracking, report
  • Maintainability:
    • code: modularity, code review, documentation, logging
    • data: data quality checks, format docs, metadata, data versioning
    • environment: venv/conda, requirements.txt
    • models: hyper-parameters as configuration, model versioning
    • experiment: code/data/env/models comparison using its artifacts, changelog
  • Security and Privacy:
    • No data outside DMZ.

Repository Structure

Directories:

|-- src/
|   |-- core/                                       <- Core functions and utils
|       |-- abstracts.(py|R)
|       |-- configuration.(py|R)
|       |-- experiment.(py|R)
|       |-- logging.(py|R)
|       |-- ...
|       |-- utils.(py|R)
|   |-- training/
|       |-- model.(py|R)                            <- Model definition
|       |-- preprocessing.(py|R)                    <- Preprocessing functions  
|       |-- ...
|       |-- utils.(py|R)
|   |-- __init.(py|R)
|   |-- 1_load_data.(py|R)                          <- Data loading pipeline
|   |-- 2_preprocessing.(py|R)                      <- Data preprocessing pipeline
|   |-- 2.1_hypothesis_1.ipynb                      <- Hypothesis testing and data exploration notebook
|   |-- 2.2_hypothesis_2.ipynb
|   |-- 3_feature_engineering.(py|R)                <- Feature engineering pipeline
|   |-- 4_model_training.(py|R)                     <- Model training pipeline, e.g. hyper-params optimization
|   |-- 5_model_evaluation.(py|R)                   <- Model evaluation pipeline
|   |-- ...
|   |-- config.yml
|   |-- config-(dev|release).yml
|   |-- secrets.yml
|   |-- secrets-(dev|release).yml
|-- data/                                           <- Data directory (not under version control, in S3)
|   |-- {data_version}/                             <- Raw data
|-- experiments/                                    <- Experiments artifacts, outputs and temp files
|   |-- {experiment_version}/
|      |-- cache/                                   <- Cache for different experiment stages
|      |-- output/                                  <- validate dataset, test dataset, hyper-opt artifacts, plots
|      |-- models/ or model.pkl                     <- Final model (or models ensemble)
|      |-- report.md                                <- Manual report
|      |-- changelog                                <- Automated report
|-- logs/
|   |-- {experiment_name}_{stage_name}_{timestamp}.log
|-- tests/
|   |-- unit/
|   |-- integration/
|   |-- e2e/
|-- docs/
|-- labs/                                         <- Jupyter notebooks and other experiments                 
|-- requirements.txt
|-- requirements-dev.txt
|-- Dockerfile
|-- Dockerfile.release
|-- .dockerignore
|-- .gitignore
|-- .github/workflows/
|   |-- build.yml
|   |-- release.yml
|-- run.(sh|ps)
|-- README.md
|-- LICENSE
|-- CHANGELOG
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment