misho-kr/CI-CD for Machine Learning.md

## CI-CD for Machine Learning.md

      
    Raw
  

              CI-CD for Machine Learning.md
            
          
    CI/CD for Machine Learning

Streamline your machine learning development processes, enhancing efficiency, reliability, and reproducibility in your projects. Develop a comprehensive understanding of CI/CD workflows and YAML syntax, utilizing GitHub Actions (GA) for automation, training models in a pipeline, versioning datasets with DVC, and performing hyperparameter tuning.
Introduction

Essential principles of Continuous Integration/Continuous Delivery (CI/CD) and YAML. Software development life cycle and key terms like build, test, and deploy. Continuous Integration, Continuous Delivery, and Continuous Deployment. Significance of CI/CD in machine learning and experimentation.

SDLC Overview

Systematic approach covering software development from start to finish
Workflow - build, test, deploy


Continuous Integration (CI): The practice of frequently building, testing, and merging code changes into a shared repository
Continuous Delivery (CD): Ensures that code changes can be deployed to production at any time but requires manual approval
Continuous Deployment (CD):Automatically deploys code changes to production without manual intervention
CI/CD in Machine Learning enables

Data versioning
Building models and model versioning
Automating experiments
Testing
Deployment


Introduction to YAML

A data formatting language similar to JSON and XML
Indentation is meaningful, Tabs are not allowed
Mappings, sequences, and scalars are building blocks of YAML


Introduction to GitHub Actions

CI/CD platform to automate pipelines
A sequence of steps that represent the flow of work and data


GHA Components

Event: is a specific activity in a repository that triggers a workflow run
Workflow: automated process that will run one or more jobs

Triggered automatically by event
Housed in .github/workflows


Job: set of steps

Each job is independent
Parallel execution is possible
Executed on the compute machine called runners


Steps: individual units of work

Executed in order, depends on previous step
Run on the same machine, so data can be shared


Action: GHA platform specific application

checkout repo, comment on PR


GitHub Actions

Components of GHA - events, actions, jobs, steps, runners, and context. Workflows that activate upon events like push and pull requests, and tailor runner machines. CI pipelines and intricacies of the GHA log.

Intermediate YAML

Multiline strings: Block scalar format

Literal style ( | ) preserves line break and indentation
Fold style ( > ) removes line breaks


Chomping indicators control the behavior of newlines at the end of the string

clip is the default mode, single newline at the end
strip ( - )  removes all newlines at the end
keep ( + ) retains all newlines at the end


Dynamic value injection - expressions allow parsers to dynamically substitute values

Usage - environment variables and references to other parts of YAML


Multi-document YAML


Setting a basic CI pipeline

Anatomy of GitHub Actions workflow


name: CI

on:
  push:
    branches: [ "main" ]

jobs:
  build:
    runs-on: ubuntu-latest
  steps:
    - name: Run a multi-line script
      run: |
        echo Hello, world!
        echo Add other actions to build,
        echo test, and deploy your project.

Running repository code

Create a feature branch
Add repository code
Configure workflow event
Create PR and trigger workflow


name: CI

on:
  pull_request:
    branches: [ "main" ]

jobs:
  build:
    runs-on: ubuntu-latest
  steps:
    - name: Checkout
      uses: actions/checkout@v3
    - name: Setup Python
      uses: actions/setup-python@v4
      with:
        python-version: 3.9
    - name: Run Python script
      run: |
        echo hello_world.py
        python hello_world.py

Environment Variables and Secrets

Contexts - access information about predefined variables and data
Contexts used in this course

github - information about the workflow run
env - variables set in the workflow
secrets - names and values that are available to workflow
job - info about the current job
runner - info about the machine


Variables store non-sensitive information in plain text compiler flags, usernames, file paths

Global/local scope is controlled by the level where defined
Accessed from the env context as ${{ env.ENV_VAR }}


Secrets store sensitive information in encrypted manner, i.e.passwords, API keys

Setting secrets
GITHUB_TOKEN secret - built-in secret provided by GitHub Actions

Used to perform workflow actions
Automatically available in every GitHub Actions workflow
Permissions can be tuned to the right degree


name: Greeting on variable day
# Global env
env:
  Greeting: Hello

# Grant permissions to write comments in PR
permissions:
  pull-requests: write

# Use GITHUB_TOKEN to authorize
steps:
  - name: Comment PR
    uses: thollander/actions-comment-pull-request@v2
    with:
      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      message: |
        Hello world ! :wave:
Continuous Integration in Machine Learning

Integration of machine learning model training into a GitHub Action pipeline using Continuous Machine Learning GitHub Action. Generate a comprehensive markdown report including model metrics and plots. Fata versioning in Machine Learning by adopting Data Version Control (DVC) to track data changes. Setting up DVC remotes and dataset transfers. DVC pipelines, DVC YAML file to orchestrate reproducible model training.

Dataset: Weather Prediction in Australia

Data preprocessing

Convert categorical features to numerical
Replace missing values of features
Scale features


Random Forest Classifier

max_depth = 2 , n_estimators = 50


Standard metrics on test data

Performance plots
Confusion matrix plot


GitHub Actions Workflow

Continuous Machine Learning (CML)

CI/CD tool for Machine Learning
GitHub Actions Integration


# Enable setup-cml action to be used later
- uses: iterative/setup-cml@v1

- name: Train model
  run: |
    # Your ML workflow goes here
    pip install -r requirements.txt
    python3 train.py
- name: Write CML report
  run: |
    # Add results and plots to markdown
    cat results.txt >> report.md
    echo "![training graph](./graph.png)" >> report.md
    # Create comment from markdown report
    cml comment create report.md
  env:
    REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Versioning datasets with Data Version Control

Ensures a historical record of data changes
DVC: Data Version Control tool

Manages data and experiments
Similar to Git


Data stored separately

SSH, HTTP/HTTPS, Local File System
AWS, GCP, and Azure object storage


> dvc init
Initialized DVC repository.
You can now commit the changes to git.
> dvc add data.csv
> cat data.csv
> cat data.csv.dvc

Interacting with DVC remotes

Location for Data Storage
Similar to Git remotes, but for cached data


> dvc remote add myAWSremote s3://mybucket
> dvc remote modify myAWSremote connect_timeout 300
> dvc remote add -d mylocalremote /tmp/dvc
> dvc push -r myAWSremote data.csv
>
> dvc add /path/to/data/datafile
> git commit /path/to/datafile.dvc -m "Dataset updates"
> git push origin main
> dvc push

DVC Pipelines - sequence of stages defining ML workflow and dependencies

Versioning data alone is not very useful
Run only what's needed
Steps in Directed Acyclic Graph (DAG)
Defined in dvc.yaml file
Similar to the GitHub Actions workflow

Focused on ML tasks instead of CI/CD
Can be abstracted as a step in GHA


> dvc stage add -n preprocess -d raw_data.csv -d preprocess.py -o processed_data.csv python preprocess.py
> dvc stage add -n train -d train.py -d processed_data.csv -o plots.png -o metrics.txt python train.py
> dvc dag
> dvc repro
> git add dvc.lock && git commit -m "first pipeline repro"`
stages:
  preprocess:
    cmd: python preprocess.py
    deps:
    - preprocess.py
    - raw_data.csv
    outs:
    - processed_data.csv
  train:
    cmd: python train.py
    deps:
    - processed_data.csv
    - train.py
    outs:
    - plots.png
Comparing training runs and Hyperparameter (HP) tuning

Analysis of model performance and the fine-tuning of hyperparameters. Compare metrics and visualizations across different branches to assess changes in model performance. Hyperparameter tuning using scikit-learn's GridSearchCV. Automation of pull requests using the optimal model configuration.

Configure DVC YAML file to track metrics across experiments
Querying and comparing DVC metrics

Change a hyperparameter and rerun dvc repro
Setting up DVC Github Action


stages:
  preprocess:
  train:
    outs:
    - confusion_matrix.png
    metrics:
      - metrics.json:
        cache: false
> dvc metrics show
> dvc metrics diff
> dvc plots show predictions.csv

Hyperparameter tuning route

Branch name hp_tune/<some-string>
Make changes to search configuration
Manually open a PR

Force runs DVC pipeline dvc repro -f hp_tune
Uses cml pr create to create a new training PR with best parameters
Force push a commit to training PR to kick off model training job


Separate feature branches for training and hyperparameter tuning

Hyperparameter tuning job kickoff
Creating a training PR from hyperparameter run