Skip to content

Instantly share code, notes, and snippets.

@hamelsmu
Last active July 30, 2020 13:35
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hamelsmu/7fa0c459d6ea5f65ce516a85f819a5fe to your computer and use it in GitHub Desktop.
Save hamelsmu/7fa0c459d6ea5f65ce516a85f819a5fe to your computer and use it in GitHub Desktop.
The Git + Data tool I always wanted

install MLOps

brew install mlops

clone repo as usual

If metadata data.yaml is detected in the root of the repo, the plugin mlops will automatically ask user if they would like to download the data.

> git clone https://github.com/david/ml
cloning....
This repository contains a datastore would you like to download now?  
You can download at any time with command `git data pull` y/n?: y

Downloading data/sales.csv [####################] 100%

If you switch branches you get the associated data on that branch

> git checkout dev
locking access to data while incremental changes are made from master -> dev

Downloading [####################] 100%

Make a change to the data and/or code and push

Data is automatically ingored by git and pushed to datastore for you as if though you are storing data in GitHub, with the same commands git add and git push.

> git add pre_process_data.py data/sales.csv; git commit -m'update data pre-processing to add fields'; git push

change to data file(s) detected:
 - data/sales.csv
Uploading to data store azure://david/sales-data

Uploading [####################] 100%


Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Delta compression using up to 8 threads
Compressing objects: 100% (1/1), done.
Writing objects: 100% (1/1), 583 bytes | 583.00 KiB/s, done.
Total 1 (delta 3), reused 0 (delta 0)
remote: Resolving deltas: 100% (3/3), completed with 1 local objects.
@hamelsmu
Copy link
Author

@hamelsmu
Copy link
Author

hamelsmu commented Jul 11, 2020

Thoughts on how

  • We can use git hooks

  • How to only apply/store incremental changes to data?, perhaps we can use Oras

  • Could build on top of dvc or something else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment