Last active July 30, 2020 13:35
The Git + Data tool I always wanted

install MLOps

brew install mlops

clone repo as usual

If metadata data.yaml is detected in the root of the repo, the plugin mlops will automatically ask user if they would like to download the data.

> git clone
This repository contains a datastore would you like to download now?  
You can download at any time with command `git data pull` y/n?: y

Downloading data/sales.csv [####################] 100%

If you switch branches you get the associated data on that branch

> git checkout dev
locking access to data while incremental changes are made from master -> dev

Downloading [####################] 100%

Make a change to the data and/or code and push

Data is automatically ingored by git and pushed to datastore for you as if though you are storing data in GitHub, with the same commands git add and git push.

> git add data/sales.csv; git commit -m'update data pre-processing to add fields'; git push

change to data file(s) detected:
 - data/sales.csv
Uploading to data store azure://david/sales-data

Uploading [####################] 100%

Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
Delta compression using up to 8 threads
Compressing objects: 100% (1/1), done.
Writing objects: 100% (1/1), 583 bytes | 583.00 KiB/s, done.
Total 1 (delta 3), reused 0 (delta 0)
remote: Resolving deltas: 100% (3/3), completed with 1 local objects.
hamelsmu commented Jul 11, 2020

Thoughts on how

  • We can use git hooks

  • How to only apply/store incremental changes to data?, perhaps we can use Oras

  • Could build on top of dvc or something else

