In addition to storing and versioning data under the data/
directory, we store and version models in the models/
directory and version them using DVC. On PySpark we store models on the nodes of the cluster at /tmp/models/
so you will need to copy them there and use spark-specific paths when using PySpark.
In order to add a file to DVC, you need to run this series of commands. Failure to run them all can result in the deletion of files recently added by your co-workers [:(]. It is very important that you git pull
and dvc pull
, one after the other, before you add files and dvc push
. You can add new files to DVC this way or re-add them to update the version.
# Get the latest .dvc files and configuration for the project
git pull origin <branch>
# Pull the latest datasets from DVC before adding our new ones
dvc pull
# Version a new file under DVC - "dvc add" is analagous to "git add"
dvc add models/pig_classifier/logistic_regression.pkl
# This creates the file models/pig_classifier/logistic_regression.pkl.dvc and pushes it to our branch
git add -f models/pig_classifier/logistic_regression.pkl.dvc
git commit -m "DVC versioned the latest logistic regression model for classifying happy and sad pigs"
# Commit the file to DVC
dvc commit
# Push the files to git
git push origin <branch>
# Push the files to DVC
dvc push
In this specific example I think we don't need
dvc pull
anddvc commit
. Otherwise it should work fine!