In addition to storing and versioning data under the data/
directory, we store and version models in the models/
directory and version them using DVC. On PySpark we store models on the nodes of the cluster at /tmp/models/
so you will need to copy them there and use spark-specific paths when using PySpark.
In order to add a file to DVC, you need to run this series of commands. Failure to run them all can result in the deletion of files recently added by your co-workers [:(]. It is very important that you git pull
and dvc pull
, one after the other, before you add files and dvc push
. You can add new files to DVC this way or re-add them to update the version.
# Get the latest .dvc files and configuration for the project
git pull origin <branch>