rjurney/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Models and their Files

In addition to storing and versioning data under the data/ directory, we store and version models in the models/ directory and version them using DVC. On PySpark we store models on the nodes of the cluster at /tmp/models/ so you will need to copy them there and use spark-specific paths when using PySpark.
In order to add a file to DVC, you need to run this series of commands. Failure to run them all can result in the deletion of files recently added by your co-workers [:(]. It is very important that you git pull and dvc pull, one after the other, before you add files and dvc push. You can add new files to DVC this way or re-add them to update the version.
# Get the latest .dvc files and configuration for the project
git pull origin <branch>

# Pull the latest datasets from DVC before adding our new ones
dvc pull

# Version a new file under DVC - "dvc add" is analagous to "git add"
dvc add models/pig_classifier/logistic_regression.pkl

# This creates the file models/pig_classifier/logistic_regression.pkl.dvc and pushes it to our branch
git add -f models/pig_classifier/logistic_regression.pkl.dvc
git commit -m "DVC versioned the latest logistic regression model for classifying happy and sad pigs"

# Commit the file to DVC
dvc commit

# Push the files to git
git push origin <branch>

# Push the files to DVC
dvc push