Skip to content

Instantly share code, notes, and snippets.

@rjurney
Last active March 17, 2021 01:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/a90bc43df97778c091d4912e55e222a5 to your computer and use it in GitHub Desktop.
Save rjurney/a90bc43df97778c091d4912e55e222a5 to your computer and use it in GitHub Desktop.
Our README process for versioning files under DVC - is this right?

Models and their Files

In addition to storing and versioning data under the data/ directory, we store and version models in the models/ directory and version them using DVC. On PySpark we store models on the nodes of the cluster at /tmp/models/ so you will need to copy them there and use spark-specific paths when using PySpark.

In order to add a file to DVC, you need to run this series of commands. Failure to run them all can result in the deletion of files recently added by your co-workers [:(]. It is very important that you git pull and dvc pull, one after the other, before you add files and dvc push. You can add new files to DVC this way or re-add them to update the version.

# Get the latest .dvc files and configuration for the project
git pull origin <branch>

# Pull the latest datasets from DVC before adding our new ones
dvc pull

# Version a new file under DVC - "dvc add" is analagous to "git add"
dvc add models/pig_classifier/logistic_regression.pkl

# This creates the file models/pig_classifier/logistic_regression.pkl.dvc and pushes it to our branch
git add -f models/pig_classifier/logistic_regression.pkl.dvc
git commit -m "DVC versioned the latest logistic regression model for classifying happy and sad pigs"

# Commit the file to DVC
dvc commit

# Push the files to git
git push origin <branch>

# Push the files to DVC
dvc push
@shcheklein
Copy link

In this specific example I think we don't need dvc pull and dvc commit. Otherwise it should work fine!

@rjurney
Copy link
Author

rjurney commented Mar 16, 2021

@shcheklein I’m confused. The docs say that dvc add is like git add. Does adding the file push it to our GCS bucket?

@shcheklein
Copy link

hey, @rjurney , sorry for the delay.

when our docs saying "dvc something is like git something", unfortunately it's never precise. And I see that it can be misleading. Our intention was to give some idea about the command, and it felt that git add was the best to explain.

dvc add indeed is similar to git add. It adds file into Git, but if go further we can say that dvc add is git add file + git commit file. So, after dvc add you don't need to do dvc commit. In this case I would even say that dvc commit's name similarity to git commit is very misleading.

dvc commit in DVC is not meant to be used often. It forcefully updates all .dvc and dvc.lock files. That's why if you don't have data pulled with dvc pull initially it will update all previous .dvc files as if they were pointing to empty data (or it actually should even error out?)

Does adding the file push it to our GCS bucket?

No, it adds it into .dvc/cache (similar to git add saving some info into .git locally). dvc push is needed to send data from cache to the remote storage.

Please let me know if this clarifies the flow a bit. Happy to discuss other details here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment