Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created March 12, 2021 19:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/a0f31d30e1ec9975865da20ab02976b2 to your computer and use it in GitHub Desktop.
Save rjurney/a0f31d30e1ec9975865da20ab02976b2 to your computer and use it in GitHub Desktop.
What the FUCK am I doing wrong with DVC?
$ dvc add data/demo/Demo\ triples\ and\ properties\ data\ -\ 20200202.xlsx
100% Add|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 1.41file/s]
$ git add -f data/demo/Demo\ triples\ and\ properties\ data\ -\ 20200202.xlsx.dvc
$ git commit -m "Now tracking Demo 1/2 dataset in DVC"
[43-demo-data-json ddb47e9] Now tracking Demo 1/2 dataset in DVC
1 file changed, 4 insertions(+)
create mode 100644 data/demo/Demo triples and properties data - 20200202.xlsx.dvc
dvc pull
M DeepDiscovery/oc/name_classifier_baseline/name_classifier_sklearn.pkl
D data/fines/ft.com-bank-fines.csv
D data/names/spanish-names-surnames/
D data/names/Names_2010Census.csv
D data/demo/Demo triples and properties data - 20200202.xlsx
4 files deleted, 1 file modified and 1 file fetched
$ dvc push
# MY FUCKING FILES WERE ALL DELTED FROM THE LAST PUSH
@shcheklein
Copy link

@rjurney could you please clarify:

  1. when you do dvc pull - does it happen after clean git clone? Not in the same place you were doing git commit, right?
  2. MY FUCKING FILES WERE ALL DELTED FROM THE LAST PUSH - deleted where? in the remote storage? in the project?

@rjurney
Copy link
Author

rjurney commented Mar 12, 2021

@schheklein Hmmmm...

  1. I have read the docs and find dvc confusing. My workflow is thus:
  • I want to add a new dataset. I run:
dvc add data/demo/my_file.txt
git add -f data/demo/my_file.txt.dvc
git commit -m "Added my_file.txt to dvc"
git push origin my_branch
dvc push
  • Here it often complains to the effect that I need to pull, is my understanding.
dvc pull
  • BAM! All the files I previously pushed, even before this last bit of work - are deleted as above.

@shcheklein
Copy link

@rjurney so, you don't do anything at all between dvc push and dvc pull, right? and it deletes the my_file.txt?

could you run and share git show in the this branch, and git status- just to make sure that we indeed got everything into Git.

@jorgeorpinel
Copy link

jorgeorpinel commented Mar 12, 2021

UPDATE: Oops I see @shcheklein is already helping here. Ignore me 🙂

Hi there! (Found this via Twitter.)

First, note that dvc push doesn't usually need a git push, just a git commit e.g. if you don't even have a Git remote in the project you can still store the data in a DVC remote.

Secondly, what kind of DVC remote storage are you using? Mostly curious as there's no info on your remote config. If possible share .dvc/config (except any secrets please!)

Here it often complains to the effect that I need to pull

Can you share the message DVC gives you to this effect please? Otherwise why are you pulling right after pushing? What are you trying to achieve at this point?

In any case, dvc pull right after dvc push shouldn't change anything in the workspace. In your this case data/demo/my_file.txt and should be where you left it. I just tried it locally in a simple repo and it doesn't. Are there any steps in between or could you share the full steps from the beginning of your workflow so we can try to reproduce this issue? (If any command prints an error message please re-try it with the -v flag and share that full output.)

Finally, please share the output of dvc version — my only hypothesis is that the DVC checkout process (included in dvc pull) may have some problem linking from cache to workspace in your system/config.

Sorry for the long answer but there's not enough info/context to give you a specific diagnostic just yet. 🙂

@rjurney
Copy link
Author

rjurney commented Mar 19, 2021

@shcheklein I'm actually pulling before I push or I can't push. The order might vary... sometimes I pull before I add files to git and version them so I have the latest stuff before making changes. The key factor seems to be that my coworker has pushed and my files get deleted. What is the right order for this stuff and how are we supposed to collaborate correctly?

@shcheklein
Copy link

@rjurney unless there is some fundamental bug, dvc push should never delete files, it works in the "append-only" mode. So, new stuff (delta) gets into the remote storage.

My guess, that what is going on, is that some .dvc, dvc.lock got out of sync. E.g. your cowoker pushed some new data, but then also did dvc commit which had globally updated all the .dvc. It doesn't mean that data disappeared (you can always go to the previous commit, run dvc pull and you should be fine). It only means that the current commit doesn't "point" to the right data?

Here it often complains to the effect that I need to pull, is my understanding.

if you could show me a message that it prints that would be very helpful.

@rjurney
Copy link
Author

rjurney commented Mar 25, 2021

@shscheklein so that means the problem was created by not running dvc pull before he ran a commit? Shouldn't it not allow this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment