shcheklein/repro.md

## repro.md

      
    Raw
  

              repro.md
            
          
    repro

Rerun commands recorded in the pipeline stages in the same order. Commands to
rerun are determined by recursively analyzing which stages and changes in their
dependencies to find only those that have to be rerun.
Synopsis

    usage: dvc repro [-h] [-q] [-v] [-f] [-s]
                     [-c CWD] [-m]
                     [--no-commit]
                     [targets [targets ...]]

    positional arguments:
        target                DVC file to reproduce.


Description

DVC file (target) can have any name followed by the .dvc suffix. If file
name is omitted, Dvcfile will be used by default.
dvc repro provides an interface to rerun the commands in the computational
graph (a.k.a. pipeline) defined by the stage files in the current workspace. By
default, this command recursively searches, starting from the Dvcfile, the
pipeline stages to find any which have changed. It then reruns the corresponding
commands. The pipeline is mostly defined using the dvc run command, while data
input nodes are defined by the dvc add command.
There are several ways to restrict the stages to rerun, by listing stage file(s)
as targets, or using the --single-item, --pipeline, or --cwd options.
dvc repro does not run dvc fetch, dvc pull or dvc checkout to get source
data files, intermediate or final results. It though saves (unless --no-commit
option is specified) all the data files, intermediate or final results into the
DVC local cache and updates stage files with the new checksum information.
Options


-f, --force - rerun the pipeline, reproducing its results, even if no
changes were found. By default this reruns the entire pipeline. To rerun a
single stage, specify the stage name on the command-line along with the
--single-item option.


-s, --single-item - reproduce only a single stage by turning off the
recursive search for changed dependencies. Multiple stages are rerun if
multiple stage names are listed on the command-line.


-c, --cwd - directory within your project to reproduce from. If no target
names are given, it attempts to use Dvcfile in the specified directory, if
it exists, for stages to rerun. Instead of using --cwd one can alternately
specify a target in a subdirectory as path/to/target.dvc. This could be
useful for subdirectories containing a semi-independent pipeline, that can
either be rerun as part of the pipeline in the parent directory, or as an
independent unit.


--no-commit - does not save outputs to cache. Useful when running different
experiments and you don't want to fill up your cache with temporary files. Use
dvc commit when you are ready to save your results to cache.


-m, --metrics - show metrics after reproduction. The pipeline must have at
least one metrics file defined either with the dvc metrics command, or by
the -M or -m options on the dvc run command.


--dry - only print the commands that would be executed without actually
executing the commands.


-i, --interactive - ask for confirmation before reproducing each stage.
The stage is rerun if the user types "y".


-p, --pipeline - reproduce the whole pipeline that the specified stage
file belongs to. Use dvc pipeline show target.dvc to show the entire
pipeline the named stage belongs to.


--ignore-build-cache - reproduce all descendants of a changed stage, or the
stages following the changed stage, even if their direct dependencies did not
change. Like with the same option on dvc run, this is a way to force certain
stages to run if they would not otherwise be rerun. This can be useful for
pipelines containing stages that produce nondeterministic (semi-random)
outputs, such as the example below containing a stage that sorts its output
into a random order. For nondeterministic stages the outputs can vary on each
execution, meaning the cache cannot be trusted for such stages.


-q, --quiet - do not write anything to standard output. The command run by
the stage is free to make output irregardless of this flag. Exit with 0 if all
stages are up to date or if all stages are successfully rerun, otherwise exit
with 1.


-v, --verbose - displays detailed tracing information from executing the
dvc repro command.


-h, --help - prints the usage/help message, and exit.


Examples

For simplicity, let's build a pipeline defined below (if you want get your hands
on something more real, check this
mini-tutorial). It takes this text.txt
file:
dvc
1231
is
3
the
best

And runs a few simple transformations to filter and count numbers:
    $ dvc run -f filter.dvc -d text.txt -o numbers.txt \
               "cat text.txt | egrep '[0-9]+' > numbers.txt"

    $ dvc run -f Dvcfile -d numbers.txt -d process.py -M count.txt \
               "python process.py numbers.txt > count.txt"

Where process.py is a script which for simplicity just prints the number of
lines:
import sys
num_lines = 0
with open(sys.argv[1], 'r') as f:
    for line in f:
        num_lines += 1
print(num_lines)
The result of executing these dvc run commands should look like this:
    $ tree
    .
    ├── Dvcfile        <---- second stage with a default DVC name
    ├── count.txt      <---- result: "2"
    ├── filter.dvc     <---- first stage
    ├── numbers.txt    <---- intermediate result of the first stage
    ├── process.py     <---- code that runs some transformation
    └── text.txt       <---- text file to process

Ok, now, let's run the dvc repro command (remember, by default it reproduces
outputs defined in Dvcfile, count.txt in this case):
    $ dvc repro

    Stage 'filter.dvc' didn't change.
    Stage 'Dvcfile' didn't change.
    Pipeline is up to date. Nothing to reproduce.

It makes sense, since we haven't changed neither of the dependencies this
pipeline has: text.txt or process.py. Now, let's imagine we want to print a
description and we add this line to the process.py:
...
print('Number of lines:')
print(num_lines)
If we now run dvc repro, that's what we should see:
    $ dvc repro

    Stage 'filter.dvc' didn't change.
    Stage 'Dvcfile' changed.
    Reproducing 'Dvcfile'
    Running command:
	    python process.py numbers.txt > count.txt

    Saving information to 'Dvcfile'.

You can check now that Dvcfile and count.txt have been updated with the new
information, new md5 checksums and a new result respectively.