This script reproduces an issue with cleamrl remote agent and argparse.
If run locally:
python tools/train.py 2
The output is as expected:
=== STARTED WORKER 1
Namespace(gpus=2) = Namespace(gpus=2)
=== FINISHED WORKER 1
=== STARTED WORKER 0
Namespace(gpus=2) = Namespace(gpus=2)
=== FINISHED WORKER 0
When run on a remote clearml-agent
:
python tools/train.py 2 --remote
The output is an error with torch.distributed run:
usage: train.py [-h] [--nnodes NNODES] [--nproc_per_node NPROC_PER_NODE]
[--rdzv_backend RDZV_BACKEND] [--rdzv_endpoint RDZV_ENDPOINT]
[--rdzv_id RDZV_ID] [--rdzv_conf RDZV_CONF] [--standalone]
[--max_restarts MAX_RESTARTS]
[--monitor_interval MONITOR_INTERVAL]
[--start_method {spawn,fork,forkserver}] [--role ROLE] [-m]
[--no_python] [--run_path] [--log_dir LOG_DIR] [-r REDIRECTS]
[-t TEE] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
[--master_port MASTER_PORT]
training_script ...
train.py: error: the following arguments are required: training_script, training_script_args