pszemraj/run_summarization_referencecard.md

## run_summarization_referencecard.md

      
    Raw
  

              run_summarization_referencecard.md
            
          
    reference for run_summarization


reference for transformers 4.30.0.dev0

about

The below options are additional configuration parameters that can be used when training a model with Hugging Face Transformers. These options control various aspects of the training process, such as the optimizer to use, data loading settings, memory management, model evaluation, checkpointing, and integration with the Hugging Face Model Hub.
Here is a summary of the high-level functionalities provided by some of the options:

Distributed training: Options like sharded_ddp and fsdp enable distributed training across multiple devices or machines.
Optimization: Options like optim, adafactor, and optim_args allow you to choose and configure different optimization algorithms for training.
Data loading: Options like dataloader_num_workers and dataloader_pin_memory control the parallelism and memory allocation for data loading.
Evaluation and metrics: Options like do_eval, eval_steps, report_to, and include_inputs_for_metrics control the evaluation process during training and specify the metrics to be reported.
Checkpointing and resuming training: Options like resume_from_checkpoint and load_best_model_at_end enable resuming training from a saved checkpoint or loading the best model found during training.
Model Hub integration: Options like push_to_hub and hub_model_id allow you to push the trained model to the Hugging Face Model Hub for sharing and deployment.
Memory management: Options like gradient_checkpointing and skip_memory_metrics provide control over memory usage during training and reporting memory profiler reports.
Model generation: Options like predict_with_generate, generation_max_length, and generation_num_beams allow you to generate text using the trained model and calculate generative metrics like ROUGE and BLEU.

These options provide flexibility and customization for training models using the Hugging Face Transformers library.
table


Argument
Description
Default


--model_name_or_path
Path to pretrained model or model identifier from huggingface.co/models (default: None)
None


--config_name
Pretrained config name or path if not the same as model_name (default: None)
None


--tokenizer_name
Pretrained tokenizer name or path if not the same as model_name (default: None)
None


--cache_dir
Where to store the pretrained models downloaded from huggingface.co (default: None)
None


--use_fast_tokenizer
Whether to use one of the fast tokenizer (backed by the tokenizers library) or not. (default: True)
True


--no_use_fast_tokenizer
Whether to use one of the fast tokenizer (backed by the tokenizers library) or not. (default: False)
False


--model_revision
The specific model version to use (can be a branch name, tag name, or commit id). (default: main)
main


--use_auth_token
Will use the token generated when running huggingface-cli login (necessary to use this script with private models). (default: False)
False


--resize_position_embeddings
Whether to automatically resize the position embeddings if max_source_length exceeds the model's position embeddings. (default: None)
None


--lang
Language id for summarization. (default: None)
None


--dataset_name
The name of the dataset to use (via the datasets library). (default: None)
None


--dataset_config_name
The configuration name of the dataset to use (via the datasets library). (default: None)
None


--text_column
The name of the column in the datasets containing the full texts (for summarization). (default: None)
None


--summary_column
The name of the column in the datasets containing the summaries (for summarization). (default: None)
None


--train_file
The input training data file (a jsonlines or csv file). (default: None)
None


--validation_file
An optional input evaluation data file to evaluate the metrics (rouge) on (a jsonlines or csv file). (default: None)
None


--test_file
An optional input test data file to evaluate the metrics (rouge) on (a jsonlines or csv file). (default: None)
None


--overwrite_cache
Overwrite the cached training and evaluation sets (default: False)
False


--preprocessing_num_workers
The number of processes to use for the preprocessing. (default: None)
None


--max_source_length
The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. (default: 1024)
1024


--max_target_length
The maximum total sequence length for target text after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. (default: 128)
128


--val_max_target_length
The maximum total sequence length for validation target text after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. Will default to max_target_length. This argument is also used to override the max_length param of model.generate, which is used during evaluate and predict. (default: None)
None


--pad_to_max_length
Whether to pad all samples to model maximum sentence length. If False, will pad the samples dynamically when batching to the maximum length in the batch. More efficient on GPU but very bad for TPU. (default: False)
False


--max_train_samples
For debugging purposes or quicker training, truncate the number of training examples to this value if set. (default: None)
None


--max_eval_samples
For debugging purposes or quicker training, truncate the number of evaluation examples to this value if set. (default: None)
None


--max_predict_samples
For debugging purposes or quicker training, truncate the number of prediction examples to this value if set. (default: None)
None


--num_beams
Number of beams to use for evaluation. This argument will be passed to model.generate, which is used during evaluate and predict. (default: None)
None


--ignore_pad_token_for_loss
Whether to ignore the tokens corresponding to padded labels in the loss computation or not. (default: True)
True


--no_ignore_pad_token_for_loss
Whether to ignore the tokens corresponding to padded labels in the loss computation or not. (default: False)
False


--source_prefix
A prefix to add before every source text (useful for T5 models). (default: )


--forced_bos_token
The token to force as the first generated token after the decoder_start_token_id. Useful for multilingual models like mBART where the first generated token needs to be the target language token (Usually it is the target language token) (default: None)
None


--output_dir
The output directory where the model predictions and checkpoints will be written. (default: None)
None


--overwrite_output_dir
Overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory. (default: False)
False


--do_train
Whether to run training. (default: False)
False


--do_eval
Whether to run eval on the dev set. (default: False)
False


--do_predict
Whether to run predictions on the test set. (default: False)
False


--evaluation_strategy
The evaluation strategy to use. (default: no)
no


--prediction_loss_only
When performing evaluation and predictions, only returns the loss. (default: False)
False


--per_device_train_batch_size
Batch size per GPU/TPU core/CPU for training. (default: 8)
8


--per_device_eval_batch_size
Batch size per GPU/TPU core/CPU for evaluation. (default: 8)
8


--per_gpu_train_batch_size
Deprecated, the use of --per_device_train_batch_size is preferred. Batch size per GPU/TPU core/CPU for training. (default: None)
None


--per_gpu_eval_batch_size
Deprecated, the use of --per_device_eval_batch_size is preferred. Batch size per GPU/TPU core/CPU for evaluation. (default: None)
None


--gradient_accumulation_steps
Number of updates steps to accumulate before performing a backward/update pass. (default: 1)
1


--eval_accumulation_steps
Number of predictions steps to accumulate before moving the tensors to the CPU. (default: None)
None


--eval_delay
Number of epochs or steps to wait for before the first evaluation can be performed, depending on the evaluation_strategy. (default: 0)
0


--learning_rate
The initial learning rate for AdamW. (default: 5e-05)
5e-05


--weight_decay
Weight decay for AdamW if we apply some. (default: 0.0)
0.0


--adam_beta1
Beta1 for AdamW optimizer (default: 0.9)
0.9


--adam_beta2
Beta2 for AdamW optimizer (default: 0.999)
0.999


--adam_epsilon
Epsilon for AdamW optimizer. (default: 1e-08)
1e-08


--max_grad_norm
Max gradient norm. (default: 1.0)
1.0


--num_train_epochs
Total number of training epochs to perform. (default: 3.0)
3.0


--max_steps
If > 0: set total number of training steps to perform. Override num_train_epochs. (default: -1)
-1


--lr_scheduler_type
The scheduler type to use. (default: linear)
linear


--warmup_ratio
Linear warmup over warmup_ratio fraction of total steps. (default: 0.0)
0.0


--warmup_steps
Linear warmup over warmup_steps. (default: 0)
0


--log_level
Logger log level to use on the main node. Possible choices are the log levels as strings: 'debug', 'info', 'warning', 'error', and 'critical', plus a 'passive' level which doesn't set anything and lets the application set the level. Defaults to 'passive'. (default: passive)
passive


--log_level_replica
Logger log level to use on replica nodes. Same choices and defaults as log_level (default: warning)
warning


--log_on_each_node
When doing a multinode distributed training, whether to log once per node or just once on the main node. (default: True)
True


--no_log_on_each_node
When doing a multinode distributed training, whether to log once per node or just once on the main node. (default: False)
False


--logging_dir
Tensorboard log dir. (default: None)
None


--logging_strategy
The logging strategy to use. (default: steps)
steps


--logging_first_step
Log the first global_step (default: False)
False


--logging_steps
Log every X updates steps. Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as a ratio of total training steps. (default: 500)
500


--logging_nan_inf_filter
Filter nan and inf losses for logging. (default: True)
True


--no_logging_nan_inf_filter
Filter nan and inf losses for logging. (default: False)
False


--save_strategy
The checkpoint save strategy to use. (default: steps)
steps


--save_steps
Save checkpoint every X updates steps. Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as a ratio of total training steps. (default: 500)
500


--save_total_limit
Limit the total amount of checkpoints. Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints (default: None)
None


--save_safetensors
Use safetensors saving and loading for state dicts instead of default torch.load and torch.save. (default: False)
False


--save_on_each_node
When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on the main one (default: False)
False


--no_cuda
Do not use CUDA even when it is available (default: False)
False


--use_mps_device
Whether to use Apple Silicon chip-based mps device. (default: False)
False


--seed
Random seed that will be set at the beginning of training. (default: 42)
42


--data_seed
Random seed to be used with data samplers. (default: None)
None


--jit_mode_eval
Whether or not to use PyTorch jit trace for inference (default: False)
False


--use_ipex
Use Intel extension for PyTorch when it is available, installation: 'https://github.com/intel/intel-extension-for-pytorch' (default: False)
False


--bf16
Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA architecture or using CPU (no_cuda). This is an experimental API and it may change. (default: False)
False


--fp16
Whether to use fp16 (mixed) precision instead of 32-bit (default: False)
False


--fp16_opt_level
For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details at https://nvidia.github.io/apex/amp.html (default: O1)
O1


--half_precision_backend
The backend to be used for half precision. (default: auto)
auto


--bf16_full_eval
Whether to use full bfloat16 evaluation instead of 32-bit. This is an experimental API and it may change. (default: False)
False


--fp16_full_eval
Whether to use full float16 evaluation instead of 32-bit (default: False)
False


--tf32
Whether to enable tf32 mode, available in Ampere and newer GPU architectures. This is an experimental API and it may change. (default: None)
None


--local_rank
For distributed training: local_rank (default: -1)
-1


--ddp_backend
The backend to be used for distributed training (default: None)
None


--tpu_num_cores
TPU: Number of TPU cores (automatically passed by the launcher script) (default: None)
None


--tpu_metrics_debug
Deprecated, the use of --debug tpu_metrics_debug is preferred. TPU: Whether to print debug metrics (default: False)
False


--debug
Whether or not to enable debug mode. Current options: underflow_overflow (Detect underflow and overflow in activations and weights), tpu_metrics_debug (print debug metrics on TPU). (default: )


Argument
Description
Default


--dataloader_drop_last
Drop the last incomplete batch if it is not divisible by the batch size. (default: False)
False


--eval_steps
Run an evaluation every X steps. Should be an integer or a float in the range [0,1). If smaller than 1, will be interpreted as a ratio of total training steps. (default: None)
None


--dataloader_num_workers
Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the main process. (default: 0)
0


--past_index
If >=0, uses the corresponding part of the output as the past state for the next step. (default: -1)
-1


--run_name
An optional descriptor for the run. Notably used for wandb logging. (default: None)
None


--disable_tqdm
Whether or not to disable the tqdm progress bars. (default: None)
None


--remove_unused_columns
Remove columns not required by the model when using an nlp.Dataset. (default: True)
True


--no_remove_unused_columns
Remove columns not required by the model when using an nlp.Dataset. (default: False)
False


--label_names
The list of keys in your dictionary of inputs that correspond to the labels. (default: None)
None


--load_best_model_at_end
Whether or not to load the best model found during training at the end of training. (default: False)
False


--metric_for_best_model
The metric to use to compare two different models. (default: None)
None


--greater_is_better
Whether the metric_for_best_model should be maximized or not. (default: None)
None


--ignore_data_skip
When resuming training, whether or not to skip the first epochs and batches to get to the same training data. (default: False)
False


--sharded_ddp
Whether or not to use sharded DDP training (in distributed training only). The base option should be simple, zero_dp_2, or zero_dp_3, and you can add CPU-offload to zero_dp_2 or zero_dp_3 like this: zero_dp_2 offload or zero_dp_3 offload. You can add auto-wrap to zero_dp_2 or zero_dp_3 with the same syntax: zero_dp_2 auto_wrap or zero_dp_3 auto_wrap. (default: )


--fsdp
Whether or not to use PyTorch Fully Sharded Data Parallel (FSDP) training (in distributed training only). The base option should be full_shard, shard_grad_op, or no_shard, and you can add CPU-offload to full_shard or shard_grad_op like this: full_shard offload or shard_grad_op offload. You can add auto-wrap to full_shard or shard_grad_op with the same syntax: full_shard auto_wrap or shard_grad_op auto_wrap. (default: )


--fsdp_min_num_params
This parameter is deprecated. FSDP's minimum number of parameters for Default Auto Wrapping. (useful only when fsdp field is passed). (default: 0)
0


--fsdp_config
Config to be used with FSDP (PyTorch Fully Sharded Data Parallel). The value is either an afsdp json config file (e.g., fsdp_config.json) or an already loaded json file as dict. (default: None)
None


--fsdp_transformer_layer_cls_to_wrap
This parameter is deprecated. Transformer layer class name (case-sensitive) to wrap, e.g., BertLayer, GPTJBlock, T5Block, etc. (useful only when fsdp flag is passed). (default: None)
None


--deepspeed
Enable deepspeed and pass the path to the deepspeed json config file (e.g., ds_config.json) or an already loaded json file as a dict (default: None)
None


--label_smoothing_factor
The label smoothing epsilon to apply (zero means no label smoothing). (default: 0.0)
0.0


--optim
The optimizer to use. (default: adamw_hf)
adamw_hf


--optim_args
Optional arguments to supply to the optimizer. (default: None)
None


--adafactor
Whether or not to replace AdamW by Adafactor. (default: False)
False


--group_by_length
Whether or not to group samples of roughly the same length together when batching. (default: False)
False


--length_column_name
Column name with precomputed lengths to use when grouping by length. (default: length)
length


--report_to
The list of integrations to report the results and logs to. (default: None)
None


--ddp_find_unused_parameters
When using distributed training, the value of the flag find_unused_parameters passed to DistributedDataParallel. (default: None)
None


--ddp_bucket_cap_mb
When using distributed training, the value of the flag bucket_cap_mb passed to DistributedDataParallel. (default: None)
None


--dataloader_pin_memory
Whether or not to pin memory for DataLoader. (default: True)
True


--no_dataloader_pin_memory
Whether or not to pin memory for DataLoader. (default: False)
False


--skip_memory_metrics
Whether or not to skip adding memory profiler reports to metrics. (default: True)
True


--no_skip_memory_metrics
Whether or not to skip adding memory profiler reports to metrics. (default: False)
False


--use_legacy_prediction_loop
Whether or not to use the legacy prediction_loop in the Trainer. (default: False)
False


--push_to_hub
Whether or not to upload the trained model to the model hub after training. (default: False)
False


--resume_from_checkpoint
The path to a folder with a valid checkpoint for your model. (default: None)
None


--hub_model_id
The name of the repository to keep in sync with the local output_dir. (default: None)
None


--hub_strategy
The hub strategy to use when --push_to_hub is activated. (default: every_save)
every_save


--hub_token
The token to use to push to the Model Hub. (default: None)
None


--hub_private_repo
Whether the model repository is private or not. (default: False)
False


--gradient_checkpointing
If True, use gradient checkpointing to save memory at the expense of slower backward pass. (default: False)
False


--include_inputs_for_metrics
Whether or not the inputs will be passed to the compute_metrics function. (default: False)
False


--fp16_backend
Deprecated. Use half_precision_backend instead (default: auto)
auto


--push_to_hub_model_id
The name of the repository to which to push the Trainer. (default: None)
None


--push_to_hub_organization
The name of the organization to which to push the Trainer. (default: None)
None


--push_to_hub_token
The token to use to push to the Model Hub. (default: None)
None


--mp_parameters
Used by the SageMaker launcher to send mp-specific args. Ignored in Trainer (default: )


--auto_find_batch_size
Whether to automatically decrease the batch size by half and rerun the training loop each time a CUDA Out-of-Memory error is reached (default: False)
False


--full_determinism
Whether to call enable_full_determinism instead of set_seed for reproducibility in distributed training. Important: this will negatively impact performance, so only use it for debugging. (default: False)
False


--torchdynamo
This argument is deprecated. Use --torch_compile_backend instead. (default: None)
None


--ray_scope
The scope to use when doing hyperparameter search with Ray. By default, "last" will be used. Ray will then use the last checkpoint of all trials, compare those, and select the best one. However, other options are also available. See the Ray documentation (https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for more options. (default: last)
last


--ddp_timeout
Overrides the default timeout for distributed training (value should be given in seconds). (default: 1800)
1800


--torch_compile
If set to True, the model will be wrapped in torch.compile. (default: False)
False


--torch_compile_backend
Which backend to use with torch.compile. Passing one will trigger model compilation. (default: None)
None


--torch_compile_mode
Which mode to use with torch.compile. Passing one will trigger model compilation. (default: None)
None


--xpu_backend
The backend to be used for distributed training on Intel XPU. (default: None)
None


--sortish_sampler
Whether to use SortishSampler or not. (default: False)
False


--predict_with_generate
Whether to use generate to calculate generative metrics (ROUGE, BLEU). (default: False)
False


--generation_max_length
The max_length to use on each evaluation loop when predict_with_generate=True. Will default to the max_length value of the model configuration. (default: None)
None


--generation_num_beams
The num_beams to use on each evaluation loop when predict_with_generate=True. Will default to the num_beams value of the model configuration. (default: None)
None


--generation_config
Model id, file path, or URL pointing to a GenerationConfig JSON file to use during prediction. (default: None)
None


## sorted_cli_output.md

      
    Raw
  

              sorted_cli_output.md
            
          
    CLI Arguments

--adafactor [ADAFACTOR]
Whether or not to replace AdamW by Adafactor.
(default: False)
--adam_beta1 ADAM_BETA1
Beta1 for AdamW optimizer (default: 0.9)
--adam_beta2 ADAM_BETA2
Beta2 for AdamW optimizer (default: 0.999)
--adam_epsilon ADAM_EPSILON
Epsilon for AdamW optimizer. (default: 1e-08)
--auto_find_batch_size [AUTO_FIND_BATCH_SIZE]
Whether to automatically decrease the batch size in
half and rerun the training loop again each time a
CUDA Out-of-Memory was reached (default: False)
--bf16 [BF16]         Whether to use bf16 (mixed) precision instead of
32-bit. Requires Ampere or higher NVIDIA architecture
or using CPU (no_cuda). This is an experimental API
and it may change. (default: False)
--bf16_full_eval [BF16_FULL_EVAL]
Whether to use full bfloat16 evaluation instead of
32-bit. This is an experimental API and it may change.
(default: False)
--cache_dir CACHE_DIR
Where to store the pretrained models downloaded from
huggingface.co (default: None)
--config_name CONFIG_NAME
Pretrained config name or path if not the same as
model_name (default: None)
--data_seed DATA_SEED
Random seed to be used with data samplers. (default:
None)
--dataloader_drop_last [DATALOADER_DROP_LAST]
Drop the last incomplete batch if it is not divisible
by the batch size. (default: False)
--dataloader_num_workers DATALOADER_NUM_WORKERS
Number of subprocesses to use for data loading
(PyTorch only). 0 means that the data will be loaded
in the main process. (default: 0)
--dataloader_pin_memory [DATALOADER_PIN_MEMORY]
Whether or not to pin memory for DataLoader. (default:
True)
--dataset_config_name DATASET_CONFIG_NAME
The configuration name of the dataset to use (via the
datasets library). (default: None)
--dataset_name DATASET_NAME
The name of the dataset to use (via the datasets
library). (default: None)
--ddp_backend {nccl,gloo,mpi,ccl}
The backend to be used for distributed training
(default: None)
--ddp_bucket_cap_mb DDP_BUCKET_CAP_MB
When using distributed training, the value of the flag
bucket_cap_mb passed to DistributedDataParallel.
(default: None)
--ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS
When using distributed training, the value of the flag
find_unused_parameters passed to
DistributedDataParallel. (default: None)
--ddp_timeout DDP_TIMEOUT
Overrides the default timeout for distributed training
(value should be given in seconds). (default: 1800)
--debug DEBUG         Whether or not to enable debug mode. Current options:
underflow_overflow (Detect underflow and overflow in
activations and weights), tpu_metrics_debug (print
debug metrics on TPU). (default: )
--deepspeed DEEPSPEED
Enable deepspeed and pass the path to deepspeed json
config file (e.g. ds_config.json) or an already loaded
json file as a dict (default: None)
--disable_tqdm DISABLE_TQDM
Whether or not to disable the tqdm progress bars.
(default: None)
--do_eval [DO_EVAL]   Whether to run eval on the dev set. (default: False)
--do_predict [DO_PREDICT]
Whether to run predictions on the test set. (default:
False)
--do_train [DO_TRAIN]
Whether to run training. (default: False)
--eval_accumulation_steps EVAL_ACCUMULATION_STEPS
Number of predictions steps to accumulate before
moving the tensors to the CPU. (default: None)
--eval_delay EVAL_DELAY
Number of epochs or steps to wait for before the first
evaluation can be performed, depending on the
evaluation_strategy. (default: 0)
--eval_steps EVAL_STEPS
Run an evaluation every X steps. Should be an integer
or a float in range [0,1).If smaller than 1, will be
interpreted as ratio of total training steps.
(default: None)
--evaluation_strategy {no,steps,epoch}
The evaluation strategy to use. (default: no)
--forced_bos_token FORCED_BOS_TOKEN
The token to force as the first generated token after
the decoder_start_token_id.Useful for multilingual
models like mBART where the first generated tokenneeds
to be the target language token (Usually it is the
target language token) (default: None)
--fp16 [FP16]         Whether to use fp16 (mixed) precision instead of
32-bit (default: False)
--fp16_backend {auto,cuda_amp,apex,cpu_amp}
Deprecated. Use half_precision_backend instead
(default: auto)
--fp16_full_eval [FP16_FULL_EVAL]
Whether to use full float16 evaluation instead of
32-bit (default: False)
--fp16_opt_level FP16_OPT_LEVEL
For fp16: Apex AMP optimization level selected in
['O0', 'O1', 'O2', and 'O3']. See details at
https://nvidia.github.io/apex/amp.html (default: O1)
--fsdp FSDP           Whether or not to use PyTorch Fully Sharded Data
Parallel (FSDP) training (in distributed training
only). The base option should be full_shard,
shard_grad_op or no_shard and you can add CPU-
offload to full_shard or shard_grad_op like this:
full_shard offloadorshard_grad_op offload. You can add auto-wrap to full_shardorshard_grad_op with the same syntax: full_shard auto_wrap or
shard_grad_op auto_wrap. (default: )
--fsdp_config FSDP_CONFIG
Config to be used with FSDP (Pytorch Fully Sharded
Data Parallel). The value is either afsdp json config
file (e.g., fsdp_config.json) or an already loaded
json file as dict. (default: None)
--fsdp_min_num_params FSDP_MIN_NUM_PARAMS
This parameter is deprecated. FSDP's minimum number of
parameters for Default Auto Wrapping. (useful only
when fsdp field is passed). (default: 0)
--fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP
This parameter is deprecated. Transformer layer class
name (case-sensitive) to wrap, e.g, BertLayer,
GPTJBlock, T5Block .... (useful only when fsdp
flag is passed). (default: None)
--full_determinism [FULL_DETERMINISM]
Whether to call enable_full_determinism instead of
set_seed for reproducibility in distributed training.
Important: this will negatively impact the
performance, so only use it for debugging. (default:
False)
--generation_config GENERATION_CONFIG
Model id, file path or url pointing to a
GenerationConfig json file, to use during prediction.
(default: None)
--generation_max_length GENERATION_MAX_LENGTH
The max_length to use on each evaluation loop when
predict_with_generate=True. Will default to the
max_length value of the model configuration.
(default: None)
--generation_num_beams GENERATION_NUM_BEAMS
The num_beams to use on each evaluation loop when
predict_with_generate=True. Will default to the
num_beams value of the model configuration.
(default: None)
--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
Number of updates steps to accumulate before
performing a backward/update pass. (default: 1)
--gradient_checkpointing [GRADIENT_CHECKPOINTING]
If True, use gradient checkpointing to save memory at
the expense of slower backward pass. (default: False)
--greater_is_better GREATER_IS_BETTER
Whether the metric_for_best_model should be
maximized or not. (default: None)
--group_by_length [GROUP_BY_LENGTH]
Whether or not to group samples of roughly the same
length together when batching. (default: False)
--half_precision_backend {auto,cuda_amp,apex,cpu_amp}
The backend to be used for half precision. (default:
auto)
--hub_model_id HUB_MODEL_ID
The name of the repository to keep in sync with the
local output_dir. (default: None)
--hub_private_repo [HUB_PRIVATE_REPO]
Whether the model repository is private or not.
(default: False)
--hub_strategy {end,every_save,checkpoint,all_checkpoints}
The hub strategy to use when --push_to_hub is
activated. (default: every_save)
--hub_token HUB_TOKEN
The token to use to push to the Model Hub. (default:
None)
--ignore_data_skip [IGNORE_DATA_SKIP]
When resuming training, whether or not to skip the
first epochs and batches to get to the same training
data. (default: False)
--ignore_pad_token_for_loss [IGNORE_PAD_TOKEN_FOR_LOSS]
Whether to ignore the tokens corresponding to padded
labels in the loss computation or not. (default: True)
--include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]
Whether or not the inputs will be passed to the
compute_metrics function. (default: False)
--jit_mode_eval [JIT_MODE_EVAL]
Whether or not to use PyTorch jit trace for inference
(default: False)
--label_names LABEL_NAMES [LABEL_NAMES ...]
The list of keys in your dictionary of inputs that
correspond to the labels. (default: None)
--label_smoothing_factor LABEL_SMOOTHING_FACTOR
The label smoothing epsilon to apply (zero means no
label smoothing). (default: 0.0)
--lang LANG           Language id for summarization. (default: None)
--learning_rate LEARNING_RATE
The initial learning rate for AdamW. (default: 5e-05)
--length_column_name LENGTH_COLUMN_NAME
Column name with precomputed lengths to use when
grouping by length. (default: length)
--load_best_model_at_end [LOAD_BEST_MODEL_AT_END]
Whether or not to load the best model found during
training at the end of training. (default: False)
--local_rank LOCAL_RANK
For distributed training: local_rank (default: -1)
--log_level {debug,info,warning,error,critical,passive}
Logger log level to use on the main node. Possible
choices are the log levels as strings: 'debug',
'info', 'warning', 'error' and 'critical', plus a
'passive' level which doesn't set anything and lets
the application set the level. Defaults to 'passive'.
(default: passive)
--log_level_replica {debug,info,warning,error,critical,passive}
Logger log level to use on replica nodes. Same choices
and defaults as log_level (default: warning)
--log_on_each_node [LOG_ON_EACH_NODE]
When doing a multinode distributed training, whether
to log once per node or just once on the main node.
(default: True)
--logging_dir LOGGING_DIR
Tensorboard log dir. (default: None)
--logging_first_step [LOGGING_FIRST_STEP]
Log the first global_step (default: False)
--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]
Filter nan and inf losses for logging. (default: True)
--logging_steps LOGGING_STEPS
Log every X updates steps. Should be an integer or a
float in range [0,1).If smaller than 1, will be
interpreted as ratio of total training steps.
(default: 500)
--logging_strategy {no,steps,epoch}
The logging strategy to use. (default: steps)
--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt,reduce_lr_on_plateau}
The scheduler type to use. (default: linear)
--max_eval_samples MAX_EVAL_SAMPLES
For debugging purposes or quicker training, truncate
the number of evaluation examples to this value if
set. (default: None)
--max_grad_norm MAX_GRAD_NORM
Max gradient norm. (default: 1.0)
--max_predict_samples MAX_PREDICT_SAMPLES
For debugging purposes or quicker training, truncate
the number of prediction examples to this value if
set. (default: None)
--max_source_length MAX_SOURCE_LENGTH
The maximum total input sequence length after
tokenization. Sequences longer than this will be
truncated, sequences shorter will be padded. (default:
1024)
--max_steps MAX_STEPS
If > 0: set total number of training steps to perform.
Override num_train_epochs. (default: -1)
--max_target_length MAX_TARGET_LENGTH
The maximum total sequence length for target text
after tokenization. Sequences longer than this will be
truncated, sequences shorter will be padded. (default:
128)
--max_train_samples MAX_TRAIN_SAMPLES
For debugging purposes or quicker training, truncate
the number of training examples to this value if set.
(default: None)
--metric_for_best_model METRIC_FOR_BEST_MODEL
The metric to use to compare two different models.
(default: None)
--model_name_or_path MODEL_NAME_OR_PATH
Path to pretrained model or model identifier from
huggingface.co/models (default: None)
--model_revision MODEL_REVISION
The specific model version to use (can be a branch
name, tag name or commit id). (default: main)
--mp_parameters MP_PARAMETERS
Used by the SageMaker launcher to send mp-specific
args. Ignored in Trainer (default: )
--no_cuda [NO_CUDA]   Do not use CUDA even when it is available (default:
False)
--no_dataloader_pin_memory
Whether or not to pin memory for DataLoader. (default:
False)
--no_ignore_pad_token_for_loss
Whether to ignore the tokens corresponding to padded
labels in the loss computation or not. (default:
False)
--no_log_on_each_node
When doing a multinode distributed training, whether
to log once per node or just once on the main node.
(default: False)
--no_logging_nan_inf_filter
Filter nan and inf losses for logging. (default:
False)
--no_remove_unused_columns
Remove columns not required by the model when using an
nlp.Dataset. (default: False)
--no_skip_memory_metrics
Whether or not to skip adding of memory profiler
reports to metrics. (default: False)
--no_use_fast_tokenizer
Whether to use one of the fast tokenizer (backed by
the tokenizers library) or not. (default: False)
--num_beams NUM_BEAMS
Number of beams to use for evaluation. This argument
will be passed to model.generate, which is used
during evaluate and predict. (default: None)
--num_train_epochs NUM_TRAIN_EPOCHS
Total number of training epochs to perform. (default:
3.0)
--optim {adamw_hf,adamw_torch,adamw_torch_fused,adamw_torch_xla,adamw_apex_fused,adafactor,adamw_anyprecision,sgd,adagrad,adamw_bnb_8bit,adamw_8bit,lion_8bit,lion_32bit,paged_adamw_32bit,paged_adamw_8bit,paged_lion_32bit,paged_lion_8bit}
The optimizer to use. (default: adamw_hf)
--optim_args OPTIM_ARGS
Optional arguments to supply to optimizer. (default:
None)
--output_dir OUTPUT_DIR
The output directory where the model predictions and
checkpoints will be written. (default: None)
--overwrite_cache [OVERWRITE_CACHE]
Overwrite the cached training and evaluation sets
(default: False)
--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]
Overwrite the content of the output directory. Use
this to continue training if output_dir points to a
checkpoint directory. (default: False)
--pad_to_max_length [PAD_TO_MAX_LENGTH]
Whether to pad all samples to model maximum sentence
length. If False, will pad the samples dynamically
when batching to the maximum length in the batch. More
efficient on GPU but very bad for TPU. (default:
False)
--past_index PAST_INDEX
If >=0, uses the corresponding part of the output as
the past state for next step. (default: -1)
--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE
Batch size per GPU/TPU core/CPU for evaluation.
(default: 8)
--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE
Batch size per GPU/TPU core/CPU for training.
(default: 8)
--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE
Deprecated, the use of --per_device_eval_batch_size
is preferred. Batch size per GPU/TPU core/CPU for
evaluation. (default: None)
--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE
Deprecated, the use of --per_device_train_batch_size
is preferred. Batch size per GPU/TPU core/CPU for
training. (default: None)
--predict_with_generate [PREDICT_WITH_GENERATE]
Whether to use generate to calculate generative
metrics (ROUGE, BLEU). (default: False)
--prediction_loss_only [PREDICTION_LOSS_ONLY]
When performing evaluation and predictions, only
returns the loss. (default: False)
--preprocessing_num_workers PREPROCESSING_NUM_WORKERS
The number of processes to use for the preprocessing.
(default: None)
--push_to_hub [PUSH_TO_HUB]
Whether or not to upload the trained model to the
model hub after training. (default: False)
--push_to_hub_model_id PUSH_TO_HUB_MODEL_ID
The name of the repository to which push the
Trainer. (default: None)
--push_to_hub_organization PUSH_TO_HUB_ORGANIZATION
The name of the organization in with to which push the
Trainer. (default: None)
--push_to_hub_token PUSH_TO_HUB_TOKEN
The token to use to push to the Model Hub. (default:
None)
--ray_scope RAY_SCOPE
The scope to use when doing hyperparameter search with
Ray. By default, "last" will be used. Ray will then
use the last checkpoint of all trials, compare those,
and select the best one. However, other options are
also available. See the Ray documentation (https://doc
s.ray.io/en/latest/tune/api_docs/analysis.html#ray.tun
e.ExperimentAnalysis.get_best_trial) for more options.
(default: last)
--remove_unused_columns [REMOVE_UNUSED_COLUMNS]
Remove columns not required by the model when using an
nlp.Dataset. (default: True)
--report_to REPORT_TO [REPORT_TO ...]
The list of integrations to report the results and
logs to. (default: None)
--resize_position_embeddings RESIZE_POSITION_EMBEDDINGS
Whether to automatically resize the position
embeddings if max_source_length exceeds the model's
position embeddings. (default: None)
--resume_from_checkpoint RESUME_FROM_CHECKPOINT
The path to a folder with a valid checkpoint for your
model. (default: None)
--run_name RUN_NAME   An optional descriptor for the run. Notably used for
wandb logging. (default: None)
--save_on_each_node [SAVE_ON_EACH_NODE]
When doing multi-node distributed training, whether to
save models and checkpoints on each node, or only on
the main one (default: False)
--save_safetensors [SAVE_SAFETENSORS]
Use safetensors saving and loading for state dicts
instead of default torch.load and torch.save.
(default: False)
--save_steps SAVE_STEPS
Save checkpoint every X updates steps. Should be an
integer or a float in range [0,1).If smaller than 1,
will be interpreted as ratio of total training steps.
(default: 500)
--save_strategy {no,steps,epoch}
The checkpoint save strategy to use. (default: steps)
--save_total_limit SAVE_TOTAL_LIMIT
Limit the total amount of checkpoints. Deletes the
older checkpoints in the output_dir. Default is
unlimited checkpoints (default: None)
--seed SEED           Random seed that will be set at the beginning of
training. (default: 42)
--sharded_ddp SHARDED_DDP
Whether or not to use sharded DDP training (in
distributed training only). The base option should be
simple, zero_dp_2 or zero_dp_3 and you can add
CPU-offload to zero_dp_2 or zero_dp_3 like this:
zero_dp_2 offloadorzero_dp_3 offload. You can add auto-wrap to zero_dp_2orzero_dp_3 with the same syntax: zero_dp_2 auto_wrap or zero_dp_3 auto_wrap.
(default: )
--skip_memory_metrics [SKIP_MEMORY_METRICS]
Whether or not to skip adding of memory profiler
reports to metrics. (default: True)
--sortish_sampler [SORTISH_SAMPLER]
Whether to use SortishSampler or not. (default: False)
--source_prefix SOURCE_PREFIX
A prefix to add before every source text (useful for
T5 models). (default: )
--summary_column SUMMARY_COLUMN
The name of the column in the datasets containing the
summaries (for summarization). (default: None)
--test_file TEST_FILE
An optional input test data file to evaluate the
metrics (rouge) on (a jsonlines or csv file).
(default: None)
--text_column TEXT_COLUMN
The name of the column in the datasets containing the
full texts (for summarization). (default: None)
--tf32 TF32           Whether to enable tf32 mode, available in Ampere and
newer GPU architectures. This is an experimental API
and it may change. (default: None)
--tokenizer_name TOKENIZER_NAME
Pretrained tokenizer name or path if not the same as
model_name (default: None)
--torch_compile [TORCH_COMPILE]
If set to True, the model will be wrapped in
torch.compile. (default: False)
--torch_compile_backend TORCH_COMPILE_BACKEND
Which backend to use with torch.compile, passing one
will trigger a model compilation. (default: None)
--torch_compile_mode TORCH_COMPILE_MODE
Which mode to use with torch.compile, passing one
will trigger a model compilation. (default: None)
--torchdynamo TORCHDYNAMO
This argument is deprecated, use
--torch_compile_backend instead. (default: None)
--tpu_metrics_debug [TPU_METRICS_DEBUG]
Deprecated, the use of --debug tpu_metrics_debug is
preferred. TPU: Whether to print debug metrics
(default: False)
--tpu_num_cores TPU_NUM_CORES
TPU: Number of TPU cores (automatically passed by
launcher script) (default: None)
--train_file TRAIN_FILE
The input training data file (a jsonlines or csv
file). (default: None)
--use_auth_token [USE_AUTH_TOKEN]
Will use the token generated when running
huggingface-cli login (necessary to use this script
with private models). (default: False)
--use_fast_tokenizer [USE_FAST_TOKENIZER]
Whether to use one of the fast tokenizer (backed by
the tokenizers library) or not. (default: True)
--use_ipex [USE_IPEX]
Use Intel extension for PyTorch when it is available,
installation: 'https://github.com/intel/intel-
extension-for-pytorch' (default: False)
--use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]
Whether or not to use the legacy prediction_loop in
the Trainer. (default: False)
--use_mps_device [USE_MPS_DEVICE]
Whether to use Apple Silicon chip based mps device.
(default: False)
--val_max_target_length VAL_MAX_TARGET_LENGTH
The maximum total sequence length for validation
target text after tokenization. Sequences longer than
this will be truncated, sequences shorter will be
padded. Will default to max_target_length.This
argument is also used to override the max_length
param of model.generate, which is used during
evaluate and predict. (default: None)
--validation_file VALIDATION_FILE
An optional input evaluation data file to evaluate the
metrics (rouge) on (a jsonlines or csv file).
(default: None)
--warmup_ratio WARMUP_RATIO
Linear warmup over warmup_ratio fraction of total
steps. (default: 0.0)
--warmup_steps WARMUP_STEPS
Linear warmup over warmup_steps. (default: 0)
--weight_decay WEIGHT_DECAY
Weight decay for AdamW if we apply some. (default:
0.0)
--xpu_backend {mpi,ccl,gloo}
The backend to be used for distributed training on
Intel XPU. (default: None)
Argument	Description	Default
--model_name_or_path	Path to pretrained model or model identifier from huggingface.co/models (default: None)	None
--config_name	Pretrained config name or path if not the same as model_name (default: None)	None
--tokenizer_name	Pretrained tokenizer name or path if not the same as model_name (default: None)	None
--cache_dir	Where to store the pretrained models downloaded from huggingface.co (default: None)	None
--use_fast_tokenizer	Whether to use one of the fast tokenizer (backed by the tokenizers library) or not. (default: True)	True
--no_use_fast_tokenizer	Whether to use one of the fast tokenizer (backed by the tokenizers library) or not. (default: False)	False
--model_revision	The specific model version to use (can be a branch name, tag name, or commit id). (default: main)	main
--use_auth_token	Will use the token generated when running `huggingface-cli login` (necessary to use this script with private models). (default: False)	False
--resize_position_embeddings	Whether to automatically resize the position embeddings if `max_source_length` exceeds the model's position embeddings. (default: None)	None
--lang	Language id for summarization. (default: None)	None
--dataset_name	The name of the dataset to use (via the datasets library). (default: None)	None
--dataset_config_name	The configuration name of the dataset to use (via the datasets library). (default: None)	None
--text_column	The name of the column in the datasets containing the full texts (for summarization). (default: None)	None
--summary_column	The name of the column in the datasets containing the summaries (for summarization). (default: None)	None
--train_file	The input training data file (a jsonlines or csv file). (default: None)	None
--validation_file	An optional input evaluation data file to evaluate the metrics (rouge) on (a jsonlines or csv file). (default: None)	None
--test_file	An optional input test data file to evaluate the metrics (rouge) on (a jsonlines or csv file). (default: None)	None
--overwrite_cache	Overwrite the cached training and evaluation sets (default: False)	False
--preprocessing_num_workers	The number of processes to use for the preprocessing. (default: None)	None
--max_source_length	The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. (default: 1024)	1024
--max_target_length	The maximum total sequence length for target text after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. (default: 128)	128
--val_max_target_length	The maximum total sequence length for validation target text after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`. This argument is also used to override the `max_length` param of `model.generate`, which is used during `evaluate` and `predict`. (default: None)	None
--pad_to_max_length	Whether to pad all samples to model maximum sentence length. If False, will pad the samples dynamically when batching to the maximum length in the batch. More efficient on GPU but very bad for TPU. (default: False)	False
--max_train_samples	For debugging purposes or quicker training, truncate the number of training examples to this value if set. (default: None)	None
--max_eval_samples	For debugging purposes or quicker training, truncate the number of evaluation examples to this value if set. (default: None)	None
--max_predict_samples	For debugging purposes or quicker training, truncate the number of prediction examples to this value if set. (default: None)	None
--num_beams	Number of beams to use for evaluation. This argument will be passed to `model.generate`, which is used during `evaluate` and `predict`. (default: None)	None
--ignore_pad_token_for_loss	Whether to ignore the tokens corresponding to padded labels in the loss computation or not. (default: True)	True
--no_ignore_pad_token_for_loss	Whether to ignore the tokens corresponding to padded labels in the loss computation or not. (default: False)	False
--source_prefix	A prefix to add before every source text (useful for T5 models). (default: )
--forced_bos_token	The token to force as the first generated token after the decoder_start_token_id. Useful for multilingual models like mBART where the first generated token needs to be the target language token (Usually it is the target language token) (default: None)	None
--output_dir	The output directory where the model predictions and checkpoints will be written. (default: None)	None
--overwrite_output_dir	Overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory. (default: False)	False
--do_train	Whether to run training. (default: False)	False
--do_eval	Whether to run eval on the dev set. (default: False)	False
--do_predict	Whether to run predictions on the test set. (default: False)	False
--evaluation_strategy	The evaluation strategy to use. (default: no)	no
--prediction_loss_only	When performing evaluation and predictions, only returns the loss. (default: False)	False
--per_device_train_batch_size	Batch size per GPU/TPU core/CPU for training. (default: 8)	8
--per_device_eval_batch_size	Batch size per GPU/TPU core/CPU for evaluation. (default: 8)	8
--per_gpu_train_batch_size	Deprecated, the use of `--per_device_train_batch_size` is preferred. Batch size per GPU/TPU core/CPU for training. (default: None)	None
--per_gpu_eval_batch_size	Deprecated, the use of `--per_device_eval_batch_size` is preferred. Batch size per GPU/TPU core/CPU for evaluation. (default: None)	None
--gradient_accumulation_steps	Number of updates steps to accumulate before performing a backward/update pass. (default: 1)	1
--eval_accumulation_steps	Number of predictions steps to accumulate before moving the tensors to the CPU. (default: None)	None
--eval_delay	Number of epochs or steps to wait for before the first evaluation can be performed, depending on the evaluation_strategy. (default: 0)	0
--learning_rate	The initial learning rate for AdamW. (default: 5e-05)	5e-05
--weight_decay	Weight decay for AdamW if we apply some. (default: 0.0)	0.0
--adam_beta1	Beta1 for AdamW optimizer (default: 0.9)	0.9
--adam_beta2	Beta2 for AdamW optimizer (default: 0.999)	0.999
--adam_epsilon	Epsilon for AdamW optimizer. (default: 1e-08)	1e-08
--max_grad_norm	Max gradient norm. (default: 1.0)	1.0
--num_train_epochs	Total number of training epochs to perform. (default: 3.0)	3.0
--max_steps	If > 0: set total number of training steps to perform. Override num_train_epochs. (default: -1)	-1
--lr_scheduler_type	The scheduler type to use. (default: linear)	linear
--warmup_ratio	Linear warmup over warmup_ratio fraction of total steps. (default: 0.0)	0.0
--warmup_steps	Linear warmup over warmup_steps. (default: 0)	0
--log_level	Logger log level to use on the main node. Possible choices are the log levels as strings: 'debug', 'info', 'warning', 'error', and 'critical', plus a 'passive' level which doesn't set anything and lets the application set the level. Defaults to 'passive'. (default: passive)	passive
--log_level_replica	Logger log level to use on replica nodes. Same choices and defaults as `log_level` (default: warning)	warning
--log_on_each_node	When doing a multinode distributed training, whether to log once per node or just once on the main node. (default: True)	True
--no_log_on_each_node	When doing a multinode distributed training, whether to log once per node or just once on the main node. (default: False)	False
--logging_dir	Tensorboard log dir. (default: None)	None
--logging_strategy	The logging strategy to use. (default: steps)	steps
--logging_first_step	Log the first global_step (default: False)	False
--logging_steps	Log every X updates steps. Should be an integer or a float in range `[0,1)`. If smaller than 1, will be interpreted as a ratio of total training steps. (default: 500)	500
--logging_nan_inf_filter	Filter nan and inf losses for logging. (default: True)	True
--no_logging_nan_inf_filter	Filter nan and inf losses for logging. (default: False)	False
--save_strategy	The checkpoint save strategy to use. (default: steps)	steps
--save_steps	Save checkpoint every X updates steps. Should be an integer or a float in range `[0,1)`. If smaller than 1, will be interpreted as a ratio of total training steps. (default: 500)	500
--save_total_limit	Limit the total amount of checkpoints. Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints (default: None)	None
--save_safetensors	Use safetensors saving and loading for state dicts instead of default torch.load and torch.save. (default: False)	False
--save_on_each_node	When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on the main one (default: False)	False
--no_cuda	Do not use CUDA even when it is available (default: False)	False
--use_mps_device	Whether to use Apple Silicon chip-based `mps` device. (default: False)	False
--seed	Random seed that will be set at the beginning of training. (default: 42)	42
--data_seed	Random seed to be used with data samplers. (default: None)	None
--jit_mode_eval	Whether or not to use PyTorch jit trace for inference (default: False)	False
--use_ipex	Use Intel extension for PyTorch when it is available, installation: 'https://github.com/intel/intel-extension-for-pytorch' (default: False)	False
--bf16	Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA architecture or using CPU (no_cuda). This is an experimental API and it may change. (default: False)	False
--fp16	Whether to use fp16 (mixed) precision instead of 32-bit (default: False)	False
--fp16_opt_level	For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details at https://nvidia.github.io/apex/amp.html (default: O1)	O1
--half_precision_backend	The backend to be used for half precision. (default: auto)	auto
--bf16_full_eval	Whether to use full bfloat16 evaluation instead of 32-bit. This is an experimental API and it may change. (default: False)	False
--fp16_full_eval	Whether to use full float16 evaluation instead of 32-bit (default: False)	False
--tf32	Whether to enable tf32 mode, available in Ampere and newer GPU architectures. This is an experimental API and it may change. (default: None)	None
--local_rank	For distributed training: local_rank (default: -1)	-1
--ddp_backend	The backend to be used for distributed training (default: None)	None
--tpu_num_cores	TPU: Number of TPU cores (automatically passed by the launcher script) (default: None)	None
--tpu_metrics_debug	Deprecated, the use of `--debug tpu_metrics_debug` is preferred. TPU: Whether to print debug metrics (default: False)	False
--debug	Whether or not to enable debug mode. Current options: `underflow_overflow` (Detect underflow and overflow in activations and weights), `tpu_metrics_debug` (print debug metrics on TPU). (default: )
Argument	Description	Default
--dataloader_drop_last	Drop the last incomplete batch if it is not divisible by the batch size. (default: False)	False
--eval_steps	Run an evaluation every X steps. Should be an integer or a float in the range `[0,1)`. If smaller than 1, will be interpreted as a ratio of total training steps. (default: None)	None
--dataloader_num_workers	Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the main process. (default: 0)	0
--past_index	If >=0, uses the corresponding part of the output as the past state for the next step. (default: -1)	-1
--run_name	An optional descriptor for the run. Notably used for wandb logging. (default: None)	None
--disable_tqdm	Whether or not to disable the tqdm progress bars. (default: None)	None
--remove_unused_columns	Remove columns not required by the model when using an nlp.Dataset. (default: True)	True
--no_remove_unused_columns	Remove columns not required by the model when using an nlp.Dataset. (default: False)	False
--label_names	The list of keys in your dictionary of inputs that correspond to the labels. (default: None)	None
--load_best_model_at_end	Whether or not to load the best model found during training at the end of training. (default: False)	False
--metric_for_best_model	The metric to use to compare two different models. (default: None)	None
--greater_is_better	Whether the `metric_for_best_model` should be maximized or not. (default: None)	None
--ignore_data_skip	When resuming training, whether or not to skip the first epochs and batches to get to the same training data. (default: False)	False
--sharded_ddp	Whether or not to use sharded DDP training (in distributed training only). The base option should be `simple`, `zero_dp_2`, or `zero_dp_3`, and you can add CPU-offload to `zero_dp_2` or `zero_dp_3` like this: `zero_dp_2 offload` or `zero_dp_3 offload`. You can add auto-wrap to `zero_dp_2` or `zero_dp_3` with the same syntax: `zero_dp_2 auto_wrap` or `zero_dp_3 auto_wrap`. (default: )
--fsdp	Whether or not to use PyTorch Fully Sharded Data Parallel (FSDP) training (in distributed training only). The base option should be `full_shard`, `shard_grad_op`, or `no_shard`, and you can add CPU-offload to `full_shard` or `shard_grad_op` like this: `full_shard offload` or `shard_grad_op offload`. You can add auto-wrap to `full_shard` or `shard_grad_op` with the same syntax: `full_shard auto_wrap` or `shard_grad_op auto_wrap`. (default: )
--fsdp_min_num_params	This parameter is deprecated. FSDP's minimum number of parameters for Default Auto Wrapping. (useful only when `fsdp` field is passed). (default: 0)	0
--fsdp_config	Config to be used with FSDP (PyTorch Fully Sharded Data Parallel). The value is either an afsdp json config file (e.g., `fsdp_config.json`) or an already loaded json file as `dict`. (default: None)	None
--fsdp_transformer_layer_cls_to_wrap	This parameter is deprecated. Transformer layer class name (case-sensitive) to wrap, e.g., `BertLayer`, `GPTJBlock`, `T5Block`, etc. (useful only when `fsdp` flag is passed). (default: None)	None
--deepspeed	Enable deepspeed and pass the path to the deepspeed json config file (e.g., ds_config.json) or an already loaded json file as a dict (default: None)	None
--label_smoothing_factor	The label smoothing epsilon to apply (zero means no label smoothing). (default: 0.0)	0.0
--optim	The optimizer to use. (default: adamw_hf)	adamw_hf
--optim_args	Optional arguments to supply to the optimizer. (default: None)	None
--adafactor	Whether or not to replace AdamW by Adafactor. (default: False)	False
--group_by_length	Whether or not to group samples of roughly the same length together when batching. (default: False)	False
--length_column_name	Column name with precomputed lengths to use when grouping by length. (default: length)	length
--report_to	The list of integrations to report the results and logs to. (default: None)	None
--ddp_find_unused_parameters	When using distributed training, the value of the flag `find_unused_parameters` passed to `DistributedDataParallel`. (default: None)	None
--ddp_bucket_cap_mb	When using distributed training, the value of the flag `bucket_cap_mb` passed to `DistributedDataParallel`. (default: None)	None
--dataloader_pin_memory	Whether or not to pin memory for DataLoader. (default: True)	True
--no_dataloader_pin_memory	Whether or not to pin memory for DataLoader. (default: False)	False
--skip_memory_metrics	Whether or not to skip adding memory profiler reports to metrics. (default: True)	True
--no_skip_memory_metrics	Whether or not to skip adding memory profiler reports to metrics. (default: False)	False
--use_legacy_prediction_loop	Whether or not to use the legacy prediction_loop in the Trainer. (default: False)	False
--push_to_hub	Whether or not to upload the trained model to the model hub after training. (default: False)	False
--resume_from_checkpoint	The path to a folder with a valid checkpoint for your model. (default: None)	None
--hub_model_id	The name of the repository to keep in sync with the local `output_dir`. (default: None)	None
--hub_strategy	The hub strategy to use when `--push_to_hub` is activated. (default: every_save)	every_save
--hub_token	The token to use to push to the Model Hub. (default: None)	None
--hub_private_repo	Whether the model repository is private or not. (default: False)	False
--gradient_checkpointing	If True, use gradient checkpointing to save memory at the expense of slower backward pass. (default: False)	False
--include_inputs_for_metrics	Whether or not the inputs will be passed to the `compute_metrics` function. (default: False)	False
--fp16_backend	Deprecated. Use `half_precision_backend` instead (default: auto)	auto
--push_to_hub_model_id	The name of the repository to which to push the `Trainer`. (default: None)	None
--push_to_hub_organization	The name of the organization to which to push the `Trainer`. (default: None)	None
--push_to_hub_token	The token to use to push to the Model Hub. (default: None)	None
--mp_parameters	Used by the SageMaker launcher to send mp-specific args. Ignored in Trainer (default: )
--auto_find_batch_size	Whether to automatically decrease the batch size by half and rerun the training loop each time a CUDA Out-of-Memory error is reached (default: False)	False
--full_determinism	Whether to call `enable_full_determinism` instead of `set_seed` for reproducibility in distributed training. Important: this will negatively impact performance, so only use it for debugging. (default: False)	False
--torchdynamo	This argument is deprecated. Use `--torch_compile_backend` instead. (default: None)	None
--ray_scope	The scope to use when doing hyperparameter search with Ray. By default, `"last"` will be used. Ray will then use the last checkpoint of all trials, compare those, and select the best one. However, other options are also available. See the Ray documentation (https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for more options. (default: last)	last
--ddp_timeout	Overrides the default timeout for distributed training (value should be given in seconds). (default: 1800)	1800
--torch_compile	If set to `True`, the model will be wrapped in `torch.compile`. (default: False)	False
--torch_compile_backend	Which backend to use with `torch.compile`. Passing one will trigger model compilation. (default: None)	None
--torch_compile_mode	Which mode to use with `torch.compile`. Passing one will trigger model compilation. (default: None)	None
--xpu_backend	The backend to be used for distributed training on Intel XPU. (default: None)	None
--sortish_sampler	Whether to use SortishSampler or not. (default: False)	False
--predict_with_generate	Whether to use generate to calculate generative metrics (ROUGE, BLEU). (default: False)	False
--generation_max_length	The `max_length` to use on each evaluation loop when `predict_with_generate=True`. Will default to the `max_length` value of the model configuration. (default: None)	None
--generation_num_beams	The `num_beams` to use on each evaluation loop when `predict_with_generate=True`. Will default to the `num_beams` value of the model configuration. (default: None)	None
--generation_config	Model id, file path, or URL pointing to a GenerationConfig JSON file to use during prediction. (default: None)	None