reference for
transformers 4.30.0.dev0
The below options are additional configuration parameters that can be used when training a model with Hugging Face Transformers. These options control various aspects of the training process, such as the optimizer to use, data loading settings, memory management, model evaluation, checkpointing, and integration with the Hugging Face Model Hub.
Here is a summary of the high-level functionalities provided by some of the options:
- Distributed training: Options like
sharded_ddp
andfsdp
enable distributed training across multiple devices or machines. - Optimization: Options like
optim
,adafactor
, andoptim_args
allow you to choose and configure different optimization algorithms for training. - Data loading: Options like
dataloader_num_workers
anddataloader_pin_memory
control the parallelism and memory allocation for data loading. - Evaluation and metrics: Options like
do_eval
,eval_steps
,report_to
, andinclude_inputs_for_metrics
control the evaluation process during training and specify the metrics to be reported. - Checkpointing and resuming training: Options like
resume_from_checkpoint
andload_best_model_at_end
enable resuming training from a saved checkpoint or loading the best model found during training. - Model Hub integration: Options like
push_to_hub
andhub_model_id
allow you to push the trained model to the Hugging Face Model Hub for sharing and deployment. - Memory management: Options like
gradient_checkpointing
andskip_memory_metrics
provide control over memory usage during training and reporting memory profiler reports. - Model generation: Options like
predict_with_generate
,generation_max_length
, andgeneration_num_beams
allow you to generate text using the trained model and calculate generative metrics like ROUGE and BLEU.
These options provide flexibility and customization for training models using the Hugging Face Transformers library.
Argument | Description | Default |
---|---|---|
--model_name_or_path | Path to pretrained model or model identifier from huggingface.co/models (default: None) | None |
--config_name | Pretrained config name or path if not the same as model_name (default: None) | None |
--tokenizer_name | Pretrained tokenizer name or path if not the same as model_name (default: None) | None |
--cache_dir | Where to store the pretrained models downloaded from huggingface.co (default: None) | None |
--use_fast_tokenizer | Whether to use one of the fast tokenizer (backed by the tokenizers library) or not. (default: True) | True |
--no_use_fast_tokenizer | Whether to use one of the fast tokenizer (backed by the tokenizers library) or not. (default: False) | False |
--model_revision | The specific model version to use (can be a branch name, tag name, or commit id). (default: main) | main |
--use_auth_token | Will use the token generated when running huggingface-cli login (necessary to use this script with private models). (default: False) |
False |
--resize_position_embeddings | Whether to automatically resize the position embeddings if max_source_length exceeds the model's position embeddings. (default: None) |
None |
--lang | Language id for summarization. (default: None) | None |
--dataset_name | The name of the dataset to use (via the datasets library). (default: None) | None |
--dataset_config_name | The configuration name of the dataset to use (via the datasets library). (default: None) | None |
--text_column | The name of the column in the datasets containing the full texts (for summarization). (default: None) | None |
--summary_column | The name of the column in the datasets containing the summaries (for summarization). (default: None) | None |
--train_file | The input training data file (a jsonlines or csv file). (default: None) | None |
--validation_file | An optional input evaluation data file to evaluate the metrics (rouge) on (a jsonlines or csv file). (default: None) | None |
--test_file | An optional input test data file to evaluate the metrics (rouge) on (a jsonlines or csv file). (default: None) | None |
--overwrite_cache | Overwrite the cached training and evaluation sets (default: False) | False |
--preprocessing_num_workers | The number of processes to use for the preprocessing. (default: None) | None |
--max_source_length | The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. (default: 1024) | 1024 |
--max_target_length | The maximum total sequence length for target text after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. (default: 128) | 128 |
--val_max_target_length | The maximum total sequence length for validation target text after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. Will default to max_target_length . This argument is also used to override the max_length param of model.generate , which is used during evaluate and predict . (default: None) |
None |
--pad_to_max_length | Whether to pad all samples to model maximum sentence length. If False, will pad the samples dynamically when batching to the maximum length in the batch. More efficient on GPU but very bad for TPU. (default: False) | False |
--max_train_samples | For debugging purposes or quicker training, truncate the number of training examples to this value if set. (default: None) | None |
--max_eval_samples | For debugging purposes or quicker training, truncate the number of evaluation examples to this value if set. (default: None) | None |
--max_predict_samples | For debugging purposes or quicker training, truncate the number of prediction examples to this value if set. (default: None) | None |
--num_beams | Number of beams to use for evaluation. This argument will be passed to model.generate , which is used during evaluate and predict . (default: None) |
None |
--ignore_pad_token_for_loss | Whether to ignore the tokens corresponding to padded labels in the loss computation or not. (default: True) | True |
--no_ignore_pad_token_for_loss | Whether to ignore the tokens corresponding to padded labels in the loss computation or not. (default: False) | False |
--source_prefix | A prefix to add before every source text (useful for T5 models). (default: ) | |
--forced_bos_token | The token to force as the first generated token after the decoder_start_token_id. Useful for multilingual models like mBART where the first generated token needs to be the target language token (Usually it is the target language token) (default: None) | None |
--output_dir | The output directory where the model predictions and checkpoints will be written. (default: None) | None |
--overwrite_output_dir | Overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory. (default: False) | False |
--do_train | Whether to run training. (default: False) | False |
--do_eval | Whether to run eval on the dev set. (default: False) | False |
--do_predict | Whether to run predictions on the test set. (default: False) | False |
--evaluation_strategy | The evaluation strategy to use. (default: no) | no |
--prediction_loss_only | When performing evaluation and predictions, only returns the loss. (default: False) | False |
--per_device_train_batch_size | Batch size per GPU/TPU core/CPU for training. (default: 8) | 8 |
--per_device_eval_batch_size | Batch size per GPU/TPU core/CPU for evaluation. (default: 8) | 8 |
--per_gpu_train_batch_size | Deprecated, the use of --per_device_train_batch_size is preferred. Batch size per GPU/TPU core/CPU for training. (default: None) |
None |
--per_gpu_eval_batch_size | Deprecated, the use of --per_device_eval_batch_size is preferred. Batch size per GPU/TPU core/CPU for evaluation. (default: None) |
None |
--gradient_accumulation_steps | Number of updates steps to accumulate before performing a backward/update pass. (default: 1) | 1 |
--eval_accumulation_steps | Number of predictions steps to accumulate before moving the tensors to the CPU. (default: None) | None |
--eval_delay | Number of epochs or steps to wait for before the first evaluation can be performed, depending on the evaluation_strategy. (default: 0) | 0 |
--learning_rate | The initial learning rate for AdamW. (default: 5e-05) | 5e-05 |
--weight_decay | Weight decay for AdamW if we apply some. (default: 0.0) | 0.0 |
--adam_beta1 | Beta1 for AdamW optimizer (default: 0.9) | 0.9 |
--adam_beta2 | Beta2 for AdamW optimizer (default: 0.999) | 0.999 |
--adam_epsilon | Epsilon for AdamW optimizer. (default: 1e-08) | 1e-08 |
--max_grad_norm | Max gradient norm. (default: 1.0) | 1.0 |
--num_train_epochs | Total number of training epochs to perform. (default: 3.0) | 3.0 |
--max_steps | If > 0: set total number of training steps to perform. Override num_train_epochs. (default: -1) | -1 |
--lr_scheduler_type | The scheduler type to use. (default: linear) | linear |
--warmup_ratio | Linear warmup over warmup_ratio fraction of total steps. (default: 0.0) | 0.0 |
--warmup_steps | Linear warmup over warmup_steps. (default: 0) | 0 |
--log_level | Logger log level to use on the main node. Possible choices are the log levels as strings: 'debug', 'info', 'warning', 'error', and 'critical', plus a 'passive' level which doesn't set anything and lets the application set the level. Defaults to 'passive'. (default: passive) | passive |
--log_level_replica | Logger log level to use on replica nodes. Same choices and defaults as log_level (default: warning) |
warning |
--log_on_each_node | When doing a multinode distributed training, whether to log once per node or just once on the main node. (default: True) | True |
--no_log_on_each_node | When doing a multinode distributed training, whether to log once per node or just once on the main node. (default: False) | False |
--logging_dir | Tensorboard log dir. (default: None) | None |
--logging_strategy | The logging strategy to use. (default: steps) | steps |
--logging_first_step | Log the first global_step (default: False) | False |
--logging_steps | Log every X updates steps. Should be an integer or a float in range [0,1) . If smaller than 1, will be interpreted as a ratio of total training steps. (default: 500) |
500 |
--logging_nan_inf_filter | Filter nan and inf losses for logging. (default: True) | True |
--no_logging_nan_inf_filter | Filter nan and inf losses for logging. (default: False) | False |
--save_strategy | The checkpoint save strategy to use. (default: steps) | steps |
--save_steps | Save checkpoint every X updates steps. Should be an integer or a float in range [0,1) . If smaller than 1, will be interpreted as a ratio of total training steps. (default: 500) |
500 |
--save_total_limit | Limit the total amount of checkpoints. Deletes the older checkpoints in the output_dir. Default is unlimited checkpoints (default: None) | None |
--save_safetensors | Use safetensors saving and loading for state dicts instead of default torch.load and torch.save. (default: False) | False |
--save_on_each_node | When doing multi-node distributed training, whether to save models and checkpoints on each node, or only on the main one (default: False) | False |
--no_cuda | Do not use CUDA even when it is available (default: False) | False |
--use_mps_device | Whether to use Apple Silicon chip-based mps device. (default: False) |
False |
--seed | Random seed that will be set at the beginning of training. (default: 42) | 42 |
--data_seed | Random seed to be used with data samplers. (default: None) | None |
--jit_mode_eval | Whether or not to use PyTorch jit trace for inference (default: False) | False |
--use_ipex | Use Intel extension for PyTorch when it is available, installation: 'https://github.com/intel/intel-extension-for-pytorch' (default: False) | False |
--bf16 | Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA architecture or using CPU (no_cuda). This is an experimental API and it may change. (default: False) | False |
--fp16 | Whether to use fp16 (mixed) precision instead of 32-bit (default: False) | False |
--fp16_opt_level | For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. See details at https://nvidia.github.io/apex/amp.html (default: O1) | O1 |
--half_precision_backend | The backend to be used for half precision. (default: auto) | auto |
--bf16_full_eval | Whether to use full bfloat16 evaluation instead of 32-bit. This is an experimental API and it may change. (default: False) | False |
--fp16_full_eval | Whether to use full float16 evaluation instead of 32-bit (default: False) | False |
--tf32 | Whether to enable tf32 mode, available in Ampere and newer GPU architectures. This is an experimental API and it may change. (default: None) | None |
--local_rank | For distributed training: local_rank (default: -1) | -1 |
--ddp_backend | The backend to be used for distributed training (default: None) | None |
--tpu_num_cores | TPU: Number of TPU cores (automatically passed by the launcher script) (default: None) | None |
--tpu_metrics_debug | Deprecated, the use of --debug tpu_metrics_debug is preferred. TPU: Whether to print debug metrics (default: False) |
False |
--debug | Whether or not to enable debug mode. Current options: underflow_overflow (Detect underflow and overflow in activations and weights), tpu_metrics_debug (print debug metrics on TPU). (default: ) |
Argument | Description | Default |
---|---|---|
--dataloader_drop_last | Drop the last incomplete batch if it is not divisible by the batch size. (default: False) | False |
--eval_steps | Run an evaluation every X steps. Should be an integer or a float in the range [0,1) . If smaller than 1, will be interpreted as a ratio of total training steps. (default: None) |
None |
--dataloader_num_workers | Number of subprocesses to use for data loading (PyTorch only). 0 means that the data will be loaded in the main process. (default: 0) | 0 |
--past_index | If >=0, uses the corresponding part of the output as the past state for the next step. (default: -1) | -1 |
--run_name | An optional descriptor for the run. Notably used for wandb logging. (default: None) | None |
--disable_tqdm | Whether or not to disable the tqdm progress bars. (default: None) | None |
--remove_unused_columns | Remove columns not required by the model when using an nlp.Dataset. (default: True) | True |
--no_remove_unused_columns | Remove columns not required by the model when using an nlp.Dataset. (default: False) | False |
--label_names | The list of keys in your dictionary of inputs that correspond to the labels. (default: None) | None |
--load_best_model_at_end | Whether or not to load the best model found during training at the end of training. (default: False) | False |
--metric_for_best_model | The metric to use to compare two different models. (default: None) | None |
--greater_is_better | Whether the metric_for_best_model should be maximized or not. (default: None) |
None |
--ignore_data_skip | When resuming training, whether or not to skip the first epochs and batches to get to the same training data. (default: False) | False |
--sharded_ddp | Whether or not to use sharded DDP training (in distributed training only). The base option should be simple , zero_dp_2 , or zero_dp_3 , and you can add CPU-offload to zero_dp_2 or zero_dp_3 like this: zero_dp_2 offload or zero_dp_3 offload . You can add auto-wrap to zero_dp_2 or zero_dp_3 with the same syntax: zero_dp_2 auto_wrap or zero_dp_3 auto_wrap . (default: ) |
|
--fsdp | Whether or not to use PyTorch Fully Sharded Data Parallel (FSDP) training (in distributed training only). The base option should be full_shard , shard_grad_op , or no_shard , and you can add CPU-offload to full_shard or shard_grad_op like this: full_shard offload or shard_grad_op offload . You can add auto-wrap to full_shard or shard_grad_op with the same syntax: full_shard auto_wrap or shard_grad_op auto_wrap . (default: ) |
|
--fsdp_min_num_params | This parameter is deprecated. FSDP's minimum number of parameters for Default Auto Wrapping. (useful only when fsdp field is passed). (default: 0) |
0 |
--fsdp_config | Config to be used with FSDP (PyTorch Fully Sharded Data Parallel). The value is either an afsdp json config file (e.g., fsdp_config.json ) or an already loaded json file as dict . (default: None) |
None |
--fsdp_transformer_layer_cls_to_wrap | This parameter is deprecated. Transformer layer class name (case-sensitive) to wrap, e.g., BertLayer , GPTJBlock , T5Block , etc. (useful only when fsdp flag is passed). (default: None) |
None |
--deepspeed | Enable deepspeed and pass the path to the deepspeed json config file (e.g., ds_config.json) or an already loaded json file as a dict (default: None) | None |
--label_smoothing_factor | The label smoothing epsilon to apply (zero means no label smoothing). (default: 0.0) | 0.0 |
--optim | The optimizer to use. (default: adamw_hf) | adamw_hf |
--optim_args | Optional arguments to supply to the optimizer. (default: None) | None |
--adafactor | Whether or not to replace AdamW by Adafactor. (default: False) | False |
--group_by_length | Whether or not to group samples of roughly the same length together when batching. (default: False) | False |
--length_column_name | Column name with precomputed lengths to use when grouping by length. (default: length) | length |
--report_to | The list of integrations to report the results and logs to. (default: None) | None |
--ddp_find_unused_parameters | When using distributed training, the value of the flag find_unused_parameters passed to DistributedDataParallel . (default: None) |
None |
--ddp_bucket_cap_mb | When using distributed training, the value of the flag bucket_cap_mb passed to DistributedDataParallel . (default: None) |
None |
--dataloader_pin_memory | Whether or not to pin memory for DataLoader. (default: True) | True |
--no_dataloader_pin_memory | Whether or not to pin memory for DataLoader. (default: False) | False |
--skip_memory_metrics | Whether or not to skip adding memory profiler reports to metrics. (default: True) | True |
--no_skip_memory_metrics | Whether or not to skip adding memory profiler reports to metrics. (default: False) | False |
--use_legacy_prediction_loop | Whether or not to use the legacy prediction_loop in the Trainer. (default: False) | False |
--push_to_hub | Whether or not to upload the trained model to the model hub after training. (default: False) | False |
--resume_from_checkpoint | The path to a folder with a valid checkpoint for your model. (default: None) | None |
--hub_model_id | The name of the repository to keep in sync with the local output_dir . (default: None) |
None |
--hub_strategy | The hub strategy to use when --push_to_hub is activated. (default: every_save) |
every_save |
--hub_token | The token to use to push to the Model Hub. (default: None) | None |
--hub_private_repo | Whether the model repository is private or not. (default: False) | False |
--gradient_checkpointing | If True, use gradient checkpointing to save memory at the expense of slower backward pass. (default: False) | False |
--include_inputs_for_metrics | Whether or not the inputs will be passed to the compute_metrics function. (default: False) |
False |
--fp16_backend | Deprecated. Use half_precision_backend instead (default: auto) |
auto |
--push_to_hub_model_id | The name of the repository to which to push the Trainer . (default: None) |
None |
--push_to_hub_organization | The name of the organization to which to push the Trainer . (default: None) |
None |
--push_to_hub_token | The token to use to push to the Model Hub. (default: None) | None |
--mp_parameters | Used by the SageMaker launcher to send mp-specific args. Ignored in Trainer (default: ) | |
--auto_find_batch_size | Whether to automatically decrease the batch size by half and rerun the training loop each time a CUDA Out-of-Memory error is reached (default: False) | False |
--full_determinism | Whether to call enable_full_determinism instead of set_seed for reproducibility in distributed training. Important: this will negatively impact performance, so only use it for debugging. (default: False) |
False |
--torchdynamo | This argument is deprecated. Use --torch_compile_backend instead. (default: None) |
None |
--ray_scope | The scope to use when doing hyperparameter search with Ray. By default, "last" will be used. Ray will then use the last checkpoint of all trials, compare those, and select the best one. However, other options are also available. See the Ray documentation (https://docs.ray.io/en/latest/tune/api_docs/analysis.html#ray.tune.ExperimentAnalysis.get_best_trial) for more options. (default: last) |
last |
--ddp_timeout | Overrides the default timeout for distributed training (value should be given in seconds). (default: 1800) | 1800 |
--torch_compile | If set to True , the model will be wrapped in torch.compile . (default: False) |
False |
--torch_compile_backend | Which backend to use with torch.compile . Passing one will trigger model compilation. (default: None) |
None |
--torch_compile_mode | Which mode to use with torch.compile . Passing one will trigger model compilation. (default: None) |
None |
--xpu_backend | The backend to be used for distributed training on Intel XPU. (default: None) | None |
--sortish_sampler | Whether to use SortishSampler or not. (default: False) | False |
--predict_with_generate | Whether to use generate to calculate generative metrics (ROUGE, BLEU). (default: False) | False |
--generation_max_length | The max_length to use on each evaluation loop when predict_with_generate=True . Will default to the max_length value of the model configuration. (default: None) |
None |
--generation_num_beams | The num_beams to use on each evaluation loop when predict_with_generate=True . Will default to the num_beams value of the model configuration. (default: None) |
None |
--generation_config | Model id, file path, or URL pointing to a GenerationConfig JSON file to use during prediction. (default: None) | None |