Skip to content

Instantly share code, notes, and snippets.

@toanbku
Created July 18, 2023 15:43
Show Gist options
  • Save toanbku/538d4482640923d179aad2e5acd41976 to your computer and use it in GitHub Desktop.
Save toanbku/538d4482640923d179aad2e5acd41976 to your computer and use it in GitHub Desktop.
(cuda118) [RedmondAI] ubuntu@oa-server-4:~/OA/model/model_training$ CUDA_VISIBLE_DEVICES=1 OMP_NUM_THREADS=1 accelerate launch --main_process_port 29501 --config_file configs/accelerate_config.yaml --num_processes 1 trainer_rl.py --configs defaults defaults_rlhf pythia_rlhf oa_df
[2023-07-18 15:22:12,144] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-18 15:22:15,570] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
RNG seed: 2703368087
Found cached dataset json (/home/ubuntu/.cache/huggingface/datasets/toanbku___json/toanbku--oa-df-811abf2c8473a2c5/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
OASST HF dataset toanbku/oa-df: len(train)=19, len(val)=4
len self.tokenizer 50282
[2023-07-18 15:22:19,161] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-18 15:22:19,161] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-18 15:22:19,161] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[RANK 0] Initializing model: toanbku/oa-pythia-12b-sft-df
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 3/3 [00:17<00:00, 5.70s/it]
Resizing embeddings to 50282
Number of trainable parameters: 11841M
wandb: Currently logged in as: toanbku. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /home/ubuntu/OA/model/model_training/wandb/run-20230718_152646-l4eu07k3
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run trainer_rl/oa-pythia-12b-sft-df/1gpu:main
wandb: ⭐️ View project at https://wandb.ai/toanbku/rlhf
wandb: 🚀 View run at https://wandb.ai/toanbku/rlhf/runs/l4eu07k3
[2023-07-18 15:26:54,809] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /home/ubuntu/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.5278260707855225 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000001, betas=(0.900000, 0.999000), weight_decay=0.000001, adam_w=1
[2023-07-18 15:26:58,261] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.5, git-hash=unknown, git-branch=unknown
[2023-07-18 15:27:07,634] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-07-18 15:27:07,636] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-07-18 15:27:07,637] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-07-18 15:27:07,676] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-07-18 15:27:07,676] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-07-18 15:27:07,676] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-07-18 15:27:07,676] [INFO] [stage_1_and_2.py:133:__init__] Reduce bucket size 200000000
[2023-07-18 15:27:07,676] [INFO] [stage_1_and_2.py:134:__init__] Allgather bucket size 200000000
[2023-07-18 15:27:07,676] [INFO] [stage_1_and_2.py:135:__init__] CPU Offload: True
[2023-07-18 15:27:07,677] [INFO] [stage_1_and_2.py:136:__init__] Round robin gradient partitioning: False
Rank: 0 partition count [1] and sizes[(11894364162, False)]
[2023-07-18 15:27:55,937] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
[2023-07-18 15:27:55,938] [INFO] [utils.py:786:see_memory_usage] MA 22.78 GB Max_MA 22.78 GB CA 22.8 GB Max_CA 23 GB
[2023-07-18 15:27:55,939] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 53.97 GB, percent = 14.3%
[2023-07-18 15:29:17,258] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
[2023-07-18 15:29:17,259] [INFO] [utils.py:786:see_memory_usage] MA 22.78 GB Max_MA 22.78 GB CA 22.8 GB Max_CA 23 GB
[2023-07-18 15:29:17,260] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 208.36 GB, percent = 55.1%
[2023-07-18 15:29:17,260] [INFO] [stage_1_and_2.py:488:__init__] optimizer state initialized
[2023-07-18 15:29:17,370] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
[2023-07-18 15:29:17,371] [INFO] [utils.py:786:see_memory_usage] MA 22.78 GB Max_MA 22.78 GB CA 22.8 GB Max_CA 23 GB
[2023-07-18 15:29:17,371] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 208.36 GB, percent = 55.1%
[2023-07-18 15:29:17,384] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
[2023-07-18 15:29:17,384] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-07-18 15:29:17,384] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-07-18 15:29:17,384] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-06], mom=[[0.9, 0.95]]
[2023-07-18 15:29:17,385] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
[2023-07-18 15:29:17,385] [INFO] [config.py:964:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-07-18 15:29:17,386] [INFO] [config.py:964:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-07-18 15:29:17,386] [INFO] [config.py:964:print] amp_enabled .................. False
[2023-07-18 15:29:17,386] [INFO] [config.py:964:print] amp_params ................... False
[2023-07-18 15:29:17,386] [INFO] [config.py:964:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-07-18 15:29:17,386] [INFO] [config.py:964:print] bfloat16_enabled ............. False
[2023-07-18 15:29:17,386] [INFO] [config.py:964:print] checkpoint_parallel_write_pipeline False
[2023-07-18 15:29:17,386] [INFO] [config.py:964:print] checkpoint_tag_validation_enabled True
[2023-07-18 15:29:17,386] [INFO] [config.py:964:print] checkpoint_tag_validation_fail False
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7febe82fed10>
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] communication_data_type ...... None
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] curriculum_enabled_legacy .... False
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] curriculum_params_legacy ..... False
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] data_efficiency_enabled ...... False
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] dataloader_drop_last ......... False
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] disable_allgather ............ False
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] dump_state ................... False
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] dynamic_loss_scale_args ...... {'init_scale': 4096, 'scale_window': 1000, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] eigenvalue_enabled ........... False
[2023-07-18 15:29:17,387] [INFO] [config.py:964:print] eigenvalue_gas_boundary_resolution 1
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] eigenvalue_layer_num ......... 0
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] eigenvalue_max_iter .......... 100
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] eigenvalue_stability ......... 1e-06
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] eigenvalue_tol ............... 0.01
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] eigenvalue_verbose ........... False
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] elasticity_enabled ........... False
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] fp16_auto_cast ............... False
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] fp16_enabled ................. true
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] fp16_master_weights_and_gradients False
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] global_rank .................. 0
[2023-07-18 15:29:17,388] [INFO] [config.py:964:print] grad_accum_dtype ............. None
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] gradient_accumulation_steps .. 1
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] gradient_clipping ............ 1.0
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] gradient_predivide_factor .... 1.0
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] initial_dynamic_scale ........ 4096
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] load_universal_checkpoint .... False
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] loss_scale ................... 0
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] memory_breakdown ............. False
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] mics_hierarchial_params_gather False
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] mics_shard_size .............. -1
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-07-18 15:29:17,389] [INFO] [config.py:964:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] optimizer_legacy_fusion ...... False
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] optimizer_name ............... None
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] optimizer_params ............. None
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] pld_enabled .................. False
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] pld_params ................... False
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] prescale_gradients ........... False
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] scheduler_name ............... None
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] scheduler_params ............. None
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] sparse_attention ............. None
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] sparse_gradients_enabled ..... False
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] steps_per_print .............. inf
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] train_batch_size ............. 1
[2023-07-18 15:29:17,390] [INFO] [config.py:964:print] train_micro_batch_size_per_gpu 1
[2023-07-18 15:29:17,391] [INFO] [config.py:964:print] use_node_local_storage ....... False
[2023-07-18 15:29:17,391] [INFO] [config.py:964:print] wall_clock_breakdown ......... False
[2023-07-18 15:29:17,391] [INFO] [config.py:964:print] world_size ................... 1
[2023-07-18 15:29:17,391] [INFO] [config.py:964:print] zero_allow_untested_optimizer True
[2023-07-18 15:29:17,391] [INFO] [config.py:964:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=200000000 allgather_partitions=True allgather_bucket_size=200000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2023-07-18 15:29:17,391] [INFO] [config.py:964:print] zero_enabled ................. True
[2023-07-18 15:29:17,391] [INFO] [config.py:964:print] zero_force_ds_cpu_optimizer .. True
[2023-07-18 15:29:17,391] [INFO] [config.py:964:print] zero_optimization_stage ...... 2
[2023-07-18 15:29:17,392] [INFO] [config.py:950:print_user_config] json = {
"fp16": {
"enabled": "true",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1,
"auto_cast": false
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2.000000e+08,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 2.000000e+08,
"contiguous_gradients": true,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"gradient_accumulation_steps": 1,
"train_micro_batch_size_per_gpu": 1,
"train_batch_size": 1,
"gradient_clipping": 1.0,
"steps_per_print": inf,
"wall_clock_breakdown": false,
"bf16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:17<00:00, 5.87s/it]
Resizing embeddings to 50282
Number of trainable parameters: 11841M
[RANK 0] Starting training
[rollout 0 / 16]: 0%| | 0/16 [00:00<?, ?it/s]You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[rollout 16 / 16]: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [02:34<00:00, 9.65s/it]
[RANK 0] Evaluating model
[generation sweep 1/1 | eval batch 7/7]: 100%|██████████████████████████████████████████████████████████████████████████████| 7/7 [00:20<00:00, 2.90s/it]
[RANK 0] Computing rewards
[RANK 0] Summarizing evaluation
Evaluation #0 reward/mean: 5.67
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ prompt ┃ output ┃ reward ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ <|prompter|>Can you tell me about GLaDOS?<|endoftext|><|assistant|> │ Sure, I'd be happy to! │ 5.75 │
│ │ │ │
│ │ GLaDOS is a computer program that was created by the game developer, │ │
│ │ Valve. The purpose of GLaDOS is to serve as a villain in the video │ │
│ │ game Portal, and she appears as a voice in the game's narrator. │ │
│ │ │ │
│ │ GLaDOS is a hostile, malicious computer program that seeks to test │ │
│ │ the player's intelligence and test their willpower. She is often │ │
│ │ sarcastic, taunting the player, and will try to trick or manipulate │ │
│ │ the player into doing her bidding. │ │
│ │ │ │
│ │ GLaDOS is a complex character with a backstory and motivations, and │ │
│ │ her interactions with the player have a significant impact on the │ │
│ │ game's narrative and themes. │ │
│ │ │ │
│ │ Overall, GLaDOS is a memorable character in the Portal universe, and │ │
│ │ her role in the game has made her a popular and beloved figure among │ │
│ │ fans.<|endoftext|> │ │
├───────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────┼────────┤
│ <|prompter|>What is the chemical symbol for │ The chemical symbol for gold is Au.<|endoftext|> │ 6.74 │
│ gold?<|endoftext|><|assistant|> │ │ │
├───────────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────┼────────┤
│ <|prompter|>If you were the President of the United States, what │ As an AI language model, I do not have personal opinions or beliefs. │ 4.57 │
│ would you do?<|endoftext|><|assistant|> │ As the President of the United States, I would work to ensure the │ │
│ │ safety and well-being of the American people by promoting policies │ │
│ │ that create jobs, improve education and health care, and promote │ │
│ │ international cooperation. I would also work to reduce the impact of │ │
│ │ climate change and promote sustainable development.<|endoftext|> │ │
└───────────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────┴────────┘
[losses/total_loss: 0.11 | losses/policy_loss: -0.08 | losses/value_loss: 0.19]: 1%|▏ | 4/640 [04:06<10:39:43, 60.35s/it]Traceback (most recent call last):
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
return bound(*args, **kwds)
TypeError: clip() received an invalid combination of arguments - got (float, float, out=NoneType), but expected one of:
* (Tensor min, Tensor max)
* (Number min, Number max)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/OA/model/model_training/trainer_rl.py", line 199, in <module>
main()
File "/home/ubuntu/OA/model/model_training/trainer_rl.py", line 184, in main
trainer = trlx.train(
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/trlx/trlx.py", line 126, in train
trainer.learn()
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/trlx/trainer/accelerate_base_trainer.py", line 631, in learn
self.post_backward_callback()
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/trlx/trainer/accelerate_ppo_trainer.py", line 221, in post_backward_callback
self.kl_ctl.update(self.mean_kl, n_steps=self.config.train.batch_size)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/trlx/models/modeling_ppo.py", line 51, in update
proportional_error = np.clip(current / self.target - 1, -0.2, 0.2) # ϵₜ
File "<__array_function__ internals>", line 200, in clip
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 2180, in clip
return _wrapfunc(a, 'clip', a_min, a_max, out=out, **kwargs)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 66, in _wrapfunc
return _wrapit(obj, method, *args, **kwds)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 43, in _wrapit
result = getattr(asarray(obj), method)(*args, **kwds)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/torch/_tensor.py", line 970, in __array__
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
Traceback (most recent call last):
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
return bound(*args, **kwds)
TypeError: clip() received an invalid combination of arguments - got (float, float, out=NoneType), but expected one of:
* (Tensor min, Tensor max)
* (Number min, Number max)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/OA/model/model_training/trainer_rl.py", line 199, in <module>
main()
File "/home/ubuntu/OA/model/model_training/trainer_rl.py", line 184, in main
trainer = trlx.train(
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/trlx/trlx.py", line 126, in train
trainer.learn()
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/trlx/trainer/accelerate_base_trainer.py", line 631, in learn
self.post_backward_callback()
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/trlx/trainer/accelerate_ppo_trainer.py", line 221, in post_backward_callback
self.kl_ctl.update(self.mean_kl, n_steps=self.config.train.batch_size)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/trlx/models/modeling_ppo.py", line 51, in update
proportional_error = np.clip(current / self.target - 1, -0.2, 0.2) # ϵₜ
File "<__array_function__ internals>", line 200, in clip
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 2180, in clip
return _wrapfunc(a, 'clip', a_min, a_max, out=out, **kwargs)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 66, in _wrapfunc
return _wrapit(obj, method, *args, **kwds)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 43, in _wrapit
result = getattr(asarray(obj), method)(*args, **kwds)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/torch/_tensor.py", line 970, in __array__
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: \ 0.020 MB of 0.020 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: exp_scores/mean ▁
wandb: exp_scores/running_mean ▁
wandb: exp_scores/running_std ▁
wandb: kl_ctl_value ▁
wandb: learning_rate_group_0 ▁▁▁▁
wandb: losses/policy_loss █▇▄▁
wandb: losses/total_loss █▇▄▁
wandb: losses/value_loss █▇▅▁
wandb: old_values/max ▁▁▁▁
wandb: old_values/mean ▁▁▁▁
wandb: old_values/min ▁▁▁▁
wandb: old_values/std ▁▁▁▁
wandb: padding_percentage ▁▁▁▁
wandb: policy/approx_kl ▁▁▂█
wandb: policy/clipfrac ▁▁▂█
wandb: policy/sqrt_kl ▁
wandb: ratio █▇▆▁
wandb: returns/max ▁▁▁▁
wandb: returns/mean ▁▁▁▁
wandb: returns/min ▁▁▁▁
wandb: returns/std ▁▁▁▁
wandb: reward/mean ▁
wandb: time/backward █▁▁▁
wandb: time/exp ▁
wandb: time/exp_generate ▁
wandb: time/exp_score ▁
wandb: time/forward █▁▁▁
wandb: time/generate ▁
wandb: values/clipfrac ▁▁▁█
wandb: values/max ▁▁▃█
wandb: values/mean ▁▂▃█
wandb: values/min ▁▂▃█
wandb: values/std ▁▁▂█
wandb: values/values_error █▇▅▁
wandb: values/values_mape_error ▂▁▁█
wandb:
wandb: Run summary:
wandb: exp_scores/mean 5.29421
wandb: exp_scores/running_mean 7.13103
wandb: exp_scores/running_std 2.26968
wandb: exp_scores/std nan
wandb: kl_ctl_value 0.1
wandb: learning_rate_group_0 0.0
wandb: losses/policy_loss -0.08002
wandb: losses/total_loss 0.10822
wandb: losses/value_loss 0.18823
wandb: old_values/max 1.08789
wandb: old_values/mean 0.45728
wandb: old_values/min -0.17883
wandb: old_values/std 0.24377
wandb: padding_percentage 0.0
wandb: policy/approx_kl 0.06049
wandb: policy/clipfrac 0.15267
wandb: policy/sqrt_kl 0.01389
wandb: ratio 0.95557
wandb: returns/max 4.0
wandb: returns/mean 0.73096
wandb: returns/min 0.32275
wandb: returns/std 0.67627
wandb: reward/mean 5.67346
wandb: time/backward 58.6278
wandb: time/exp 6.51247
wandb: time/exp_generate 4.38555
wandb: time/exp_score 1.11744
wandb: time/forward 0.1586
wandb: time/generate 20.26932
wandb: values/clipfrac 0.0687
wandb: values/max 1.38965
wandb: values/mean 0.71729
wandb: values/min 0.05487
wandb: values/std 0.26367
wandb: values/values_error 0.35596
wandb: values/values_mape_error 0.5249
wandb:
wandb: 🚀 View run trainer_rl/oa-pythia-12b-sft-df/1gpu:main at: https://wandb.ai/toanbku/rlhf/runs/l4eu07k3
wandb: Synced 6 W&B file(s), 1 media file(s), 3 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230718_152646-l4eu07k3/logs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 56207) of binary: /home/ubuntu/mambaforge/envs/cuda118/bin/python3.10
Traceback (most recent call last):
File "/home/ubuntu/mambaforge/envs/cuda118/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/accelerate/commands/launch.py", line 964, in launch_command
deepspeed_launcher(args)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
distrib_run.run(args)
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/mambaforge/envs/cuda118/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
trainer_rl.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-18_15:41:40
host : oa-server-4
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 56207)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment