Skip to content

Instantly share code, notes, and snippets.

@toanbku
Created July 21, 2023 03:54
Show Gist options
  • Save toanbku/760b3c71e5ae44364e52769995b2f2d4 to your computer and use it in GitHub Desktop.
Save toanbku/760b3c71e5ae44364e52769995b2f2d4 to your computer and use it in GitHub Desktop.
(base) [RedmondAI] ubuntu@oa-server-8:~/OA/model/model_training$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 OMP_NUM_THREADS=1 accelerate launch --main_process_port 29501 --config_file configs/accelerate_config.yaml --num_processes 6 trainer_rl.py --configs defaults defaults_rlhf pythia_rlhf oasst_df_x1000
[2023-07-21 03:19:15,948] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-21 03:19:19,845] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-21 03:19:19,845] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-21 03:19:19,864] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-21 03:19:19,865] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-21 03:19:19,873] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-21 03:19:19,979] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
RNG seed: 2703368087
RNG seed: 2703368087
RNG seed: 2703368087
RNG seed: 2703368087
RNG seed: 2703368087
Found cached dataset json (/home/ubuntu/.cache/huggingface/datasets/toanbku___json/toanbku--oa-df-x1000-330ee3ae28b11a32/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Found cached dataset json (/home/ubuntu/.cache/huggingface/datasets/toanbku___json/toanbku--oa-df-x1000-330ee3ae28b11a32/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Found cached dataset json (/home/ubuntu/.cache/huggingface/datasets/toanbku___json/toanbku--oa-df-x1000-330ee3ae28b11a32/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Found cached dataset json (/home/ubuntu/.cache/huggingface/datasets/toanbku___json/toanbku--oa-df-x1000-330ee3ae28b11a32/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Found cached dataset json (/home/ubuntu/.cache/huggingface/datasets/toanbku___json/toanbku--oa-df-x1000-330ee3ae28b11a32/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
RNG seed: 2703368087
Found cached dataset json (/home/ubuntu/.cache/huggingface/datasets/toanbku___json/toanbku--oa-df-x1000-330ee3ae28b11a32/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
OASST HF dataset toanbku/oa-df-x1000: len(train)=1970, len(val)=330
OASST HF dataset toanbku/oa-df-x1000: len(train)=1970, len(val)=330
OASST HF dataset toanbku/oa-df-x1000: len(train)=1970, len(val)=330
OASST HF dataset toanbku/oa-df-x1000: len(train)=1970, len(val)=330
OASST HF dataset toanbku/oa-df-x1000: len(train)=1970, len(val)=330
len self.tokenizer 50282
[2023-07-21 03:19:25,472] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-21 03:19:25,472] [INFO] [comm.py:616:init_distributed] cdb=None
len self.tokenizer 50282
[2023-07-21 03:19:25,477] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-21 03:19:25,477] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-21 03:19:25,477] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
len self.tokenizer 50282
[2023-07-21 03:19:25,539] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-21 03:19:25,539] [INFO] [comm.py:616:init_distributed] cdb=None
len self.tokenizer 50282
[2023-07-21 03:19:25,551] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-21 03:19:25,552] [INFO] [comm.py:616:init_distributed] cdb=None
len self.tokenizer 50282
[2023-07-21 03:19:25,567] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-21 03:19:25,567] [INFO] [comm.py:616:init_distributed] cdb=None
OASST HF dataset toanbku/oa-df-x1000: len(train)=1970, len(val)=330
len self.tokenizer 50282
[2023-07-21 03:19:26,154] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-21 03:19:26,154] [INFO] [comm.py:616:init_distributed] cdb=None
[RANK 0] Initializing model: toanbku/oa-pythia-12b-sft-df
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:20<00:00, 6.87s/it]
Resizing embeddings to 50282
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:20<00:00, 6.94s/it]
Resizing embeddings to 50282
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:24<00:00, 8.06s/it]
Resizing embeddings to 50282
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:28<00:00, 9.44s/it]
Resizing embeddings to 50282
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:26<00:00, 8.74s/it]
Resizing embeddings to 50282
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:27<00:00, 9.26s/it]
Resizing embeddings to 50282
Number of trainable parameters: 11841M
Number of trainable parameters: 11841M
Downloading (…)model.bin.index.json: 100%|███████████████████████████████████| 47.3k/47.3k [00:00<00:00, 128MB/s]
[2023-07-21 03:24:00,596] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-07-21 03:24:00,626] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /home/ubuntu/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Creating extension directory /home/ubuntu/.cache/torch_extensions/py310_cu118/cpu_adam...
Using /home/ubuntu/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Number of trainable parameters: 11841M
Number of trainable parameters: 11841M
[2023-07-21 03:24:10,252] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Number of trainable parameters: 11841M
[2023-07-21 03:24:11,981] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Number of trainable parameters: 11841M
[1/3] /home/ubuntu/mambaforge/bin/nvcc -ccbin /home/ubuntu/mambaforge/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntu/mambaforge/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/ubuntu/mambaforge/include -isystem /home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/include -isystem /home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/include/THC -isystem /home/ubuntu/mambaforge/include -isystem /home/ubuntu/mambaforge/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -c /home/ubuntu/mambaforge/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2023-07-21 03:24:12,718] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /home/ubuntu/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
wandb: Currently logged in as: toanbku. Use `wandb login --relogin` to force relogin
Using /home/ubuntu/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /home/ubuntu/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /home/ubuntu/OA/model/model_training/wandb/run-20230721_032413-7g1yavb6
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run trainer_rl/oa-pythia-12b-sft-df/6gpus:main
wandb: ⭐️ View project at https://wandb.ai/toanbku/rlhf
wandb: 🚀 View run at https://wandb.ai/toanbku/rlhf/runs/7g1yavb6
[2023-07-21 03:24:21,849] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /home/ubuntu/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
[2/3] /home/ubuntu/mambaforge/bin/x86_64-conda-linux-gnu-c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntu/mambaforge/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/home/ubuntu/mambaforge/include -isystem /home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/include -isystem /home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/include/TH -isystem /home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/include/THC -isystem /home/ubuntu/mambaforge/include -isystem /home/ubuntu/mambaforge/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/home/ubuntu/mambaforge/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/ubuntu/mambaforge/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/3] /home/ubuntu/mambaforge/bin/x86_64-conda-linux-gnu-c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/home/ubuntu/mambaforge/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 33.74172616004944 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 22.379701375961304 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 24.125978469848633 seconds
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 33.770209074020386 seconds
Time to load cpu_adam op: 12.540481567382812 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 21.731924295425415 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000001, betas=(0.900000, 0.999000), weight_decay=0.000001, adam_w=1
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000001, betas=(0.900000, 0.999000), weight_decay=0.000001, adam_w=1
[2023-07-21 03:24:35,325] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000001, betas=(0.900000, 0.999000), weight_decay=0.000001, adam_w=1
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000001, betas=(0.900000, 0.999000), weight_decay=0.000001, adam_w=1
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000001, betas=(0.900000, 0.999000), weight_decay=0.000001, adam_w=1
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000001, betas=(0.900000, 0.999000), weight_decay=0.000001, adam_w=1
[2023-07-21 03:24:51,078] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-07-21 03:24:51,081] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-07-21 03:24:51,081] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-07-21 03:24:51,121] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-07-21 03:24:51,121] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-07-21 03:24:51,122] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-07-21 03:24:51,122] [INFO] [stage_1_and_2.py:133:__init__] Reduce bucket size 200000000
[2023-07-21 03:24:51,122] [INFO] [stage_1_and_2.py:134:__init__] Allgather bucket size 200000000
[2023-07-21 03:24:51,122] [INFO] [stage_1_and_2.py:135:__init__] CPU Offload: True
[2023-07-21 03:24:51,122] [INFO] [stage_1_and_2.py:136:__init__] Round robin gradient partitioning: False
Rank: 5 partition count [6] and sizes[(1982394028, False)]
Rank: 3 partition count [6] and sizes[(1982394028, False)]
Rank: 1 partition count [6] and sizes[(1982394028, False)]
Rank: 0 partition count [6] and sizes[(1982394028, False)]
Rank: 4 partition count [6] and sizes[(1982394028, False)]
Rank: 2 partition count [6] and sizes[(1982394028, False)]
[2023-07-21 03:25:29,511] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
[2023-07-21 03:25:29,514] [INFO] [utils.py:786:see_memory_usage] MA 22.78 GB Max_MA 22.78 GB CA 22.8 GB Max_CA 23 GB
[2023-07-21 03:25:29,514] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 83.13 GB, percent = 11.0%
[2023-07-21 03:25:46,648] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
[2023-07-21 03:25:46,650] [INFO] [utils.py:786:see_memory_usage] MA 22.78 GB Max_MA 22.78 GB CA 22.8 GB Max_CA 23 GB
[2023-07-21 03:25:46,650] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 208.3 GB, percent = 27.6%
[2023-07-21 03:25:46,650] [INFO] [stage_1_and_2.py:493:__init__] optimizer state initialized
[2023-07-21 03:25:46,767] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
[2023-07-21 03:25:46,769] [INFO] [utils.py:786:see_memory_usage] MA 22.78 GB Max_MA 22.78 GB CA 22.8 GB Max_CA 23 GB
[2023-07-21 03:25:46,769] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 208.34 GB, percent = 27.6%
[2023-07-21 03:25:46,772] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
[2023-07-21 03:25:46,773] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-07-21 03:25:46,773] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-07-21 03:25:46,773] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-06], mom=[[0.9, 0.95]]
[2023-07-21 03:25:46,774] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
[2023-07-21 03:25:46,774] [INFO] [config.py:964:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-07-21 03:25:46,774] [INFO] [config.py:964:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-07-21 03:25:46,775] [INFO] [config.py:964:print] amp_enabled .................. False
[2023-07-21 03:25:46,775] [INFO] [config.py:964:print] amp_params ................... False
[2023-07-21 03:25:46,775] [INFO] [config.py:964:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-07-21 03:25:46,775] [INFO] [config.py:964:print] bfloat16_enabled ............. False
[2023-07-21 03:25:46,775] [INFO] [config.py:964:print] checkpoint_parallel_write_pipeline False
[2023-07-21 03:25:46,775] [INFO] [config.py:964:print] checkpoint_tag_validation_enabled True
[2023-07-21 03:25:46,775] [INFO] [config.py:964:print] checkpoint_tag_validation_fail False
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f9d7868f010>
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] communication_data_type ...... None
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] curriculum_enabled_legacy .... False
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] curriculum_params_legacy ..... False
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] data_efficiency_enabled ...... False
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] dataloader_drop_last ......... False
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] disable_allgather ............ False
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] dump_state ................... False
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] dynamic_loss_scale_args ...... {'init_scale': 4096, 'scale_window': 1000, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] eigenvalue_enabled ........... False
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] eigenvalue_gas_boundary_resolution 1
[2023-07-21 03:25:46,776] [INFO] [config.py:964:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] eigenvalue_layer_num ......... 0
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] eigenvalue_max_iter .......... 100
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] eigenvalue_stability ......... 1e-06
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] eigenvalue_tol ............... 0.01
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] eigenvalue_verbose ........... False
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] elasticity_enabled ........... False
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] fp16_auto_cast ............... False
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] fp16_enabled ................. true
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] fp16_master_weights_and_gradients False
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] global_rank .................. 0
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] grad_accum_dtype ............. None
[2023-07-21 03:25:46,777] [INFO] [config.py:964:print] gradient_accumulation_steps .. 1
[2023-07-21 03:25:46,778] [INFO] [config.py:964:print] gradient_clipping ............ 1.0
[2023-07-21 03:25:46,778] [INFO] [config.py:964:print] gradient_predivide_factor .... 1.0
[2023-07-21 03:25:46,778] [INFO] [config.py:964:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-07-21 03:25:46,778] [INFO] [config.py:964:print] initial_dynamic_scale ........ 4096
[2023-07-21 03:25:46,778] [INFO] [config.py:964:print] load_universal_checkpoint .... False
[2023-07-21 03:25:46,778] [INFO] [config.py:964:print] loss_scale ................... 0
[2023-07-21 03:25:46,778] [INFO] [config.py:964:print] memory_breakdown ............. False
[2023-07-21 03:25:46,778] [INFO] [config.py:964:print] mics_hierarchial_params_gather False
[2023-07-21 03:25:46,778] [INFO] [config.py:964:print] mics_shard_size .............. -1
[2023-07-21 03:25:46,778] [INFO] [config.py:964:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] optimizer_legacy_fusion ...... False
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] optimizer_name ............... None
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] optimizer_params ............. None
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] pld_enabled .................. False
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] pld_params ................... False
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] prescale_gradients ........... False
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] scheduler_name ............... None
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] scheduler_params ............. None
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] sparse_attention ............. None
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] sparse_gradients_enabled ..... False
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] steps_per_print .............. inf
[2023-07-21 03:25:46,779] [INFO] [config.py:964:print] train_batch_size ............. 6
[2023-07-21 03:25:46,780] [INFO] [config.py:964:print] train_micro_batch_size_per_gpu 1
[2023-07-21 03:25:46,780] [INFO] [config.py:964:print] use_node_local_storage ....... False
[2023-07-21 03:25:46,780] [INFO] [config.py:964:print] wall_clock_breakdown ......... False
[2023-07-21 03:25:46,780] [INFO] [config.py:964:print] world_size ................... 6
[2023-07-21 03:25:46,780] [INFO] [config.py:964:print] zero_allow_untested_optimizer True
[2023-07-21 03:25:46,780] [INFO] [config.py:964:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=200000000 allgather_partitions=True allgather_bucket_size=200000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2023-07-21 03:25:46,780] [INFO] [config.py:964:print] zero_enabled ................. True
[2023-07-21 03:25:46,780] [INFO] [config.py:964:print] zero_force_ds_cpu_optimizer .. True
[2023-07-21 03:25:46,780] [INFO] [config.py:964:print] zero_optimization_stage ...... 2
[2023-07-21 03:25:46,780] [INFO] [config.py:950:print_user_config] json = {
"fp16": {
"enabled": "true",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1,
"auto_cast": false
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2.000000e+08,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 2.000000e+08,
"contiguous_gradients": true,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"gradient_accumulation_steps": 1,
"train_micro_batch_size_per_gpu": 1,
"train_batch_size": 6,
"gradient_clipping": 1.0,
"steps_per_print": inf,
"wall_clock_breakdown": false,
"bf16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:18<00:00, 6.02s/it]
Resizing embeddings to 50282
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:22<00:00, 7.37s/it]
Resizing embeddings to 50282
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:21<00:00, 7.24s/it]
Resizing embeddings to 50282
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:21<00:00, 7.16s/it]
Resizing embeddings to 50282
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:22<00:00, 7.46s/it]
Resizing embeddings to 50282
Number of trainable parameters: 11841M
Loading checkpoint shards: 33%|█████████████████ | 1/3 [00:09<00:19, 9.60s/it]Number of trainable parameters: 11841M
Number of trainable parameters: 11841M
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 3/3 [00:22<00:00, 7.66s/it]
Resizing embeddings to 50282
Number of trainable parameters: 11841M
Number of trainable parameters: 11841M
[RANK 0] Starting training
[rollout 0 / 16]: 0%| | 0/16 [00:00<?, ?it/s]You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Number of trainable parameters: 11841M
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[rollout 16 / 16]: 100%|█████████████████████████████████████████████████████████| 16/16 [05:07<00:00, 19.22s/it]
[RANK 0] Evaluating model
[generation sweep 1/1 | eval batch 11/11]: 100%|██████████████████████████████████████████████| 11/11 [00:59<00:00, 5.40s/it]
[RANK 0] Computing rewards
Traceback (most recent call last):
File "/home/ubuntu/OA/model/model_training/trainer_rl.py", line 199, in <module>
main()
File "/home/ubuntu/OA/model/model_training/trainer_rl.py", line 184, in main
trainer = trlx.train(
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/trlx/trlx.py", line 126, in train
trainer.learn()
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/trlx/trainer/accelerate_base_trainer.py", line 539, in learn
results = self.evaluate()
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/trlx/trainer/accelerate_base_trainer.py", line 430, in evaluate
rewards = self.reward_fn(
TypeError: create_reward_fn.<locals>.reward_fn() got an unexpected keyword argument 'tokenizer'
Traceback (most recent call last):
File "/home/ubuntu/OA/model/model_training/trainer_rl.py", line 199, in <module>
main()
File "/home/ubuntu/OA/model/model_training/trainer_rl.py", line 184, in main
trainer = trlx.train(
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/trlx/trlx.py", line 126, in train
trainer.learn()
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/trlx/trainer/accelerate_base_trainer.py", line 539, in learn
results = self.evaluate()
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/trlx/trainer/accelerate_base_trainer.py", line 430, in evaluate
rewards = self.reward_fn(
TypeError: create_reward_fn.<locals>.reward_fn() got an unexpected keyword argument 'tokenizer'
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: | 0.023 MB of 0.023 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: exp_scores/mean ▁
wandb: exp_scores/running_mean ▁
wandb: exp_scores/running_std ▁
wandb: exp_scores/std ▁
wandb: kl_ctl_value ▁
wandb: policy/sqrt_kl ▁
wandb: time/exp ▁
wandb: time/exp_generate ▁
wandb: time/exp_score ▁
wandb:
wandb: Run summary:
wandb: exp_scores/mean 4.85738
wandb: exp_scores/running_mean 3.56623
wandb: exp_scores/running_std 4.1059
wandb: exp_scores/std 5.84514
wandb: kl_ctl_value 0.1
wandb: policy/sqrt_kl 0.01649
wandb: time/exp 25.99541
wandb: time/exp_generate 23.5927
wandb: time/exp_score 0.55094
wandb:
wandb: 🚀 View run trainer_rl/oa-pythia-12b-sft-df/6gpus:main at: https://wandb.ai/toanbku/rlhf/runs/7g1yavb6
wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230721_032413-7g1yavb6/logs
Exception in thread NetStatThr:
Traceback (most recent call last):
File "/home/ubuntu/mambaforge/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/ubuntu/mambaforge/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 255, in check_network_status
self._loop_check_status(
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 211, in _loop_check_status
local_handle = request()
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/interface/interface.py", line 795, in deliver_network_status
return self._deliver_network_status(status)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/interface/interface_shared.py", line 601, in _deliver_network_status
return self._deliver_record(record)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/interface/interface_shared.py", line 560, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10083 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10084 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10085 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10086 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 10087 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 10083 via 15, forcefully exiting via 9
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 10084 via 15, forcefully exiting via 9
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 10085 via 15, forcefully exiting via 9
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 10086 via 15, forcefully exiting via 9
WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 10087 via 15, forcefully exiting via 9
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10082) of binary: /home/ubuntu/mambaforge/bin/python3.10
Traceback (most recent call last):
File "/home/ubuntu/mambaforge/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/accelerate/commands/launch.py", line 964, in launch_command
deepspeed_launcher(args)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
distrib_run.run(args)
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/mambaforge/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
trainer_rl.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-21_03:37:13
host : oa-server-8
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 10082)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment