Skip to content

Instantly share code, notes, and snippets.

@chunhualiao
Created April 23, 2023 03:06
Show Gist options
  • Save chunhualiao/0dec705a10814b3603f20bd6e4fe5a62 to your computer and use it in GitHub Desktop.
Save chunhualiao/0dec705a10814b3603f20bd6e4fe5a62 to your computer and use it in GitHub Desktop.
DeepSpeed-Chat.ipynb
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@chunhualiao
Copy link
Author

chunhualiao commented Apr 23, 2023

I was able to train DeepSpeed Chat using a single A100 GPU from Google Colab (I am a paying Colab Pro with access to premium GPUs). There were some minor issues I had to deal with. Mostly they were related to CUDA out of memory errors.

To apply the following patch, save it into a text file like my_changes.patch.

Then
cd DeepSpeedExamples
patch -p0 <my_changes.patch

diff --git applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh
index 8d2865c..3cb36cd 100644
--- applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh
+++ applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh
@@ -16,5 +16,5 @@ fi
 mkdir -p $OUTPUT
 
 deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-1.3b \
-   --gradient_accumulation_steps 8 --lora_dim 128 --zero_stage $ZERO_STAGE \
+   --gradient_accumulation_steps 8 --gradient_checkpointing --lora_dim 128 --zero_stage $ZERO_STAGE \
    --deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log
diff --git applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh
index 435de2c..35ea226 100644
--- applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh
+++ applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_gpu/run_350m.sh
@@ -14,5 +14,5 @@ fi
 mkdir -p $OUTPUT
 
 deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-350m \
-   --num_padding_at_beginning 1 --weight_decay 0.1 --disable_dropout --gradient_accumulation_steps 4 --zero_stage $ZERO_STAGE \
+   --num_padding_at_beginning 1 --weight_decay 0.1 --disable_dropout --gradient_checkpointing --gradient_accumulation_steps 4 --zero_stage $ZERO_STAGE \
    --deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log
diff --git applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_gpu/run_1.3b.sh applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_gpu/run_1.3b.sh
index b33e3ad..d061e6a 100644
--- applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_gpu/run_1.3b.sh
+++ applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_gpu/run_1.3b.sh
@@ -22,6 +22,6 @@ mkdir -p $OUTPUT
 deepspeed main.py \
    --actor_model_name_or_path $ACTOR_MODEL_PATH --critic_model_name_or_path $CRITIC_MODEL_PATH \
    --actor_zero_stage $ACTOR_ZERO_STAGE --critic_zero_stage $CRITIC_ZERO_STAGE \
-   --num_padding_at_beginning 1 --gradient_accumulation_steps 2 \
+   --num_padding_at_beginning 1 --gradient_accumulation_steps 2 --actor_gradient_checkpointing --critic_gradient_checkpointing \
    --deepspeed --actor_lora_dim 128 --enable_hybrid_engine --actor_gradient_checkpointing --disable_actor_dropout \
    --output_dir $OUTPUT &> $OUTPUT/training.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment