edp1096/alpaca_train_run.md

## alpaca_train_run.md

      
    Raw
  

              alpaca_train_run.md
            
          
    스탠포드알파카 학습하기


runpod.io에서 실행
Llama-7B 사용
llama 모델을 workspace 밖 홈디렉에서 다운받기 때문에 컨테이너 용량을 15GB 정도 잡아줘야 한다
Volume 용량은 30G 이상으로 잡아줘야 한다 - 파인튜닝 끝나고 output에 저장되는 파일들이 25GB 남짓되는 크기가 필요하기 때문
허깅페이스 모델은 별도 다운 받을 필요 없고
A100 80G X 4로 처음 시작 1% 지점에서 예상시간 5:37:57 찍힘

Install vi

apt update
apt install vim
Go to Workspace

cd /workspace
Git LFS

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt install git-lfs
git lfs install
Clone stanford alpaca

git clone https://github.com/tatsu-lab/stanford_alpaca.git
Install transformers from git & requirements.txt

pip install git+https://github.com/huggingface/transformers.git
pip install -r requirements.txt
Train alpaca

torchrun --nproc_per_node=4 --master_port=8080 train.py \
    --model_name_or_path decapoda-research/llama-7b-hf \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir ../output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True
Trouble shooting

Change from LLaMATokenizer to LlamaTokenizer

vi /root/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348/tokenizer_config.json
Exception: Could not find the transformer layer class to wrap in the model


tatsu-lab/stanford_alpaca#58 (comment)

Change value of fsdp_transformer_layer_cls_to_wrap to LlamaDecoderLayer - 위 옵션에는 수정


wandb disable for nohup

wandb offline