Skip to content

Instantly share code, notes, and snippets.

@sshleifer
Last active August 5, 2021 20:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sshleifer/82d85d0ab63dc4980a9ff0663eae4c08 to your computer and use it in GitHub Desktop.
Save sshleifer/82d85d0ab63dc4980a9ff0663eae4c08 to your computer and use it in GitHub Desktop.
  1. remove optimizer state and save to $HOME for example:
MODEL_DIR=/large_experiments/xlmg/models/moe/52B/xlmg.52b.fp16.bm_none.tps2048.transformer_lm_gpt2_bigger.dl24.demb1024.dffn4096.moe_w0.01.all.share.adam.b2_0.98.eps1e-08.cl0.0.lr0.0003.sqrt_world_size.wu715.dr0.0.atdr0.0.wd0.01.ms2.uf1.mu572204.s1.ngpu128

python scripts/remove_opt_state.py \
    $MODEL_DIR/checkpoint_1_105000/checkpoint_1_105000 \
    checkpoint_1_105000_eval \
    --nproc 4 --resume-failed

Note you can do larger nproc (--nproc 32) on learnfair, but if you do it on devfair the code sometimes hangs.

  1. move to somewhere in /large_experiments (I am fuzzy on the chown command)

  2. update model_configs.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment