Skip to content

Instantly share code, notes, and snippets.

@rahulvigneswaran
Last active February 7, 2022 12:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rahulvigneswaran/985fbe0d3de5b319dc29658325c456c6 to your computer and use it in GitHub Desktop.
Save rahulvigneswaran/985fbe0d3de5b319dc29658325c456c6 to your computer and use it in GitHub Desktop.
This gist is for newbies of server 93 (DGX) at IIT Hyderabad.

Table of Contents

CPU

  • Dont use CPUs to train. This will slow experiments of others drastically. Almost always try to run your exps on GPUs excluding certain edge cases.

GPU

  • Before you start a run on a GPU that someone else is already using, please do try and run first on an empty GPU, to check the GPU memory your experiment will occupy. If it exceeds the available memory (Use gpustat --watch to find the usage), PLEASE DONT RUN ON IT. This will crash the other person's runs.
  • But I need a GPU ASAP to run, what should I do?
    • 1st way : Contact the person who is using that GPU via GoogleChats/Slack and almost always they will make up some space for you if they are not running against a deadline. I generally use gpustat --watch to find the name/rollno and use the iith mail ID / Slack to contact them.
    • 2nd way (The Best Way) : Use half the batchsize and do gradient accumulation (https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3).

Disk Space

  • Once in a while use the command ncdu to find your unused large files and delete them. Nothing worse than starting a run before sleep and waking up to realize that your run crashed because the memory is full.
    • When using ncdu , you can just press d directly to delete that particular dir/file from within ncdu.
    • When using ncdu , you find out that the conda dir is occupying the most space, use conda clean --all
  • If you are using an account of a PhD or someone else, make a directory in your name first and do everything from inside that so that its easy for them to identify you.

For general training related pointers, check https://gist.github.com/rahulvigneswaran/8b5e6ecd2cae9698e360dbf6d6fc7ed3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment