jerome9189/computing_infrastructure.md

## computing_infrastructure.md

      
    Raw
  

              computing_infrastructure.md
            
          
    CSE 447/517 Winter 2022 Computing Infrastructure

Welcome to CSE 447/517 Natural Language Processing! This document gives an overview of the computing resources available to you for this course and some development recommendations. If you have more questions, feel free to ask them on Ed!
Resources

We have a variety of resources for this course, including GPUs. Of course, you can always your own laptop/machine, but it is not recommended since (a) many things that we do will be quite expensive; (b) some libraries may not have good non-linux support; and (c) you probably do not want large datasets/models/etc. to occupy all your disk space. See the next section for tips that will make remote development smooth.
CSE Machines

We have CSE non-GPU machines attu.cs.washington.edu which everyone should have access to. PhD students may also have access to *cycle machines. See https://www.cs.washington.edu/lab/linux for more information. These machines can be accessed from anywhere. There is a disk space quota for attu; see https://www.cs.washington.edu/lab/file-access. They should be enough for early assignments that do not require a GPU.
We also reserved four CSE GPU machines nlpg0{0,1,2,3}.cs.washington.edu. nlpg0{0,1} each has two GTX 1080 Ti (11G) and nlpg0{2,3} each has two TITAN Xp (12G). They share the same file system and disk quota as attu. These machines require UW IPs to log in. Remote access options include logging in via the aforementioned non-GPU machines or with the UW VPN. See https://www.cs.washington.edu/lab/remote-access for more information.
Please be mindful when using these GPU machines as they are shared across all students in the course. Don't use, say, half of all GPUs at a time. Please do export CUDA_VISIBLE_DEVICES=i where i is 0 or 1 so that your job only sees and uses one of the two GPUs. See the next sections for tips on inspecting GPU usage and finding the owner of current GPU processes.
Each of the nlpg0* hosts also contains a /local1 directory that you may use with around 600G quota that is shared among all students. With the four hosts, this totals to around 2.4T space. Among the ~152 students in the class, each student on average can use ~15G there. Please be careful not to exceed this limit! See the next section for how to check the size of a directory. If you choose to use this space, please create a directory with your NetID or your team name -- as this space is shared and public, your team can put everything in a centralized directory. Note that, unlike the home directory, this space is not shared across machines, which means you (or your team) may have to choose a host early on and stick to it.
We have also worked with CSE support to give non-CSE students access to these machines; each non-CSE student in the course has been granted a CSE account. Please ask on Ed if you encounter issues while accessing it.
Cloud Computing Credits

Each student can claim a $50 Google Cloud Platform coupon. See https://edstem.org/us/courses/16743/discussion/986290 for instructions. You can use these credits for computing resources. See https://cloud.google.com/ai-platform/deep-learning-vm/docs/cloud-marketplace#creating_an_instance_with_one_or_more_gpus for how to launch a GPU instance. Other docs on the sidebar may also be useful.
Remember to stop machines that you are not using!
Other Resources

Google Colab provides free access to GPUs. Both Google Cloud Platform and Azure Cloud Computing also provide free education credits for students. These are in addition to the credits that your staff provides, mentioned above.
Development Recommendations

You are free to use these resources however you want, but here are some tips and recommendations that could be useful:


nvidia-smi is the go-to command to see the current GPU usage. You can use ps -up `nvidia-smi -q -x | grep pid | sed -e 's/<pid>//g' -e 's/<\/pid>//g' -e 's/^[[:space:]]*//'`  (from here) to look up the owner of each process. You can contact the user if he/she is occupying excessive resources for an extended amount of time. The output for nvidia-smi should look something like this 


Some IDEs (e.g. VSCode) support remote development which could be useful to write code directly on the remote machines without having to copy your code around (with scp, git, etc.). Note that, as mentioned above, the CSE GPU machines require UW IPs to log in, so you may need to use the UW VPN when connecting your IDE to those machines. Alternatively, since attu (but not the *cycle machines) share the same file system as the GPU machines, you can connect your IDE to and write code on attu, and run it on the GPU machines.


You should develop and test your code on either some local machine or the CSE machines, and only do training using the cloud computing platforms, since you only have a limited amount of credits.


You can choose to use virtual environments, such as Miniconda, to install your own Python binaries and packages in a separate environment. This may also be convenient for you to run the same code on different platforms (e.g., developing on the CSE machines and training on the cloud platforms), as well as for others to run your code.


However, virtual environments do take up a lot of disk space. If that becomes an issue, you may use the system-wide Python binary (python/python3). Many packages are already pre-installed (check with pip3 list). You can install additional packages in your home directory with pip3 install --user $PACKAGE_NAME.


Use du -sh ${DIR} to check the size of a particular directory.


Adding your local SSH public key (e.g. ~/.ssh/id_rsa.pub ) to ~/.ssh/authorized_keys on the remote machines will eliminate the need to type your password every time and will make your life a lot easier.


Please see the this example code for GPU training with pytorch. This example is primarily based on the CIFAR 10 pytorch example code. The important lines for GPU training are lines 76-78, 92, 105. In general, the .to() method can be called on models or tensors to move them to a GPU, and the .cpu() method is called to bring them back to the CPU.