You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Download Slack Channel/PrivateChannel/DirectMessage History
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Setting up a GPU server with scheduling and containers
Setting up a GPU server with scheduling and containers
Our group recently acquired a new server to do some deep learning: a SuperMicro 4029GP-TRT2, stuffed with 8x NVidia RTX 2080 Ti. Though maybe a bit overpowered, with upcoming networks like BigGAN and fully 3D networks, as well as students joining our group, this machine will be used quite a lot in the future.
One challenge is, is how to manage these GPUs. There are many approaches, but given that most PhD candidates aren't sysadmins, these range from 'free-for-all', leading to one person hogging all GPUs for weeks due to a bug in the code, to Excel sheets that noone understands and noone adheres to because changing GPU ids in code is hard. This leads to a lot of frustration, low productivity and under-utilisation of these expensive servers. Another issue is conflicting software versions. TensorFlow and Keras, for example, tend to do breaking API changes every now and then. As t