In this article, I will share some of my experience on installing NVIDIA driver and CUDA on Linux OS. Here I mainly use Ubuntu as example. Comments for CentOS/Fedora are also provided as much as I can.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import torch | |
import psutil | |
import numpy as np | |
def log_profile(summaryWriter, step, scope='profile', cpu=True, mem=True, gpu=torch.cuda.is_available(), disk=['read_time', 'write_time'], network=False): | |
if cpu: | |
cpu_usage = np.array(psutil.cpu_percent(percpu=True)) | |
summaryWriter.add_scalars(f'{scope}/cpu/percent', { | |
'min': cpu_usage.min(), | |
'avg': cpu_usage.mean(), |
- Write your training script so that it can be killed, and then automatically resumes from the beginning of the current epoch when restarted. (See
train-example.py
for an example training loop incorporating these recommendations.)- Save checkpoints at every epoch... (See
utils.py
forsave_training_state
helper function.)- model(s)
- optimizer(s)
- any hyperparameter schedules — I usually write the epoch number to a JSON file and compute the hyperparameter schedules as a function of the epoch number.
- At the beginning of training, check for any saved training checkpoints and load all relevent info (models, optimizers, hyperparameter schedules). (See
utils.py
forload_training_state
helper function.) - Consider using smaller epochs by limiting the number of batches pulled from your (shuffled) dataloader during each epoch.
- Save checkpoints at every epoch... (See
- This will cause your trai
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# xvdg1 and xvdh1 are 2 attached volumes; md0 will be the (virtual) raid volume | |
mdadm -C /dev/md0 -l raid0 -c 64 -n 2 /dev/xvdg1 /dev/xvdh1 | |
mdadm -E /dev/xvd[g-h]1 | |
mdadm --detail /dev/md0 | |
mkfs.ext4 /dev/md0 | |
df -h | |
mount /dev/md0 /mnt/ |
-
Install NVIDIA drivers
-
Find NVIDIA driver download link for your system at http://www.nvidia.com/Download/index.aspx?lang=en-us
-
wget -P ~/Downloads/ http://us.download.nvidia.com/tesla/390.46/NVIDIA-Linux-x86_64-390.46.run
-
sudo rm /etc/X11/xorg.conf # It's ok if this doesn't exist
-
NVIDIA will clash with the nouveau driver so deactivate it:
$ sudo vim /etc/modprobe.d/blacklist-nouveau.conf
-
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Based on https://github.com/pytorch/pytorch/pull/3740 | |
import torch | |
import math | |
class AdamW(torch.optim.Optimizer): | |
"""Implements AdamW algorithm. | |
It has been proposed in `Fixing Weight Decay Regularization in Adam`_. | |
Arguments: |
These notes mostly follow a tutorial on installing NVIDIA and CUDA to avoid collisions with X servers.
Backup Links:
- https://davidsanwald.github.io/2016/11/13/building-tensorflow-with-gpu-support.html
- forgets to discuss the
--overide --no-opengl-libs
when installing cuda: https://gist.github.com/wangruohui/df039f0dc434d6486f5d4d098aa52d07
Open questions:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# From https://gist.github.com/colllin/c02319fe3202470cc4d0a0b73cdbd1a6 | |
#!/usr/bin/env python | |
############################################################################### | |
# $Id$ | |
# | |
# Project: GDAL2Tiles, Google Summer of Code 2007 & 2008 | |
# Global Map Tiles Classes | |
# Purpose: Convert a raster into TMS tiles, create KML SuperOverlay EPSG:4326, | |
# generate a simple HTML viewers based on Google Maps and OpenLayers |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Dynamic Routing Between Capsules | |
https://arxiv.org/abs/1710.09829 | |
""" | |
import torch | |
import torch.nn as nn | |
import torch.optim as optim | |
import torch.nn.functional as F | |
import torchvision.transforms as transforms |