- A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
- Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
- Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
- Warning: might need to re-factor your own code.
- Warning: might be secretly condemned by your colleagues because using too many GPUs.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# -*- coding: utf-8 -*- | |
from argparse import ArgumentParser | |
import torch | |
import torch.distributed as dist | |
from torch.nn.parallel import DistributedDataParallel as DDP | |
from torch.utils.data import DataLoader, Dataset | |
from torch.utils.data.distributed import DistributedSampler | |
from transformers import BertForMaskedLM |
This is a companion piece to my instructions on building TensorFlow from source. In particular, the aim is to install the following pieces of software
- NVIDIA graphics card driver (v450.57)
- CUDA (v11.0.2)
- cuDNN (v8.0.2.39)
on an Ubuntu Linux system, in particular Ubuntu 20.04.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import torch | |
import torch.nn as nn | |
class conv_block_nested(nn.Module): | |
def __init__(self, in_ch, mid_ch, out_ch): | |
super(conv_block_nested, self).__init__() | |
self.activation = nn.ReLU(inplace=True) | |
self.conv1 = nn.Conv2d(in_ch, mid_ch, kernel_size=3, padding=1, bias=True) | |
self.bn1 = nn.BatchNorm2d(mid_ch) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
PREFIX="$HOME/.local/" | |
install_tree() { | |
# The project page of linux "tree" command is located at http://mama.indstate.edu/users/ice/tree | |
TMP_TREE_DIR="/tmp/$USER/tree"; mkdir -p $TMP_TREE_DIR | |
wget -nc -O $TMP_TREE_DIR/tree.tgz "http://mama.indstate.edu/users/ice/tree/src/tree-1.7.0.tgz" | |
tar -xvzf $TMP_TREE_DIR/tree.tgz -C $TMP_TREE_DIR --strip-components 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Copyright 2018 Uber Technologies, Inc. All Rights Reserved. | |
# | |
# Licensed under the Apache License, Version 2.0 (the "License"); | |
# you may not use this file except in compliance with the License. | |
# You may obtain a copy of the License at | |
# | |
# http://www.apache.org/licenses/LICENSE-2.0 | |
# | |
# Unless required by applicable law or agreed to in writing, software | |
# distributed under the License is distributed on an "AS IS" BASIS, |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
e.g: tar -czvf name-of-archive.tar.gz /path/to/directory-or-file
- -c: Create an archive.
- -z: Compress the archive with gzip.
- -v: makes tar talk a lot. Verbose output shows you all the files being archived and much.
- -f: Allows you to specify the filename of the archive.
Notes from arXiv:1611.07004v1 [cs.CV] 21 Nov 2016
- Euclidean distance between predicted and ground truth pixels is not a good method of judging similarity because it yields blurry images.
- GANs learn a loss function rather than using an existing one.
- GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss.
- Conditional GANs (cGANs) learn a mapping from observed image
x
and random noise vectorz
toy
:y = f(x, z)
- The generator
G
is trained to produce outputs that cannot be distinguished from "real" images by an adversarially trained discrimintor,D
which is trained to do as well as possible at detecting the generator's "fakes". - The discriminator
D
, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. - Unlike an unconditional GAN, both th
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
set -e | |
# install cuda-7.5 | |
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.5-18_amd64.deb | |
sudo dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.deb | |
sudo apt-get update | |
sudo apt-get install -y linux-image-extra-`uname -r` linux-headers-`uname -r` linux-image-`uname -r` | |
sudo apt-get install -y cuda-7-5 | |
echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:\$LD_LIBRARY_PATH" | tee -a ~/.profile | tee -a ~/.bashrc |
NewerOlder