Skip to content

Instantly share code, notes, and snippets.

@crosstyan
Last active December 21, 2022 07:23
Show Gist options
  • Save crosstyan/bc1de3f74ceac1e43b491af58a05c69b to your computer and use it in GitHub Desktop.
Save crosstyan/bc1de3f74ceac1e43b491af58a05c69b to your computer and use it in GitHub Desktop.
fxxk Nvidia

Install the driver

The distro matters! I'm using Ubuntu 20.04 since 22.04 isn't supported by cuda 11.6.

Inspired by nvidia-smi指令报错

sudo apt install nvidia-driver-470 nvidia-settings

version 470 comes with cuda 11.4 by default. Leave it alone and use seperate cuda that configured with conda or docker. Don't bother installling with apt the system package manager. What you need is the driver since the forward compatibility of cuda. (CUDA 11 and Later Defaults to Minor Version Compatibility)

Install Docker

Install Docker first. See Install Docker Engine on Ubuntu.

# ignore importing apt source
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin

You have to Setting up NVIDIA Container Toolkit

sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Get a image from Docker Hub

# checkout the naming convention
sudo docker run --rm --gpus all nvidia/cuda:11.6.2-devel-ubuntu20.04 nvidia-smi

Notes

You should use --ipc="host" or increase shm-size shared memory size when playing with Pytorch. docker image with --ipc=host option. Nothing would go wrong when inferring with AUTOMATIC WebUI, but it will be problematic when training.

It's not recommended and not the best practice to use docker as VM but I'm just lazy and it just works. You're image size would increase uncontrollably since all of the changes in container would be saved instead of last state. I know it's stupid.

An ugly solution is using volume to persist most of the working files and conda environment.

sudo docker run  --name dummy -d docker-webui
sudo docker cp dummy:/root /home/ubuntu/
sudo docker cp dummy:/opt/conda /home/ubuntu/
sudo docker stop dummy && sudo docker rm dummy
# 前略
    volumes:
      - /home/ubuntu/workplace:/workplace
      # copy from the container first after build
      # Why do I do that? prevent I losting all of StAtE
      - /home/ubuntu/conda:/opt/conda
      - /home/ubuntu/root:/root

Stupid but it works as expected and I can't find a better solution other than using a real VM.

Don’t treat docker containers like a VM, you’ll be shooting yourself in the foot on down the road.

See also How to flatten a Docker image?

# docker compose 1.28+
version: '3'
services:
webui:
build:
context: .
dockerfile: webui/Dockerfile
privileged: true
command: /bin/bash -c 'echo "Port 2222" >> /etc/ssh/sshd_config && service ssh start && tail -f /dev/null'
# image: nvidia/cuda:11.6.2-devel-ubuntu20.04
# command: bash
# host:container
network_mode: host
# DISTRIBUTED DATA PARALLEL based on nccl
# needs to exposed port in certain range
# and I can't expose all of them
# https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
# https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
# ports:
# - "2222:22"
# - "36000:36000"
volumes:
- /home/ubuntu/workplace:/workplace
# copy from the container first after build
# Why do I do that? prevent I losting all of StAtE
# sudo docker run docker-webui --name dummy -d
# sudo docker cp dummy:/root /home/ubuntu/
# sudo docker cp dummy:/opt/conda /home/ubuntu/
# sudo docker stop dummy && sudo docker rm dummy
# - /home/ubuntu/conda:/opt/conda
# - /home/ubuntu/root:/root
# stdin_open: true
# tty: true
# See also
# You have to either increase the size of shm or set ipc=host
# Or you will have trouble in training
# https://github.com/pytorch/pytorch#docker-image
ipc: "host"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
# https://stackoverflow.com/questions/72179923/docker-run-without-entry-point
FROM nvidia/cuda:11.6.2-devel-ubuntu20.04
ENV TZ=Asia/Shanghai
ENV DEBIAN_FRONTEND=noninteractive
RUN echo "export LANG=C.UTF-8" >>/etc/profile \
&& apt-get update \
&& apt-get install vim openssh-server iputils-ping wget curl -y \
&& sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/g' /etc/ssh/sshd_config \
&& echo "root:114514" | chpasswd
RUN apt install -y git tmux build-essential \
&& apt clean
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
/bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH
# init conda for bash and install mamba
RUN conda init bash \
&& conda install mamba -y -n base -c conda-forge \
&& mamba create -y -n diffusers python=3.10.6
RUN mamba install -n diffusers -y pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
# wget https://anaconda.org/xformers/xformers/0.0.15.dev343%2Bgit.1b1fd8a/download/linux-64/xformers-0.0.15.dev343%2Bgit.1b1fd8a-py310_cu11.6_pyt1.13.tar.bz2
# mamba install xformers-0.0.15.dev343+git.1b1fd8a-py310_cu11.6_pyt1.13.tar.bz2
EXPOSE 22
# a dummy command prevent docker container exit
CMD ["/bin/bash", "-c", "echo 'Port 2222' >> /etc/ssh/sshd_config && service ssh start && tail -f /dev/null"]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment