crosstyan/Dockerfile

## cheat_sheet.md

      
    Raw
  

              cheat_sheet.md
            
          
    Install the driver

The distro matters! I'm using Ubuntu 20.04 since 22.04 isn't supported by cuda 11.6.
Inspired by nvidia-smi指令报错
sudo apt install nvidia-driver-470 nvidia-settings
version 470 comes with cuda 11.4 by default. Leave it alone and use seperate cuda that configured with conda or docker.
Don't bother installling with apt the system package manager.
What you need is the driver since the forward compatibility of cuda.
(CUDA 11 and Later Defaults to Minor Version Compatibility)
Install Docker

Install Docker first. See Install Docker Engine on Ubuntu.
# ignore importing apt source
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin
You have to Setting up NVIDIA Container Toolkit
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
Get a image from Docker Hub
# checkout the naming convention
sudo docker run --rm --gpus all nvidia/cuda:11.6.2-devel-ubuntu20.04 nvidia-smi
Notes

You should use --ipc="host" or increase shm-size shared memory size when playing with Pytorch.
docker image with --ipc=host option.
Nothing would go wrong when inferring with AUTOMATIC WebUI,
but it will be problematic when training.
It's not recommended and not the best practice to use docker as VM but I'm just lazy and it just works.
You're image size would increase uncontrollably since all of the changes in container would be saved instead of
last state. I know it's stupid.
An ugly solution is using volume to persist most of the working files and conda environment.
sudo docker run  --name dummy -d docker-webui
sudo docker cp dummy:/root /home/ubuntu/
sudo docker cp dummy:/opt/conda /home/ubuntu/
sudo docker stop dummy && sudo docker rm dummy
# 前略
    volumes:
      - /home/ubuntu/workplace:/workplace
      # copy from the container first after build
      # Why do I do that? prevent I losting all of StAtE
      - /home/ubuntu/conda:/opt/conda
      - /home/ubuntu/root:/root
Stupid but it works as expected and I can't find a better solution other than using a real VM.

Don’t treat docker containers like a VM, you’ll be shooting yourself in the foot on down the road.

See also How to flatten a Docker image?

  
## docker-compose.yaml
# docker compose 1.28+
version: '3'
services:
  webui:
    build:
      context: .
      dockerfile: webui/Dockerfile
    privileged: true
    command: /bin/bash -c 'echo "Port 2222" >> /etc/ssh/sshd_config && service ssh start && tail -f /dev/null'
    # image: nvidia/cuda:11.6.2-devel-ubuntu20.04
    # command: bash
    # host:container
    network_mode: host
    # DISTRIBUTED DATA PARALLEL based on nccl
    # needs to exposed port in certain range
    # and I can't expose all of them
    # https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
    # https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
    # ports:
    #  - "2222:22"
    #  - "36000:36000"
    volumes:
      - /home/ubuntu/workplace:/workplace
      # copy from the container first after build
      # Why do I do that? prevent I losting all of StAtE
      # sudo docker run docker-webui --name dummy -d
      # sudo docker cp dummy:/root /home/ubuntu/
      # sudo docker cp dummy:/opt/conda /home/ubuntu/
      # sudo docker stop dummy && sudo docker rm dummy
      # - /home/ubuntu/conda:/opt/conda
      # - /home/ubuntu/root:/root
    # stdin_open: true
    # tty: true
    # See also
    # You have to either increase the size of shm or set ipc=host
    # Or you will have trouble in training
    # https://github.com/pytorch/pytorch#docker-image
    ipc: "host"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

## Dockerfile
# https://stackoverflow.com/questions/72179923/docker-run-without-entry-point
FROM nvidia/cuda:11.6.2-devel-ubuntu20.04
ENV TZ=Asia/Shanghai
ENV DEBIAN_FRONTEND=noninteractive
RUN echo "export LANG=C.UTF-8" >>/etc/profile \
&& apt-get update \
&& apt-get install vim openssh-server iputils-ping wget curl -y \
&& sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/g' /etc/ssh/sshd_config \
&& echo "root:114514" | chpasswd
RUN apt install -y git tmux build-essential \
&& apt clean
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
     /bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH
# init conda for bash and install mamba
RUN conda init bash \
&& conda install mamba -y -n base -c conda-forge \
&& mamba create -y -n diffusers python=3.10.6
RUN mamba install -n diffusers -y pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
# wget https://anaconda.org/xformers/xformers/0.0.15.dev343%2Bgit.1b1fd8a/download/linux-64/xformers-0.0.15.dev343%2Bgit.1b1fd8a-py310_cu11.6_pyt1.13.tar.bz2
# mamba install xformers-0.0.15.dev343+git.1b1fd8a-py310_cu11.6_pyt1.13.tar.bz2
EXPOSE 22
# a dummy command prevent docker container exit
CMD ["/bin/bash", "-c", "echo 'Port 2222' >> /etc/ssh/sshd_config && service ssh start && tail -f /dev/null"]
	# docker compose 1.28+
	version: '3'
	services:
	webui:
	build:
	context: .
	dockerfile: webui/Dockerfile
	privileged: true
	command: /bin/bash -c 'echo "Port 2222" >> /etc/ssh/sshd_config && service ssh start && tail -f /dev/null'
	# image: nvidia/cuda:11.6.2-devel-ubuntu20.04
	# command: bash
	# host:container
	network_mode: host
	# DISTRIBUTED DATA PARALLEL based on nccl
	# needs to exposed port in certain range
	# and I can't expose all of them
	# https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
	# https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
	# ports:
	# - "2222:22"
	# - "36000:36000"
	volumes:
	- /home/ubuntu/workplace:/workplace
	# copy from the container first after build
	# Why do I do that? prevent I losting all of StAtE
	# sudo docker run docker-webui --name dummy -d
	# sudo docker cp dummy:/root /home/ubuntu/
	# sudo docker cp dummy:/opt/conda /home/ubuntu/
	# sudo docker stop dummy && sudo docker rm dummy
	# - /home/ubuntu/conda:/opt/conda
	# - /home/ubuntu/root:/root
	# stdin_open: true
	# tty: true
	# See also
	# You have to either increase the size of shm or set ipc=host
	# Or you will have trouble in training
	# https://github.com/pytorch/pytorch#docker-image
	ipc: "host"
	deploy:
	resources:
	reservations:
	devices:
	- capabilities: [gpu]
	# https://stackoverflow.com/questions/72179923/docker-run-without-entry-point
	FROM nvidia/cuda:11.6.2-devel-ubuntu20.04
	ENV TZ=Asia/Shanghai
	ENV DEBIAN_FRONTEND=noninteractive
	RUN echo "export LANG=C.UTF-8" >>/etc/profile \
	&& apt-get update \
	&& apt-get install vim openssh-server iputils-ping wget curl -y \
	&& sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/g' /etc/ssh/sshd_config \
	&& echo "root:114514" \| chpasswd
	RUN apt install -y git tmux build-essential \
	&& apt clean
	ENV CONDA_DIR /opt/conda
	RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
	/bin/bash ~/miniconda.sh -b -p /opt/conda
	ENV PATH=$CONDA_DIR/bin:$PATH
	# init conda for bash and install mamba
	RUN conda init bash \
	&& conda install mamba -y -n base -c conda-forge \
	&& mamba create -y -n diffusers python=3.10.6
	RUN mamba install -n diffusers -y pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
	# wget https://anaconda.org/xformers/xformers/0.0.15.dev343%2Bgit.1b1fd8a/download/linux-64/xformers-0.0.15.dev343%2Bgit.1b1fd8a-py310_cu11.6_pyt1.13.tar.bz2
	# mamba install xformers-0.0.15.dev343+git.1b1fd8a-py310_cu11.6_pyt1.13.tar.bz2
	EXPOSE 22
	# a dummy command prevent docker container exit
	CMD ["/bin/bash", "-c", "echo 'Port 2222' >> /etc/ssh/sshd_config && service ssh start && tail -f /dev/null"]