jsjolund/Dockerfile

## README.md

      
    Raw
  

              README.md
            
          
    llama.cpp with Nvidia cuBLAS support in Docker

Dockerfile for running llama.cpp with Nvidia GPU support.
Install docker and the NVIDIA Container Toolkit. Instructions for Arch Linux here.
Build

docker build -t llama-cpp-cuda:0.0.1 .
Configuration

Create a model directory:
mkdir -p ~/models
It will be used for storing LLMs and configuration files.
Download a model supporting the new (as of Jun 2023) k-quant methods in llama.cpp, for example

Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_S.bin

and place it in the models directory.
Usage

Edit prompt.sh to set the model path. Also set number of CPU threads and number of GPU layers to use depending on your hardware.
Link or copy it to your $PATH:
ln prompt.sh ~/.local/bin
chmod +x ~/.local/bin/prompt.sh
Run it with a storyteller prompt:
prompt.sh storyteller "a mysterious forest"

Once upon a time, there was a vast and ancient forest that stretched for miles in every direction. It was said to be enchanted, with strange and wondrous creatures living within its depths. The trees were tall and gnarled...

Run it with an instruct prompt:
prompt.sh instruct "build a bicycle"

To build a bicycle, you will need the following components:

Frame: The main body of the bike that supports the wheels and seat.
Wheels: The large wheel in front and the smaller one in back that roll along the ground.
Pedals: The circular rotating devices that allow...


GPU offloading

This container was tested on the following hardware:

AMD Ryzen 9 3900XT 12-Core
1x Nvidia GTX 1080 Ti 11GB

Performance is approximately doubled with GPU offloading.
The output from llama.cpp should look like this:
main: build = 710 (b24c304)
main: seed  = 1687136441
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti
llama.cpp: loading model from /models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_S.bin
...
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2135.98 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloaded 42/43 layers to GPU
llama_model_load_internal: total VRAM used: 8212 MB


## Dockerfile
ARG CUDA_VERSION=12.1.1-devel-ubuntu22.04
FROM nvidia/cuda:${CUDA_VERSION}

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive \
    SHELL=/bin/bash

# Set locale
RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen

# Install apt packages
RUN apt-get update \
    && apt-get upgrade -y \
    && apt-get install -y --no-install-recommends \
    git cmake build-essential pkg-config libopenblas-dev \
    && apt-get clean all

RUN mkdir /src

RUN git clone --recurse-submodules https://github.com/ggerganov/llama.cpp.git /src/llama.cpp
RUN cd /src/llama.cpp \
    && mkdir build \
    && cd build \
    && cmake .. -DLLAMA_CUBLAS=ON \
    && cmake --build . --config Release

# Entrypoint
ENTRYPOINT ["/src/llama.cpp/build/bin/main"]

## prompt.sh
#!/bin/bash

# Set the model path
MODEL="/models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_S.bin"
# Size of the prompt context (default: 512)
PROMPT_CTX_SIZE=2048
# Number of threads to use during computation (default: 12)
CPU_THREADS=24
# Number of layers to store in VRAM
GPU_LAYERS=42

display_help() {
    cat <<EOL
Help for llama.cpp:

$(docker run --rm -it llama-cpp-cuda:0.0.1 --help)

Help for prompt.sh:

Usage: prompt.sh [type] [prompt]

  type    Type of prompt. Options are: 'story', 'instruct', 'file'.
  prompt  The prompt you want to process.

Examples:
  $ prompt.sh story "a mysterious forest"
  $ prompt.sh instruct "build a bicycle"
  $ prompt.sh file "file.txt"
EOL
}

form_prompt() {
    local type=$1
    local user_prompt=$2

    case $type in
        "instruct")
            printf '\b### Instruction:\nHow do I %s?\n### Response:\n' "$user_prompt"
            ;;
        "story")
            printf '\b### Instruction:\nTell me a compelling and imaginative story about %s. Include vivid descriptions and engaging dialogue.\n### Response:\n' "$user_prompt"
            ;;
        "file")
            file_content=$(cat "$user_prompt")
            printf '\b### Instruction:\nI have file on the local hard drive called %s. 1. Infer what file format and syntax the file has and output it. 2. Infer from the file name the purpose of the file and output it. 3. Give a two sentence long description of its contents, which follows:\n\n%s.\n### Response:\n' "$user_prompt" "$file_content"
            ;;
        *)
            printf "Invalid type."
            exit 1
    esac
}

[[ $# -eq 0 || $1 == "-h" || $1 == "--help" ]] && { display_help; exit 0; }
[[ $# -ne 2 ]] && { echo "Error: Invalid number of arguments."; display_help; exit 1; }

INSTRUCT_PROMPT=$(form_prompt "$1" "$2")

set -x

docker run --rm -it \
    --gpus all \
    -v ~/models:/models \
    --name llama \
    llama-cpp-cuda:0.0.1 \
    -m $MODEL \
    -t $CPU_THREADS \
    -ngl $GPU_LAYERS \
    -c $PROMPT_CTX_SIZE \
    -p "${INSTRUCT_PROMPT}" \
    --color --temp 0.7 --repeat_penalty 1.1 -n -1
	ARG CUDA_VERSION=12.1.1-devel-ubuntu22.04
	FROM nvidia/cuda:${CUDA_VERSION}

	# Set environment variables
	ENV DEBIAN_FRONTEND=noninteractive \
	SHELL=/bin/bash

	# Set locale
	RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen

	# Install apt packages
	RUN apt-get update \
	&& apt-get upgrade -y \
	&& apt-get install -y --no-install-recommends \
	git cmake build-essential pkg-config libopenblas-dev \
	&& apt-get clean all

	RUN mkdir /src

	RUN git clone --recurse-submodules https://github.com/ggerganov/llama.cpp.git /src/llama.cpp
	RUN cd /src/llama.cpp \
	&& mkdir build \
	&& cd build \
	&& cmake .. -DLLAMA_CUBLAS=ON \
	&& cmake --build . --config Release

	# Entrypoint
	ENTRYPOINT ["/src/llama.cpp/build/bin/main"]
	#!/bin/bash

	# Set the model path
	MODEL="/models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_S.bin"
	# Size of the prompt context (default: 512)
	PROMPT_CTX_SIZE=2048
	# Number of threads to use during computation (default: 12)
	CPU_THREADS=24
	# Number of layers to store in VRAM
	GPU_LAYERS=42

	display_help() {
	cat <<EOL
	Help for llama.cpp:

	$(docker run --rm -it llama-cpp-cuda:0.0.1 --help)

	Help for prompt.sh:

	Usage: prompt.sh [type] [prompt]

	type Type of prompt. Options are: 'story', 'instruct', 'file'.
	prompt The prompt you want to process.

	Examples:
	$ prompt.sh story "a mysterious forest"
	$ prompt.sh instruct "build a bicycle"
	$ prompt.sh file "file.txt"
	EOL
	}

	form_prompt() {
	local type=$1
	local user_prompt=$2

	case $type in
	"instruct")
	printf '\b### Instruction:\nHow do I %s?\n### Response:\n' "$user_prompt"
	;;
	"story")
	printf '\b### Instruction:\nTell me a compelling and imaginative story about %s. Include vivid descriptions and engaging dialogue.\n### Response:\n' "$user_prompt"
	;;
	"file")
	file_content=$(cat "$user_prompt")
	printf '\b### Instruction:\nI have file on the local hard drive called %s. 1. Infer what file format and syntax the file has and output it. 2. Infer from the file name the purpose of the file and output it. 3. Give a two sentence long description of its contents, which follows:\n\n%s.\n### Response:\n' "$user_prompt" "$file_content"
	;;
	*)
	printf "Invalid type."
	exit 1
	esac
	}

	[[ $# -eq 0 \|\| $1 == "-h" \|\| $1 == "--help" ]] && { display_help; exit 0; }
	[[ $# -ne 2 ]] && { echo "Error: Invalid number of arguments."; display_help; exit 1; }

	INSTRUCT_PROMPT=$(form_prompt "$1" "$2")

	set -x

	docker run --rm -it \
	--gpus all \
	-v ~/models:/models \
	--name llama \
	llama-cpp-cuda:0.0.1 \
	-m $MODEL \
	-t $CPU_THREADS \
	-ngl $GPU_LAYERS \
	-c $PROMPT_CTX_SIZE \
	-p "${INSTRUCT_PROMPT}" \
	--color --temp 0.7 --repeat_penalty 1.1 -n -1