Skip to content

Instantly share code, notes, and snippets.

@simonw
Created October 14, 2025 08:34
Show Gist options
  • Select an option

  • Save simonw/0942d96f616b9e328568ab27d911c8ed to your computer and use it in GitHub Desktop.

Select an option

Save simonw/0942d96f616b9e328568ab27d911c8ed to your computer and use it in GitHub Desktop.

llama.cpp Setup Notes with CUDA Support

Quick Start Guide (TL;DR)

To build llama.cpp with CUDA support on this system:

# Install dependencies
sudo apt-get update
sudo apt-get install -y cmake libcurl4-openssl-dev

# Clone and build
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
export PATH=/usr/local/cuda-13.0/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.0
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)

# Run a model
cd /home/dev/llama-cpp-experiment
export PATH=$PATH:/home/dev/llama-cpp-experiment/llama.cpp/build/bin
llama-cli -hf unsloth/Qwen3-4B-GGUF -p "I believe the meaning of life is" -n 128 -no-cnv

System Information

  • Platform: Linux (Ubuntu 24.04)
  • Architecture: aarch64 (ARM64)
  • GPU: NVIDIA GB10 (compute capability 12.1)
  • Driver Version: 580.82.09
  • CUDA Version: 13.0.88
  • CUDA Installation: /usr/local/cuda-13.0

Detailed Build Process

Step 1: Environment Assessment

Initial Discovery

  1. Checked for GPU availability:

    nvidia-smi
    • Found NVIDIA GB10 GPU
    • CUDA Version: 13.0
    • Driver: 580.82.09
    • No processes running on GPU
  2. Checked for CUDA compiler:

    nvcc --version
    • Issue: nvcc not found in PATH
    • Root Cause: CUDA toolkit installed but not in system PATH
    • Solution: Added /usr/local/cuda-13.0/bin to PATH
  3. Verified CUDA installation:

    ls -la /usr/local/cuda*
    • Confirmed CUDA 13.0 installed at /usr/local/cuda-13.0
    • Found nvcc binary at /usr/local/cuda-13.0/bin/nvcc
  4. Checked build tools:

    which cmake make gcc g++
    • Found: make, gcc, g++
    • Missing: cmake
    • Action Required: Install cmake

Step 2: Installing Dependencies

CMake Installation

sudo apt-get install -y cmake
  • Installed cmake 3.28.3
  • Additional packages: cmake-data, libarchive13t64, libcurl4t64, libjsoncpp25, librhash0

Initial Build Attempt - Failure #1

cd llama.cpp
make GGML_CUDA=1 -j$(nproc)
  • Error: "Build system changed: The Makefile build has been replaced by CMake"
  • Root Cause: llama.cpp now requires CMake instead of make
  • Solution: Switch to CMake build system

Step 3: CMake Build Process

Build Attempt #2 - Failure #2

mkdir build && cd build
export PATH=/usr/local/cuda-13.0/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.0
cmake .. -DGGML_CUDA=ON
  • Error: "Could NOT find CURL (missing: CURL_LIBRARY CURL_INCLUDE_DIR)"
  • Root Cause: CURL development headers required for HuggingFace integration
  • Observations:
    • CUDA toolkit found successfully
    • Using CUDA architectures: native
    • ARM feature FMA enabled
    • All other checks passed

CURL Installation

sudo apt-get install -y libcurl4-openssl-dev
  • Installed libcurl4-openssl-dev (8.5.0-2ubuntu10.6)
  • This package provides development files for curl with SSL support

Build Attempt #3 - Success

rm -rf build && mkdir build && cd build
export PATH=/usr/local/cuda-13.0/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.0
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)

CMake Configuration Output (Key Points):

  • CMAKE_SYSTEM_PROCESSOR: aarch64
  • GGML_SYSTEM_ARCH: ARM
  • Found CUDAToolkit: 13.0.88
  • CUDA host compiler: GNU 13.3.0
  • Including CPU backend
  • Including CUDA backend
  • Found CURL: 8.5.0
  • OpenMP enabled (version 4.5)

Build Process:

  • Total build targets: ~100
  • Built ggml-base (CPU)
  • Built ggml-cuda (GPU)
  • Built ggml (combined)
  • Built llama library
  • Built common utilities
  • Built all example programs including llama-cli

Build Time: Approximately 2-3 minutes with all CPU cores

Step 4: Binary Location

Binaries installed to: /home/dev/llama-cpp-experiment/llama.cpp/build/bin/

Key binaries:

  • llama-cli - Main CLI for inference (2.3M)
  • llama-server - HTTP server for model serving (4.0M)
  • llama-bench - Benchmarking tool
  • llama-quantize - Model quantization tool
  • libllama.so - Shared library (2.4M)
  • libggml-base.so - Base GGML library
  • libggml-cuda.so - CUDA backend library
  • libggml-cpu.so - CPU backend library

First Run - Model Download and Inference

Command

cd /home/dev/llama-cpp-experiment
export PATH=$PATH:/home/dev/llama-cpp-experiment/llama.cpp/build/bin
llama-cli -hf unsloth/Qwen3-4B-GGUF -p "I believe the meaning of life is" -n 128 -no-cnv

What Happened

  1. GPU Detection:

    • Found 1 CUDA device: NVIDIA GB10
    • Compute capability: 12.1
    • VMM: yes (Virtual Memory Management enabled)
  2. Model Download:

    • Connected to huggingface.co
    • Downloaded from: https://huggingface.co/unsloth/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf
    • File size: 2.38 GB (2381 MB)
    • Download speed: ~60 MB/s
    • Download time: ~39 seconds
    • Cached to: /home/dev/.cache/llama.cpp/unsloth_Qwen3-4B-GGUF_Qwen3-4B-Q4_K_M.gguf
  3. Model Loading:

    • Architecture: qwen3
    • Model type: 4B (4.02 B parameters)
    • Quantization: Q4_K_M (Medium)
    • File size: 2.32 GiB (4.95 BPW)
    • Context length: 40960 (model trained)
    • Context used: 4096
    • Vocabulary size: 151936 tokens
    • Layers: 36
  4. GPU Offloading:

    • All 37 layers offloaded to GPU (36 repeating layers + 1 output layer)
    • CUDA0 model buffer: 2375.91 MiB
    • CPU_Mapped buffer: 304.28 MiB
    • KV cache: 576.00 MiB (on GPU)
    • Compute buffer: 301.75 MiB (on GPU)
    • Total GPU memory used: ~3.3 GB
  5. Performance Metrics:

    • Prompt eval time: 25.22 ms / 7 tokens = 277.59 tokens/second
    • Generation time: 2079.89 ms / 127 tokens = 61.06 tokens/second
    • Load time: 1590.36 ms
    • Total time: 2158.01 ms for 134 tokens
  6. Memory Breakdown:

    • GPU total: 122548 MiB (GB10)
    • GPU free: 108254 MiB
    • GPU used: 3253 MiB (model 2375 + context 576 + compute 301)
    • Unaccounted: 11040 MiB
    • Host memory: 317 MiB
  7. Output Quality:

    • Generated coherent philosophical response about the meaning of life
    • Properly formatted with structured points
    • No errors or hallucinations detected

Troubleshooting Guide

Problem: "nvcc: command not found"

Solution: Add CUDA bin directory to PATH:

export PATH=/usr/local/cuda-13.0/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.0

Problem: "Build system changed: Makefile build replaced by CMake"

Solution: Use CMake instead of make:

cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)

Problem: "Could NOT find CURL"

Solution: Install CURL development package:

sudo apt-get install -y libcurl4-openssl-dev

Problem: Model download fails

Solution:

  • Check internet connectivity
  • Verify HuggingFace is accessible
  • Ensure sufficient disk space in /home/dev/.cache/llama.cpp/

Problem: GPU not detected

Solution:

  1. Verify GPU is visible: nvidia-smi
  2. Check CUDA installation: ls /usr/local/cuda*
  3. Ensure CUDA paths are set in environment variables

Problem: Out of memory on GPU

Solution: Reduce context size or use smaller quantization:

llama-cli -hf model-name -c 2048  # Reduce context to 2048

Performance Optimization Notes

Current Configuration

  • All model layers on GPU
  • Flash Attention enabled (auto-detected)
  • OpenMP enabled for CPU operations
  • ARM FMA (Fused Multiply-Add) enabled
  • Thread count: 20 threads

Possible Improvements

  1. Increase batch size for throughput-focused workloads:

    llama-cli -hf model-name -b 4096
  2. Adjust context size based on needs:

    llama-cli -hf model-name -c 8192  # Increase context
  3. GPU offload control (if needed to balance memory):

    llama-cli -hf model-name -ngl 30  # Offload only 30 layers
  4. Parallel processing for batch inference:

    llama-cli -hf model-name -np 4  # 4 parallel sequences

HuggingFace Integration

The -hf flag enables direct model download from HuggingFace:

# General syntax
llama-cli -hf <username>/<repo-name> [options]

# Examples
llama-cli -hf unsloth/Qwen3-4B-GGUF -p "Your prompt" -n 128
llama-cli -hf TheBloke/Llama-2-7B-GGUF -p "Your prompt" -n 256

Model Selection

  • llama.cpp automatically selects an appropriate quantization (typically Q4_K_M)
  • Models are cached in ~/.cache/llama.cpp/
  • First run downloads the model, subsequent runs use the cache

Supported Formats

  • GGUF format (required)
  • Various quantizations: Q4_K_M, Q5_K_M, Q8_0, etc.

Environment Setup for Future Sessions

Add to ~/.bashrc or ~/.zshrc:

# CUDA paths
export PATH=/usr/local/cuda-13.0/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.0
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH

# llama.cpp binary
export PATH=$PATH:/home/dev/llama-cpp-experiment/llama.cpp/build/bin

Verification Checklist

  • GPU detected and accessible
  • CUDA toolkit installed and in PATH
  • CMake installed (3.28.3+)
  • CURL development libraries installed
  • llama.cpp compiled with CUDA support
  • All layers offloaded to GPU
  • Model download and caching working
  • Inference successful with good performance

Useful Commands

# Check GPU status
nvidia-smi

# Monitor GPU usage in real-time
watch -n 1 nvidia-smi

# List all llama.cpp binaries
ls -lh /home/dev/llama-cpp-experiment/llama.cpp/build/bin/

# Clear model cache
rm -rf ~/.cache/llama.cpp/

# Benchmark a model
llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3-4B-GGUF_Qwen3-4B-Q4_K_M.gguf

# Run interactive chat
llama-cli -hf unsloth/Qwen3-4B-GGUF

# Start HTTP server
llama-server -hf unsloth/Qwen3-4B-GGUF --port 8080

Additional Resources

Success Metrics

Final working configuration achieved:

  • ✓ Build completed without errors
  • ✓ GPU fully utilized (37/37 layers on GPU)
  • ✓ High inference speed (61 tokens/sec generation)
  • ✓ Fast prompt processing (277 tokens/sec)
  • ✓ Stable memory usage (~3.3 GB GPU)
  • ✓ HuggingFace integration functional
  • ✓ Model caching working correctly

Date

Setup completed: October 14, 2025

@simonw
Copy link
Copy Markdown
Author

simonw commented Oct 14, 2025

Ran this inside a container with a dev user:

docker run -it --gpus=all \
  -v /usr/local/cuda:/usr/local/cuda:ro \
  nvcr.io/nvidia/cuda:13.0.1-devel-ubuntu24.04 bash

Then:

apt-get update && apt-get install -y sudo
# pick the first free UID >=1000
U=$(for i in $(seq 1000 65000); do if ! getent passwd $i >/dev/null; then echo $i; break; fi; done)
echo "Chosen UID: $U"
# same for a GID
G=$(for i in $(seq 1000 65000); do if ! getent group $i >/dev/null; then echo $i; break; fi; done)
echo "Chosen GID: $G"
# create user+group
groupadd -g "$G" devgrp
useradd -m -u "$U" -g "$G" -s /bin/bash dev
# enable password-less sudo:
printf 'dev ALL=(ALL) NOPASSWD:ALL\n' > /etc/sudoers.d/90-dev-nopasswd chmod 0440 /etc/sudoers.d/90-dev-nopasswd
# Switch to dev user
su - dev

I also added Claude Code:

sudo apt-get install -y npm
# You have to interactively set timezone here, fix that
sudo npm install -g @anthropic-ai/claude-code

@Nico59000
Copy link
Copy Markdown

no perception of "user-land"-like problems on the model local actions ?

@libredove
Copy link
Copy Markdown

no perception of "user-land"-like problems on the model local actions ?

If you want to install e.g. cmake, node, or curl so it is fully for your user only and unprivileged in your homedir, i can recommend pkgsrc 😉🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment