To build llama.cpp with CUDA support on this system:
# Install dependencies
sudo apt-get update
sudo apt-get install -y cmake libcurl4-openssl-dev
# Clone and build
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
export PATH=/usr/local/cuda-13.0/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.0
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)
# Run a model
cd /home/dev/llama-cpp-experiment
export PATH=$PATH:/home/dev/llama-cpp-experiment/llama.cpp/build/bin
llama-cli -hf unsloth/Qwen3-4B-GGUF -p "I believe the meaning of life is" -n 128 -no-cnv- Platform: Linux (Ubuntu 24.04)
- Architecture: aarch64 (ARM64)
- GPU: NVIDIA GB10 (compute capability 12.1)
- Driver Version: 580.82.09
- CUDA Version: 13.0.88
- CUDA Installation:
/usr/local/cuda-13.0
-
Checked for GPU availability:
nvidia-smi
- Found NVIDIA GB10 GPU
- CUDA Version: 13.0
- Driver: 580.82.09
- No processes running on GPU
-
Checked for CUDA compiler:
nvcc --version
- Issue: nvcc not found in PATH
- Root Cause: CUDA toolkit installed but not in system PATH
- Solution: Added
/usr/local/cuda-13.0/binto PATH
-
Verified CUDA installation:
ls -la /usr/local/cuda*- Confirmed CUDA 13.0 installed at
/usr/local/cuda-13.0 - Found nvcc binary at
/usr/local/cuda-13.0/bin/nvcc
- Confirmed CUDA 13.0 installed at
-
Checked build tools:
which cmake make gcc g++
- Found: make, gcc, g++
- Missing: cmake
- Action Required: Install cmake
sudo apt-get install -y cmake- Installed cmake 3.28.3
- Additional packages: cmake-data, libarchive13t64, libcurl4t64, libjsoncpp25, librhash0
cd llama.cpp
make GGML_CUDA=1 -j$(nproc)- Error: "Build system changed: The Makefile build has been replaced by CMake"
- Root Cause: llama.cpp now requires CMake instead of make
- Solution: Switch to CMake build system
mkdir build && cd build
export PATH=/usr/local/cuda-13.0/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.0
cmake .. -DGGML_CUDA=ON- Error: "Could NOT find CURL (missing: CURL_LIBRARY CURL_INCLUDE_DIR)"
- Root Cause: CURL development headers required for HuggingFace integration
- Observations:
- CUDA toolkit found successfully
- Using CUDA architectures: native
- ARM feature FMA enabled
- All other checks passed
sudo apt-get install -y libcurl4-openssl-dev- Installed libcurl4-openssl-dev (8.5.0-2ubuntu10.6)
- This package provides development files for curl with SSL support
rm -rf build && mkdir build && cd build
export PATH=/usr/local/cuda-13.0/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.0
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)CMake Configuration Output (Key Points):
- CMAKE_SYSTEM_PROCESSOR: aarch64
- GGML_SYSTEM_ARCH: ARM
- Found CUDAToolkit: 13.0.88
- CUDA host compiler: GNU 13.3.0
- Including CPU backend
- Including CUDA backend
- Found CURL: 8.5.0
- OpenMP enabled (version 4.5)
Build Process:
- Total build targets: ~100
- Built ggml-base (CPU)
- Built ggml-cuda (GPU)
- Built ggml (combined)
- Built llama library
- Built common utilities
- Built all example programs including llama-cli
Build Time: Approximately 2-3 minutes with all CPU cores
Binaries installed to: /home/dev/llama-cpp-experiment/llama.cpp/build/bin/
Key binaries:
llama-cli- Main CLI for inference (2.3M)llama-server- HTTP server for model serving (4.0M)llama-bench- Benchmarking toolllama-quantize- Model quantization toollibllama.so- Shared library (2.4M)libggml-base.so- Base GGML librarylibggml-cuda.so- CUDA backend librarylibggml-cpu.so- CPU backend library
cd /home/dev/llama-cpp-experiment
export PATH=$PATH:/home/dev/llama-cpp-experiment/llama.cpp/build/bin
llama-cli -hf unsloth/Qwen3-4B-GGUF -p "I believe the meaning of life is" -n 128 -no-cnv-
GPU Detection:
- Found 1 CUDA device: NVIDIA GB10
- Compute capability: 12.1
- VMM: yes (Virtual Memory Management enabled)
-
Model Download:
- Connected to huggingface.co
- Downloaded from:
https://huggingface.co/unsloth/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf - File size: 2.38 GB (2381 MB)
- Download speed: ~60 MB/s
- Download time: ~39 seconds
- Cached to:
/home/dev/.cache/llama.cpp/unsloth_Qwen3-4B-GGUF_Qwen3-4B-Q4_K_M.gguf
-
Model Loading:
- Architecture: qwen3
- Model type: 4B (4.02 B parameters)
- Quantization: Q4_K_M (Medium)
- File size: 2.32 GiB (4.95 BPW)
- Context length: 40960 (model trained)
- Context used: 4096
- Vocabulary size: 151936 tokens
- Layers: 36
-
GPU Offloading:
- All 37 layers offloaded to GPU (36 repeating layers + 1 output layer)
- CUDA0 model buffer: 2375.91 MiB
- CPU_Mapped buffer: 304.28 MiB
- KV cache: 576.00 MiB (on GPU)
- Compute buffer: 301.75 MiB (on GPU)
- Total GPU memory used: ~3.3 GB
-
Performance Metrics:
- Prompt eval time: 25.22 ms / 7 tokens = 277.59 tokens/second
- Generation time: 2079.89 ms / 127 tokens = 61.06 tokens/second
- Load time: 1590.36 ms
- Total time: 2158.01 ms for 134 tokens
-
Memory Breakdown:
- GPU total: 122548 MiB (GB10)
- GPU free: 108254 MiB
- GPU used: 3253 MiB (model 2375 + context 576 + compute 301)
- Unaccounted: 11040 MiB
- Host memory: 317 MiB
-
Output Quality:
- Generated coherent philosophical response about the meaning of life
- Properly formatted with structured points
- No errors or hallucinations detected
Solution: Add CUDA bin directory to PATH:
export PATH=/usr/local/cuda-13.0/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.0Solution: Use CMake instead of make:
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j$(nproc)Solution: Install CURL development package:
sudo apt-get install -y libcurl4-openssl-devSolution:
- Check internet connectivity
- Verify HuggingFace is accessible
- Ensure sufficient disk space in
/home/dev/.cache/llama.cpp/
Solution:
- Verify GPU is visible:
nvidia-smi - Check CUDA installation:
ls /usr/local/cuda* - Ensure CUDA paths are set in environment variables
Solution: Reduce context size or use smaller quantization:
llama-cli -hf model-name -c 2048 # Reduce context to 2048- All model layers on GPU
- Flash Attention enabled (auto-detected)
- OpenMP enabled for CPU operations
- ARM FMA (Fused Multiply-Add) enabled
- Thread count: 20 threads
-
Increase batch size for throughput-focused workloads:
llama-cli -hf model-name -b 4096
-
Adjust context size based on needs:
llama-cli -hf model-name -c 8192 # Increase context -
GPU offload control (if needed to balance memory):
llama-cli -hf model-name -ngl 30 # Offload only 30 layers -
Parallel processing for batch inference:
llama-cli -hf model-name -np 4 # 4 parallel sequences
The -hf flag enables direct model download from HuggingFace:
# General syntax
llama-cli -hf <username>/<repo-name> [options]
# Examples
llama-cli -hf unsloth/Qwen3-4B-GGUF -p "Your prompt" -n 128
llama-cli -hf TheBloke/Llama-2-7B-GGUF -p "Your prompt" -n 256- llama.cpp automatically selects an appropriate quantization (typically Q4_K_M)
- Models are cached in
~/.cache/llama.cpp/ - First run downloads the model, subsequent runs use the cache
- GGUF format (required)
- Various quantizations: Q4_K_M, Q5_K_M, Q8_0, etc.
Add to ~/.bashrc or ~/.zshrc:
# CUDA paths
export PATH=/usr/local/cuda-13.0/bin:$PATH
export CUDA_HOME=/usr/local/cuda-13.0
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH
# llama.cpp binary
export PATH=$PATH:/home/dev/llama-cpp-experiment/llama.cpp/build/bin- GPU detected and accessible
- CUDA toolkit installed and in PATH
- CMake installed (3.28.3+)
- CURL development libraries installed
- llama.cpp compiled with CUDA support
- All layers offloaded to GPU
- Model download and caching working
- Inference successful with good performance
# Check GPU status
nvidia-smi
# Monitor GPU usage in real-time
watch -n 1 nvidia-smi
# List all llama.cpp binaries
ls -lh /home/dev/llama-cpp-experiment/llama.cpp/build/bin/
# Clear model cache
rm -rf ~/.cache/llama.cpp/
# Benchmark a model
llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3-4B-GGUF_Qwen3-4B-Q4_K_M.gguf
# Run interactive chat
llama-cli -hf unsloth/Qwen3-4B-GGUF
# Start HTTP server
llama-server -hf unsloth/Qwen3-4B-GGUF --port 8080- llama.cpp repository: https://github.com/ggerganov/llama.cpp
- Build documentation: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
- Model repository: https://huggingface.co/unsloth/Qwen3-4B-GGUF
- GGUF format info: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
Final working configuration achieved:
- ✓ Build completed without errors
- ✓ GPU fully utilized (37/37 layers on GPU)
- ✓ High inference speed (61 tokens/sec generation)
- ✓ Fast prompt processing (277 tokens/sec)
- ✓ Stable memory usage (~3.3 GB GPU)
- ✓ HuggingFace integration functional
- ✓ Model caching working correctly
Setup completed: October 14, 2025
Ran this inside a container with a
devuser:Then:
I also added Claude Code:
sudo apt-get install -y npm # You have to interactively set timezone here, fix that sudo npm install -g @anthropic-ai/claude-code