Skip to content

Instantly share code, notes, and snippets.

@FlorSanders
Created April 11, 2024 15:17
Show Gist options
  • Save FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f to your computer and use it in GitHub Desktop.
Save FlorSanders/2cf043f7161f52aa4b18fb3a1ab6022f to your computer and use it in GitHub Desktop.
Setup llama.cpp on a Nvidia Jetson Nano 2GB

Setup Guide for llama.cpp on Nvidia Jetson Nano 2GB

This is a full account of the steps I ran to get llama.cpp running on the Nvidia Jetson Nano 2GB. It accumulates multiple different fixes and tutorials, whose contributions are referenced at the bottom of this README.

Procedure

At a high level, the procedure to install llama.cpp on a Jetson Nano consists of 3 steps.

  1. Compile the gcc 8.5 compiler from source.

  2. Compile llama.cpp from source using the gcc 8.5 compiler.

  3. Download a model.

  4. Perform inference.

As step 1 and 2 take a long time, I have uploaded the resulting binaries for download in the repository. Simply download, unzip and follow step 3 and 4 to perform inference.

GCC Compilation

  1. Compile the GCC 8.5 compiler from source on the Jetson nano.
    NOTE: The make -j6 command takes a long time. I recommend running it overnight in a tmux session. Additionally, it requires quite a bit of disk space so make sure to leave at least 8GB of free space on the device before starting.
wget https://bigsearcher.com/mirrors/gcc/releases/gcc-8.5.0/gcc-8.5.0.tar.gz
sudo tar -zvxf gcc-8.5.0.tar.gz --directory=/usr/local/
cd /usr/local/
./contrib/download_prerequisites
mkdir build
cd build
sudo ../configure -enable-checking=release -enable-languages=c,c++
make -j6
make install
  1. Once the make install command ran successfully, you can clean up disk space by removing the build directory.
cd /usr/local/
rm -rf build
  1. Set the newly installed GCC and G++ in the environment variables.
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
  1. Double check whether the install was indeed successful (both commands should say 8.5.0).
gcc --version
g++ --version

llama.cpp Compilation

  1. Start by cloning the repository and rolling back to a known working commit.
git clone git@github.com:ggerganov/llama.cpp.git
git checkout a33e6a0
  1. Edit the Makefile and apply the following changes
    (save to file.patch and apply with git apply --stat file.patch)
diff --git a/Makefile b/Makefile
index 068f6ed0..a4ed3c95 100644
--- a/Makefile
+++ b/Makefile
@@ -106,11 +106,11 @@ MK_NVCCFLAGS = -std=c++11
 ifdef LLAMA_FAST
 MK_CFLAGS     += -Ofast
 HOST_CXXFLAGS += -Ofast
-MK_NVCCFLAGS  += -O3
+MK_NVCCFLAGS += -maxrregcount=80
 else
 MK_CFLAGS     += -O3
 MK_CXXFLAGS   += -O3
-MK_NVCCFLAGS  += -O3
+MK_NVCCFLAGS += -maxrregcount=80
 endif

 ifndef LLAMA_NO_CCACHE
@@ -299,7 +299,6 @@ ifneq ($(filter aarch64%,$(UNAME_M)),)
     # Raspberry Pi 3, 4, Zero 2 (64-bit)
     # Nvidia Jetson
     MK_CFLAGS   += -mcpu=native
-    MK_CXXFLAGS += -mcpu=native
     JETSON_RELEASE_INFO = $(shell jetson_release)
     ifdef JETSON_RELEASE_INFO
         ifneq ($(filter TX2%,$(JETSON_RELEASE_INFO)),)
  • NOTE: If you rather make the changes manually, do the following:

    • Change MK_NVCCFLAGS += -O3 to MK_NVCCFLAGS += -maxrregcount=80 on line 109 and line 113.

    • Remove MK_CXXFLAGS += -mcpu=native on line 302.

  1. Build the llama.cpp source code.
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6

Download a model

  1. Download a model to the device
wget https://huggingface.co/second-state/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf

Perform inference

  1. Test the main inference script
./main -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33  -c 2048 -b 512 -n 128 --keep 48
  1. Run the live server
./server -m ./TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf -ngl 33  -c 2048 -b 512 -n 128
  1. Test the web server functionality using curl
curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

You can now run a large language model on this tiny and cheap edge device. Have fun!

References

@keszegrobert
Copy link

with a freshly installed Nano, I also had to set up the cuda paths:
$ export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
source: https://forums.developer.nvidia.com/t/cuda-nvcc-not-found/118068

@VVilliams123
Copy link

I am on a jetson nano and using cuda 10.2 i exported the cuda correctly but i am getting this error.
CUDA error: no kernel image is available for execution on the device
current device: 0, in function ggml_cuda_op_flatten at ggml-cuda.cu:9906
cudaGetLastError()
GGML_ASSERT: ggml-cuda.cu:255: !"CUDA error"
No symbol table is loaded. Use the "file" command.
[New LWP 12804]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000007f87dd7d5c in __waitpid (pid=, stat_loc=0x0, options=) at ../sysdeps/unix/sysv/linux/waitpid.c:30
30 ../sysdeps/unix/sysv/linux/waitpid.c: No such file or directory.
No symbol "frame" in current context.
Aborted (core dumped)
what am i needing to do to fix this?

@kreier
Copy link

kreier commented Jan 7, 2025

Line 3 under GCC Compilation and "1. Compile the GCC 8.5" should read:

cd /usr/local/gcc-8.5.0/

@nzemke
Copy link

nzemke commented Jan 27, 2025

With the above instructions I was still seeing errors from the ggml-quants.c when using the make command. The first couple error lines are:

 ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:
 ggml-quants.c:497:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
  #define ggml_vld1q_s16_x2 vld1q_s16_x2
                            ^
 ggml-quants.c:5712:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
          const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
                                          ^~~~~~~~~~~~~~~~~
 ggml-quants.c:497:27: error: invalid initializer
  #define ggml_vld1q_s16_x2 vld1q_s16_x2
                            ^
 ggml-quants.c:5712:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
          const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);

To resolve the long list of errors I added the following code to the ggml-quants.c

#ifdef __ARM_NEON
#include <arm_neon.h>
#else
// Define fallback implementations for NEON intrinsics if required
#endif

#undef MIN
#undef MAX
#define MIN(a, b) ((a) < (b) ? (a) : (b))
#define MAX(a, b) ((a) > (b) ? (a) : (b))

#ifdef __ARM_NEON

// Provide fallback implementations for missing NEON functions
static inline int16x8x2_t vld1q_s16_x2(const int16_t *ptr) {
    int16x8x2_t result;
    result.val[0] = vld1q_s16(ptr);
    result.val[1] = vld1q_s16(ptr + 8);
    return result;
}

static inline uint8x16x2_t vld1q_u8_x2(const uint8_t *ptr) {
    uint8x16x2_t result;
    result.val[0] = vld1q_u8(ptr);
    result.val[1] = vld1q_u8(ptr + 16);
    return result;
}

static inline uint8x16x4_t vld1q_u8_x4(const uint8_t *ptr) {
    uint8x16x4_t result;
    result.val[0] = vld1q_u8(ptr);
    result.val[1] = vld1q_u8(ptr + 16);
    result.val[2] = vld1q_u8(ptr + 32);
    result.val[3] = vld1q_u8(ptr + 48);
    return result;
}

static inline int8x16x2_t vld1q_s8_x2(const int8_t *ptr) {
    int8x16x2_t result;
    result.val[0] = vld1q_s8(ptr);
    result.val[1] = vld1q_s8(ptr + 16);
    return result;
}

static inline int8x16x4_t vld1q_s8_x4(const int8_t *ptr) {
    int8x16x4_t result;
    result.val[0] = vld1q_s8(ptr);
    result.val[1] = vld1q_s8(ptr + 16);
    result.val[2] = vld1q_s8(ptr + 32);
    result.val[3] = vld1q_s8(ptr + 48);
    return result;
}

#define ggml_vld1q_s16_x2 vld1q_s16_x2
#define ggml_vld1q_u8_x2  vld1q_u8_x2
#define ggml_vld1q_u8_x4  vld1q_u8_x4
#define ggml_vld1q_s8_x2  vld1q_s8_x2
#define ggml_vld1q_s8_x4  vld1q_s8_x4

#else
#error "This implementation requires ARM NEON support. Please provide fallbacks for your architecture."
#endif

// The rest of the code implementation remains unchanged.

// Quantization and dequantization functions are here.

@SomKen
Copy link

SomKen commented Jan 27, 2025

Do you need to make these changes on 8gb Jetsons?

@nzemke
Copy link

nzemke commented Jan 28, 2025

Do you need to make these changes on 8gb Jetsons?

You shouldn't have to roll the llama.cpp back to the previous version to get it to run. I'm working with the Jetson Nano 2gb that can't go any higher then CUDA version 10.2. The reason to roll back the version of llama.cpp would only be in the case that you are working with an older version of CUDA.

That being said, I have not checked the current master to see if the dev still has the trouble section commented as being untested. If you are experiencing the same errors I listed (and there were a total of 5 repeating chunks of error for each "implicit declaration of function" errors) then I would say yes to give it a try.

Hope that helps!

@kreier
Copy link

kreier commented Jan 28, 2025

This gist is from April 11, 2024 and refers to the old Jetson Nano from 2019 with only 2GB RAM. There was also a 4GB version, but all of the original Jetson Nano are only supported up to JetPack 4.6.6 which includes CUDA 10.2 as latest version. Then this gist recommendes a version of llama.cpp to check out a33e6a0 from February 26, 2024. A lot has changed since then, and I think the Jetson Nano is natively supported by llama.cpp. The current version of the Makefile has entries for the Jetson in line 476. It could well be that this only refers to run on the CPU (as with the mentioned Raspberry Pi's) and not using the GPU with CUDA. This aligns with the error message by VViliams123 on October 4, 2024.

Currently I can run ollama (based on llama.cpp) on my Jetson Nano 4GB from 2019 in CPU inference out of the box. I just followed the recommended installation. As pointed out in ollama issue 4140 it should not be possible to run ollama with CUDA support on the GPU of the Jetson Nano. The reason being is that the latest JetPack from Nvidia only includes CUDA 10.2. The latest version of gcc that supports CUDA 10.2 is gcc-8. But to compile ollama with CUDA support you need at least gcc-11. @dtischler pointed out on May4, 2024 that the upgrade to gcc-11 can be done relatively easily, but CUDA and the GPU are not usable.

The situation is different for the Jetson Orin Nano you might refer to with 8GB of RAM. The Jetson Orin Nano is supported by JetPack 5 (L4T r35.x) and JetPack 6 (L4T r36.x) since 5.1.1. Inference acceleration using CUDA and the GPU should be possible, and probably work out of the box.

@Romyull-Islam
Copy link

I am using jetson nano 4GB. I could install gcc 9 and g++ 9 for cpu inference.
I am facing similar error while building for gpu. I tried with gcc8 and 9 both. Main issue appears CUDA version 10.2.

@konanast
Copy link

konanast commented Feb 8, 2025

I also have the 4GB version, and Ollama is only using the CPU. It would be great if there were a way to make it utilize the GPU as well.

@kreier
Copy link

kreier commented Feb 9, 2025

Not sure if it would actually improve the inference speed, since the token generation is generally limited by memory speed. With the 4GB unified memory for CPU and GPU the theoretical 25.6 GB/s sets the maximum speed. I actually only get 6.8 GB/s in real world measurement with sysbench memory --memory-block-size=1M run, not sure why that is.

The Jetsons Nano Orin with LPDDR5 has more and 3x faster memory with theoretical 68.3 GB/s. And ongoing software support by Nvidia ...

@zurvan23
Copy link

As the standard build of llama.cpp is rather slow on the Jetson Nano 4GB from 2019, running only on the CPU, I was hoping to get it to run on the GPU with these instructions, but alas, @kreier is correct. It runs after following these instructions, but again only with CPU inference. As soon as I try to offload to the gpu with --ngl xx, I get the same error as @VVilliams123.

Sad but true!

@Romyull-Islam
Copy link

Romyull-Islam commented Feb 15, 2025 via email

@anuragdogra2192
Copy link

anuragdogra2192 commented Mar 24, 2025

I made it work on Ubuntu 18, Jetson Nano 4 GB with CUDA 10.2. with gcc 8.5
This version 81bc,
by defining the below before including <cuda_runtime.h> in ggml-cuda.cu

#if CUDA_VERSION < 11000
#define CUBLAS_TF32_TENSOR_OP_MATH CUBLAS_TENSOR_OP_MATH
#define CUBLAS_COMPUTE_16F CUDA_R_16F
#define CUBLAS_COMPUTE_32F CUDA_R_32F
#endif

make sure while cmake -DLLAMA_CUBLAS=ON

@kreier
Copy link

kreier commented Mar 24, 2025

Do you see your GPU being used in jtop or after running ollama with ollama ps? How does the inferrence speed compares to the pure use of the CPU? I got some 2.65 token/s with deepseek-r1:1.5b in ollama 0.5.4 with a surprizing 66% GPU utilization, but in jtop the GPU always idles. So not sure if the percentage from ollama is correct.

After upgrading to ollama 0.6.2 it goes even up to 100% GPU in ollama ps (using 1.9 GByte RAM), and the token generation is slightly faster with 3.66 token/s. But jtop indicates 0% GPU usage, while the CPU is at 100%. Only sometimes some GPU activity is showing, but thats probably related to screen activity (headless is always 0%).

Switching ollama to pure CPU with /set parameter num_gpu 0 does not change the 3.66 token/s speed. But ollama ps reports now 100% CPU and needs only 1.1 GByte RAM. As usual it takes a few seconds to unload the model from the GPU and reload the model to the RAM for the CPU (even in this unified architecture, I guess the memory controller handles the different usage case for CPU). The increased RAM usage in ollama for the same model when using a GPU (compared to the CPU) matches my experience with larger GPUs and models (like P106-100 or 3060 Ti). The unchanged token generation speed matches my experience with their linear corellation of token/s to RAM speed. Since the RAM speed is the same for GPU and CPU because of the unified memory architecture of the Jetson Nano we would expect the same token generation speed. So the GPU cannot increase the inference speed on the Jetson. That's different on a regular PC with a dedicated VRAM for the GPU and much faster GDDR compared to the slower DIMMS with DDR RAM for the CPU.

PS: Another testrun resulted in the same speed. With the same question ("How many r's are in the word strawberry?") and model (deepseek-r1:1.5b) ollama now reports 12%/88% CPU/GPU. And jtop is not even blinking during promt evaluation (2.18 seconds).

My explanation attempt is that the attention operations (MATMUL and addition) is done rather fast by both GPU and CPU with the available cache and vector instructions, and the bottleneck in general is the slow RAM. The current prompt has to be processed through the entire data value matrix (1.5b parameters in my case) for the next token. Even expanding the quantized parameters from int4 to int8 or fp16 or whatever value is used for the matrix multiplication needs comparitavely little time, and is waiting for the next chunk of LLM model data to arrive to continue calculating the next predicted token. Therefore in LM Studio or ollama I see only a utilization of 60% of the GPU when doing inference, same for power draw. Training an LLM is probably a different picture.

@anuragdogra2192
Copy link

anuragdogra2192 commented Mar 25, 2025

I have tried the model "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf" from Hugging Face with llama.cpp.
It worked like a charm. I have added my performance optimized parameters in the blog below, check it out:

https://medium.com/@anuragdogra2192/llama-cpp-on-nvidia-jetson-nano-a-complete-guide-fb178530bc35

@kreier
Copy link

kreier commented Mar 27, 2025

Thanks @anuragdogra2192 for the detailled explanation on medium.com. The screenshots clearly show that the GPU is used with jtop (although 100% only use 403 mW, or 1.1W), and you include the speed for prompt evaluation wit 3.08 token/s and evaluation (token generation) with 1.75 token/s.

The codebase you use there (the llama.cpp with git checkout 81bc921) is from December 2023, so one might assume that some improvements in software happened in the last 1.5 years. I started with ollama 0.6.2 since I think the backend is llama.cpp. I loaded the same model you used with

ollama run hf.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M --verbose

And after inquiring "Can you suggest some places to visit in Dresden?" I got 10 places with the following analysis:

total duration:       1m27.059177746s
load duration:        35.433224ms
prompt eval count:    36 token(s)
prompt eval duration: 4.208096886s
prompt eval rate:     8.55 tokens/s
eval count:           449 token(s)
eval duration:        1m22.814029296s
eval rate:            5.42 tokens/s

I followed up with your question "I want to know some Cafes in Dresden city in Germany" and got again 10 places (instead of 3 for the same model?) with Alte Brücke, Kronenberg, Bauer, Kuhlewasser, Wenckebach, Am Kamp, Mauer, Bode, Schmitz and Slowenischer Hof. The analysis:

total duration:       2m57.918577451s
load duration:        39.293293ms
prompt eval count:    517 token(s)
prompt eval duration: 1m1.272009807s
prompt eval rate:     8.44 tokens/s
eval count:           569 token(s)
eval duration:        1m56.59851322s
eval rate:            4.88 tokens/s

jtop again shows no activity for the GPU, even 0mW power consumption, while the CPU is at 3.5W (compared to 2.2W for your case).

image

And ollama somehow still indicates to be using the GPU:

mk@jetson:~$ ollama ps
NAME                                                   ID              SIZE      PROCESSOR         UNTIL
hf.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M    86746d71dea5    1.3 GB    6%/94% CPU/GPU    4 minutes from now
mk@jetson:~$ ollama list
NAME                                                   ID              SIZE      MODIFIED
hf.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M    86746d71dea5    668 MB    2 hours ago
deepseek-r1:1.5b                                       a42b25d8c10a    1.1 GB    2 months ago

And since the model is 41% smaller than deepseek-r1:1.5b with 668 MB instead of 1.04 GiB (and only 22 layers instead of 28) it is also 40% faster in the token generation (average 5.15 compared to 3.66). Which would in turn align with the memory bandwith being the bottleneck.

It seems the use of the GPU for inference on the Jetson Nano currently does not make sense. The current way of doing inference (even with MLA Multi-head Latent Attention, Mixture of Experts and Multi-Token Prediction) presents the memory bandwidth as bottleneck. And CPUs get more useful instructions too, like NEON and FMA. The newer software makes the LLM almost 3x faster with the CPU than the older code with the GPU (5.15 token/s vs. 1.75). Therefore it currently seems like an academic excercise to use the GPU. That might be different for training a model.

I'll check the speed with llama.cpp later and post the update here.

@kreier
Copy link

kreier commented Mar 27, 2025

I successfully compiled one of the latest versions (b4970) of llama.cpp on the Jetson Nano with gcc 9.4 for CPU inference, using cmake 3.31.6 (installed with snap, with apt you only get 3.10.2 but you need at least 3.14). All following tests are done with the model TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M

Now I can compare the average token speed of 5.15 in ollama with the speed in llama.cpp. First I used the cli with a questions about cafe's in Dresden, using the command ./build/bin/llama-cli -m models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "I want to know some Cafes in Dresden city in Germany". The result are 5.02 token/s:

llama_perf_sampler_print:    sampling time =      57,77 ms /   411 runs   (    0,14 ms per token,  7114,79 tokens per second)
llama_perf_context_print:        load time =     793,19 ms
llama_perf_context_print: prompt eval time =    4082,84 ms /    30 tokens (  136,09 ms per token,     7,35 tokens per second)
llama_perf_context_print:        eval time =   75642,80 ms /   380 runs   (  199,06 ms per token,     5,02 tokens per second)
llama_perf_context_print:       total time =  181344,19 ms /   410 tokens

Adding parameters --n-gpu-layers 5 --ctx-size 512 --threads 4 --temp 0.7 --top-k 40 --top-p 0.9 --batch-size 16 does not change the result significantly. The gpu-layers is ignored anyway since it's running on the CPU. And it made me wonder why you chose the value of 5 layers? TinyLlama-1.1B-Chat has 22 layers, and they all fit into the unified RAM. Yet somehow the GPU was still utilized 100%? Can you try different values?

For consistency I ran the benchmark on this model with ./build/bin/llama-bench -m ../.cache/llama.cpp/TheBloke_TinyLlama-1.1B-Chat-v1.0-GGUF_tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf. The result is the same for PP and TG (within the margin of error), the Jetson Nano produces tokens at the speed of about 5 token/s:

| model                  |       size | params | backend | threads |  test |         t/s |
| ---------------------- | ---------: | -----: | ------- | ------: | ----: | ----------: |
| llama 1B Q4_K - Medium | 636.18 MiB | 1.10 B | CPU     |       4 | pp512 | 6.71 ± 0.00 |
| llama 1B Q4_K - Medium | 636.18 MiB | 1.10 B | CPU     |       4 | tg128 | 4.98 ± 0.01 |

build: c7b43ab6 (4970)

This indicates the limits of this edge computing device. I measured a realistic memory bandwidth of 6 GB/s for the Jetson Nano. On a i7-13700T with dual-channel DDR4 and 57 GB/s I get the following result:

| model                  |       size | params | backend | threads |  test |           t/s |
| ---------------------- | ---------: | -----: | ------- | ------: | ----: | ------------: |
| llama 1B Q4_K - Medium | 636.18 MiB | 1.10 B | CPU     |      12 | pp512 | 156.84 ± 8.99 |
| llama 1B Q4_K - Medium | 636.18 MiB | 1.10 B | CPU     |      12 | tg128 |  47.38 ± 0.88 |

build: d5c6309d (4975)

And finally on a 3070 Ti with 575 GB/s I get the result:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
| model                  |       size | params | backend | ngl |   test |               t/s |
| ---------------------- | ---------: | -----: | ------- | --: | -----: | ----------------: |
| llama 1B Q4_K - Medium | 636.18 MiB | 1.10 B | CUDA    |  99 |  pp512 | 12830.34 ± 186.18 |
| llama 1B Q4_K - Medium | 636.18 MiB | 1.10 B | CUDA    |  99 |  tg128 |    325.35 ± 11.50 |

build: f125b8dc (4977)

Which indicates: 10x memory bandwidth - 10x token generation. 96x memory bandwidth - 65x token generation. The CUDA core comparison is 128 to 6144, but with GPU the Jetson is currently its even slower 😲.

You see where the raw compute power is really needed, in the initial prompt processing. Here we see a jump from 6.71 on Jetson to 12830 on RTX 3070, a factor of 1912x. Comparing @anuragdogra2192 GPU version to my CPU version it is only 2x slower in pp (3.08 vs. 6.71) than in tg (1.75 vs. 4.98), so the GPU might have an impact here.

@anuragdogra2192
Copy link

anuragdogra2192 commented Mar 27, 2025

Thanks @kreier for your updates.
I have used the old version of LLAMA.cpp because it uses CUDA 10.2, which is compatible with the Jetson Nano Architecture.
The recent version of LLAMA.cpp requires a minimum CUDA 11+ (Maybe CUDA 10.2 lacks optimizations that latest CUDA versions have for LLM inference). And I wanted to build a GPU-enabled one to try out on my nano kit.
--n-gpu-layers 5 to keep GPU memory usage low and avoid crashing, but I will try with --n-gpu-layers = 22 as suggested by @kreier and update my metrics here soon. GPU makes a difference for sure.

@kreier
Copy link

kreier commented Mar 28, 2025

I followed the instructions form @anuragdogra2192 on medium.com and successfully compiled a GPU accelerated version of llama.cpp. Some predictions regarding the prompt processing speed pp512 came true, and a few new questions arise. But first the results.

This old version b1618 (81bc921 from December 7, 2023) does not have a llama-cli yet, so we call the main program with a task regarding the solar system: mk@jetson:~/llama.cpp3$ ./build/bin/main -m ../.cache/llama.cpp/TheBloke_TinyLlama-1.1B-Chat-v1.0-GGUF_tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Solar System" --n-gpu-layers 5 --ctx-size 512 --threads 4 --temp 0.7 --top-k 40 --top-p 0.9 --batch-size 16. After the answer the speed summary is

llama_print_timings:        load time =    1502,98 ms
llama_print_timings:      sample time =     739,99 ms /   869 runs   (    0,85 ms per token,  1174,34 tokens per second)
llama_print_timings: prompt eval time =    1148,26 ms /     4 tokens (  287,06 ms per token,     3,48 tokens per second)
llama_print_timings:        eval time =  387189,88 ms /   868 runs   (  446,07 ms per token,     2,24 tokens per second)
llama_print_timings:       total time =  389600,15 ms
Log end

Slightly better than 1.75 token/s. But it could also be a result of the context window being not filled yet. And it's still slower than the pure CPU use with newer llama.cpp builds. Now the CPU is only partly used with 650 mW, but the GPU is at 100% and 3.2W:

mk2

Let's move to the integrated benchmark tool, starting with the same 5 layers: ./build/bin/llama-bench -m ../.cache/llama.cpp/TheBloke_TinyLlama-1.1B-Chat-v1.0-GGUF_tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --n-gpu-layers 5

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA Tegra X1, compute capability 5.3
| model                          |       size | params | backend | ngl | test   |          t/s |
| ------------------------------ | ---------: | -----: | ------- | --: | ------ | -----------: |
| llama ?B mostly Q4_K - Medium  | 636.18 MiB | 1.10 B | CUDA    |   5 | pp 512 | 20.99 ± 0.16 |
| llama ?B mostly Q4_K - Medium  | 636.18 MiB | 1.10 B | CUDA    |   5 | tg 128 |  2.77 ± 0.01 |

build: 81bc9214 (1618)

And now a limit to 10 layers:

model size params backend ngl test t/s
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 10 pp 512 24.29 ± 0.42
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 10 tg 128 2.93 ± 0.01

Now with 22 layers:

model size params backend ngl test t/s
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 22 pp 512 42.24 ± 0.63
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 22 tg 128 3.09 ± 0.01

And now with the maximum working number of layers, 24. With 25 or no limit it crashes:

model size params backend ngl test t/s
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 24 pp 512 54.18 ± 0.17
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 24 tg 128 3.55 ± 0.01

Now both the CPU at 1.9W and the GPU at 2.4W are at 100% :

mk3

Questions

The build used by @anuragdogra2192 is from December 2023, the recommended build from the author of this gist @FlorSanders is from February 2024, both are rather old.

Both used the available nvcc 10.2.300 provided with the ubuntu 18.04 LTS version provided by Nvidia. And needed to build gcc 8.5.0 from scratch (takes 3 hours). The version from Anurag needed 5 extra lines in the file ggml-cuda.cu in the llama.cpp folder. Then he uses first cmake .. -DLLAMA_CUBLAS=ON, followed by make -j 2 in the build folder.

Flor compiled with make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6 after changing some lines in the Makefile. And he could run llama.cpp with 33 layers and --n-gpu-layers 33 while I get crashes for values larger than 24. The currently recommended method consists of two steps with CMake:

cmake -B build
cmake --build build --config Release

Would it be possible to tweak the current build (something in the b4984 range) to let it compile with the older nvcc 10.2 and gcc 8.5? Trying without any changes I got some errors of non supported nvcc fatal : Unsupported gpu architecture 'compute_80'. Flor had it already explixit required with sm_62 which is larger than the 5.3 the Jetson is having in hardware. I could not find a specific date or build of llama.cpp that indicates a dropped support for nvcc 10.2. And the CC 5.3 of the Jetson is still supported by nvcc 12.8, it is just not provided for the Jetson by Nvidia. And the CPU version of the current llama.cpp can be compiled with gcc 8.5.

Observations

With increased use of the GPU the prompt processing speed pp512 is indeed increasing! The pure CPU for the current llama.cpp build was 6.71, with 5 GPU layers it was more than 3x faster at 20.99, then 24.29 with 10 layers, 42.24 with 22 layers and finally 54.18 with 24 layers (before crashing at 25 layers). That's almost 10x faster with using the GPU!

The token generation tg128 is still significantly slower than newer builds with the CPU, compare 3.55 for GPU to 4.98 for CPU, 29% slower. I think this gap would be filled with more recent versions of llama.cpp.

And the use of the GPU is constantly fluctuating from 0 to 100% and any value between. I haven't observed this behaviour with discrete graphics cards and llama.cpp, they usually go to a very high percentage for pp, then a rather constant 20% - 40% usage (each value depending on model and graphics card type, or when distributed across several GPUs in the system , but constant for the time processing a certain task in one setup situation). The fluctuation of the Jetson GPU could be an effect of the use of the unified memory that has to be shared with the CPU, and I could not yet fully utilize the GPU.

Actually, only llama-bench crashes after 10 seconds. main runs continously with all 22 layers exported:

llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 1,10 B
llm_load_print_meta: model size       = 636,18 MiB (4,85 BPW)
llm_load_print_meta: general.name     = tinyllama_tinyllama-1.1b-chat-v1.0
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0,07 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   35,23 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors: VRAM used: 601,02 MiB
..................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 11,00 MiB
llama_new_context_with_model: kv self size  =   11,00 MiB
llama_build_graph: non-view tensors processed: 466/466
llama_new_context_with_model: compute buffer total size = 69,57 MiB
llama_new_context_with_model: VRAM scratch buffer: 66,50 MiB
llama_new_context_with_model: total VRAM used: 678,52 MiB (model: 601,02 MiB, context: 77,50 MiB)

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |

mk5

@kreier
Copy link

kreier commented Mar 28, 2025

Update on this initial Gist a33e6a0 from February 26, 2024, I finally got it compiled! Somehow cc was still linked to cc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0 for aarch64-linux-gnu. Updated /usr/bin/cc to gcc 8.5.0 and the single make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6 runs through rather fast compared to the other CMake variants. main and llama-bench are not in a /build/bin subfolder, but can be called directly in the llama.cpp folder. main works out of the box purely on the CPU with 4.24 t/s prompt evaluation and 2.24 t/s evaluation or token generation to the -p "Solar system" prompt.

But as soon as --n-gpu-layers 1 is involved it crashes. And llama-bench crashes out of the box, even when no GPU layers are indicated. The initial statement is positiv:

Log start
main: build = 2275 (a33e6a0d)
main: built with gcc (GCC) 8.5.0 for aarch64-unknown-linux-gnu
main: seed  = 1743182462
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA Tegra X1, compute capability 5.3, VMM: no
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from ..
...
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/23 layers to GPU
llm_load_tensors:        CPU buffer size =   636,18 MiB
llm_load_tensors:      CUDA0 buffer size =    23,64 MiB
....................................................................................
llama_new_context_with_model: n_ctx      = 512

but then later

CUDA error: no kernel image is available for execution on the device
  current device: 0, in function ggml_cuda_op_flatten at ggml-cuda.cu:9906
  cudaGetLastError()
GGML_ASSERT: ggml-cuda.cu:255: !"CUDA error"
[New LWP 30420]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000007f8e76ed5c in __waitpid (pid=<optimized out>, stat_loc=0x0, options=<optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:30
30      ../sysdeps/unix/sysv/linux/waitpid.c: No such file or directory.
#0  0x0000007f8e76ed5c in __waitpid (pid=<optimized out>, stat_loc=0x0, options=<optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:30
30      in ../sysdeps/unix/sysv/linux/waitpid.c
#1  0x0000000000416370 in ggml_print_backtrace ()
#2  0x00000000004e6d50 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) [clone .constprop.453] ()
#3  0x0000000000500138 in ggml_cuda_op_flatten(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, float const*, float const*, float*, CUstream_st*)) ()
#4  0x00000000004fe2e8 in ggml_cuda_compute_forward ()
#5  0x00000000004fe928 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#6  0x000000000050b214 in ggml_backend_sched_graph_compute ()
#7  0x000000000046a920 in llama_decode_internal(llama_context&, llama_batch) ()
#8  0x000000000046b730 in llama_decode ()
#9  0x00000000004b414c in llama_init_from_gpt_params(gpt_params&) ()
#10 0x000000000040f94c in main ()
[Inferior 1 (process 30419) detached]
Aborted (core dumped)

Same result for llama-bench after a promising

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA Tegra X1, compute capability 5.3, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
CUDA error: no kernel image is available for execution on the device

But at least for a short period of time the GPU is actually utilized:

image

For a short period of time the main program shows up in the list of active Commands, and the GPU Shared RAM jumps to 845 MB/3.9GB (see below). Part of the terminal prompt before crashing reads:

llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:        CPU buffer size =    35,16 MiB
llm_load_tensors:      CUDA0 buffer size =   601,02 MiB

mk5

@seccijr
Copy link

seccijr commented Apr 3, 2025

I followed the instructions form @anuragdogra2192 on medium.com and successfully compiled a GPU accelerated version of llama.cpp. Some predictions regarding the prompt processing speed pp512 came true, and a few new questions arise. But first the results.

This old version b1618 (81bc921 from December 7, 2023) does not have a llama-cli yet, so we call the main program with a task regarding the solar system: mk@jetson:~/llama.cpp3$ ./build/bin/main -m ../.cache/llama.cpp/TheBloke_TinyLlama-1.1B-Chat-v1.0-GGUF_tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Solar System" --n-gpu-layers 5 --ctx-size 512 --threads 4 --temp 0.7 --top-k 40 --top-p 0.9 --batch-size 16. After the answer the speed summary is

llama_print_timings:        load time =    1502,98 ms
llama_print_timings:      sample time =     739,99 ms /   869 runs   (    0,85 ms per token,  1174,34 tokens per second)
llama_print_timings: prompt eval time =    1148,26 ms /     4 tokens (  287,06 ms per token,     3,48 tokens per second)
llama_print_timings:        eval time =  387189,88 ms /   868 runs   (  446,07 ms per token,     2,24 tokens per second)
llama_print_timings:       total time =  389600,15 ms
Log end

Slightly better than 1.75 token/s. But it could also be a result of the context window being not filled yet. And it's still slower than the pure CPU use with newer llama.cpp builds. Now the CPU is only partly used with 650 mW, but the GPU is at 100% and 3.2W:

mk2

Let's move to the integrated benchmark tool, starting with the same 5 layers: ./build/bin/llama-bench -m ../.cache/llama.cpp/TheBloke_TinyLlama-1.1B-Chat-v1.0-GGUF_tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --n-gpu-layers 5

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA Tegra X1, compute capability 5.3
| model                          |       size | params | backend | ngl | test   |          t/s |
| ------------------------------ | ---------: | -----: | ------- | --: | ------ | -----------: |
| llama ?B mostly Q4_K - Medium  | 636.18 MiB | 1.10 B | CUDA    |   5 | pp 512 | 20.99 ± 0.16 |
| llama ?B mostly Q4_K - Medium  | 636.18 MiB | 1.10 B | CUDA    |   5 | tg 128 |  2.77 ± 0.01 |

build: 81bc9214 (1618)

And now a limit to 10 layers:
model size params backend ngl test t/s
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 10 pp 512 24.29 ± 0.42
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 10 tg 128 2.93 ± 0.01

Now with 22 layers:
model size params backend ngl test t/s
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 22 pp 512 42.24 ± 0.63
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 22 tg 128 3.09 ± 0.01

And now with the maximum working number of layers, 24. With 25 or no limit it crashes:
model size params backend ngl test t/s
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 24 pp 512 54.18 ± 0.17
llama ?B mostly Q4_K - Medium 636.18 MiB 1.10 B CUDA 24 tg 128 3.55 ± 0.01

Now both the CPU at 1.9W and the GPU at 2.4W are at 100% :

mk3

Questions

The build used by @anuragdogra2192 is from December 2023, the recommended build from the author of this gist @FlorSanders is from February 2024, both are rather old.

* [81bc921](https://github.com/ggml-org/llama.cpp/tree/81bc9214a389362010f7a57f4cbc30e5f83a2d28) from December 7, 2023 - [b1618](https://github.com/ggml-org/llama.cpp/tree/b1618)

* [a33e6a0](https://github.com/ggml-org/llama.cpp/commit/a33e6a0d2a66104ea9a906bdbf8a94d050189d91) from February 26, 2024 - [b2275](https://github.com/ggml-org/llama.cpp/tree/b2275)

Both used the available nvcc 10.2.300 provided with the ubuntu 18.04 LTS version provided by Nvidia. And needed to build gcc 8.5.0 from scratch (takes 3 hours). The version from Anurag needed 5 extra lines in the file ggml-cuda.cu in the llama.cpp folder. Then he uses first cmake .. -DLLAMA_CUBLAS=ON, followed by make -j 2 in the build folder.

Flor compiled with make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6 after changing some lines in the Makefile. And he could run llama.cpp with 33 layers and --n-gpu-layers 33 while I get crashes for values larger than 24. The currently recommended method consists of two steps with CMake:

cmake -B build
cmake --build build --config Release

Would it be possible to tweak the current build (something in the b4984 range) to let it compile with the older nvcc 10.2 and gcc 8.5? Trying without any changes I got some errors of non supported nvcc fatal : Unsupported gpu architecture 'compute_80'. Flor had it already explixit required with sm_62 which is larger than the 5.3 the Jetson is having in hardware. I could not find a specific date or build of llama.cpp that indicates a dropped support for nvcc 10.2. And the CC 5.3 of the Jetson is still supported by nvcc 12.8, it is just not provided for the Jetson by Nvidia. And the CPU version of the current llama.cpp can be compiled with gcc 8.5.

Observations

With increased use of the GPU the prompt processing speed pp512 is indeed increasing! The pure CPU for the current llama.cpp build was 6.71, with 5 GPU layers it was more than 3x faster at 20.99, then 24.29 with 10 layers, 42.24 with 22 layers and finally 54.18 with 24 layers (before crashing at 25 layers). That's almost 10x faster with using the GPU!

The token generation tg128 is still significantly slower than newer builds with the CPU, compare 3.55 for GPU to 4.98 for CPU, 29% slower. I think this gap would be filled with more recent versions of llama.cpp.

And the use of the GPU is constantly fluctuating from 0 to 100% and any value between. I haven't observed this behaviour with discrete graphics cards and llama.cpp, they usually go to a very high percentage for pp, then a rather constant 20% - 40% usage (each value depending on model and graphics card type, or when distributed across several GPUs in the system , but constant for the time processing a certain task in one setup situation). The fluctuation of the Jetson GPU could be an effect of the use of the unified memory that has to be shared with the CPU, and I could not yet fully utilize the GPU.

Actually, only llama-bench crashes after 10 seconds. main runs continously with all 22 layers exported:

llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 1,10 B
llm_load_print_meta: model size       = 636,18 MiB (4,85 BPW)
llm_load_print_meta: general.name     = tinyllama_tinyllama-1.1b-chat-v1.0
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0,07 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   35,23 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors: VRAM used: 601,02 MiB
..................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 11,00 MiB
llama_new_context_with_model: kv self size  =   11,00 MiB
llama_build_graph: non-view tensors processed: 466/466
llama_new_context_with_model: compute buffer total size = 69,57 MiB
llama_new_context_with_model: VRAM scratch buffer: 66,50 MiB
llama_new_context_with_model: total VRAM used: 678,52 MiB (model: 601,02 MiB, context: 77,50 MiB)

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |

mk5

After compiling cmake, I had to export the location of the CC and CXX compilers:
user@user-desktop$ export CC=/usr/local/bin/gcc
user@user-desktop$ export CXX=/usr/local/bin/g++
This solves the problem when making llama.cpp and rising this errors:
user@user-desktop:~/Downloads/llama.cpp/build$ make -j 2
[ 1%] Generating build details from Git
[ 2%] Building C object CMakeFiles/ggml.dir/ggml.c.o
-- Found Git: /usr/bin/git (found version "2.17.1")
[ 3%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[ 3%] Built target build_info
[ 4%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 5%] Building C object CMakeFiles/ggml.dir/ggml-backend.c.o
[ 6%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
/home/user/Downloads/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q2_K_q8_K’:
/home/user/Downloads/llama.cpp/ggml-quants.c:403:27: error: implicit declaration of function ‘vld1q_s16_x2’; did you mean ‘vld1q_s16’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_s16_x2 vld1q_s16_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3679:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
^~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:403:27: error: invalid initializer
#define ggml_vld1q_s16_x2 vld1q_s16_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3679:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
^~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:3680:41: warning: missing braces around initializer [-Wmissing-braces]
const ggml_int16x8x2_t mins16 = {vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(mins))), vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(mins)))};
^
{ }
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: implicit declaration of function ‘vld1q_u8_x2’; did you mean ‘vld1q_u32’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3716:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3716:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q2bits = ggml_vld1q_u8_x2(q2); q2 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:406:27: error: implicit declaration of function ‘vld1q_s8_x2’; did you mean ‘vld1q_s32’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_s8_x2 vld1q_s8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3718:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:406:27: error: invalid initializer
#define ggml_vld1q_s8_x2 vld1q_s8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3718:40: note: in expansion of macro ‘ggml_vld1q_s8_x2’
ggml_int8x16x2_t q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3723:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
SHIFT_MULTIPLY_ACCUM_WITH_SCALE(2, 2);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3725:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
SHIFT_MULTIPLY_ACCUM_WITH_SCALE(4, 4);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:3708:17: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
/home/user/Downloads/llama.cpp/ggml-quants.c:3727:13: note: in expansion of macro ‘SHIFT_MULTIPLY_ACCUM_WITH_SCALE’
SHIFT_MULTIPLY_ACCUM_WITH_SCALE(6, 6);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q3_K_q8_K’:
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:4353:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:4371:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q3bits = ggml_vld1q_u8_x2(q3); q3 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:407:27: error: implicit declaration of function ‘vld1q_s8_x4’; did you mean ‘vld1q_s64’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:4372:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:4372:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes_1 = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:4373:48: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes_2 = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q4_K_q8_K’:
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:5273:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q4bits = ggml_vld1q_u8_x2(q4); q4 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:5291:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
/home/user/Downloads/llama.cpp/ggml-quants.c:5300:21: error: incompatible types when assigning to type ‘int8x16x2_t {aka struct int8x16x2_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x2(q8); q8 += 32;
^
/home/user/Downloads/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q5_K_q8_K’:
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:5918:36: note: in expansion of macro ‘ggml_vld1q_u8_x2’
ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh);
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:5926:46: note: in expansion of macro ‘ggml_vld1q_u8_x2’
const ggml_uint8x16x2_t q5bits = ggml_vld1q_u8_x2(q5); q5 += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:5927:46: note: in expansion of macro ‘ggml_vld1q_s8_x4’
const ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c: In function ‘ggml_vec_dot_q6_K_q8_K’:
/home/user/Downloads/llama.cpp/ggml-quants.c:403:27: error: invalid initializer
#define ggml_vld1q_s16_x2 vld1q_s16_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:6627:41: note: in expansion of macro ‘ggml_vld1q_s16_x2’
const ggml_int16x8x2_t q8sums = ggml_vld1q_s16_x2(y[i].bsums);
^~~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:6629:43: warning: missing braces around initializer [-Wmissing-braces]
const ggml_int16x8x2_t q6scales = {vmovl_s8(vget_low_s8(scales)), vmovl_s8(vget_high_s8(scales))};
^
{ }
/home/user/Downloads/llama.cpp/ggml-quants.c:404:27: error: invalid initializer
#define ggml_vld1q_u8_x2 vld1q_u8_x2
^
/home/user/Downloads/llama.cpp/ggml-quants.c:6641:40: note: in expansion of macro ‘ggml_vld1q_u8_x2’
ggml_uint8x16x2_t qhbits = ggml_vld1q_u8_x2(qh); qh += 32;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:405:27: error: implicit declaration of function ‘vld1q_u8_x4’; did you mean ‘vld1q_u64’? [-Werror=implicit-function-declaration]
#define ggml_vld1q_u8_x4 vld1q_u8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:6642:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:405:27: error: invalid initializer
#define ggml_vld1q_u8_x4 vld1q_u8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:6642:40: note: in expansion of macro ‘ggml_vld1q_u8_x4’
ggml_uint8x16x4_t q6bits = ggml_vld1q_u8_x4(q6); q6 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:407:27: error: invalid initializer
#define ggml_vld1q_s8_x4 vld1q_s8_x4
^
/home/user/Downloads/llama.cpp/ggml-quants.c:6643:40: note: in expansion of macro ‘ggml_vld1q_s8_x4’
ggml_int8x16x4_t q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
^~~~~~~~~~~~~~~~
/home/user/Downloads/llama.cpp/ggml-quants.c:6686:21: error: incompatible types when assigning to type ‘int8x16x4_t {aka struct int8x16x4_t}’ from type ‘int’
q8bytes = ggml_vld1q_s8_x4(q8); q8 += 64;
^
cc1: some warnings being treated as errors
CMakeFiles/ggml.dir/build.make:120: recipe for target 'CMakeFiles/ggml.dir/ggml-quants.c.o' failed
make[2]: *** [CMakeFiles/ggml.dir/ggml-quants.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
CMakeFiles/Makefile2:823: recipe for target 'CMakeFiles/ggml.dir/all' failed
make[1]: *** [CMakeFiles/ggml.dir/all] Error 2
Makefile:145: recipe for target 'all' failed
make: *** [all] Error 2

@kreier
Copy link

kreier commented Apr 4, 2025

Hi @seccijr , it looks like your problem with compiling are related to the version of gcc you are using. I assume it is 8.4 from March 4, 2020. While this version is rather fast to install from the ppa:ubuntu-toolchain-r/test apt-repository, it seems that with this older version the ARM NEON intrinsic vld1q_s8_x4 is treated as a built-in function that cannot be replaced by a macro. There is a fix from ktkachov on 2020-10-13 in one of the 199 bug fixes leading to 8.5 that probably solved it. You have to compile gcc 8.5 yourself and try again.

Version 9.4 is not supported by nvcc 10.2 and shows error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!. The reasons are found in line 136 of /usr/local/cuda/targets/aarch64-linux/include/crt/host_config.h

#if defined (__GNUC__)
#if __GNUC__ > 8
#error -- unsupported GNU version! gcc versions later than 8 are not supported!
#endif /* __GNUC__ > 8 */ 

Taking 3 hours to compile is worth the time, more updates are coming. Soon you get an even faster llama.cpp with CUDA support, some 20% faster than pure CPU, with a build from today.

@seccijr
Copy link

seccijr commented Apr 4, 2025

Completely agree, I compiled gcc 8.5 and everything went like a charm till following the steps after CMake compilation. It happens to me that I had to specify de CC and CXX variables because CMake was using the old gcc 7.5 version. After setting those variables before llama.cpp compilation, all smooth.

user@user-desktop$ export CC=/usr/local/bin/gcc
user@user-desktop$ export CXX=/usr/local/bin/g++
user@user-desktop$ cd llama.cpp
user@user-desktop$ mkdir build && cd build
user@user-desktop$ cmake .. -DLLAMA_CUBLAS=ON
user@user-desktop$ make -j 2

I was able to replicate your steps and I am currently running some models. I plan to use TinyAgent and TinyAgent-ToolRAG . Now I am struggling to convert the ToolRAG to gguf: python numpy, cython, gguf nightmare in Jetson Nano.

Thank you for your dedication. It is admirable.

@kreier
Copy link

kreier commented Apr 5, 2025

It is actually possible to compile a recent llama.cpp version (April 2025) with CUDA support using gcc 8.5. After cloning the repository you have to edit 6 files, create 2 new ones (for fbloat16) and add a few flags to the first call of cmake. Overall this takes less than 5 minutes. Then you can have your Jetson Nano compile a new GPU accelerated version of llama.cpp with cmake --build build --config Release, but this takes 85 minutes. The compiled version runs Gemma3 and is on average 20% faster than purely on CPU or using ollama. I created a gist with the steps and instructions.

@kreier
Copy link

kreier commented Apr 6, 2025

I was trying to replicate this gist here from @FlorSanders to compare the performance with some benchmarks. Similar to other solutions you have to compile gcc 8.5, and that takes time (~ 3 hours). After that is short and fast: Just one file to edit (change 3 lines in the Makefile) and then don't execute two cmake programs but a single line make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_62 -j 6. Wait just 7 minutes and you're done!

The main and llama-bench are not in a /build/bin/ subfolder, and llama-bench only contains token generation tg128. Being over a year old the performance with TinyLlama-1.1B-Chat Q4 K M is only about 2.65 t/s. The performance is expected for a CUDA compiled llama.cpp that only uses the CPU. That changes when I start offloading even just one layer to the GPU with --n-gpu-layers 1. The GPU is used, and the GPU RAM is filled.

And then it immediately crashes, for any number of GPU layers other than zero. The error message is the same for both make and llama-bench:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA Tegra X1, compute capability 5.3, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
CUDA error: no kernel image is available for execution on the device
  current device: 0, in function ggml_cuda_op_flatten at ggml-cuda.cu:9906
  cudaGetLastError()
GGML_ASSERT: ggml-cuda.cu:255: !"CUDA error"
[New LWP 17972]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000007f8cb90d5c in __waitpid (pid=<optimized out>, stat_loc=0x0, options=<optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:30
30      ../sysdeps/unix/sysv/linux/waitpid.c: No such file or directory.
#0  0x0000007f8cb90d5c in __waitpid (pid=<optimized out>, stat_loc=0x0, options=<optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:30
30      in ../sysdeps/unix/sysv/linux/waitpid.c
#1  0x00000000004117fc in ggml_print_backtrace ()
#2  0x00000000004d9c00 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) [clone .constprop.453] ()
#3  0x00000000004f2fe8 in ggml_cuda_op_flatten(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, float const*, float const*, float*, CUstream_st*)) ()
#4  0x00000000004f1198 in ggml_cuda_compute_forward ()
#5  0x00000000004f17d8 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#6  0x00000000004fd974 in ggml_backend_sched_graph_compute ()
#7  0x000000000045e540 in llama_decode_internal(llama_context&, llama_batch) ()
#8  0x000000000045f350 in llama_decode ()
#9  0x0000000000410190 in main ()
[Inferior 1 (process 17971) detached]
Aborted (core dumped)

Can anyone confirm this behaviour? It was already reported by @VVilliams123 in October 2024. And confirmed by @zurvan23 in February 2025. In this case this gist would only describe another way to create a CPU build. And that could be done without any changes with the current llama.cpp source code and gcc 8.5 in 24 minutes. And be much faster. (b5058: pp=7.47 t/s and tg=4.15 t/s while using only 1.1 GB RAM total). Or using ollama, no need for gcc 8.5 - even less time needed to install. And it should probably work with a 2GB Jetson Nano. ollama run --verbose gemma3:1b consumes only 1.6 to 1.8 GB RAM (checked with jtop over ssh in a headless system). Just checked with another "Explain quantum entanglement" and pp=8.01 t/s and tg=4.66 t/s. While supposedly running 100% on GPU and using 1.9 GB VRAM according to ollama ps. Well, jtop disagrees. And re-check with my b5050 CUDA build, llama.cpp has 1.5 GB GPU shared RAM, total 2.3 GB (not good for the 2GB model). Now 3 video recommendations, and pp=17.33 t/s and tg=5.35 t/s. Only +15% to ollama this time. But +29% to the CPU llama.cpp.

@kreier
Copy link

kreier commented Apr 20, 2025

This gist here actually works! I can't replicate the compilation (as mentioned above), but the provided binaries DO use the GPU and accept given values for --n-gpu-layers. With increased number of layers it gets faster. Since its based on an older version b2275 of llama.cpp it is slower than a current CPU version, or ollama. I did some benchmarking:

benchmark image

More recent builds are faster than pure CPU compilations or ollama. And they support newer models like Gemma3. I exported my gist with some updates to a repository to include more images and benchmarks. And created a second repository with compiled versions of build 5050 (April 2025) and an installer. Its tested with the latest ubuntu 18.04.6 LTS image provided by Nvidia with Jetpack 4.6.1 (L4T 32.7.1). It can be installed with:

curl -fsSL https://kreier.github.io/llama.cpp-jetson.nano/install.sh | bash && source ~/.bashrc

The installation should take less than a minute. You can try your first LLM with llama-cli -hf ggml-org/gemma-3-1b-it-GGUF --n-gpu-layers 99. For unknown reasons the first start is stuck for 6:30 minutes at main: load model the model and apply lora adapter, if any. Any successive start takes only 12 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment