Last active
January 10, 2024 16:57
-
-
Save joanbm/2ec3c512a1ac21f5f5c6b3c1a4dbef35 to your computer and use it in GitHub Desktop.
Tentative fix for NVIDIA 470.199.02 driver for Linux 6.6-rc1
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
From a1879549b0bf049de790c0775c25971c82da8638 Mon Sep 17 00:00:00 2001 | |
From: =?UTF-8?q?Joan=20Bruguera=20Mic=C3=B3?= <joanbrugueram@gmail.com> | |
Date: Sat, 15 Jul 2023 22:26:18 +0000 | |
Subject: [PATCH] Tentative fix for NVIDIA 470.199.02 driver for Linux 6.6-rc1 | |
You will also need to apply this patch for Linux 6.5 support: | |
https://gist.github.com/joanbm/dfe8dc59af1c83e2530a1376b77be8ba | |
--- | |
nvidia-drm/nvidia-drm-drv.c | 4 ++++ | |
1 file changed, 4 insertions(+) | |
diff --git a/nvidia-drm/nvidia-drm-drv.c b/nvidia-drm/nvidia-drm-drv.c | |
index b93642a..1b310f3 100644 | |
--- a/nvidia-drm/nvidia-drm-drv.c | |
+++ b/nvidia-drm/nvidia-drm-drv.c | |
@@ -808,8 +808,12 @@ static struct drm_driver nv_drm_driver = { | |
.ioctls = nv_drm_ioctls, | |
.num_ioctls = ARRAY_SIZE(nv_drm_ioctls), | |
+// Rel. commit "drm/prime: Unexport helpers for fd/handle conversion" (Thomas Zimmermann, 20 Jun 2023) | |
+// Those functions are no longer exported, but leaving them to NULL is equivalent | |
+#if LINUX_VERSION_CODE < KERNEL_VERSION(6, 6, 0) | |
.prime_handle_to_fd = drm_gem_prime_handle_to_fd, | |
.prime_fd_to_handle = drm_gem_prime_fd_to_handle, | |
+#endif | |
.gem_prime_import = nv_drm_gem_prime_import, | |
.gem_prime_import_sg_table = nv_drm_gem_prime_import_sg_table, | |
-- | |
2.41.0 | |
The good news and the bad news.
The good news is that, yes, indeed one may compile the stock driver for 470.223.02
with a system running 6.7.0 kernel.
The bad news is that it won't work.
Nope.
It may pretend to work and you can do some very trivial operations within CUDA but
anything of any real value will simply hang in a terrible way :
[63403.837023] BUG: kernel NULL pointer dereference, address: 0000000000000088
[63403.838341] #PF: supervisor read access in kernel mode
[63403.839505] #PF: error_code(0x0000) - not-present page
[63403.840651] PGD 0 P4D 0
[63403.841774] Oops: 0000 [#1] PREEMPT SMP PTI
[63403.842894] CPU: 39 PID: 10250 Comm: mbrot Tainted: P O 6.7.0-genunix #1
See that NULL pointer deref stuff? That only happens with Linux kernels after 6.1.x and I have
tested the hell out of it.
t# cat /proc/version
Linux version 6.7.0-genunix (root@titan) (gcc (GENUNIX Mon Jan 8 03:58:15 UTC 2024) 13.2.0, GNU ld (GNU Binutils) 2.41) #1 SMP PREEMPT_DYNAMIC Mon Jan 8 05:41:19 GMT 2024
t#
So yes ... you can compile the drivers with 6.7.0. Big whoop.
No they will not work.
C'est la vie.
I shall now build 6.1.70 and re-do a trivial NVidia CUDA code test and it will work.
I can confirm that the drivers work with 6.1.71 and NVidia CUDA cranks out solid IEEE754
clean floating point math. No warnings. No NULL deref.
@blastwave If this also happens with recent but non-cutting-edge kernels like 6.2.x to 6.6.x, it may be worth reporting it to NVIDIA with reproduction steps. Not sure how responsive they are with problems with those "old" drivers but as far as I can tell, those kernel versions are officially supported.
@blastwave If this also happens with recent but non-cutting-edge kernels like 6.2.x to 6.6.x
get the 6.7.0 kernel. The issue here is that the NVidia devs are doing nasty wrapper calls in
their code.
it may be worth reporting it to NVIDIA with reproduction steps
How? The NVidia folks do not really have a bugzilla.
Not sure how responsive they are with problems with those "old" drivers
The real issue is that NVidia wants to drop support on all the Kepler hardware that has
the ability to perform FP64 floating point operations at full speed. It is about money.
Of course.
However nothing will get around the nasty code tricks that NVidia devs perform inside
their secret proprietary driver code. That is why we get a NULL pointer deref from code.
There were some changes in the way things are done in the Linux kernel after 6.1.x where
people really need to stick to the well known _syscallX ( for X = 0 ... 6 ) type calls and
nothing else. No digging around into __this_funky_non_API type call and no weird
wrapper calls. However the NVidia folks seem to just want to do whatever they do and
we get garbage drivers.
I can give a try with 545.29.06 and see what happens.
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@canolucas Unless something changes at the last minute (very unlikely) no patches should be required for the latest NVIDIA 470xx to work on Linux 6.7.