Skip to content

Instantly share code, notes, and snippets.

@joanbm
Last active January 10, 2024 16:57
Show Gist options
  • Save joanbm/2ec3c512a1ac21f5f5c6b3c1a4dbef35 to your computer and use it in GitHub Desktop.
Save joanbm/2ec3c512a1ac21f5f5c6b3c1a4dbef35 to your computer and use it in GitHub Desktop.
Tentative fix for NVIDIA 470.199.02 driver for Linux 6.6-rc1
From a1879549b0bf049de790c0775c25971c82da8638 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Joan=20Bruguera=20Mic=C3=B3?= <joanbrugueram@gmail.com>
Date: Sat, 15 Jul 2023 22:26:18 +0000
Subject: [PATCH] Tentative fix for NVIDIA 470.199.02 driver for Linux 6.6-rc1
You will also need to apply this patch for Linux 6.5 support:
https://gist.github.com/joanbm/dfe8dc59af1c83e2530a1376b77be8ba
---
nvidia-drm/nvidia-drm-drv.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/nvidia-drm/nvidia-drm-drv.c b/nvidia-drm/nvidia-drm-drv.c
index b93642a..1b310f3 100644
--- a/nvidia-drm/nvidia-drm-drv.c
+++ b/nvidia-drm/nvidia-drm-drv.c
@@ -808,8 +808,12 @@ static struct drm_driver nv_drm_driver = {
.ioctls = nv_drm_ioctls,
.num_ioctls = ARRAY_SIZE(nv_drm_ioctls),
+// Rel. commit "drm/prime: Unexport helpers for fd/handle conversion" (Thomas Zimmermann, 20 Jun 2023)
+// Those functions are no longer exported, but leaving them to NULL is equivalent
+#if LINUX_VERSION_CODE < KERNEL_VERSION(6, 6, 0)
.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
+#endif
.gem_prime_import = nv_drm_gem_prime_import,
.gem_prime_import_sg_table = nv_drm_gem_prime_import_sg_table,
--
2.41.0
@blastwave
Copy link

Giving this a try now : 

t# 
t# uname -a 
Linux titan 6.6.0-rc2-genunix #1 SMP Fri Sep 22 07:45:23 GMT 2023 x86_64 GNU/Linux
t# 
t# cat /proc/version  
Linux version 6.6.0-rc2-genunix (root@titan) (gcc (GENUNIX Thu Aug 31 14:20:03 UTC 2023) 13.2.0, GNU ld (GNU Binutils) 2.40) #1 SMP Fri Sep 22 07:45:23 GMT 2023
t# 

t# 
t# diff -u kernel/nvidia-drm/nvidia-drm-drv.c.orig kernel/nvidia-drm/nvidia-drm-drv.c
--- kernel/nvidia-drm/nvidia-drm-drv.c.orig     2023-05-11 11:45:05.000000000 +0000
+++ kernel/nvidia-drm/nvidia-drm-drv.c  2023-09-22 08:36:11.719353469 +0000
@@ -807,8 +807,15 @@
     .ioctls                 = nv_drm_ioctls,
     .num_ioctls             = ARRAY_SIZE(nv_drm_ioctls),
 
+/* Rel. commit "drm/prime:
+ * Unexport helpers for fd/handle conversion" (Thomas Zimmermann, 20 Jun 2023)
+ * Those functions are no longer exported, but leaving them to NULL is equivalent
+ */
+#if LINUX_VERSION_CODE < KERNEL_VERSION(6, 6, 0)
     .prime_handle_to_fd     = drm_gem_prime_handle_to_fd,
     .prime_fd_to_handle     = drm_gem_prime_fd_to_handle,
+#endif
+
     .gem_prime_import       = nv_drm_gem_prime_import,
     .gem_prime_import_sg_table = nv_drm_gem_prime_import_sg_table,
 
t# 
t# 

I like comments to be reasonable all the way back to C89 or so. That explains my strange need
to make that comment a /* */ <-- type comment and not the bizarre weird double slash // C++ 
type thingie.

Well it all goes to hell ... fast :  


     CC [M]  /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/nvidia/nv_uvm_interface.o
     CC [M]  /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/nvidia/nvlink_linux.o
     CC [M]  /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/nvidia/nvlink_caps.o
   In file included from /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-linux.h:21,
                    from /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/nvidia/nv.c:13:
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h: In function 'NV_GET_USER_PAGES_REMOTE':
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:164:45: error: passing argument 1 of 'get_user_pages_remote' from incompatible pointer type [-Werror=incompatible-pointer-types]
     164 |                return get_user_pages_remote(tsk, mm, start, nr_pages, flags,
         |                                             ^~~
         |                                             |
         |                                             struct task_struct *
   In file included from /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-pgprot.h:17,
                    from /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-linux.h:20:
   ./include/linux/mm.h:2419:46: note: expected 'struct mm_struct *' but argument is of type 'struct task_struct *'
    2419 | long get_user_pages_remote(struct mm_struct *mm,
         |                            ~~~~~~~~~~~~~~~~~~^~
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:164:50: warning: passing argument 2 of 'get_user_pages_remote' makes integer from pointer without a cast [-Wint-conversion]
     164 |                return get_user_pages_remote(tsk, mm, start, nr_pages, flags,
         |                                                  ^~
         |                                                  |
         |                                                  struct mm_struct *
   ./include/linux/mm.h:2420:42: note: expected 'long unsigned int' but argument is of type 'struct mm_struct *'
    2420 |                            unsigned long start, unsigned long nr_pages,
         |                            ~~~~~~~~~~~~~~^~~~~
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:164:71: warning: passing argument 5 of 'get_user_pages_remote' makes pointer from integer without a cast [-Wint-conversion]
     164 |                return get_user_pages_remote(tsk, mm, start, nr_pages, flags,
         |                                                                       ^~~~~
         |                                                                       |
         |                                                                       unsigned int
   ./include/linux/mm.h:2421:66: note: expected 'struct page **' but argument is of type 'unsigned int'
    2421 |                            unsigned int gup_flags, struct page **pages,
         |                                                    ~~~~~~~~~~~~~~^~~~~
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:165:45: error: passing argument 6 of 'get_user_pages_remote' from incompatible pointer type [-Werror=incompatible-pointer-types]
     165 |                                             pages, vmas);
         |                                             ^~~~~
         |                                             |
         |                                             struct page **
   ./include/linux/mm.h:2422:33: note: expected 'int *' but argument is of type 'struct page **'
    2422 |                            int *locked);
         |                            ~~~~~^~~~~~
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:164:23: error: too many arguments to function 'get_user_pages_remote'
     164 |                return get_user_pages_remote(tsk, mm, start, nr_pages, flags,
         |                       ^~~~~~~~~~~~~~~~~~~~~
   ./include/linux/mm.h:2419:6: note: declared here
    2419 | long get_user_pages_remote(struct mm_struct *mm,
         |      ^~~~~~~~~~~~~~~~~~~~~
   In file included from /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-linux.h:21,
                    from /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/nvidia/nv-cray.c:14:
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h: In function 'NV_GET_USER_PAGES_REMOTE':
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:164:45: error: passing argument 1 of 'get_user_pages_remote' from incompatible pointer type [-Werror=incompatible-pointer-types]
     164 |                return get_user_pages_remote(tsk, mm, start, nr_pages, flags,
         |                                             ^~~
         |                                             |
         |                                             struct task_struct *
   In file included from /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-pgprot.h:17,
                    from /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-linux.h:20:
   ./include/linux/mm.h:2419:46: note: expected 'struct mm_struct *' but argument is of type 'struct task_struct *'
    2419 | long get_user_pages_remote(struct mm_struct *mm,
         |                            ~~~~~~~~~~~~~~~~~~^~
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:164:50: warning: passing argument 2 of 'get_user_pages_remote' makes integer from pointer without a cast [-Wint-conversion]
     164 |                return get_user_pages_remote(tsk, mm, start, nr_pages, flags,
         |                                                  ^~
         |                                                  |
         |                                                  struct mm_struct *
   ./include/linux/mm.h:2420:42: note: expected 'long unsigned int' but argument is of type 'struct mm_struct *'
    2420 |                            unsigned long start, unsigned long nr_pages,
         |                            ~~~~~~~~~~~~~~^~~~~
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:164:71: warning: passing argument 5 of 'get_user_pages_remote' makes pointer from integer without a cast [-Wint-conversion]
     164 |                return get_user_pages_remote(tsk, mm, start, nr_pages, flags,
         |                                                                       ^~~~~
         |                                                                       |
         |                                                                       unsigned int
   ./include/linux/mm.h:2421:66: note: expected 'struct page **' but argument is of type 'unsigned int'
    2421 |                            unsigned int gup_flags, struct page **pages,
         |                                                    ~~~~~~~~~~~~~~^~~~~
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:165:45: error: passing argument 6 of 'get_user_pages_remote' from incompatible pointer type [-Werror=incompatible-pointer-types]
     165 |                                             pages, vmas);
         |                                             ^~~~~
         |                                             |
         |                                             struct page **
   ./include/linux/mm.h:2422:33: note: expected 'int *' but argument is of type 'struct page **'
    2422 |                            int *locked);
         |                            ~~~~~^~~~~~
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:164:23: error: too many arguments to function 'get_user_pages_remote'
     164 |                return get_user_pages_remote(tsk, mm, start, nr_pages, flags,
         |                       ^~~~~~~~~~~~~~~~~~~~~
   ./include/linux/mm.h:2419:6: note: declared here
    2419 | long get_user_pages_remote(struct mm_struct *mm,
         |      ^~~~~~~~~~~~~~~~~~~~~
   In file included from /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-linux.h:21,
                    from /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/nvidia/nv-pat.c:14:
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h: In function 'NV_GET_USER_PAGES_REMOTE':
   /usr/local/build/nvidia/linux-6.6.0-rc2-genunix/NVIDIA-Linux-x86_64-470.199.02/kernel/common/inc/nv-mm.h:164:45: error: passing argument 1 of 'get_user_pages_remote' from incompatible pointer type [-Werror=incompatible-pointer-types]
     164 |                return get_user_pages_remote(tsk, mm, start, nr_pages, flags,
         |                                             ^~~


etc etc 


I think we have bigger problems here .

 
--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken
Greybeard and Suspenders Installed

@joanbm
Copy link
Author

joanbm commented Sep 22, 2023

Hi @blastwave. For Linux 6.6, in addition to this patch, you should also apply this other patch for Linux 6.5 at the same time (unfortunately the NVIDIA 470xx drivers only officially support up to Linux 6.4, so that's why both are necessary). Sorry for not making this clear.

@blastwave
Copy link

Thank you for the clear directions Sir.

I will give this a try with Linux 6.5.7 later today and see how things go.
 
--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken

@Augusto7743
Copy link

Only for report.
Here using Ubuntu 20.04.6 with Linux kernel 6.6.5 and Nvidia binary driver
NVIDIA-Linux-x86_64-470.223.02
That driver was installed using 6.2.16.
After was done an kernel update and nvidia driver dkms update too.
The driver works, but I not has tested if is possible install in an fresh OS.

@canolucas
Copy link

@joanbm
kernel 6.7 is getting released next weekend. Do you know if any patching will be necessary for the upcoming release to work ?

@joanbm
Copy link
Author

joanbm commented Jan 3, 2024

@canolucas Unless something changes at the last minute (very unlikely) no patches should be required for the latest NVIDIA 470xx to work on Linux 6.7.

@blastwave
Copy link

The good news and the bad news. 

The good news is that, yes, indeed one may compile the stock driver for 470.223.02
with a system running 6.7.0 kernel.

The bad news is that it won't work.

Nope.

It may pretend to work and you can do some very trivial operations within CUDA but
anything of any real value will simply hang in a terrible way : 

[63403.837023] BUG: kernel NULL pointer dereference, address: 0000000000000088
[63403.838341] #PF: supervisor read access in kernel mode
[63403.839505] #PF: error_code(0x0000) - not-present page
[63403.840651] PGD 0 P4D 0 
[63403.841774] Oops: 0000 [#1] PREEMPT SMP PTI
[63403.842894] CPU: 39 PID: 10250 Comm: mbrot Tainted: P           O       6.7.0-genunix #1

See that NULL pointer deref stuff?  That only happens with Linux kernels after 6.1.x and I have
tested the hell out of it.

t# cat /proc/version 
Linux version 6.7.0-genunix (root@titan) (gcc (GENUNIX Mon Jan  8 03:58:15 UTC 2024) 13.2.0, GNU ld (GNU Binutils) 2.41) #1 SMP PREEMPT_DYNAMIC Mon Jan  8 05:41:19 GMT 2024
t#
 
So yes ... you can compile the drivers with 6.7.0.  Big whoop.

No they will not work.

C'est la vie. 

I shall now build 6.1.70 and re-do a trivial NVidia CUDA code test and it will work.

@blastwave
Copy link

I can confirm that the drivers work with  6.1.71 and NVidia CUDA cranks out solid IEEE754
clean floating point math. No warnings. No NULL deref. 

@joanbm
Copy link
Author

joanbm commented Jan 9, 2024

@blastwave If this also happens with recent but non-cutting-edge kernels like 6.2.x to 6.6.x, it may be worth reporting it to NVIDIA with reproduction steps. Not sure how responsive they are with problems with those "old" drivers but as far as I can tell, those kernel versions are officially supported.

@blastwave
Copy link

@blastwave If this also happens with recent but non-cutting-edge kernels like 6.2.x to 6.6.x

get the 6.7.0 kernel. The issue here is that the NVidia devs are doing nasty wrapper calls in
their code.

it may be worth reporting it to NVIDIA with reproduction steps

How? The NVidia folks do not really have a bugzilla.

Not sure how responsive they are with problems with those "old" drivers

The real issue is that NVidia wants to drop support on all the Kepler hardware that has
the ability to perform FP64 floating point operations at full speed. It is about money. 
Of course.

However nothing will get around the nasty code tricks that NVidia devs perform inside
their secret proprietary driver code. That is why we get a NULL pointer deref from code.
There were some changes in the way things are done in the Linux kernel after 6.1.x where
people really need to stick to the well known _syscallX ( for X = 0 ... 6 ) type calls and
nothing else. No digging around into __this_funky_non_API type call and no weird
wrapper calls. However the NVidia folks seem to just want to do whatever they do and
we get garbage drivers.

I can give a try with 545.29.06 and see what happens.

Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment