Skip to content

Instantly share code, notes, and snippets.

@joanbm
Created July 9, 2023 21:09
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joanbm/dfe8dc59af1c83e2530a1376b77be8ba to your computer and use it in GitHub Desktop.
Save joanbm/dfe8dc59af1c83e2530a1376b77be8ba to your computer and use it in GitHub Desktop.
Tentative fix for NVIDIA 470.182.03 driver for Linux 6.5-rc1
From 0ca9614e5b074d3dd01e95f47b3555f48e74f622 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Joan=20Bruguera=20Mic=C3=B3?= <joanbrugueram@gmail.com>
Date: Wed, 17 May 2023 21:54:08 +0000
Subject: [PATCH] Tentative fix for NVIDIA 470.182.03 driver for Linux 6.5-rc1
---
common/inc/nv-mm.h | 45 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 43 insertions(+), 2 deletions(-)
diff --git a/common/inc/nv-mm.h b/common/inc/nv-mm.h
index 54f6f60..25333e8 100644
--- a/common/inc/nv-mm.h
+++ b/common/inc/nv-mm.h
@@ -23,6 +23,7 @@
#ifndef __NV_MM_H__
#define __NV_MM_H__
+#include <linux/version.h>
#include "conftest.h"
#if !defined(NV_VM_FAULT_T_IS_PRESENT)
@@ -47,7 +48,27 @@ typedef int vm_fault_t;
*
*/
-#if defined(NV_GET_USER_PAGES_HAS_TASK_STRUCT)
+// Rel. commit. "mm/gup: remove unused vmas parameter from get_user_pages()" (Lorenzo Stoakes, 14 May 2023)
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 5, 0)
+#include <linux/mm.h>
+
+static inline long NV_GET_USER_PAGES(unsigned long start,
+ unsigned long nr_pages,
+ int write,
+ int force,
+ struct page **pages,
+ struct vm_area_struct **vmas)
+{
+ unsigned int flags = 0;
+
+ if (write)
+ flags |= FOLL_WRITE;
+ if (force)
+ flags |= FOLL_FORCE;
+
+ return get_user_pages(start, nr_pages, flags, pages);
+}
+#elif defined(NV_GET_USER_PAGES_HAS_TASK_STRUCT)
#if defined(NV_GET_USER_PAGES_HAS_WRITE_AND_FORCE_ARGS)
#define NV_GET_USER_PAGES(start, nr_pages, write, force, pages, vmas) \
get_user_pages(current, current->mm, start, nr_pages, write, force, pages, vmas)
@@ -130,7 +151,27 @@ typedef int vm_fault_t;
*
*/
-#if defined(NV_GET_USER_PAGES_REMOTE_PRESENT)
+// Rel. commit. "mm/gup: remove unused vmas parameter from get_user_pages_remote()" (Lorenzo Stoakes, 14 May 2023)
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 5, 0)
+static inline long NV_GET_USER_PAGES_REMOTE(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ int write,
+ int force,
+ struct page **pages,
+ struct vm_area_struct **vmas)
+{
+ unsigned int flags = 0;
+
+ if (write)
+ flags |= FOLL_WRITE;
+ if (force)
+ flags |= FOLL_FORCE;
+
+ return get_user_pages_remote(mm, start, nr_pages, flags, pages, NULL);
+}
+#elif defined(NV_GET_USER_PAGES_REMOTE_PRESENT)
#if defined(NV_GET_USER_PAGES_REMOTE_HAS_WRITE_AND_FORCE_ARGS)
#define NV_GET_USER_PAGES_REMOTE get_user_pages_remote
#else
--
2.41.0
@oscarbg
Copy link

oscarbg commented Jul 10, 2023

Can you provide also equivalent patches for latest stable nv driver branches like 535.xx?
Thanks anyway for your patches..

@joanbm
Copy link
Author

joanbm commented Jul 10, 2023

Can you provide also equivalent patches for latest stable nv driver branches like 535.xx? Thanks anyway for your patches..

@oscarbg Unfortunately I don't have much free time to look into the other versions, nor any hardware to test it, though the adaptations needed for the other versions are often relatively similar since they come from common code after all.

@babiulep
Copy link

Thank you Joan for all your work!

@oscarbg: This patch seems to work for 6.5+535: https://gist.github.com/Fjodor42/cfd29b3ffd1d1957894469f2def8f4f6

@blastwave
Copy link

Sadly this patch allows for a compile but then CUDA code fails to run.
In fact the cuda code executable simply hangs and then PID has to be 
killed. Then a "defunct" zombie is left hanging. 

The only viable option here is to run Linux 6.4.14 and then the driver 470.199.02 
seems to work out of the box with CUDA 11.8.

@joanbm
Copy link
Author

joanbm commented Sep 22, 2023

@blastwave Interesting. I am able to run some CUDA code like this example with Linux 6.5 and this patch - though my CUDA version is older and so my card is likely older as well:

[zealcharm@solpc ~]$ uname -a
Linux solpc 6.5.4-arch2-1 #1 SMP PREEMPT_DYNAMIC Thu, 21 Sep 2023 11:06:39 +0000 x86_64 GNU/Linux
[zealcharm@solpc ~]$ nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 660 (UUID: GPU-8fb786b1-7350-0e01-4d7d-f0dbbafadc17)
[zealcharm@solpc ~]$ /opt/cuda-10.2/bin/nvcc test.cu && ./a.out
Hello World!

Any chance you can try to extract a stack trace of the hung process with gdb? Looking at dmesg may also be useful, that "zombie process" behaviour may point to something gone wrong on the kernel.

@AntonioTrindade
Copy link

I just confirmed this patch works for the nvidia-legacy-390xx-driver, version 390.157-2. with Linux 6.5.3, debian package (linux-6.5.0-1).

@Abdou-St-009
Copy link

How to install/patch driver give me plz the guide instructions, thanks.

@joanbm
Copy link
Author

joanbm commented Oct 15, 2023

@Abdou-St-009 Unless you are testing kernel release candidates or using some niche Linux distribution, you should look for instructions on how to install the NVIDIA drivers for your specific distribution like Ubuntu or Fedora. Most of them have some easy way to get them installed and updated.

If you really want to use those patches manually, the general steps to install the driver with the patches are:

$ # Download driver and patches
$ wget "https://us.download.nvidia.com/XFree86/Linux-x86_64/470.199.02/NVIDIA-Linux-x86_64-470.199.02.run"
$ wget "https://gist.githubusercontent.com/joanbm/dfe8dc59af1c83e2530a1376b77be8ba/raw/37ff2b5ccf99f295ff958c9a44ca4ed4f42503b4/nvidia-470xx-fix-linux-6.5.patch"
$ wget "https://gist.githubusercontent.com/joanbm/2ec3c512a1ac21f5f5c6b3c1a4dbef35/raw/2a51c270a9ff4cf8f05966b8313c49e8ce3833d4/nvidia-470xx-fix-linux-6.6.patch"
$ # Extract and patch drivers
$ sh NVIDIA-Linux-x86_64-470.199.02.run --extract-only
$ cd NVIDIA-Linux-x86_64-470.199.02/kernel/
$ patch -Np1 -i ../../nvidia-470xx-fix-linux-6.5.patch 
$ patch -Np1 -i ../../nvidia-470xx-fix-linux-6.6.patch 
$ cd ..
$ # Run the installer
$ sudo ./nvidia-installer 

@Abdou-St-009
Copy link

@joanbm thank you for quick reply, i have kali with kernel 6.5 and have Nvidia 720m with no driver now (390xx) , @AntonioTrindade said the patch 470xx work with 390xx can you confirm ?
And what you propose ?
The guide you give, I'm already tried and i had some errors.
Have a nice day.

@blastwave
Copy link

I used your patch ... but added in my own "flare" because you and Lorenzo Stoakes are awesome : 

t# 
t# diff -u ./kernel/common/inc/nv-mm.h.orig ./kernel/common/inc/nv-mm.h 
--- ./kernel/common/inc/nv-mm.h.orig    2023-05-11 12:07:29.000000000 +0000
+++ ./kernel/common/inc/nv-mm.h 2023-10-18 07:53:46.935521412 +0000
@@ -23,6 +23,7 @@
 #ifndef __NV_MM_H__
 #define __NV_MM_H__
 
+#include <linux/version.h>
 #include "conftest.h"
 
 #if !defined(NV_VM_FAULT_T_IS_PRESENT)
@@ -47,7 +48,30 @@
  *
  */
 
-#if defined(NV_GET_USER_PAGES_HAS_TASK_STRUCT)
+/* Rel. commit. "mm/gup: remove unused vmas parameter from get_user_pages()"
+ *  (Lorenzo Stoakes, 14 May 2023)
+ *
+ *     way to go Lorenzo ! 
+ *
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 5, 0)
+#include <linux/mm.h>
+
+static inline long NV_GET_USER_PAGES(unsigned long start,
+                                     unsigned long nr_pages,
+                                     int write,
+                                     int force,
+                                     struct page **pages,
+                                     struct vm_area_struct **vmas)
+{
+    unsigned int flags = 0;
+
+    if (write) flags |= FOLL_WRITE;
+    if (force) flags |= FOLL_FORCE;
+
+    return get_user_pages(start, nr_pages, flags, pages);
+}
+#elif defined(NV_GET_USER_PAGES_HAS_TASK_STRUCT)
     #if defined(NV_GET_USER_PAGES_HAS_WRITE_AND_FORCE_ARGS)
         #define NV_GET_USER_PAGES(start, nr_pages, write, force, pages, vmas) \
             get_user_pages(current, current->mm, start, nr_pages, write, force, pages, vmas)
@@ -130,7 +154,29 @@
  *
  */
 
-#if defined(NV_GET_USER_PAGES_REMOTE_PRESENT)
+/* Rel. commit. "mm/gup: remove unused vmas parameter from get_user_pages_remote()"
+ *  (Lorenzo Stoakes, 14 May 2023)
+ *
+ *    This guy Lorenzo is awesome ! 
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 5, 0)
+static inline long NV_GET_USER_PAGES_REMOTE(struct task_struct *tsk,
+                                            struct mm_struct *mm,
+                                            unsigned long start,
+                                            unsigned long nr_pages,
+                                            int write,
+                                            int force,
+                                            struct page **pages,
+                                            struct vm_area_struct **vmas)
+{
+    unsigned int flags = 0;
+
+    if (write) flags |= FOLL_WRITE;
+    if (force) flags |= FOLL_FORCE;
+
+    return get_user_pages_remote(mm, start, nr_pages, flags, pages, NULL);
+}
+#elif defined(NV_GET_USER_PAGES_REMOTE_PRESENT)
     #if defined(NV_GET_USER_PAGES_REMOTE_HAS_WRITE_AND_FORCE_ARGS)
         #define NV_GET_USER_PAGES_REMOTE    get_user_pages_remote
     #else
t# 


That seems to work but I can not yet confirm the function of CUDA code for computation.

I am running some tests on a Quadro GP100 and will also see what happens with a 
new Ada Lovelace type unit. If the math looks good then we have a good patch here.

--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CUDA/CISC
UNIX and Linux spoken

@blastwave
Copy link

@blastwave Interesting. I am able to run some CUDA code like ...


I can report that this patch seems to work well on a Linux 6.5.7 system with 
dual NVidia Quadro GPU units. I performed some numerical computation
load tests where a very fast IBM POWER9 server requires 1939.24 secs
to create the exact same result set.  The NVidia nvprof profiler seems to
be entirely broken and the pid simply hangs. That is unrelated to the driver
and the NVCC CUDA Compiler functions. 

titan$ $NVCC --version 
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

titan$ nvidia-smi 
Sat Oct 21 06:15:33 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro K6000        Off  | 00000000:03:00.0 Off |                  Off |
| 26%   42C    P0    50W / 225W |      0MiB / 12198MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro GP100        Off  | 00000000:81:00.0 Off |                  Off |
| 35%   48C    P0    29W / 235W |      0MiB / 16278MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

titan$ uname -a 
Linux titan 6.5.7-genunix #1 SMP PREEMPT_DYNAMIC Wed Oct 18 07:04:10 GMT 2023 x86_64 GNU/Linux

titan$ ls -l /lib/modules/6.5.7-genunix/kernel/drivers/video/
total 98572
-rw-r--r-- 1 root root  3864181 Oct 18 08:23 nvidia-drm.ko
-rw-r--r-- 1 root root 52715773 Oct 18 08:23 nvidia.ko
-rw-r--r-- 1 root root  2320349 Oct 18 08:23 nvidia-modeset.ko
-rw-r--r-- 1 root root   326397 Oct 18 08:23 nvidia-peermem.ko
-rw-r--r-- 1 root root 41690589 Oct 18 08:23 nvidia-uvm.ko

-------------------------------------------------------------
        system name = Linux
          node name = titan
            release = 6.5.7-genunix
            version = #1 SMP PREEMPT_DYNAMIC Wed Oct 18 07:04:10 GMT 2023
            machine = x86_64
          page size = 4096
       avail memory = 540904030208
                    = 528226592 kB
                    = 515846 MB
-------------------------------------------------------------
INFO : number of host CPUs:     40
INFO : number of CUDA devices:  2
     :    0: Quadro GP100
     :         17069309952 totalGlobalMem
     :             1442500 clockRate
     :                4096 memoryBusWidth
     :                  56 multiProcessorCount
     :                2048 maxThreadsPerMultiProcessor
     :                3584 cores
     :    1: Quadro K6000
     :         12791250944 totalGlobalMem
     :             901500 clockRate
     :                 384 memoryBusWidth
     :                  15 multiProcessorCount
     :                2048 maxThreadsPerMultiProcessor
     :                2880 cores
WARN : FORCE SELECT the device 0
     :    0: Quadro K6000
     :         12791250944 totalGlobalMem

INFO : Quadro K6000 device is selected

INFO : firing off 1048576 cuda core code chunks
     : mem size of double data array is 8388608 bytes
     : mem size of uint32_t height array is 4194304 bytes

-----------------------------------------------------------
     : 1048576 coord loaded 48169298 nsecs
     : cudaMalloc device_r 209083871 nsecs
     : cudaMalloc device_j 142637 nsecs
     : cudaMalloc device_mval 103888 nsecs
INFO : Copy of real data from host to device done.
     : cudaMemcpy() 1033456nsecs
INFO : Copy of imaginary data from host to device done.
     : cudaMemcpy() 1097713 nsecs
INFO : CUDA kernel launch with 1024 blocks of 1024 threads
INFO : done.
.
.
.

etc etc and the entire output dataset of 1048576 structs made up of a
FP64 complex coordinate and then a uint32 value is all done in seconds.

    Quadro K6000 : 4.54 secs
    Quadro GP100 : 4.47 secs

I can only guess the actual time required with "time -p" but certainly the
old Quadro K6000 and the newer GP100 are working fine.

My flavour of the patch has some flare in it for fun :) 


t# 
t# diff -u ./kernel/common/inc/nv-mm.h.orig ./kernel/common/inc/nv-mm.h 
--- ./kernel/common/inc/nv-mm.h.orig    2023-05-11 12:07:29.000000000 +0000
+++ ./kernel/common/inc/nv-mm.h 2023-10-18 07:53:46.935521412 +0000
@@ -23,6 +23,7 @@
 #ifndef __NV_MM_H__
 #define __NV_MM_H__
 
+#include <linux/version.h>
 #include "conftest.h"
 
 #if !defined(NV_VM_FAULT_T_IS_PRESENT)
@@ -47,7 +48,30 @@
  *
  */
 
-#if defined(NV_GET_USER_PAGES_HAS_TASK_STRUCT)
+/* Rel. commit. "mm/gup: remove unused vmas parameter from get_user_pages()"
+ *  (Lorenzo Stoakes, 14 May 2023)
+ *
+ *     way to go Lorenzo ! 
+ *
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 5, 0)
+#include <linux/mm.h>
+
+static inline long NV_GET_USER_PAGES(unsigned long start,
+                                     unsigned long nr_pages,
+                                     int write,
+                                     int force,
+                                     struct page **pages,
+                                     struct vm_area_struct **vmas)
+{
+    unsigned int flags = 0;
+
+    if (write) flags |= FOLL_WRITE;
+    if (force) flags |= FOLL_FORCE;
+
+    return get_user_pages(start, nr_pages, flags, pages);
+}
+#elif defined(NV_GET_USER_PAGES_HAS_TASK_STRUCT)
     #if defined(NV_GET_USER_PAGES_HAS_WRITE_AND_FORCE_ARGS)
         #define NV_GET_USER_PAGES(start, nr_pages, write, force, pages, vmas) \
             get_user_pages(current, current->mm, start, nr_pages, write, force, pages, vmas)
@@ -130,7 +154,29 @@
  *
  */
 
-#if defined(NV_GET_USER_PAGES_REMOTE_PRESENT)
+/* Rel. commit. "mm/gup: remove unused vmas parameter from get_user_pages_remote()"
+ *  (Lorenzo Stoakes, 14 May 2023)
+ *
+ *    This guy Lorenzo is awesome ! 
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 5, 0)
+static inline long NV_GET_USER_PAGES_REMOTE(struct task_struct *tsk,
+                                            struct mm_struct *mm,
+                                            unsigned long start,
+                                            unsigned long nr_pages,
+                                            int write,
+                                            int force,
+                                            struct page **pages,
+                                            struct vm_area_struct **vmas)
+{
+    unsigned int flags = 0;
+
+    if (write) flags |= FOLL_WRITE;
+    if (force) flags |= FOLL_FORCE;
+
+    return get_user_pages_remote(mm, start, nr_pages, flags, pages, NULL);
+}
+#elif defined(NV_GET_USER_PAGES_REMOTE_PRESENT)
     #if defined(NV_GET_USER_PAGES_REMOTE_HAS_WRITE_AND_FORCE_ARGS)
         #define NV_GET_USER_PAGES_REMOTE    get_user_pages_remote
     #else
t# 

--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken

@blastwave
Copy link

bad news : 

[ 3602.841202] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 3602.903353] nvidia-uvm: Loaded the UVM driver, major device number 239.
[ 3604.266847] BUG: kernel NULL pointer dereference, address: 0000000000000088
[ 3604.267829] #PF: supervisor read access in kernel mode
[ 3604.268795] #PF: error_code(0x0000) - not-present page
[ 3604.269743] PGD 0 P4D 0 
[ 3604.270660] Oops: 0000 [#1] PREEMPT SMP PTI
[ 3604.271566] CPU: 39 PID: 2240 Comm: mandel_hack Tainted: P           O       6.5.7-genunix #1
[ 3604.272469] Hardware name: LENOVO 30B8S0VQ00/1031, BIOS S02KT73A 05/24/2022
[ 3604.273351] RIP: 0010:map_user_pages+0x133/0x2e0 [nvidia_uvm]
[ 3604.274307] Code: b8 00 00 00 e8 9e 09 e9 d7 4c 39 fd 75 53 45 31 e4 eb 38 0f 1f 44 00 00 8b 40 34 3d 00 00 10 00 0f 8f 6f 01 00 00 4b 8b 04 e6 <48> 8b b8 88 00 00 00 e8 31 ea ff ff 84 c0 0f 85 54 01 00 00 49 83
[ 3604.276083] RSP: 0018:ffffa69c4d017c80 EFLAGS: 00010283
[ 3604.276961] RAX: 0000000000000000 RBX: ffff8d6797169a28 RCX: ffff8d6740551000
[ 3604.277847] RDX: ffffd23a45425a88 RSI: 0000000000000000 RDI: 0000000000000000
[ 3604.278682] RBP: 0000000000000080 R08: 0000000000000000 R09: 0000000000000000
[ 3604.279513] R10: ffff8d66c882d280 R11: 0000000000000000 R12: 0000000000000000
[ 3604.280306] R13: ffff8d6797169a30 R14: ffff8d6740556800 R15: 0000000000000080
[ 3604.281092] FS:  00007fb0d8875000(0000) GS:ffff8da5bfcc0000(0000) knlGS:0000000000000000
[ 3604.281900] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3604.282698] CR2: 0000000000000088 CR3: 000000433012a003 CR4: 00000000003706e0
[ 3604.283517] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3604.284335] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3604.285163] Call Trace:
[ 3604.285983]  <TASK>
[ 3604.286810]  ? __die+0x1f/0x70
[ 3604.287625]  ? page_fault_oops+0x181/0x4b0
[ 3604.288444]  ? exc_page_fault+0x6c/0x150
[ 3604.289287]  ? asm_exc_page_fault+0x22/0x30
[ 3604.290110]  ? map_user_pages+0x133/0x2e0 [nvidia_uvm]
[ 3604.290935]  uvm_api_tools_init_event_tracker+0x1aa/0x210 [nvidia_uvm]
[ 3604.291773]  uvm_tools_unlocked_ioctl+0x273/0x2a0 [nvidia_uvm]
[ 3604.292613]  uvm_tools_unlocked_ioctl_entry.part.0+0xdc/0x110 [nvidia_uvm]
[ 3604.293452]  ? kmem_cache_free+0x24a/0x370
[ 3604.294244]  __x64_sys_ioctl+0x9b/0xe0
[ 3604.295039]  do_syscall_64+0x5c/0xc0
[ 3604.295817]  ? do_syscall_64+0x68/0xc0
[ 3604.296576]  ? __count_memcg_events+0x41/0xa0
[ 3604.297349]  ? handle_mm_fault+0xb1/0x360
[ 3604.298135]  ? do_user_addr_fault+0x170/0x5a0
[ 3604.298901]  ? exit_to_user_mode_prepare+0x37/0x1a0
[ 3604.299672]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 3604.300458] RIP: 0033:0x7fb0d896b237
[ 3604.301210] Code: 00 00 00 48 8b 05 59 cc 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 29 cc 0d 00 f7 d8 64 89 01 48
[ 3604.302701] RSP: 002b:00007ffc1e947b78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 3604.303469] RAX: ffffffffffffffda RBX: 000055789c03f1c8 RCX: 00007fb0d896b237
[ 3604.304240] RDX: 00007ffc1e947b80 RSI: 0000000000000038 RDI: 000000000000001d
[ 3604.305014] RBP: 000055789c040000 R08: 0000000000000000 R09: 0000000000000000
[ 3604.305797] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000001d
[ 3604.306578] R13: 00007ffc1e947b80 R14: 00007ffc1e947c60 R15: 000055789c03f1e8
[ 3604.307362]  </TASK>
[ 3604.308138] Modules linked in: nvidia_uvm(PO) nfsd auth_rpcgss ib_iser rdma_cm configfs iw_cm ib_cm ib_core nls_ascii nls_cp437 vfat fat iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rtsx_usb_sdmmc mmc_core rtsx_usb_ms memstick rtsx_usb nvidia_drm(PO) nvidia_modeset(PO) intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common sb_edac x86_pkg_temp_thermal nvidia(PO) intel_powerclamp coretemp crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd ixgbe video ehci_pci iTCO_wdt xhci_pci rapl sg mei_wdt mdio_devres xhci_hcd drm_kms_helper e1000e igb libphy sd_mod intel_pmc_bxt ehci_hcd mei_me intel_cstate iTCO_vendor_support drm watchdog usbcore wmi_bmof intel_wmi_thunderbolt mxm_wmi intel_uncore ptp mei efi_pstore pcspkr ahci i2c_algo_bit i2c_i801 mdio pps_core usb_common libahci dca lpc_ich i2c_smbus intel_pch_thermal wmi button nvme nvme_core t10_pi crc32c_intel evdev crc64_rocksoft crc64 crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common
[ 3604.314072] CR2: 0000000000000088
[ 3604.314925] ---[ end trace 0000000000000000 ]---
[ 3607.614591] RIP: 0010:map_user_pages+0x133/0x2e0 [nvidia_uvm]
[ 3607.615537] Code: b8 00 00 00 e8 9e 09 e9 d7 4c 39 fd 75 53 45 31 e4 eb 38 0f 1f 44 00 00 8b 40 34 3d 00 00 10 00 0f 8f 6f 01 00 00 4b 8b 04 e6 <48> 8b b8 88 00 00 00 e8 31 ea ff ff 84 c0 0f 85 54 01 00 00 49 83
[ 3607.617196] RSP: 0018:ffffa69c4d017c80 EFLAGS: 00010283
[ 3607.618015] RAX: 0000000000000000 RBX: ffff8d6797169a28 RCX: ffff8d6740551000
[ 3607.618828] RDX: ffffd23a45425a88 RSI: 0000000000000000 RDI: 0000000000000000
[ 3607.619645] RBP: 0000000000000080 R08: 0000000000000000 R09: 0000000000000000
[ 3607.620457] R10: ffff8d66c882d280 R11: 0000000000000000 R12: 0000000000000000
[ 3607.621262] R13: ffff8d6797169a30 R14: ffff8d6740556800 R15: 0000000000000080
[ 3607.622071] FS:  00007fb0d8875000(0000) GS:ffff8da5bfcc0000(0000) knlGS:0000000000000000
[ 3607.622882] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3607.623694] CR2: 0000000000000088 CR3: 000000433012a003 CR4: 00000000003706e0
[ 3607.624505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3607.625294] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3607.626060] note: mandel_hack[2240] exited with irqs disabled

At great risk of saying "oops" ... well something really looks wrong here.

@blastwave
Copy link

I simply build 6.1.59 and use the stock off the shelf NVidia drivers and it all works ...

This include the NVidia profiler tool. 

titan$ cat /proc/version 
Linux version 6.1.59-genunix (root@titan) (gcc (GENUNIX Thu Aug 31 14:20:03 UTC 2023) 13.2.0, GNU ld (GNU Binutils) 2.40) #1 SMP PREEMPT_DYNAMIC Sat Oct 21 07:28:06 GMT 2023
titan$ 

titan$ 
titan$ nvidia-smi
Sat Oct 21 08:12:06 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro K6000        Off  | 00000000:03:00.0 Off |                  Off |
| 26%   41C    P0    46W / 225W |      0MiB / 12198MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro GP100        Off  | 00000000:81:00.0 Off |                  Off |
| 31%   45C    P0    30W / 235W |      0MiB / 16278MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
titan$ 

Sadly the profiler still has a fit if I go with an older Quadro K6000 : 

[ 1114.353313] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 1114.405537] nvidia-uvm: Loaded the UVM driver, major device number 239.
[ 2072.317465] mandel_hack[11219]: segfault at 30 ip 00007f844e9bc760 sp 00007ffe32414268 error 4 in libpthread-2.31.so[7f844e9b8000+10000] likely on CPU 39 (core 12, socket 1)
[ 2072.319507] Code: ff ff 48 8d 0d 31 c0 00 00 ba a7 01 00 00 48 8d 35 af be 00 00 48 8d 3d de bd 00 00 e8 69 ba ff ff 66 0f 1f 84 00 00 00 00 00 <8b> 47 10 89 c2 81 e2 7f 01 00 00 83 e0 7c 0f 85 7c 00 00 00 53 48
t# 

Guess that is why NVidia calls the stuff deprecated. 


So I think there is far more going on here. 

@joanbm
Copy link
Author

joanbm commented Oct 22, 2023

@joanbm thank you for quick reply, i have kali with kernel 6.5 and have Nvidia 720m with no driver now (390xx) , @AntonioTrindade said the patch 470xx work with 390xx can you confirm ? And what you propose ? The guide you give, I'm already tried and i had some errors. Have a nice day.

@Abdou-St-009 Honestly, you're probably going to have a much easier time if you either switch to the Nouveau open source driver, or switch to a more stable distribution (like Ubuntu 22.04 or Debian Bullseye) which provides official packages for the 390xx driver branch. Using the somewhat-abandoned 390xx driver with a rolling release distribution is most likely going to be a PITA.

If you want to try to do it anyway, there's an AUR package for Arch for nvidia-390xx (https://aur.archlinux.org/packages/nvidia-390xx-dkms) which has various patches to make the driver work up to Linux 6.5, see https://aur.archlinux.org/cgit/aur.git/tree/?h=nvidia-390xx-utils. But as I said, it's probably going to be painful to get and keep such a setup working.

@joanbm
Copy link
Author

joanbm commented Oct 22, 2023

@blastwave Yeah, looks like somehow you ran into a NULL pointer reference somewhere in the UVM code... not sure what this is about since the driver doesn't need any changes to the UVM code to build on Linux 6.5, but it's still possible that something has changed in Linux that indirectly breaks the UVM part of the driver nonetheless.

I ran some simple CUDA sample using UVM on my card and it worked fine, so it's probably something related to your specific setup & application. So if you want to try to look into it, the first step would be to try to find a minimal reproducer (simple small application, single card if possible), though there's no guarantee at all that we can figure out what the problem is since most of the drivers are closed source.

Though TBH, if you want to do scientific computing / advanced CUDA use, it's probably better that you stay on the officially supported versions of the kernel by the NVIDIA driver, as (hopefully) NVIDIA has tested and validated their driver with those, while these patches are provided on a "best effort" basis and have no real testing besides checking that very basic desktop / gaming / CUDA use cases work.

@Abdou-St-009
Copy link

@joanbm thank you for quick reply, i have kali with kernel 6.5 and have Nvidia 720m with no driver now (390xx) , @AntonioTrindade said the patch 470xx work with 390xx can you confirm ? And what you propose ? The guide you give, I'm already tried and i had some errors. Have a nice day.

@Abdou-St-009 Honestly, you're probably going to have a much easier time if you either switch to the Nouveau open source driver, or switch to a more stable distribution (like Ubuntu 22.04 or Debian Bullseye) which provides official packages for the 390xx driver branch. Using the somewhat-abandoned 390xx driver with a rolling release distribution is most likely going to be a PITA.

If you want to try to do it anyway, there's an AUR package for Arch for nvidia-390xx (https://aur.archlinux.org/packages/nvidia-390xx-dkms) which has various patches to make the driver work up to Linux 6.5, see https://aur.archlinux.org/cgit/aur.git/tree/?h=nvidia-390xx-utils. But as I said, it's probably going to be painful to get and keep such a setup working.

I am grateful and will take your advice into consideration.

@joanbm
Copy link
Author

joanbm commented Oct 31, 2023

@blastwave FWIW there's a new driver version in the 470xx branch that builds up to Linux 6.6 without third-party patches, you may want to try it: https://www.nvidia.com/download/driverResults.aspx/215840/en-us/

@AntonioTrindade
Copy link

@joanbm thank you for quick reply, i have kali with kernel 6.5 and have Nvidia 720m with no driver now (390xx) , @AntonioTrindade said the patch 470xx work with 390xx can you confirm ? And what you propose ? The guide you give, I'm already tried and i had some errors. Have a nice day.

Honestly I do not remember if I used the 470xx patch as is or not. I patched the files manually and created a patch file for myself and have had no problems since I reported the compilation problem, even after several kernel upgrades.

@Abdou-St-009
Copy link

Abdou-St-009 commented Nov 24, 2023

@joanbm thank you for quick reply, i have kali with kernel 6.5 and have Nvidia 720m with no driver now (390xx) , @AntonioTrindade said the patch 470xx work with 390xx can you confirm ? And what you propose ? The guide you give, I'm already tried and i had some errors. Have a nice day.

Honestly I do not remember if I used the 470xx patch as is or not. I patched the files manually and created a patch file for myself and have had no problems since I reported the compilation problem, even after several kernel upgrades.

Hey @AntonioTrindade
Can you please help me to find the driver 390.x for my Debian and i have old Gpu Nvidia gt720m and my kernel is 6.5 and no official driver from Nvidia site for this kernel yet and i got very tired to find it.
Have a good day

@AntonioTrindade
Copy link

@Abdou-St-009, If you are using Debian Sid, just install nvidia-legacy-390xx-driver.
If you are using Debian stable, change your apt.sources.list to Sid, run "apt update; apt install nvidia-legacy-390xx-driver" and change back the apt.sources.list to your Debian version and rerun "apt update".
WARNING: after editing apt.sources.list to Sid DO NOT RUN "apt upgrade"!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment