Skip to content

Instantly share code, notes, and snippets.

@hitorilabs
Last active March 26, 2024 05:21
Show Gist options
  • Save hitorilabs/00abd76e3462500d40997b1a657a1940 to your computer and use it in GitHub Desktop.
Save hitorilabs/00abd76e3462500d40997b1a657a1940 to your computer and use it in GitHub Desktop.

WARNING: I don't really care about desktop, so this might fuck it up - make sure you can still ssh into it on reboots

If your installation is really messed up or you've kind of mangled it by mashing commands, you should wipe everything cuda related and restart - the docs provides a neat that will get rid of most if not all traces

(https://gist.github.com/hitorilabs/3fed1a6e5dd500edb5ad7568562c064d)

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#removing-cuda-toolkit-and-driver

As an overview, here's a laundry list of things you probably want to check:


I started with installing cuda-toolkit, which seemed to be totally fine - but I noticed that nvidia-smi was reporting a different version from nvcc. My nvidia drivers were installed by default from the Ubuntu installation.

Then I tried to install the rest of the cuda-drivers and it gave me this error:

Building initial module for 6.5.0-10-generic
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/nvidia-dkms-545.0.crash'
Error! Bad return status for module build on kernel: 6.5.0-10-generic (x86_64)
Consult /var/lib/dkms/nvidia/545.23.08/build/make.log for more information.
dpkg: error processing package nvidia-dkms-545 (--configure):
 installed nvidia-dkms-545 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of cuda-drivers-545:
 cuda-drivers-545 depends on nvidia-dkms-545 (>= 545.23.08); however:
  Package nvidia-dkms-545 is not configured yet.

At the top of /var/crash/nvidia-dkms-545.0.crash, I saw this:

DKMSBuildLog:                                                                                                                                      
 DKMS make.log for nvidia-545.23.08 for kernel 6.5.0-10-generic (x86_64)                                                                            Thu Nov 16 08:18:11 AM EST 2023                                                                                                                   
 make[1]: Entering directory '/usr/src/linux-headers-6.5.0-10-generic'                                                                             
 make --no-print-directory -C /usr/src/linux-headers-6.5.0-10-generic \                                                                             -f /usr/src/linux-headers-6.5.0-10-generic/Makefile modules                                                                                       
 warning: the compiler differs from the one used to build the kernel                                                                               
   The kernel was built by: x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0                                                                
   You are using:           Ubuntu clang version 17.0.4 

I didn't realize that setting CC as an environment variable isn't always respected by the build process. At the system-level I had the symlink set to clang - so I fixed it by changing that with sudo update-alternatives --config cc


On reflection, I was surprised that cuda-toolkit was totally fine being compiled using clang. This just meant that the other half my cuda installation was compiled using gcc - but in nearly all situations they interoperated just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment