Skip to content

Instantly share code, notes, and snippets.

@AlexRuiz7
Last active February 7, 2024 22:36
Show Gist options
  • Save AlexRuiz7/c7dc79025e7d33ef2d1bb1b22527b0d9 to your computer and use it in GitHub Desktop.
Save AlexRuiz7/c7dc79025e7d33ef2d1bb1b22527b0d9 to your computer and use it in GitHub Desktop.
My Odyssey installing `nvidia-driver-535` in Ubuntu 22.04

-- Context --

  • Laptop: MSI GL66 Pulse with Nvidia RTX 3060
  • System: 6.5.0-15-generic 22.04.1-Ubuntu x86_64 GNU/Linux
  • Previous drivers: xserver-xorg-video-noveau
  • Cause: External monitor not detected !link

Ubuntu didn't recognize my external monitor, although it was correctly connected to the laptop with an HDMI cable.

I had installed Nvidia drivers before on this same operative system and laptop and it didn't end well, so I always had to revert. Every time, I installed the driver provided by Ubuntu. Installing a driver directly provided by the vendor was a different approach, so I went for it.

The installer warned me that drivers provided by Ubuntu could be more stable as were tested by Ubuntu maintainers, hence it was recommended installing them that way. I aborted the installation and installed those instead. The installation was apparently successful, so I rebooted the system to double-check, but it didn't boot back. It didn't surprise me as I had had this problem in the past installing this same driver this same way.

-- Reverting it back --

The booting process got stuck on a black screen with a blinking cursor. First I did was to check the journal to look for the cause that prevents the boot from continuing.

sudo journalctl --reverse --since=today --grep=error

I quickly noticed a yellow block saying that gdm had a few of fatal errors. Nice, ain't it?

/usr/libexec/gdm-x-session[2117]: (EE) Server terminated with error (1). Closing log file.
/usr/libexec/gdm-x-session[2117]: Fatal server error:
/usr/libexec/gdm-x-session[2117]:         (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
/usr/libexec/gdm-x-session[2057]: (EE) Server terminated with error (1). Closing log file.
/usr/libexec/gdm-x-session[2057]: Fatal server error:
/usr/libexec/gdm-x-session[2057]:         (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
/usr/libexec/gdm-x-session[1997]: (EE) Server terminated with error (1). Closing log file.
/usr/libexec/gdm-x-session[1997]: Fatal server error:
/usr/libexec/gdm-x-session[1997]:         (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
/usr/libexec/gdm-x-session[1937]: (EE) Server terminated with error (1). Closing log file.
/usr/libexec/gdm-x-session[1937]: Fatal server error:
/usr/libexec/gdm-x-session[1937]:         (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
/usr/libexec/gdm-x-session[1877]: (EE) Server terminated with error (1). Closing log file.
/usr/libexec/gdm-x-session[1877]: Fatal server error:
/usr/libexec/gdm-x-session[1877]:         (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
/usr/libexec/gdm-x-session[1699]: (EE) Server terminated with error (1). Closing log file.
/usr/libexec/gdm-x-session[1699]: Fatal server error:
/usr/libexec/gdm-x-session[1699]:         (WW) warning, (EE) error, (NI) not implemented, (??) unknown.

Googling gdm fatal server error took me a to a forum for Arch Linux that included instructions to enable DRM kernel mode (whatever it is). I ran cat /sys/module/nvidia_drm/parameters/modeset and verified hat it was enabled (Y). Afterwards, it asked for a initframs update and a reboot for changes to take effect. !link

sudo update-initframs -u
reboot

The problem persisted, so I continued searching and found a thread in the Nvidia developer forums with a title that summed up my problem quite well. !link

There was not only a clear guide about how to revert the installed drivers, but also about how to install them.

  • Change your boot mode in grub to not boot into the GUI but console mode only
  • Purge all existing NVIDIA drivers
  • Run sudo update-initramfs -u
  • Reboot
  • Install the new drivers
  • Run sudo update-initramfs -u
  • Reboot
  • Run nvidia-smi to check if the GPU is recognized correctly
  • Only then revert the changes in grub to get your GUI back.

Once you have a shell/terminal/console open, start by making sure no Nvidia modules are loaded

sudo lsmod | grep nvidia

If there are still modules loaded, unload them with sudo modprobe -r and the name of the module. You might need to do it in certain order. Then uninstall the Nvidia driver for example with

sudo apt-get purge "nvidia*"
sudo apt-get autoremove
sudo update-initramfs -u
sudo reboot

Reboot and get back into a shell. Now install the NVIDIA driver you want to install. Make sure to follow the instructions of the installer exactly! If you have secure boot enabled you MUST follow the correct authentication process, otherwise the kernel module will not be loaded.

After the installation you can re-enable the boot to GUI.

-- Freezing the Kernel image --

!link

My laptop booted up again and took me to the login screen. Soon after that, I noticed that the operative system ran quite slow, and that the Bluetooth and Wi-Fi were gone. At this point, I only could not connect an external monitor that I could not use essential features that were working before. I also had had this problem before, caused by the installation of the Nvidia drivers, so I knew exactly what to do: reboot with a previous Kernel version.

For some reason, the latest Kernel (as of time of this post, 6.0.5-17) image is automatically set as default at boot. I had to select a previous kernel version at GRUB. 6.0.5-14 worked fine so I kept that one. In order to prevent this happening again, I looked for instructions about how to disable "automatic kernel updates", which raised useful results.

!link

  1. Open Terminal:

    Open a terminal on your Ubuntu system. You can use the keyboard shortcut Ctrl+Alt> +T to open the terminal.

  2. Hold the Current Kernel Package:

    Run the following command to hold the current kernel package to prevent it from being automatically updated:

    sudo apt-mark hold linux-image-generic linux-headers-generic
  3. Check Held Packages:

    You can verify that the packages are held by running:

    dpkg --get-selections | grep hold

    You should see output similar to:

    linux-headers-generic hold
    linux-image-generic hold
    
  4. Reverting the Hold:

    If you want to revert this and allow automatic updates for the kernel again, you can un-hold the packages using:

    sudo apt-mark unhold linux-image-generic linux-headers-generic

-- Resuming the drivers revert --

Next step in the guide was to install the drivers with the GUI disabled, but it didn't explain how to do so arguing that there were loads of tutorials in the net. So I did. !link

sudo systemctl disable gdm
reboot

Next rock on the path was the installer failing miserably, which took me to another Nvidia developer forum's thread. !link

sudo sh NVIDIA-Linux-x86_64-535.154.05.run
...
cc: error: unrecognized command-line option ‘-ftrivial-auto-var-init=zero’

I followed the thread, applied the instructions to use GCC-12 by default and tried again.

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 11
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12
sudo update-alternatives --config gcc
sudo sh NVIDIA-Linux-x86_64-535.154.05.run

This time the installation completed successfully. I didn't change any of the default values provided by the installer and refused to automatically configure the windows' manager (I could always do it later if the installation worked).

After reboot, I checked the system worked as expected, including wireless technologies. My GPU was also detected and Nvidia drivers were in use.

reboot
nvidia-smi

It was time for clean-up, which meant enabling GDM again. This turned out into another puzzle, as surprisingly systecmtl won't enable it.

sudo systemctl enable gdm
sudo systemctl -f enable gdm
sudo systemctl -f enable gdm.service

None of these worked.

-- Bringing GDM back --

!link

In this case, I found a thread on AskUbuntu about this matter, only that it was for KDM. Interestingly enough, the answer was not in the Answers section, but as a comment on the question, as it referenced an edit of the original author with a fix that worked for him, but not completely in my case. To sum up, the service had to be reconfigured back only after it was started.

sudo systemctl start gdm3
sudo dpkg-reconfigure gdm3
sudo systemctl status gdm
nvidia-smi

-- Final touches --

!link

Nvidia's drivers were in use an GDM was back on. As a last touch, I wanted to make sure that the kernel 6.0.5-17 could never come back.

Once again, Google and developer forums had the solution. I edited GRUB's config file to add these 2 lines, apply the changes and reboot.

GRUB_SAVEDEFAULT=true
GRUB_DEFAULT=saved
sudo update-grub
reboot

Finally, I listed the installed Kernel images and removed these I didn't want.

dpkg --list | grep linux-image | grep ii
sudo apt-get remove linux-image-6.5.0-17-generic
sudo rm -rf /lib/modules/6.5.0-17-generic
reboot
nvidia-smi
Wed Feb  7 23:07:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   40C    P8              11W /  80W |     59MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2595      G   /usr/lib/xorg/Xorg                           55MiB |
+---------------------------------------------------------------------------------------+

And this is how I managed to install the nvidia-535 driver in Ubuntu 22.04 after 2 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment