greerso/smos-nvidia.md

## smos-nvidia.md

      
    Raw
  

              smos-nvidia.md
            
          
    Intro

I recently helped a few friends install an ETH mining rig that they'd built on a wobbly Taipei rooftop in the middle of earthquake season. With a motherboard specially designed for mining, SMOS Linux, 3 GPUs and a fourth on the way, they seemed well set for success.
Except, only one GPU would be recognised by the mining program.
All GPUs lit up and spun up their fans. Each of them would also happily connect to our monitor. ls /dev/nvidia* would show three GPUs present. We were dumbstruck because the GPU that would work was one of two identical models in the rig: MSI GTX 1060s. We couldn't figure out why one of them was any different, but Nvidia's System Management Interface (nvidia-smi) indeed returned ERR! instead of the name of two of the cards.
When we left only one of the non-working cards in the system, we'd be told No AMD OPENCL or NVIDIA CUDA GPUs found, exit. Then we found out that one MSI was manufactured in 2017 and the other in early 2018. So something must be minutely different about the two cards. Perhaps one was carrying an updated firmware?
Drivers, of course

I knew of no way to safely flash GPU firmware under Linux, so I decided a different driver and updating CUDA would have to do. But just doing an apt install nvidia-390 cuda-9-1 was not enough. Dmesg, nvcc and modprobe nvidia would all highlight  version mismatches between the software and the kernel module, of which version 387.22  somehow seemed baked into the custom SMOS kernel and was not being replaced by the package manager.
The Nvidia CUDA Toolkit Documentation has a Linux Installation Guide which I followed carefully. But many hours of downloading packages over a slow connection and installing them on the system which operated from a slow pendrive, I'd always end up with a borked installation and nothing would replace the kernel module.
Solution: the hard way then.

Here are the steps I took to finally get everything working. Long story short, it involved installing a different Linux kernel instead of the SMOS custom kernel, as there is no matching kernel module for the desired Nvidia drivers.
Not every step may be required but I'm really not tempted to retry every possible configuration to find the shortest way.
It's important that you do this over an SSH connection, just in case you somehow lose your display in the process!
1. Switch to multi-user mode

Switch to runlevel 3 (i.e. multi-user.target) and make it the default runlevel for the time being. This apparently allows you to mess with drivers etc better and it will keep the GUI from loading.
$ sudo systemctl isolate multi-user.target
$ sudo systemctl enable multi-user.target
$ sudo systemctl set-default multi-user.target

Check that you're on the right runlevel by doing $ runlevel. The second number should be 3.
2. Kill any remaining nvidia processes.

$ sudo lsof /dev/nvidia*
$ sudo kill x # (where * is the PID of any process that still uses the GPU)

3. Get rid of everything nvidia.

Unload the nvidia modules (the order is important) and remove any packages that still remotely smell of Nvidia or CUDA.
$ sudo rmmod nvidia_drm
$ sudo rmmod nvidia_modeset
$ sudo rmmod nvidia_uvm
$ sudo rmmod nvidia
$ sudo apt --purge remove cuda* nvidia*

Also make sure the nouveau driver isn't loaded with a blacklist file:
$ sudo nano /etc/modprobe.d/blacklist-nouveau.conf
copy in the following text:
blacklist nouveau
options nouveau modeset=0

And update initramfs:
$ sudo update-initramfs -u
4. Update your system

Run $ sudo nano /etc/apt/sources.list and uncomment all the lines that begin with deb-src by removing the pound symbol at the beginning.
Then:
$ sudo apt update
$ sudo apt upgrade -y
$ sudo apt dist-upgrade -y
$ sudo apt autoremove -y

5. Download and install a different kernel.

This step is crucial, I believe.
The commands below will install the Linux kernel 4.4.14 which is the current one at the time of writing. You can visit the Ubuntu Kernel Mainline page to find a newer one or one you like better. You should go with anything that has the world Xenial in it, and make sure to grab the linux-headers, the generic linux-headers, and the generic linux image.
$ mkdir kernel # do this in a separate directory
$ cd kernel
$ wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.14-xenial/linux-headers-4.4.14-040414_4.4.14-040414.201606241434_all.deb
$ wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.14-xenial/linux-headers-4.4.14-040414-generic_4.4.14-040414.201606241434_amd64.deb
$ wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.14-xenial/linux-image-4.4.14-040414-generic_4.4.14-040414.201606241434_amd64.deb
$ sudo dpkg -i *.deb

Don't reboot yet.
6. Configure GRUB to always boot up the new Kernel.

First, figure out which Kernels are installed in the GRUB boot menu by typing:
$ grep "menuentry" /boot/grub/grub.cfg
This will return a bunch of lines including one that says Advanced boot options for Ubuntu, followed by blocks enclosed by curly braces { }. Find the first block that involves your Kernel version, and count its position from the first (starting from 1!). In my case, the number was 7, so I did:
$ sudo grub-set-default '1>7'
(You will have to change the first number if you have a multi-boot system)
Then reboot your machine:
$ sudo reboot
And check if you have the correct kernel by running uname -r. The returned number should match that of the kernel you just installed.
7. Get the requirements to install CUDA

Download the kernel sources and any other packages required to later install CUDA.
$ sudo apt install linux-image-extra-virtual
$ sudo apt install linux-source
$ sudo apt source linux-image-$(uname -r)
$ sudo apt install linux-headers-$(uname -r)

8. Download and install CUDA

You are now ready to download and install CUDA and the included 387.34 driver from Nvidia. I used a local runfile so that I could easily retry without having to download the 1.6 GB file again.
At the time of writing, the CUDA version was 9.1, but you can find newer versions on the Nvidia CUDA Toolkit download page, and adjust your commands accordingly.
Mind that there is a mistake on their webpage as the download does not include the .run extension. I also created a temporary file folder because the default (/tmp) was not large enough on my pendrive.
$ mkdir /home/miner/cudatmp
$ wget https://developer.nvidia.com/compute/cuda/9.1/Prod/local_installers/cuda_9.1.85_387.26_linux
$ sudo sh cuda_9.1.85_387.26_linux --override --tmpdir=/home/miner/cudatmp

Now follow the instructions on screen. I answered everything with yes and chose the defaults whenever given the choice. The entire process can take over 10 minutes, so have patience.
9. Finish the CUDA setup

$ export PATH=/usr/local/cuda-9.1/bin${PATH:+:${PATH}}
$ export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib64\
                         ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
$ /usr/bin/nvidia-persistenced --verbose

If all went well, the nvidia-smi command should now show the correct GPU names. If it doesn't, try rebooting first with sudo reboot.
10. Start mining

Now make runlevel 5 (graphical.target) the default runlevel again, so that SMOS can start all the necessary configuration scripts to mine.
$ sudo systemctl isolate graphical.target
$ sudo systemctl enable graphical.target
$ sudo systemctl set-default graphical.target