Skip to content

Instantly share code, notes, and snippets.

@tleyden
Created November 4, 2014 17:00
Show Gist options
  • Save tleyden/9051a82f7bd124817995 to your computer and use it in GitHub Desktop.
Save tleyden/9051a82f7bd124817995 to your computer and use it in GitHub Desktop.
CoreOS with Nvidia CUDA GPU drivers

Launch CoreOS on AWS GPU instance

  • Go to Launch CoreOS on AWS and find the HVM AMI you want to use, eg: ami-d878c3b0

  • Go to AWS control panel under EC2 instances

  • Launch a new instance

  • Under "Community AMIs", search for ami-d878c3b0

  • Select the GPU Instance: ami-d878c3b0

  • Increase root EBS store from 8 GB -> 20 GB to give yourself some breathing room

Run docker container in privileged mode

$ docker run --privileged=true -i -t ubuntu:12.04 /bin/bash

After the above command, you should be inside a shell in your docker container. The rest of the steps will assume that you are running them from inside your docker container.

Install build tools + other required packages

$ apt-get install build-essential wget git

Upgrade GCC to 4.7

Install

$ apt-get install software-properties-common python-software-properties 
$ add-apt-repository ppa:ubuntu-toolchain-r/test
$ apt-get update
$ apt-get install gcc-4.7 g++-4.7

Add GCC 4.7

$ update-alternatives --remove gcc /usr/bin/gcc-4.6
$ update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.7 60 --slave /usr/bin/g++ g++ /usr/bin/g++-4.7
$ update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.6 40 --slave /usr/bin/g++ g++ /usr/bin/g++-4.6

Make sure GCC 4.7 is the default alternative

$ update-alternatives --config gcc
...
* 0 /usr/bin/gcc-4.7 60 auto mode
...

It should have gcc-4.7 as the default choice with the asterisk next to it.

Get kernel source

$ mkdir -p /usr/src/kernels
$ cd /usr/src/kernels
$ git clone https://github.com/coreos/linux.git

Find the tag that corresponds to your CoreOS version. For example, currently the latest stable version is the v3.8 tag (I think).

$ git checkout v3.8

Compile kernel source -- Take 1

If you don't do this step, you'll get an error about not having a /usr/src/kernels/linux/include/linux/version.h file when you run NVIDIA-Linux-x86_64-340.29.run.

$ https://raw.githubusercontent.com/coreos/coreos-overlay/f0de215426eede4d479c732ea1f575d99b978f3a/sys-kernel/coreos-kernel/files/x86_64_defconfig-3.16.2
$ mv x86_64_defconfig-3.16.2 .config
$ make

Error

root@373c6d823a5b:/usr/src/kernels/linux# make
  HOSTCC  scripts/kconfig/conf.o
  ...
scripts/kconfig/conf --silentoldconfig Kconfig
.config:2620:warning: symbol value 'm' invalid for USB_OHCI_HCD_PCI
.config:2622:warning: symbol value 'm' invalid for USB_OHCI_HCD_PLATFORM
.config:2984:warning: symbol value 'm' invalid for XEN_TMEM
warning: (BLK_DEV_RBD && CEPH_FS) selects CEPH_LIB which has unmet direct dependencies (NET && INET && EXPERIMENTAL)
*
* Restart config...
*
*
* General setup
*
Prompt for development and/or incomplete code/drivers (EXPERIMENTAL) [N/y/?] (NEW) n

I gave up on trying to use the kernel config from coreos-overlay, and tried a different direction.

Compile kernel source -- Take 2

If you don't do this step, you'll get an error about not having a /usr/src/kernels/linux/include/linux/version.h file when you run NVIDIA-Linux-x86_64-340.29.run.

$ apt-get install libncurses5-dev
$ make menuconfig

When the screen pops up, just choose "Exit" and hit "Yes" when it asks you to save.

$ make

This works, and the kernel will compile. But it will end up in an error below when trying to run NVIDIA-Linux-x86_64-340.29.run.

Install nvidia driver

Download

$ mkdir -p /opt/nvidia
$ cd /opt/nvidia
$ wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run

Unpack

$ chmod +x cuda_6.5.14_linux_64.run
$ mkdir nvidia_installers
$ ./cuda_6.5.14_linux_64.run -extract=`pwd`/nvidia_installers

Install

$ cd nvidia_installers
$ ./NVIDIA-Linux-x86_64-340.29.run --kernel-source-path=/usr/src/kernels/linux/

Error

Using the "Take 2" method of compiling kernel source above, I got the following error when installing the kernel:

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a
         version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA kernel module
         from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.

         Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.

Verify

TODO

References

@moacirsouza
Copy link

I dont't think you pretend to go back to this any time soon, right? :)

@tleyden
Copy link
Author

tleyden commented Jun 4, 2019

Probably not, but I would look at http://tleyden.github.io/blog/2014/11/04/coreos-with-nvidia-cuda-gpu-drivers/, which looks like a newer version of this.

@moacirsouza
Copy link

Thanks tleyden, and I'm sorry it took me so long to answer. I'll check it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment