Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
CUDA Compilers

In general, check the crt/host_config.h file to find out which versions are supported. Sometimes it is possible to hack the requirements there to get some newer versions working, too :)

Thrust version can be found in $CUDA_ROOT/include/thrust/version.h.

Download Archives: https://developer.nvidia.com/cuda-toolkit-archive

Release notes for CUDA Toolkit (CTK):

Version notes Nvidia HPC SDK:

Compatibility Guarantees

Quote:

  • CUDA 10.0: First introduced in CUDA 10, the CUDA Forward Compatible Upgrade is designed to allow users to get access to new CUDA features and run applications built with new CUDA releases on systems with older installations of the NVIDIA datacenter GPU driver.
  • CUDA 11.1: First introduced in CUDA 11.1, CUDA Enhanced Compatibility provides two benefits:
    • By leveraging semantic versioning across components in the CUDA Toolkit, an application can be built for one CUDA minor release (such as 11.1) and work across all future minor releases within the major family (such as 11.x).
    • CUDA has relaxed the minimum driver version check and thus no longer requires a driver upgrade with minor releases of the CUDA Toolkit.

nvcc

Latest, officical Compiler requirements: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

CUDA version SM Arch g++ icpc pgc++ xlC MSVC clang++ Linux driver thrust note
1.0 1.0-1.1 ? ? ?
1.1 1.0-1.1 ? ? ?
2.0 1.0-1.1 ? ? ?
2.1 1.0-1.3 ? ? ?
2.3.1 1.0-1.3 ? ? ?
3.0 1.0-2.0 ? ? ?
3.1 1.0-2.0 ? ? ?
3.2 1.0-2.1 ? 11.1 ?
4.0 1.0-2.1 <=4.4 11.1 ?
4.1 1.0-2.1 <=4.5 11.1 ?
4.2 1.0-2.1 <=4.6 11.1 ?
5.0 1.0-3.? <=4.6 11.1 ? ? 1.5.3
5.5 1.0-3.? <=4.8 12.1 ? ? 1.7.0 C++11 on host side supported; ICC fixed to build 20110811
6.0 1.0-5.0 <=4.8 13.1 ? 331.62 1.7.1
6.5 1.1-5.X <=4.8 14.0 ? ? ? 1.7.2 experimenal device side C++11 support; including this version, <thrust/sort.h> skrews up __CUDA_ARCH__ (must be undefined on host); deprecation of SM 11-13 (10 removed)
7.0.17 (RC) s. below <=4.9 15.0 >=14.9 13.1.1 ? 346.29 1.8.0 first official PGI support, first xlc string found; powerpc64 w. little endian supported
7.0.27 2.0-5.X <=4.9 15.0 >=14.9 13.1.1 2010-13 346.46 1.8.1 official C++11 support on device side
7.5 <=4.9 15.0 15.4 13.1 2010-13 3.5-3.6 352.41? 1.8.2 clang (host) on linux supported, __CUDACC_VER__ macros added
7.5.18 2.0-5.X <=4.9 15.0 15.4 13.1 2010-13 352.39 1.8.2
8.0.44 2.0-6.X <=5.3 15.0(.4)-16.0 16(.3)+ 13.1(.2) 2012-15 3.8-3.9 367.48 1.8.3-patch2 sm_60 (pascal) support added
8.0.61 2.0-6.X <=5.3 15.0(.4)-17.0 16(.3)+ 13.1(.2) 2012-15 3.8-3.9 375.26 1.8.3-patch2 nvcc 8 is incompatible with std::tuple in gcc 5.4+
9.0.69 (RC) 3.0-7.0 <=5.5 (<=6) 15.0(.4)-17.0 17 13.1(.2) 2012-17 3.8-3.9 ???.?? 1.9.0-patch4 device-side C++14; __CUDACC_VER__ deprecated for __CUDACC_VER_MAJOR/MINOR/BUILD__
9.0.103 (RC) 3.0-7.0 <=5.5 (<=6) 15.0(.4)-17.0 17 13.1(.2) 2012-17 3.8-3.9 384.59 1.9.0-patch4 same as above, __CUDACC_VER__ defined as #error rendering it fully broken
9.0.176 3.0-7.0 <=5.5 (<=6) (15.0-)17.0 17.1 13.1(.5) 2012-17 (3.8-)3.9 384.81 1.9.0-patch5 same as above
9.1.85 3.0-7.2 <=5.5 (<=6) (15.0-)17.0 17.X 13.1(.6) 2012-17 (3.8-)4.0 390.46 1.9.1-patch2 math_functions.hpp moved to crt/
9.1.85.1 cuBLAS 9.1.128: Volta GEMM kernels optimized
9.1.85.2 ptxas: fix address calculations using large immediate operands
9.1.85.3 cuBLAS: fixes to GEMM optimizations for convolutional sequence to sequence (seq2seq) models.
9.0-9.1 nvcc 9.0-9.1 is incompatible with std::tuple in gcc 6+
9.2.88 3.0-7.2 <=7.3.0 (<=7) (15.0-)17.0 17-18.X 13.1(.6),16.1 2012-17 (3.8-)5.0 396.26 1.9.2 CUTLASS 1.0 added; std::tuple fixed (prior GCC 6 issues)
9.2.148 396.37 1.9.2
10.0.130 3.0-7.5 <=7 (15.0-)18.0 17-18.X 13.1, 16.1 2013-17 (3.8-)6.0 410.48 1.9.3 CUDA Forward Compatible Upgrade
10.1.105 3.0-7.5 <=8 (15.0-)19.0 17-19.X 2013-19 (3.8-)7.0 418.39 1.9.4
10.1.168 (3.8-)8.0 418.67 10.1 "Update 1"
10.1.243 418.87 10.1 "Update 2"
10.2.89 3.0-7.5 <=8 (15.0-)19.0 18-19.X 13.1, 16.1 2015-19 (3.3-)8.X 440.33.01 1.9.7 sm_30,35,37,50 deprecated; nvcc: -allow-unsupported-compiler
11.0.1 (RC) NVCC:11.0.167 3.5-8.0 (5-)6-9.* (15.0-)19.1 18-20.1 13.1, 16.1 2015-19 3.2-9.0.0 450.36.06 1.9.9 macOS dropped; libs drop pre-C++11, deprecate pre-C++14 (GCC < 5, Clang < 6, and MSVC < 2017); Arm C/C++ 19.2 support
11.0.2-1 NVCC:11.0.194 (3.3/)6-9.0.0 450.51.05 nvcc: --Wext-lambda-captures-this
11.0.3 NVCC:11.0.221 ? ? ? ? ? ? ? 450.51.06 ? 11.0 "Update 1"; nvcc: --forward-unknown-to-host-compiler, --forward-unknown-to-host-linker flags
11.1.0 NVCC:11.1.74 3.5-8.6 (5-)6-10.0 (15.0-)19.1 18-20.1 13.1, 16.1 2017-19 (3.3/)6-10.X 455.23.05 1.9.10-1 Ubuntu@ppc64le deprecated; CUDA Enhanced Compatibility
11.1.1 NVCC:11.1.? ? ? ?
11.2.0 NVCC:11.2.67 <12 460.27.04 1.10.0
11.2.1 NVCC:11.2.142 460.32.03 ? "Update 1"
11.2.2 NVCC:11.2.152 460.32.03 ? "Update 2"
11.3.0 NVCC:11.3.58 6.0-10.X 465.19.01 ? cu++flt added, Python Driver/RT bindings, alloca()
11.4.0 NVCC:11.4.48 6.0-11.X <13 470.42.01 ? sm30,32 and Ubuntu 16.04 dropped, C++11 stdlib for math
11.4.1 NVCC:11.4.100 6.0-11.X 470.57.02 ? 11.4 "Update 1", fix g++ 10 issues with chrono headers of libstdc++; Ubuntu 16.04 dropped (x86)
11.4.2 NVCC:11.4.120 3.2-12.X 470.57.02 ? ...
11.5.0 NVCC:11.5.50 6.0-11.X 3.2-12.X 495.29.05 ? ...
11.5.1 NVCC:11.5.119
11.6.0 NVCC:11.6.55 6.0-11.X adds VS2022 3.2-13.X 510.39.01 ? adds -arch=native and PTX generation in nvlink (for LTO workflows with PTX)
11.6.1 NVCC:11.6.112 510.47.03 ?
11.6.2 NVCC:11.6.124 510.47.03 ?
11.7.0 NVCC:11.7.64 ? ? ? ? 515.43.04 ?
11.7.1 NVCC:11.7.99 515.65.01 ?
11.8.0 NVCC:11.8.89 6.0-11.2.1 520.61.05 ?
CUDA version SM g++ icpc pgc++ xlC MSVC clang++ Linux driver thrust note

SM: means SM architecture support.

pgc++: now NVHPC products, e.g., nvc/nvfortran/nvc++.

Note: empty cells generally mean "same as above" for readability.

macOS: As of 7.0, clang seems to be the only supported compiler on OSX (but no version check found). CUDA 10.1.243 adds support for Xcode 10.2 . CUDA 11.0 dropped macOS support.

Compilers such as pgC, icc, xlC are only supported on x86 linux and little endian.

Dynamic parallelism was added with sm_35 and CUDA 5.0.

Newer CUDA releases have a per-release support matrix for compilers, which also lists supported kernel and glibc versions: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements

clang++ -x cuda

clang++ can compile CUDA C++ to ptx as well. Give it a whirl!

clang++ supported CUDA release supported SMs
3.9-5.0 7.0-8.0 2.0-(5.0)6.0
6.0 7.0-9.0 (2.0)3.0-7.0
7.0 7.0-9.2 (2.0)3.0-7.2
8.0 7.0-10.0 (2.0)3.0-7.5
9.0 7.0-10.1 (2.0)3.0-7.5
10.0 7.0-10.1 (2.0)3.0-7.5
11.0 7.0-11.0 (2.0)3.0-8.0
12.0 7.0-11.0 (2.0)3.0-8.0
13.0 7.0-11.2 (2.0)3.0-8.6
14.0 7.0-11.5 (2.0)3.0-8.6
15.0 7.0-11.5 (2.0)3.0-8.6
main 7.0-11.5 (2.0)3.5-9.0

https://llvm.org/docs/CompileCudaWithLLVM.html

Device-Side C++ Standard Support

C++ core language features:

supported C++ standard notes
nvcc -6.0 c++03
nvcc 6.5 c++03, exp. c++11 undocumented
nvcc 7.0-8.0 c++03,11 only c++11 switch
nvcc 9.0-10.2 c++03,11,14 10.2 adds libcu++ (atomics); open repository: https://github.com/NVIDIA/libcudacxx/releases
nvcc 11.0.167+ c++03,11,14,17 C++11 host compiler needed for math libs; ships C++11-compatible backport of the C++20 synchronization library; device LTO added; starting with CUDA Toolkit 11.0.1, nvcc and CUDA Toolkit versions are not equivalent anymore
clang 5+ c++03,11,14,17
clang 6+ c++03,11,14,17,2a
clang 10+ c++03,11,14,17,20
clang 13+ c++03,11,14,17,20,2b
clang trunk c++03,11,14,17,20,2b status

CUDA-enabled C++ standard library libcu++, based on LLVM's libc++ (docs):

introduced components notes
CUDA 10.2 <atomic> (SM6.0+), <type_traits> introduction of libcu++
CUDA 11.0 atomic<T>::wait/notify, <barrier>, <latch>, <counting_semaphore>(SM7.0+), <chrono>, <ratio>, <functional> w/o function anticipated with GTC 2020 slides
CUDA 11.2 cuda::std::tuple,pair notes
CUDA next cuda::std::complex, backports: chrono, type_traits notes
newer see the release notes and api docs all open source now

Incremental libcu++ release goals (GTC 2020):

  • Version 1 (CUDA 10.2): <atomic>(SM6.0+), <type_traits>.
  • Version 2 (CUDA next): atomic<T>::wait/notify, <barrier>, <latch>, <counting_semaphore>(SM7.0+), <chrono>, <ratio>, <functional>minus function.
  • Future priorities: atomic_ref<T>, <complex>, <tuple>, <array>, <utility>, <cmath>, string processing, ...

NVC++

NVC++ is a unified C++ compiler and GPU-accelerated STL for the CUDA platform. It also seems to support OpenACC. NVC++ does currently not support the CUDA C++ language.

supported C++ standard notes
nvc++ 11.0 ...,c++17 initial release, ships C++11-compatible backport of the C++20 synchronization library

All GPU compilers are cheese.

@psychocoderHPC
Copy link

psychocoderHPC commented Dec 18, 2020

@ax3l @bernhardmgruber pointed out that the table is saying clang 3.X is not supported for CUDA 11.X. I think it is wrong because clang 3.3+ is supported.

Copy link

ghost commented Dec 27, 2020

@Artem-B (and others)

Any idea when clang will support cuda 11? At my group, we are undecided if to choose c++20 and cuda 10.1 or c++17 and cuda 11 (with nvcc instead of clang), for an upcoming project which we will start in around 3 months.
Ideally we would prefer to use clang w/ support for cuda 11 (as we will be working with new ampere cards) but some of the features (i.e. modules) of c++ 20 are too good to ignore for a project that will be written from scratch.

@Artem-B
Copy link

Artem-B commented Dec 27, 2020

Clang (top-of-the-tree) is able to compile with CUDA-11.x. It works well enough to compile TensorFlow. What's missing is the support for the full set of the new TensorCore instructions for newer GPUs (that's been the case for a while, already), ability to target sm_86, and support for bf16/tf32 types. Existing code that compiles with CUDA-10.1 is expected to compile with CUDA-11.x.

I'll likely add sm_86 support in January. The rest is a larger undertaking with no specific plans to get it done. In practice the code that would need TensorCore instructions tends to use inline assembly for the new instructions and should compile with clang now. Support for bf16/tf32 will probably not happen any time soon.

Copy link

ghost commented Dec 27, 2020

@Artem-B
Thanks for the update. We already have a few 3090 (which we will be using fp16) while we wait for the a100 system to be delivered, so we will probably start working with clang and then see if its worth to port the c++ 20 code to c++17 to be used with nvcc.

Also, any idea of how clang is comparing to nvcc in terms of optimal code/performance?
We will be using cuda/cudnn/cublas.

@Artem-B
Copy link

Artem-B commented Dec 31, 2020

Also, any idea of how clang is comparing to nvcc in terms of optimal code/performance?

Performance-wise clang is usually on par. Sometimes a bit better (it tends to have a bit better optimizer), sometimes a bit worse (e.g. NVCC is much more aggressive at unrolling loops). A lot of the differences are eliminated due to the fact that both NVCC and clang in the end use ptxas which optimizes generated PTX and often produces nearly identical SASS for somewhat different PTX from both compilers.

We will be using cuda/cudnn/cublas.

For using NVIDIA's libraries compiler does not matter -- it's the same library regardless of whether you use clang or nvcc.
For CUDA sources, see above.

Copy link

ghost commented Jan 2, 2021

Great, Thanks.

@ax3l
Copy link
Author

ax3l commented Jan 2, 2021

@psychocoderHPC thank you for the update!

I updated the old clang versions, since the changelog does officially drop support for them. That means the library teams at Nvidia do not run CI against it anymore, so the range of usage will get narrower and less useful for those ancient compiler releases with modern CUDA releases (think: cuRand, cuBLAS, cuFFT, cub and thrust). Also, Nvidia libs transition all to C++14, so essentially you don't want to bother with gcc<5 and clang<5 anymore with CUDA11+.

With that disclaimer said, I updated the table accordingly nonetheless :) (putting ancient asserts in brackets)

@mabraham
Copy link

mabraham commented Jan 3, 2021

Note that https://llvm.org/docs/CompileCudaWithLLVM.html still documents that latest supported CUDA is 10.1. Whose error is that?

@ax3l
Copy link
Author

ax3l commented Jan 3, 2021

@mabraham The documentation is in this file: https://github.com/llvm/llvm-project/blob/main/llvm/docs/CompileCudaWithLLVM.rst
You could send a PR to http://reviews.llvm.org and assign/request @Artem-B for a review :)

@Artem-B
Copy link

Artem-B commented Jan 5, 2021

latest supported CUDA is 10.1.
This is still true. This is the version for which clang implements all builtins needed by CUDA headers.
More recent CUDA versions will mostly compile and work as long as you don't happen to need the new compiler builtins. I.e. if you include mma.h or cuda_fp16.h when bf16/tf32 types are enabled in CUDA-11.x, things will likely break. Most of the code that compiles with CUDA-11.1 will still compile with newer CUDA versions, so clang issues a warning, but allows compilation to proceed. The results vary.

@ax3l
Copy link
Author

ax3l commented Apr 10, 2021

You can check out the PyTorch documentation or use a from-source package manager like Spack (package: py-torch). This comment section is not the right place for support for specific CUDA-dependent software though, all we do is document CUDA compiler compatibility.

@ax3l
Copy link
Author

ax3l commented May 26, 2021

@Artem-B I was wondering, should the C++ standard default for -x cuda maybe be in lockstep with the Clang default for -x c++ to avoid confusion?

$ clang++-9 -dM -E -x c++  /dev/null | grep -F __cplusplus
#define __cplusplus 201402L

$ clang++-9 -dM -E -nocudalib -nocudainc -x cuda /dev/null | grep -F __cplusplus
#define __cplusplus 199711L
#define __cplusplus 199711L

@Artem-B
Copy link

Artem-B commented May 26, 2021

Huh. Interesting. I don't think we do anything special to set the default C++ version during CUDA compilation.
I suspect that whatever sets it only checks if the input language is C++, but does not pay attention to C++ extensions.

Yes, it would make sense to match the default version set by clang for C++. I'll get that fixed.

@ax3l
Copy link
Author

ax3l commented May 26, 2021

@Artem-B awesome, thanks a lot!

If it's any good, the friendly folks over at AMD/HIP seem to have the same challenge :)
ROCm-Developer-Tools/HIP#2278

@Artem-B
Copy link

Artem-B commented May 26, 2021

CUDA and HIP front-end share the same code under the hood, so it's not surprising.

@Artem-B
Copy link

Artem-B commented May 27, 2021

AMD folks beat me to it: https://reviews.llvm.org/D103221

@ax3l
Copy link
Author

ax3l commented May 28, 2021

Cool! But that diff only changed the HIP frontend, not yet the CUDA frontend?
The CUDA defaults a few lines above probably need changing, too :)

@Artem-B
Copy link

Artem-B commented Jun 2, 2021

It's c++14 by default now for both CUDA and HIP: llvm/llvm-project@f7e87dd

@ax3l
Copy link
Author

ax3l commented Jun 4, 2021

Wuhu, thanks a lot!

@bernhardmgruber
Copy link

bernhardmgruber commented Aug 13, 2021

nvcc 11.4.1 now works again with g++ 10. Previous versions of nvcc choked somewhere in the chrono headers of libstdc++ IIRC.
Also g++ 11 is now supported with nvcc 11.4.1. I just successfully compiled an alpaka/LLAMA program with both combinations.

@ax3l
Copy link
Author

ax3l commented Aug 20, 2021

Thanks for the info @bernhardmgruber. So you say we should mark nvcc 11.1..11.4.0 as broken for g++10, right?

@bernhardmgruber
Copy link

bernhardmgruber commented Aug 20, 2021

I am mainly happy that nvcc 11.4.1 supports g++11 now, which you updated, thx!

So you say we should mark nvcc 11.1..11.4.0 as broken for g++10, right?

Well, some versions of nvcc before 11.4.1 failed to parse the chrono headers. Whether that should be considered broken is beyond my judgement. Some people may not need this part of the stdlib and are fine. So I think your comment, that nvcc 11.4.1 fixes compilation of chrono, is good enough. Thanks!

@ax3l
Copy link
Author

ax3l commented Sep 4, 2021

Automated compiler crawling script on include/host_config.h by @haampie:
spack/spack#25054 (comment)

@Artem-B
Copy link

Artem-B commented Sep 4, 2021

BTW, top-of-the-tree clang (14?) now defaults to sm_35 with CUDA support bumped up up to 11.4.
Default C++ version for CUDA compilation now matches that of C++ compilation and is currently C++14.

@ax3l
Copy link
Author

ax3l commented Jan 21, 2022

Thanks Artem, updated :)

@DStrelak
Copy link

DStrelak commented Jan 24, 2022

@Flamefire
Copy link

Flamefire commented Oct 6, 2022

CUDA 11.3.1 supports GCC 10.x, i.e. all GCC 10 minor versions, similar 11.4.1 supports all GCC 11 versions, so more complete:

  • 9.2 GCC < 8
  • 10.1 - 10.2 GCC < 9
  • 11.0 GCC < 10
  • 11.1 - 11.3 GCC < 11
  • 11.4 - 11.7 GCC < 12

And for Clang:

  • 10.2 Clang < 9
  • 11.1 Clang < 11
  • 11.2 - 11.3 Clang < 12
  • 11.4 - 11.5 Clang < 13
  • 11.6 - 11.7 Clang < 14

All checked for the SDKs I have installed via the checks in crt/host_config.h. I found this after an error in PyTorch pointed me to a compatibility check based on this table

@ax3l
Copy link
Author

ax3l commented Nov 1, 2022

@Flamefire Thanks, that usually is right - I try to document the host compiler version known at release time and documented by Nvidia for the specific release to have been tested with. Minor releases of host compilers released after the CTK usually (but not always) work well together.

Updated the 11.3 and 11.4 tables accordingly to your tests - thanks a lot!

Glad to see that PyTorch cites us! :)

@ax3l
Copy link
Author

ax3l commented Nov 1, 2022

CUDA 11.2.2 supports GCC-9, not 10, see:
https://docs.nvidia.com/cuda/archive/11.2.2/cuda-installation-guide-linux/index.html

@DStrelak thanks - I thik that is not as strict, checking crt/host_config.h. See @Flamefire's comment for comparison.

@DStrelak
Copy link

DStrelak commented Nov 15, 2022

@DStrelak thanks - I thik that is not as strict, checking crt/host_config.h. See @Flamefire's comment for comparison.

I disagree, at least my my particular case it seems to be rather explicit about it :-)

/usr/local/cuda-11.2/bin/nvcc -o mycode.o -c --x cu -D_FORCE_INLINES -Xcompiler -fPIC -ccbin /usr/bin/g++-10 -std=c++14 --expt-extended-lambda -gencode=arch=compute_60,code=compute_60 -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_86,code=compute_86 -I../ -I/usr/include -I/usr/include/hdf5/serial -I/usr/include/opencv4 -Iexternal -Ilibraries mycode.cpp
/usr/include/c++/10/chrono: In substitution of 'template<class _Rep, class _Period> template<class _Period2> using __is_harmonic = std::__bool_constant<(std::ratio<((_Period2::num / std::chrono::duration<_Rep, _Period>::_S_gcd(_Period2::num, _Period::num)) * (_Period::den / std::chrono::duration<_Rep, _Period>::_S_gcd(_Period2::den, _Period::den))), ((_Period2::den / std::chrono::duration<_Rep, _Period>::_S_gcd(_Period2::den, _Period::den)) * (_Period::num / std::chrono::duration<_Rep, _Period>::_S_gcd(_Period2::num, _Period::num)))>::den == 1)> [with _Period2 = _Period2; _Rep = _Rep; _Period = _Period]':
/usr/include/c++/10/chrono:473:154:   required from here
/usr/include/c++/10/chrono:428:27: internal compiler error: Segmentation fault
  428 |  _S_gcd(intmax_t __m, intmax_t __n) noexcept
      |                           ^~~~~~
0x7f784e68b08f ???
	/build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x7f784e66c082 __libc_start_main
	../csu/libc-start.c:308
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <file:///usr/share/doc/gcc-10/README.Bugs> for instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment