Skip to content

Instantly share code, notes, and snippets.

@imaginary-person
Created June 25, 2021 03:29
Show Gist options
  • Save imaginary-person/4b4fda660534f0493bf9573d511a878d to your computer and use it in GitHub Desktop.
Save imaginary-person/4b4fda660534f0493bf9573d511a878d to your computer and use it in GitHub Desktop.
AVX512 support in ATen (#56992)
Please refer to the comment below
@imaginary-person
Copy link
Author

imaginary-person commented Jun 25, 2021

AVX512 support in ATen

Files in aten/src/ATen/cpu/vec

vec/functional.h is common for both AVX2 & AVX512, and so are vec/functional_base.h and vec/bfloat16_functional.h.
So, the latter two files were moved to aten/src/ATen/cpu/vec.

First, the files in vec/vec256 were modified to remove code pertaining to CPU_CAPABILITY_AVX.
These files are -
vec/vec256/vec256.h
vec/vec256/vec256_base.h
vec/vec256/vec256_bfloat16.h
vec/vec256/vec256_complex_double.h
vec/vec256/vec256_complex_float.h
vec/vec256/vec256_double.h
vec/vec256/vec256_float.h
vec/vec256/vec256_int.h
vec/vec256/vec256_qint.h
vec/vec256/intrinsics.h

Then, their counterpart files in vec/vec512 were created -
vec/vec512.h
vec/vec512/vec512_base.h
vec/vec512/vec512_bfloat16.h
vec/vec512/vec512_complex_double.h
vec/vec512'vec512_complex_float.h
vec/vec512/vec512_double.h
vec/vec512/vec512_float.h
vec/vec512/vec512_int.h
vec/vec512/vec512_qint.h
vec/vec512/intrinsics.h

For reviewing the vec/vec512 files, I believe using a 2 column GUI diff tool would help,
as the files in vec/vec512 are based on the files in vec/vec256.
The intrinsics are not exactly same, as AVX512 has some instrinsics whose counterparts aren't present in AVX2, and vice-versa, but most of them are similar.

Files in aten/src/ATen/test/

aten/src/ATen/test/vec_test_all_types.cpp was modified because one test had been hardcoded for AVX2.
aten/src/ATen/test/vec_test_all_types.h was modified to remove CPU_CAPABILITY_AVX, add CPU_CAPABILITY_AVX512,
and change the alignment size for CPU_CAPABILITY_AVX512.

Files in aten/src/ATen/cpu/

`aten/src/ATen/cpu/FlushDenormal.cpp` was modified to conditionally include either `vec512/intrinsics.h` or `vec256/intrinsics.h`.
`aten/src/ATen/cpu/vml.h` was modified to remove code that's redundant now due to AVX support having been removed.

CMake files

cmake/Codegen.cmake was modified to build AVX512 kernels, and to use 32 ymm registers with AVX2 when AVX512VL is supported.
caffe2/CMakeLists.txt was modified to ensure that SIGILLs don't happen.
cmake/Modules/FindAVX.cmake was modified to add checks for AVX512 support.
aten/src/ATen/CMakeLists.txt was modified to add vec/vec512/*.h to a GLOB.
setup.py was also modified to add vec/vec512/*.h, as directed by aten/src/ATen/CMakeLists.txt.
Description about ATEN_AVX512_256 environment variable was added. When TRUE, the build process compiles AVX2 kernels
with 32 ymm registers instead of the default 16.
aten.bzl was modified to remove AVX. I haven't added AVX512 support to the Bazel build yet.

Files in aten/src/ATen/

aten/src/ATen/Version.cpp was modified to add the CPU capability version string for AVX512, and remove that of AVX.

Files in aten/src/ATen/native/ & aten/src/ATen/native/cpu

`aten/src/ATen/native/BatchLinearAlgebraKernel.cpp` - AVX dispatch was removed, and AVX512 dispatch was added.
`aten/src/ATen/native/DispatchStub.cpp` &   `aten/src/ATen/native/DispatchStub.h` - AVX dispatch was removed, and AVX512 dispatch was 
   added. Kernels without AVX512 support are currently allowed to exist, as `nansum` & `sum` currently have poor accuracy with AVX512. 
  Also, dispatch of AVX512 quantized kernels has been disabled on Windows because of flaky tests that I haven't been able to debug.

aten/src/ATen/native/SegmentReduce.cpp - AVX dispatch was removed, and AVX512 dispatch was added.
aten/src/ATen/native/mkl/SpectralOps.cpp - AVX dispatch was removed, and AVX512 dispatch was added.
aten/src/ATen/native/cpu/SumKernel.cpp - AVX512 dispatch was removed for sum kernel.
aten/src/ATen/native/cpu/SoftMaxKernel.cpp - A TORCH_CHECK was changed because it was hardcoded for AVX2.
aten/src/ATen/native/cpu/ReduceOpsKernel.cpp - AVX512 dispatch was removed for nansum.
aten/src/ATen/native/cpu/Reduce.h - Values hardcoded for AVX2 were conditionally harcoded for AVX512 as well.
aten/src/ATen/native/cpu/README.md was modified.

Additional files, modifying whom was necessary in the PR -

test/quantization/bc/test_backward_compatibility.py - test_lstm was skipped for machines with AVX512_VNNI support.
torch/testing/_internal/common_utils.py - IS_AVX512_VNNI_SUPPORTED was added to skip the test mentioned in the file above.
test/cpp/api/dispatch.cpp - Tests CPU Capability dispatch

The files listed above should be in a single PR, but the following files can also be used to create 3 separate PRs

aten/src/ATen/native/cpu/avx_mathfun.h & aten/src/ATen/native/cpu/DistributionTemplates.h - avx_mathfun.h was only being used with
AVX2 for normal_fill. I removed the corresponding SSE code. The corresponding AVX512 version exposes flakiness in some tests, so
CPU_CAPABILITY_AVX512 would also use AVX2 code.

aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp - AVX512 support was added for various kernels.

.github/scripts/generate_ci_workflows.py & .github/workflows/pytorch-linux-bionic-py3.8-gcc9-coverage.yml - These were
modified/create to create a pytorch-linux-bionic-py3.8-gcc9-coverage workflow. However, it'd have to be removed from CircleCI as
well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment