Skip to content

Instantly share code, notes, and snippets.

@fm4dd
Last active April 9, 2024 19:24
Show Gist options
  • Save fm4dd/c663217935dc17f0fc73c9c81b0aa845 to your computer and use it in GitHub Desktop.
Save fm4dd/c663217935dc17f0fc73c9c81b0aa845 to your computer and use it in GitHub Desktop.
GCC compiler optimization for ARM-based systems

GCC compiler optimization for ARM-based systems

2017-03-03 fm4dd

The gcc compiler can optimize code by taking advantage of CPU specific features. Especially for ARM CPU's, this can have impact on application performance. ARM CPU's, even under the same architecture, could be implemented with different versions of floating point units (FPU). Utilizing full FPU potential improves performance of heavier operating systems such as full Linux distributions.

-mcpu, -march: Defining the CPU type and architecture

These flags can both be used to set the CPU type. Setting one or the other is sufficient.

GCC supports all Cortex-A processors up to A15 (2010). Example are 'cortex-a5', 'cortex-a7', 'cortex-a8', 'cortex-a9', 'cortex-a15'. For specific values on popular boards see table below.

-mfloat-abi, -mfpu: Defining the floating-point application binary interface type (ABI) and floating point unit type

All ARM Cortex-A processors available today come with a floating-point unit called vector floating-point VFP. Most also have a SIMD co-processor called NEON. Confusingly, both are types of floating point units. VFP is older, IEEE-754 compliant and provides double precision (64bit). NEON is a multi-function coprocessor that comes with single precision (32bit) vector functionality in 32bit CPU's, and got double precision in 64bit ARM.

As a result, we have a combination of -mfpu type options, depending on the actual CPU implementation. Some combinations have an alias: 'neon-vfpv3' is aliased to 'neon' and 'vfpv2' is aliased to 'vfp'. -mfpu and -mfloat-abi are important flags, because GCC will only use floating-point and NEON instructions if it is explicitly told to do so. It is also important to set -mfpu in combination with -mfloat-abi.

-mfloat-abi sets the overall strategy for floating point code compilation. It has only three values: 'soft', 'softfp', or 'hard'. ‘soft’ generates library calls for floating-point operations. ‘softfp’ allows the generation of code using hardware floating-point instructions, but still uses the soft-float calling conventions. 'hard' generates floating-point instructions with FPU-specific calling conventions.

Other options that "can" improve performance in certain circumstances

-mtune This flag can further improve performance by setting the CPU type to a specific CPU model. For example, on a Raspberry Pi 1, it could be set to -mtune=arm1176jzf-s.

-mneon-for-64bits lets Neon handle scalar 64-bits operations. This flag has no arguments. It is disabled by default due to high cost of moving data from core registers to Neon on 32bit CPU's.

-funsafe-math-optimizations can improve floating point performance by letting it run through NEON. Use of NEON instructions may lead to a loss of precision (depending on the version of NEON).

-munaligned-access, -mno-unaligned-access
Enables (or disables) reading and writing of 16- and 32- bit values from addresses that are not 16- or 32- bit aligned. The CPU type defines the correct default, and changing it may negatively impact performance.

Pre-ARMv6, ARMv6-M, and ARMv8-M default to -mno-unaligned-access. -munaligned-access is default for all other architectures. If unaligned access is not enabled, then words in packed data structures are accessed one byte at a time.

Flags that don't need to be specified, because they are correct on default:

-mlittle-endian generates code for little-endian CPUs. This is the default.

CPU information for popular ARM-based boards:

Board CPU Architecture ARM Core FPU
Raspberry Pi 1 Broadcom BCM2835 ARMv6 ARM11 (ARM1176JZFS) VFPv2 (VFP only, no NEON)
Raspberry Pi 2 Broadcom BCM2836 ARMv7-A Cortex-A7 MPcore VFPv4-D32 (VFP and NEON)
BeagleBone Black TI Sitara AM3358/9 ARMv7-A Cortex-A8 VFPv3-D32 (VFP and NEON)
Altera Cyclone V5 5CSEMA4U23C6N A9    ARMv7-A       Cortex-A9 MPcore   VFPv3-D32 (VFP and NEON)
NanoPi NEO 2 Allwinner H5 ARMv8-A Cortex-A53 VFPv4 (VFP and NEON)
Raspberry Pi 3 Broadcom BCM2837 ARMv8-A Cortex-A53 VFPv4 (VFP and NEON)
Amazon EC2 A1 Graviton ARMv8-A Cortex-A72 VFPv4 (VFP and NEON)
Raspberry Pi 4 Broadcom BCM2711 ARMv8-A Cortex-A72 VFPv4 (VFP and NEON)

GCC compiler options for popular ARM-based boards:

Board GCC optimisation flags
Raspberry Pi 1 -mcpu=arm1176jzf-s -mfloat-abi=hard -mfpu=vfp (alias for vfpv2)
Raspberry Pi 2   -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4                      
BeagleBone Black  -mcpu=cortex-a8 -mfloat-abi=hard -mfpu=neon (alias for neon-vfpv3)
Altera Cyclone V5 -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon (alias for neon-vfpv3)
Raspberry Pi 3 -mcpu=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -mneon-for-64bits
Amazon EC2 A1 -mcpu=cortex-a72 -mfloat-abi=hard -mfpu=neon-fp-armv8 -mneon-for-64bits
Raspberry Pi 4 -mcpu=cortex-a72 -mfloat-abi=hard -mfpu=neon-fp-armv8 -mneon-for-64bits

GCC compiler tuning options for popular ARM-based boards:

Board GCC extra tuning flags
Raspberry Pi 1 -mtune=arm1176jzf-s
Raspberry Pi 2   -mtune=cortex-a7      
BeagleBone Black  -mtune=cortex-a8
Altera Cyclone V5 -mtune=cortex-a9      
Raspberry Pi 3 -mtune=cortex-a53
Raspberry Pi 4 -mtune=cortex-a72

Note that gcc's code optimisation creates code for a specific CPU/board. If that code is copied to another platform, it may perform different (e.g. poorly).

How to set compiler options

For most projects, setting the environment variable or adding the flags in the Makefile variables will do the trick.

export CFLAGS="-mcpu=cortex-a53 -mfloat-abi=hard -mfpu=neon-fp-armv8 -mneon-for-64bits"

Note to crypto miners

Starting with the ARMv8 architecture, a crypto acceleration extension has been added. On a Raspberry Pi 3 (Cortex A53), it accelerates AES and SHA1/SHA2. To enable it, use -mcpu=cortex-a53+crypto on a RPi3, and -mcpu=cortex-a72+crypto for RPi4.

Further crypto support (SHA3, SHA512) gets added from ARMv8.4-A onwards.

References

For more information, see:

@gldneagl
Copy link

gldneagl commented Mar 4, 2018

How do I determine where to place the GCC compiler options in the make file, aka CMakeLists.txt, in this case?

I am not completely new at this but I am having trouble locating where in a C make file to place the option string.

@usptact
Copy link

usptact commented Dec 15, 2018

@gldneagl I believe the support of those options comes from the GCC compiler that you have installed. Perhaps reading the release notes is the place to start looking.

@zertyz
Copy link

zertyz commented May 14, 2019

Very nice article. Thank you. It would be nice to have some section about clang to. I didn't find anything similar to "-mneon-for-64bits" for clang...

@linuxonlinehelp
Copy link

Problem: https://www.hardkernel.com/shop/odroid-n2-with-4gbyte-ram/
DataSheet Image of Layout: https://dn.odroid.com/S922X/ODROID-N2/Pictures/figure-3.png

Hello@All can anyone tell me the NEON PARAM for Odroid N2? its using 2 CPUs A53+A7* as hexacore
need urgent cause this does not compile:
OLD Makefile of csdr:
PARAMS_NEON = -mfloat-abi=hard -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mvectorize-with-neon-quad -funsafe-math-optimizations -Wformat=0 -DNEON_OPTS
my try FAILS!!:
PARAMS_NEON = -mfloat-abi=hard -march=armv8-a -mtune=cortex-a53 -mfpu=neon -mvectorize-with-neon-quad -funsafe-math-optimizations -Wformat=0 -DNEON_OPTS

@fm4dd
Copy link
Author

fm4dd commented Dec 29, 2019

If I am reading the N2 specs right, the CPU is the Amlogic S922X. The datasheet is somewhat vague on the specific SoC implementation, which leads to a bit of speculation unless the hardware is available for tests.

I'd try compiling with the following settings (untested):

march=armv8-a
mtune=cortex-a73.cortex-a53
mfpu=crypto-neon-fp-armv8 (or try leave it out for 'auto' setting)

e.g. experiment like this:

PARAMS_NEON = -mfloat-abi=hard -march=armv8-a -mtune=cortex-a73.cortex-a53 -mfpu=crypto-neon-fp-armv8 -mvectorize-with-neon-quad -funsafe-math-optimizations -Wformat=0 -DNEON_OPTS

@linuxonlinehelp
Copy link

linuxonlinehelp commented Dec 29, 2019 via email

@waterwin
Copy link

Dear Frank,

Do you intend to update the information with the RPi4 details ?
That would/could help a lot of people updating their source codes for the not so new anymore RPi4.

Thanks, regards,
Erwin

@fm4dd
Copy link
Author

fm4dd commented Feb 4, 2020

Hi Erwin,

Agree that it is long overdue, and was just done! I could finally spend a few hours this Tuesday on a RPi4, and confirmed the compiler options. Thanks to Eduardo for lending me his device. The performance is really as good as it looks in the specs.

Cheers,
Frank

@waterwin
Copy link

waterwin commented Feb 4, 2020

Great, hope many developers will now start to update their makefiles and the rest for the RPi4 as well.

Erwin

@potentialdiffer
Copy link

Hi, thank you for your great overview!

You may also want to add general compiler optimization by omitting -O or respectivley -O2 and -O3 . This was unclear to me, as I did not experience any great boost of performance by only adding fpu settings.

Christian

Source: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

@mxa
Copy link

mxa commented May 9, 2020

gcc: warning: '-mcpu=' is deprecated; use '-mtune=' or '-march=' instead

@tomek-szczesny
Copy link

tomek-szczesny commented Mar 14, 2021

Hello! I own Odroid XU4 and N2+ I'd love to know the optimal GCC flags for those, and I could test the proposed options. If you could point me in the right direction I'd be very glad to help.
Also, please forgive my ignorance, but are those flags expected to work with g++ as well?
Many thanks!

EDIT:
I started playing with compiling options by building Himeno-Benchmark with various flags, and my conclusion is that none flags other than GCC's -O3 has any effect on the performance on my Odroid N2+.
Perhaps the reason is that I use aarch64, which has most of the featured enabled by default, as those are mandatory in this architecture.

So, this output:
gcc himenobmtxpa.c -O3 -o himenobmtxpa
performed the same as this:
gcc himenobmtxpa.c -O3 -march=armv8-a+crc+crypto+simd+predres -mtune=cortex-a73.cortex-a53 -Wformat=0 -o himenobmtxpa

Without -O3, additional compiler options also don't make any difference, and the benchmark gives about 5 times worse results.

One thing I'm not sure about is whether that means that aarch64 compilation has very limited benefit from using -m* flags, or does that mean that my Armbian distro has all of them enabled by default somehow..

EDIT2: It seems that these flags do something after all. The -O3 programs with and without -m options have identical size but are different. Perhaps this benchmark didn't expose the potential of this optimizations.

@fm4dd
Copy link
Author

fm4dd commented Mar 21, 2021

Hi Tomek,

Hello! I own Odroid XU4 and N2+ I'd love to know the optimal GCC flags for those

According to the Odroid Xu4 Wiki, the CPU is a Samsung's Exynos 5422, a 8-core 32bit ARMv7 type. Below are the paramters I would try for getting best results:

march=armv7-a mtune=cortex-a7.cortex-a15 -mfloat-abi=hard -mfpu=neon-vfpv4

For the N2+, the Amlogic S922X revision C improves the core clock speeds. I would try the N2 options mentioned a few comments up.

One thing I'm not sure about is whether that means that aarch64 compilation has very limited benefit
aarch64, which has most of the featured enabled by default, as those are mandatory in this architecture.

Yes. This is my observation too. Starting from the armv8 architecture, the compiler flag optimization especially for VFP and thump instructions are becoming less important. CPU-specific features such as crypto-engine support may still be helpful, but to accelerate specific functions only.

Cheers,
Frank

@tomek-szczesny
Copy link

Hello, Frank!

Sorry it took me so long to look into this.

According to the Odroid Xu4 Wiki, the CPU is a Samsung's Exynos 5422, a 8-core 32bit ARMv7 type. Below are the paramters I would try for getting best results:

march=armv7-a mtune=cortex-a7.cortex-a15 -mfloat-abi=hard -mfpu=neon-vfpv4

I had a success compiling a benchmark on XU4 after correcting a few errors - missing dashes, and reversed order of big.little cores, like so:
-march=armv7-a -mtune=cortex-a15.cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4

I compiled himenobmtxpa using arguments you provided, and without them. In both cases, I added -O3. I ran benchmark separately for A15 and A7 cores using taskset. A15 core had 2% decrease in performance, while A7 lost 5% - I assume this simply means "no gain whatsoever".
Running the benchmark on "all cores" gives a result close to A15 core alone - so no magic threading occurred either.

Both binaries, again, have identical sizes but differ in content, according to diff.

Hmm, then again, perhaps the benchmark I'm exercising here does not benefit from these optimizations at all. Anyway, there you go, another set of settings you may add to your list. :)

Have a good day!
Tom

@vincent-olivert-riera
Copy link

gcc: warning: '-mcpu=' is deprecated; use '-mtune=' or '-march=' instead

That's not true for ARM:

$ /usr/bin/x86_64-redhat-linux-gcc -mcpu=x86-64 /tmp/wop.c 
x86_64-redhat-linux-gcc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead
cc1: warning: ‘-mtune=x86-64’ is deprecated; use ‘-mtune=k8’ or ‘-mtune=generic’ instead as appropriate [-Wdeprecated]

$ ./gcc-arm-10.2-2020.11-x86_64-arm-none-linux-gnueabihf/bin/arm-none-linux-gnueabihf-gcc -mcpu=cortex-a72 /tmp/wop.c
$

@marcost2
Copy link

Hello, Frank!

Sorry it took me so long to look into this.

According to the Odroid Xu4 Wiki, the CPU is a Samsung's Exynos 5422, a 8-core 32bit ARMv7 type. Below are the paramters I would try for getting best results:
march=armv7-a mtune=cortex-a7.cortex-a15 -mfloat-abi=hard -mfpu=neon-vfpv4

I had a success compiling a benchmark on XU4 after correcting a few errors - missing dashes, and reversed order of big.little cores, like so:
-march=armv7-a -mtune=cortex-a15.cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4

I compiled himenobmtxpa using arguments you provided, and without them. In both cases, I added -O3. I ran benchmark separately for A15 and A7 cores using taskset. A15 core had 2% decrease in performance, while A7 lost 5% - I assume this simply means "no gain whatsoever".
Running the benchmark on "all cores" gives a result close to A15 core alone - so no magic threading occurred either.

Both binaries, again, have identical sizes but differ in content, according to diff.

Hmm, then again, perhaps the benchmark I'm exercising here does not benefit from these optimizations at all. Anyway, there you go, another set of settings you may add to your list. :)

Have a good day!
Tom

Sorry for barging in, but i stumbled upon this post after reading this
https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu
So maybe you do want to try with -mcpu instead of -mtune and -march as it may be a case of it not being a "pure" v7 implementation, and as such you aren't using the full potential of it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment