tniessen/stm32-cmox-crypto-crc-aes-gcm.md

## stm32-cmox-crypto-crc-aes-gcm.md

      
    Raw
  

              stm32-cmox-crypto-crc-aes-gcm.md
            
          
    X-CUBE-CRYPTOLIB (CMOX) and the STM32's CRC unit

This write-up is about the STM32 cryptographic firmware library X-CUBE-CRYPTOLIB, also known as the Cortex-M Optimized Crypto Stack (CMOX). It is a cryptographic library developed by STMicroelectronics (ST) for their series of STM32 processors, which are based on the ARM Cortex-M family.
Hardware features vary across different STM32 processors. Because CRC checksums are widely used in embedded systems, most (if not all) STM32 processors feature a hardware CRC unit that is supposed to accelerate CRC computations.
Introduction

Interestingly, on the page "Getting started with the Cryptographic Library", ST claims:

The Cryptographic Library uses the STM32 CRC peripheral for some internal computing.

Indeed, when one attempts to perform AES or AES-GCM operations using CMOX without properly setting up the CRC peripheral first, one will encounter no errors but incorrect results, such as wrong ciphertexts and authentication tags during encryption, and verification failures or incorrect plaintexts during decryption.
Of course, one now wonders:
 
What part of the Rijndael cipher or of the GCM construction benefits from a hardware CRC unit?
 
Note
ST only distributes CMOX as precompiled static libraries, along with C header files that contain necessary declarations. The company does not provide any source code for CMOX, which would have made this investigation much simpler.

Background

The ST processors' 32-bit CRC units essentially perform polynomial division of configurable polynomials, but the result only consists of the remainder. The generator polynomial, which acts as the divisor, has a highest degree of 32, i.e., it consists of 33 coefficients, each of which is an element of GF(2). Because the most significant coefficient is always assumed to be 1, only the lower 32 coefficients need to be encoded, thus forming a 32-bit representation.
In fact, both Rijndael and GCM also employ polynomial arithmetic.
Rijndael requires polynomial multiplication over the Galois field GF(2⁸) with reducing polynomial x⁸ + x⁴ + x³ + x + 1. In other words, each element of GF(2⁸) is a polynomial of degree 7 over GF(2) and is represented by eight bits. Beyond that, in Rijndael's MixColumns algorithm, four such elements of GF(2⁸) are interpreted as the coefficients of a polynomial of degree 3 over GF(2⁸). Each such polynomial thus has a 32-bit representation.
Alternatively, the same operation can be expressed as matrix multiplication in which each entry is an element of GF(2⁸). This alternative representation allows fast implementations that do not require much polynomial arithmetic at all (see this example).
GCM, on the other hand, performs polynomial multiplication over the Galois field GF(2¹²⁸) defined by x^128 + x^7 + x^2 + x + 1. The coefficients of such polynomials are thus elements of GF(2), and an entire polynomial has a 128-bit representation.
Even though Rijndael, similar to CRC-32, operates on polynomials that have 32-bit representations, there does not appear to be any obvious benefit in attempting to utilize the CRC unit to implement either Rijndael or GCM. Even if the CRC unit was incredibly fast at computing the remainder in 32-bit polynomial divisions, it is not at all clear how that would help with polynomial multiplication in any of the relevant Galois fields used by Rijndael and GCM.
Experiment

What am I missing then? Why does CMOX implement AES-GCM using the CRC unit? Does the hardware peripheral have some undocumented feature? Is there some way of accelerating Rijndael or GCM using a CRC unit that I missed?
Because, as noted above, ST has not made the CMOX source code available to the public, I can only rely on the precompiled library to find an answer.
Therefore, I decided to use the CMOX library to encrypt some data on an ARM Cortex-M4 processor manufactured by ST and to see what the CRC unit actually does. It must be doing something because, when the CRC peripheral's clock is disabled, the AES ciphertexts produced by CMOX are incorrect.
Note
This example simply encrypts one block using AES. The assertion in the last line succeeds when the CRC unit is enabled and configured correctly, but fails when the CRC unit has been disabled.
unsigned char key[16];
memset(key, 0xac, sizeof(key));
unsigned char block[16];
memset(block, 0x9e, sizeof(block));
unsigned char out[sizeof(block)];
size_t outlen;
cmox_cipher_retval_t ret =
    cmox_cipher_encrypt(CMOX_AESFAST_ECB_ENC_ALGO, block, sizeof(block), key,
                        CMOX_CIPHER_128_BIT_KEY, NULL, 0, out, &outlen);
assert(ret == CMOX_CIPHER_SUCCESS && outlen == sizeof(block));
unsigned char expected[16] = {0x47, 0xce, 0x31, 0x3c, 0x1d, 0x19, 0x87, 0x20,
                              0xda, 0xeb, 0x9a, 0x2b, 0xa1, 0xa1, 0xc6, 0x35};
assert(memcmp(out, expected, sizeof(expected)) == 0);


On an STM32F4 processor, the CRC unit is mapped onto the sysbus at address 0x40023000. By watching for load and store operations at this address, we can determine what instruction sequences are responsible for interacting with the CRC unit.
When running a test program that uses CMOX to encrypt some data using AES (such as the example above), we can quickly identify multiple CRC peripheral accesses during a single AES operation. Each such access consists of writing to the CRC unit, reading the result from the CRC unit, and finally resetting the CRC unit to its initial state. By tracing the program counter (PC) at the exact time of those CRC, we find the single function responsible for these interactions with the CRC peripheral.
Tip
This is also easily done through Renode, without actual hardware or even attaching a debugger:
sysbus SetHookBeforePeripheralRead sysbus.crc "print(sysbus.GetCurrentCPU().PC)"


Findings

The cmox_cipherMode_setKey function is an undocumented, internal function. It does not appear in the CMOX header files, nor is its source code available, which is why this analysis is based on the disassembled library only.
When called, it first appears to mainly copy the symmetric key given by the user to a second data structure. However, the function also uses the hardware CRC unit to compute the CRC-32/MPEG-2 checksums of some 32-bit constants. The results are then combined with certain bytes of the user's symmetric key, thus potentially changing it.
 
The byte array key is modified as follows:

key[c0] += (CRC32-MPEG2(0x910e0ba4) & 0xff) ^ 0x0b
key[c1] += (CRC32-MPEG2(0xf78e2254) & 0xff) ^ 0x52
key[c2] += (CRC32-MPEG2(0x2e8f137d) & 0xff) ^ 0x85

The constant indices c0, c1, and c2 depend on the key size:

For 128-bit keys: c0 = 0x00, c1 = 0x0e, c2 = 0x0f.
For 192-bit keys: c0 = 0x0f, c1 = 0x00, c2 = 0x0e.
For 256-bit keys: c0 = 0x1f, c1 = 0x10, c2 = 0x11.

 
Importantly, if the CRC unit has been set up correctly and is functioning properly, the XOR operations shown above all result in zero, thus not changing the key at all. This is because:
  0x0b ^ (CRC32-MPEG2(0x910e0ba4) & 0xff)
= 0x0b ^ (0x5a05e40b & 0xff)
= 0x0b ^ 0x0b
= 0

  0x52 ^ (CRC32-MPEG2(0xf78e2254) & 0xff)
= 0x52 ^ (0xbc0c5c52 & 0xff)
= 0x52 ^ 0x52
= 0

  0x85 ^ (CRC32-MPEG2(0x2e8f137d) & 0xff)
= 0x85 ^ (0x40b35885 & 0xff)
= 0x85 ^ 0x85
= 0

If, however, the code were to be executed on an ARM Cortex-M processor that did not have a CRC unit at the expected sysbus address, each of the CRC32-MPEG2 computations would yield incorrect results, thus changing up to three bytes of the symmetric key. Similar behavior occurs if the CRC unit's clock is disabled, or when it is misconfigured. The changes to the key then, of course, also affect the ciphertexts produced by the Rijndael cipher.
If the CRC unit's hardware clock is disabled on an STM32L4 processor, it appears to always yield a value of 0x00000000 as the result of the CRC computation. And indeed, when adjusting the three bytes at offsets c0, c1, and c2 of the symmetric key accordingly, CMOX produces the expected results for the AES-GCM algorithm even when the CRC unit is disabled. In other words, the CRC unit does not seem to serve any useful purpose during the AES-GCM encryption and decryption operations.
Note
These experiments were conducted using version 4.1.0 of X-CUBE-CRYPTOLIB on multiple ARM Cortex-M4 processors from the STM32F4 and STM32L4 series.

Conclusion

As far as I can tell, at least for the AES-GCM algorithm, ST's claim that the CRC unit is being used for "some internal computing" is true but serves no purpose other than to bind the CMOX library to ST processors, assuming non-ST processors do not provide a compatible CRC unit at the same sysbus address. The CRC unit does not appear to improve the performance or security of the implementation itself.
It seems unfortunate that CMOX does not report any errors (other than incorrect results) when the CRC unit is not set up correctly. This artificial limitation also restricts how users can utilize the CRC unit in their own applications, since the peripheral must be enabled and reset to its default configuration for various CMOX operations.
I originally began investigating this issue because Cubicrypt had been deployed for a mission, but the STM32L4 processor running Cubicrypt on top of CMOX had unfortunately not had its CRC unit's hardware clock enabled. This had not been noticed during last-minute tests because CMOX (and thus Cubicrypt) were not reporting any errors, but, as the team would learn later, the satellite would be unable to correctly decrypt any received messages.
On a positive note, the findings described above allowed the team to not only adapt the ground station software accordingly to account for the modified AES-GCM keys that the satellite was using as a result of CMOX's unnecessary dependence on the CRC unit, but they also allowed us to verify that all assumed security properties would still hold — aside from the symmetric key itself, the actual AES-GCM algorithm implementation did, in fact, not depend on the CRC unit.