Skip to content

Instantly share code, notes, and snippets.

@zingaburga
Last active April 10, 2024 06:21
Show Gist options
  • Star 64 You must be signed in to star a gist
  • Fork 9 You must be signed in to fork a gist
  • Save zingaburga/805669eb891c820bd220418ee3f0d6bd to your computer and use it in GitHub Desktop.
Save zingaburga/805669eb891c820bd220418ee3f0d6bd to your computer and use it in GitHub Desktop.
ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads

ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads

Scalable Vector Extensions (SVE) is ARM’s latest SIMD extension to their instruction set, which was announced back in 2016. A follow-up SVE2 extension was announced in 2019, designed to incorporate all functionality from ARM’s current primary SIMD extension, NEON (aka ASIMD).

Despite being announced 5 years ago, there is currently no generally available CPU which supports any form of SVE (which excludes the Fugaku supercomputer as well as in-preview platforms like Graviton3). This is set to change with ARM having announced ARMv9 which has SVE2 as the base SIMD instruction set, and ARM announcing support for it on their next generation cores across their entire server (Neoverse) and client (Cortex A/X) line up, expected to become generally available early 2022.

Cashing in on the ‘scalable’ moniker that drives cloud computing sales, SVE, particularly SVE2, is expected to be the future of ARM SIMD, and it’s set to gain more interest as it becomes more available. In particular, SVE addresses an issue commonly faced by SIMD developers - maintaining multiple fixed-width vector code paths - by introducing the notion of a variable (or ‘scalable’) length vector. Being SVE’s standout feature, this enables the same SVE code to run on processors supporting different hardware SIMD capabilities without the need to write, compile and maintain multiple fixed-width code paths.

ARM provides plenty of info summarizing SVE/2 and it’s various benefits, so for brevity’s sake, I won’t repeat the same content, and instead focus more on the quirks and aspects that aren’t being touted that I’ve come across, mixed in with some of my thoughts and opinions. Note that I only ever touch integer SIMD and I mostly use C intrinsics, referred to as ACLE (ARM C Language Extensions). To avoid making this article too long, I’ll focus primarily on what I’m more familiar with, which means I won’t be touching much on floating point, assembly or compiler auto-vectorization (despite how important these aspects are to SIMD). This also means I’ll be deliberately skipping over a favourite amongst SIMD examples.

SIMD = SAXPY?

Detailed information on SVE outside of core documentation is relatively scarce at the moment, as it’s largely unavailable in hardware and not yet widely coded for. As such, along with not having access to any SVE hardware, I could be wrong on a number of factors, so corrections welcome.

Note: this document was written in late 2021. ARM has since added changes to SVE which are not fully addressed in this article

So another instruction set I have to care about?

Since SVE2 was announced before any SVE supporting processors have been announced by ARM, it would seem like there’s not much point in restricting yourself to just what SVE supports when writing code. And indeed, all but one of the announced cores which support SVE also support SVE2, which means you could use SVE2 as a baseline, greatly simplifying development (not to mention that ARMv9 associates with SVE2 over SVE).

The unfortunate exception here is the Neoverse V1. It’s also the only announced core (ignoring A64FX) which supports a width greater than 128 bits, so likely the only core that initially will benefit greatly from SVE (compared to NEON code). It’s unknown whether V1 or N2 will gain wider adoption in their target server market, so it’s possible that you may want to ignore specifically targeting the V1 to avoid writing separate SVE and SVE2 code paths (taking note that AWS’ Graviton3 processor is expected to be V1). As SVE2 is meant to be the more complete SIMD instruction set, and potentially a NEON replacement, I’ll largely assume an SVE2 base for the rest of this document.

On the fortunate side, there’s a good chance that your code either won’t work without SVE2, or it works with only SVE and SVE2 doesn’t provide much of a benefit, so you may not even have to worry about the distinction.

The point about V1 being the only 256-bit SVE implementation leads to the next point…

SVE2 - scaling from 128-bit to 128-bit

All currently announced SVE2 cores only support 128-bit vectors, which is also the minimum width an SVE vector can be. This means that you probably won’t be seeing any massive gains initially from adopting SVE2 over NEON (unless the new instructions help a lot with your problem, or enables vectorization not possible under NEON). Regardless, the point of SVE is making the adoption of wider vectors easier, so gains may eventuate in a later core, though lack of any initial gains could fail to encourage developers to port over their NEON code at first.

However, I think that we may be stuck with 128-bit for quite some time in the consumer space, whilst the next ‘large’ server core (Neoverse V2?) will probably get 256-bit SVE2 (if not wider). This is due to the popularity of heterogenous core setups (aka big.LITTLE) among consumer SKUs and SVE requiring that all cores support the same vector width. This means that CPUs will be limited to the narrowest SVE implementation amongst all cores, with the little cores likely being the limiting factor. The widest upcoming consumer core, the Cortex X2, is limited to 128-bit SVE2, possibly due to the presence of the Cortex A510 and/or A710.

Update: the Neoverse V2 reverts to a 128-bit vector width. It's unknown what width future ARM server cores adopt.

Unfortunately, ARM historically hasn’t updated their little cores as frequently as their larger cores, and with the Cortex A510 supporting a shared SIMD unit between two cores (something its predecessor doesn’t), widening SIMD doesn’t appear to be a priority there. (one possibility that could eventuate is if the little cores declare support for a wider vector width but retain narrow SIMD units - this would enable the use of wider SVE on the big cores)

Servers however currently don’t have a heterogenous core setup, so being limited by a particular core type isn’t an issue yet. With that in mind, presumably, SVE isn’t about enabling 256-bit SIMD on mobile, but more about enabling the same ISA to be used in a number of different environments.

However, whilst ARM may not enable >128-bit vectors on consumer cores in the near future, it’s unknown what custom ARM cores from Apple and Qualcomm/Nuvia may bring. On the Apple side though, considering their recently announced A15 chip doesn’t support ARMv9/SVE2, which the M2 is expected to be a derivative of, and the introduction of their own AMX instruction set (in lieu of SME), it raises doubts over their interest in supporting SVE in the near future. As for Qualcomm, the first Nuvia showing is expected in 2023.

Of course, it’ll take time for developers to adopt SVE2, so it’s preferable to have a narrow implementation available, and roll out wider variants in the future.

Extensions: When plain SVE2 isn’t enough

ISA extensions

To add a tinge of complexity, SVE and SVE2 already come with optional extensions, most of which only add a few instructions. However, SVE2’s base instruction set seems to be fairly comprehensive, so there’s a good chance you won’t have any use for these relatively application-specific extensions for many general problems.

One curiosity I have is the SHA3 extension, which only adds a single instruction, RAX1 (rotate by 1 and Xor). This is likely a holdover from the NEON SHA3 extension, which added four instructions, but transitioning to SVE, three of them have been moved to the base SVE2 set. One of the three is XAR (Xor and rotate), which now works on all element sizes. With that being present, one wonders why RAX1 couldn’t have just become as generic as XAR, and why it has to be shoved behind an extension (one can only assume it’s much simpler to Xor before rotating than go the other way around).

In terms of actual hardware support, it’s unknown what extensions CPUs will ultimately support, but from the optimization guides for Neoverse N2/V1 and Cortex A510/A710/X2, they all appear to support BFloat16 and INT8 matrix multiply, but lack the FP32/FP64 matrix multiply extension. Excluding the V1, they all also appear to support all listed SVE2 extensions (BitPerm, AES-128, SHA-3 and SM4), though BitPerm appears to be micro-coded (i.e. very slow) on the A510. Although the N2/A710/X2 listing does look like a copy/paste job from ASIMD, I presume the correct rows were copied (whilst the A510 listing seems to be somewhat confused).

SVE2 extension ops from N2 optimization manual

SME: Moving into the next dimension

Before we get any SVE2 hardware to play with, ARM keeps things exciting, having already defined their next major extension: Scalable Matrix Extensions. SME is largely considered to be a superset of SVE2, but not entirely, and is denoted as a distinct feature from SVE/2.

Time to add a dimension

SME adds a ‘streaming SVE mode’ which permits the processor to have an alternative vector size. My guess is this effectively lets the processor have a latency optimised (regular) mode and a throughput optimised (streaming) mode.

From what I can tell, the following SVE2 instructions aren’t available in SME:

  • ADR (compute vector address)
  • COMPACT, HISTCNT/HISTSEG, MATCH/NMATCH
  • FADDA, FEXPA and floating point trigonometric instructions
  • most (all?) load, store and prefetch instructions
  • FFR instructions
  • anything added by an SVE extension (except some from FP64MatMul - see below)

Despite it’s aim of accelerating matrix multiplication, SME adds a few possibly useful SVE2 instructions, which are:

  • DUP (predicate)
  • REVD (rotate 128-bit elements by 64)
  • SCLAMP/UCLAMP (clamp values, i.e. combined min+max)
  • Quadword (128-bit) TRN/UZP/ZIP (also available in FP64MatMul SVE extension)

AArch64 only

SVE is defined for AArch64 only, so you don’t have to worry about supporting AArch32 + SVE (not to mention that AArch32 is being dropped on most upcoming ARM processors).

Wearing a Mask is Mandatory

Other than offering variable length vectors, the biggest feature SVE introduces is embedded masking, aka lane-wise predication. And to ensure that knowledge of this feature spreads around, ARM makes sure that you can’t code SVE without a mask.

Consider how to add two vectors in C intrinsics for AVX512, NEON and SVE:

_mm512_add_epi32(a, b)       // AVX512
vaddq_u32(a, b)              // NEON
svadd_x(svptrue_b32(), a, b) // SVE
// ( svptrue_bX() sets all predicate lanes to be active )

The thing to take note is the required svptrue_b32() part in the SVE code - this is a mask (or predicate) where all values are set to true (if any element in the mask were to be false, it would ‘deactivate’ the corresponding lane during the operation - the all-true mask is required when no masking is desired).

If you write a lot of code which doesn’t use predication (as I do), ARM makes you atone for such sins by requiring a svptrue_bX() for every violation (or if you’re a rebel, macro the thing away).

As for assembly, ADD actually has an unpredicated form due to the predicated version being destructive, but this isn’t available to all instructions, so you may want to keep an ‘all-true’ predicate register handy.

For reference, AVX512 makes predication optional for both intrinsics and assembly. I feel it would’ve made sense to do the same for SVE, just like they do for instructions which don’t support predication...

Masks not suitable for snouts

Despite being a major feature, there’s quite a number of instructions which don’t support predication. Anything which takes 3 vector inputs, or involves swizzling data, seem to commonly lack support for predication, in addition to instructions such as PMULL or EORBT. Also, there are some instructions that support a choice of predication or being destructive (possibly due to a limitation of coding space), although this is abstracted away in ACLE.

Many instructions which do support masking only support one type (merging or zeroing). ACLE allows zeroing on many merge-only instructions by merging into a zeroed vector, but because of this, zeroing predication may, in some cases, be less efficient than merging predication (despite what one might assume for a modern OoO processor).

This ACLE emulation of predication type doesn’t exist for a small number of cases. Most (but not all) instructions returning a predicate only support zeroing (e.g. NMATCH) whilst others only support merging (e.g. pairwise reductions).

Should I care if you “don’t care”?

In addition to zeroing and merge predication, ACLE has a “don’t care” _x predication variant on most instructions. As there is no equivalent in assembly, this is purely an ACLE feature.

It’s curious as to what purpose this serves, as it could be achieved by simply ignoring predication altogether. As such, why not drop the predicate argument, since it ultimately doesn’t matter? The few instructions where you actually may want predication, but don’t care about masked out lanes (e.g. loads/stores, COMPACT etc), don’t support “don’t care” predication anyway. On the other hand, instructions like ADDP only support merge masking, have a “don’t care” variant despite compilers having no choice in masking mode.

Perhaps it’s for instructions which require predication, but for which you may have an appropriate predicate register handy. If there were only unpredicated intrinsics, the compiler may be forced to have an all-true predicate handy when it’s strictly unnecessary. It may also be to make it easier for programmers to not have to remember what’s supported, when they ultimately don’t care.

Keep that mask straight: no shifting around

Predicate operations seem limited to in-place bitwise logic (AND, ORR etc), NEON inherited swizzles (ZIP, TRN etc), testing and some count/propagation operations (CNTP, BRKB).

Unfortunately, operations like bitwise shift are missing, which may force you to reinterpret the predicate as a vector, operate on that, then move back to a predicate.

Shifting out all but one bit may be simulated via BRKN, for example, mask >> (svcntb()-1) (logical right shift) can be implemented with svbrkn_z(svptrue_b8(), mask, svptrue_pat_b8(SV_VL1)).

Morphing masks into integers/vectors

SVE provides a number of instructions to manipulate predicates, but if you’re doing something special, you may want to apply integer/vector operations to them. Unfortunately, there’s no instruction to move a predicate to a different register set - the only solution is to go through memory.

SVE provides LDR/STR instructions for spilling predicates onto the stack, however you don’t have access to these from ACLE - the manual of which mentions:

No direct support, although the compiler can use these instructions to [refill/spill] registers [from/to] the stack.

Thus you’ll have to coerce the compiler into doing what you want (unions don’t work, neither does typecasting, however you can reinterpret a pointer).

int f(svbool_t mask) {
	uint8_t maskBytes[256/8]; // max predicate width is 2048/8 = 256 bits
	//union { svbool_t p; uint8_t b[256/8]; } maskBytes; -- illegal
	*(svbool_t*)maskBytes = mask; // pointer casting seems to be the only option
	//memcpy(maskBytes, &mask, svcntd()); // very poorly optimised on current compilers
	
	// do something unique with the predicate
	int value = 0;
	for(int i=0; i<svcntd(); i++) {
		value += lookup_table[maskBytes[i]];
	}
	return value;
}

Finally a PMOVMSKB equivalent?

NEON famously lacks an equivalent to x86’s handy PMOVMSKB instruction, often requiring several instructions to emulate. Now that SVE has predication, this becomes much more straightforward. Unfortunately, as there’s no direct predicate to integer instruction, this will need to go through memory (as described above), and you’ll need to at least consider cases where the predicate could be > 64 bits.

ACLE Tweaks Over NEON

Defined tuple handling functions

I don’t recall the NEON intrinsics documenting how to construct vector tuples (e.g. uint8x16x2_t type) or how to access elements within (all compilers seem to have adopted the tuple.val[] access method). SVE has documented svcreateX, svgetX and svsetX to handle this.

Uninitialized vectors

Another thing lacking in NEON intrinsics is the ability to specify uninitialized vectors, which is now available in SVE. There doesn’t seem to be a way to explicitly specify uninitialized predicates, though I feel that they’re less useful than uninitialized vectors.

Optional type specifier

NEON required vectors to be typed, as well as the intrinsic functions. SVE makes the latter optional via function overloading.

svuint32_t a, b;
svadd_u32_x(svptrue_b32(), a, b); // the `_u32` part can be omitted because we know a and b are u32
svadd_x(svptrue_b32(), a, b);     // ...like this

_n_ forms for constants

ACLE introduces _n_ forms to a number of intrinsics, which could cut down on a bit of boilerplate. Some instructions do support embedding immediate values, but many with _n_ forms don’t (and this doesn’t limit you to immediates), so essentially this saves you specifying svdup_TYPE(...) whenever you use a scalar value that you want broadcasted into a vector.

The functions are overloaded, so you don’t even need to specify the _n_ part, however this does require also dropping the type specifier:

svadd_u32_x(svptrue_b32(), a, svdup_u32(5)); // instead of writing this...
svadd_n_u32_x(svptrue_b32(), a, 5);          // ...you can now use this...
svadd_x(svptrue_b32(), a, 5);                // ...or even this
//svadd_u32_x(svptrue_b32(), a, 5);          // but not this

ACLE Limitations

SVE vectors cannot be embedded in containers (or composite types)

Current GCC (v11) / Clang (v13) doesn’t like dealing with arrays/unions/structs/classes containing SVE types, and refuses to do any pointer arithmetic (including indexing) on them.

This includes cases where no arithmetic would be necessary in optimised code, due to all vectors fitting within addressable registers, and can be an annoyance if you need to pass around multiple vectors (e.g. in some sort of ‘state’ variable) to functions. It can also be problematic if you’re trying to do some generic templating of a common vector type for different ISAs. You could pass around a pointer instead, but this will likely require specifying explicit loads/store intrinsics, and compilers may be less willing to optimise these out.

There are vector-tuple types (note: unrelated to std::tuple), but are limited to 2, 3 or 4 vectors of the same element type. Otherwise, there doesn’t seem to be any way to create any ‘packaged type’ containing vectors unfortunately.

// invalid - cannot use a sizeless type in an array
svuint32_t tables[8];
init_tables(tables);

// correct way to do it
svuint32_t table0, table1, table2, table3, table4, table5, table6, table7;
init_tables(&table0, &table1, &table2, &table3, &table4, &table5, &table6, &table7);

// alternatively, save some typing by using tuples
svuint32x4_t table0123, table4567;
init_tables(&table0123, &table4567);

Here’s an example of a templated design where SVE may seem somewhat awkward:

// example processing functions for NEON/SVE
static inline void process_vector(uint8x16_t& data) {
	data = vaddq_u8(data, vdupq_n_u8(1));
}
static inline void process_vector(svuint8_t& data) {
	data = svadd_x(svptrue_b8(), data, 1);
}

// this concept doesn't work particularly well for SVE
template<typename V>
struct state {
	unsigned processed;
	V data;
};
template<typename V>
void process_state(struct state<V>* s) {
	s->processed += sizeof(V);
	process_vector(s->data);
}

// explicitly instantiate NEON version
template void process_state<uint8x16_t>(struct state<uint8x16_t>* s);
// or SSE, if compiling for x86
//template void process_state<__m128i>(struct state<__m128i>* s);

// perhaps you could do this for SVE...
template<>
struct state<svuint8_t> {
	unsigned processed;
	uint8_t data[256]; // size of largest possible vector
};
template<>
void process_state<svuint8_t>(struct state<svuint8_t>* s) {
	s->processed += svcntb();
	// the following requires an explicit load/store, even if this function is inlined and the state could be held in registers; the compiler will need to recognise the optimisation
	svuint8_t tmp = svld1_u8(svptrue_b8(), s->data);
	process_vector(tmp);
	svst1_u8(svptrue_b8(), s->data, tmp);
}

I personally don’t use SIMD abstraction libraries or layers, but I imagine not being able to embed a vector within a class/union may cause issues with some existing designs.

I don’t know enough about compilers to really understand the fundamental limitations here, but it’s possible that this is just something current compilers have yet to tackle. If so, it could mean a future compiler won’t require programmers deal with this constraint.

No direct indexing of vector elements

I suppose the feature is less useful with variable length vectors, but SVE does guarantee that at least 128 bits exist in a vector, so it’d be nice if at least those could be directly accessed.

// NEON: direct indexing works
int get_third_element(int32x4_t v) {
	return v[2];
}
// SVE: requires explicit intrinsics
int get_third_element(svint32_t v) {
	return svlasta(svptrue_pat_b32(SV_VL2), v);
}

Loading Vector/Predicate Constants

Due to its unknown vector length, ARM doesn’t recommend loading vector/predicate constants, instead providing instructions like INDEX encouraging you to construct the appropriate vector at runtime.

However, since there’s a defined maximum length, it should theoretically be possible to define a 2048-bit vector, or 256-bit predicate and load as much as the hardware supports.

ACLE doesn’t provide direct support for this idea unfortunately (as in, similar to svdup functions).

No direct NEON ↔ SVE casting

Although NEON and SVE registers map onto the same space, ACLE provides no way to typecast a NEON vector to an SVE vector, or vice versa. I’m not sure how useful the feature may be, but it perhaps it could be useful if you have some code which only needs 128-bit vectors, or may help with gradually transitioning a NEON codebase. Under ACLE, NEON ↔ SVE value transfer must go through memory.

Interestingly, full vector reduction instructions (such as UADDV) write to a vector register, but the ACLE intrinsics give back a scalar (i.e. they include an implicit vector → scalar move).

32 bits ought to be enough for any instruction

SVE maintains AArch64’s fixed-length 32-bit instructions. Whilst a fixed-length ISA greatly helps with designing the front-end of a CPU, trying to support up to 3x 32 registers along with 8 or 16 predicate registers per instruction, and maintaining compatibility with AArch64, can lead to some unexpected contortions in the ISA which may initially confuse developers.

one size fits all?

Destructive instructions aren’t scary

Despite the assembly for various instructions, that produce an output from three inputs (such as BSL), looking like they accept four registers, the first two must be identical. This basically means that the instruction only really accepts three registers, where the first input must be overwritten.

BSL z0, z1, z2, z3  # illegal - first two operands must be identical
BSL z1, z1, z2, z3  # okay

The manual mentions this naming convention was chosen to make it easier for readers, though this may initially surprise those writing assembly.

In addition, a number of SVE instructions provide the option of either allowing predication or being non-destructive, but not both.

SVE does provide a ‘workaround’ for destructive instructions - the MOVPRFX instruction, which must be placed directly before a destructive instruction (presumably to enable the decoder to implement macro-op fusion, which should be cheaper than relying on move-elimination). In my opinion, it’s a nice idea, allowing the microarchitecture to determine whether SVE should be treated as a fixed or variable length instruction set, whilst keeping defined instructions to 4 bytes. Requiring an extra 4 bytes just to work around the destructive nature of instructions does feel rather wasteful, but given the options SVE provides, I suspect this not to be too common of a problem to really matter.

In terms of actual implementation of MOVPRFX, the Neoverse V1/N2 and Cortex A510/A710/X2 optimization guides don’t list support for fusing MOVPRFX, list non-zero latency figures and don’t list it as a zero-latency instruction, so the optimisation may be unavailable on these initial SVE implementations.

It’s also interesting to note that it appears MOVPRFX can’t be paired with instructions which are ‘defined as destructive’ like SRI or TBX - it only works for instructions that are destructive as a result of limited encoding space, meaning that ‘defined as destructive’ instructions are likely to incur a move penalty if you need to retain the inputs (unless a future processor introduces zero-latency MOVs on vector registers).

One-handed predicate wielding

Despite there being 16 predicate registers, most instructions only designate 3 bits for the predicate, only allowing P0-7 to be used for predication. P8-15 is generally only accessible to instructions which directly work with predicates (such as predicate AND), or those that yield a predicate (such as compares).

Fixed Width Vector Workloads on Variable Width Vectors

SIMD programmers maintaining fixed width or width specific problems may be concerned with the applicability of an unknown vector length system to their problem. Vector based designs, such as SVE, work well for problems like SAXPY, but if your problem involves fixed width units, or even requires different algorithms depending on the width, how feasible is an SVE implementation when you don’t know the vector width?

square peg in round hole

Some problems may be solvable with some rethinking, others may be frustrating to deal with, and others just may not work well at all (for example, the despacer problem may not scale well under SVE, at least without non-trivial redesign). In general, you may need to accept some level of inefficiency (in terms of performance and implementation complexity), but in the worst case, you may be stuck with only using the bottom 128 bits of the vector, or need to code specifically for each width you wish to support.

Processing in terms of 128-bit units

Given that SVE requires vector widths to be a multiple of 128 bits, it may be worth thinking about the problem in terms of 128-bit units, if possible. For example, when processing an image in terms of 8x8 pixel blocks, instead of trying to fit a block in a vector, consider processing multiple blocks, 128 bits at a time, across each ‘128-bit segment’.

This also fits in nicely with a number of SVE2 instructions, such as MATCH/NMATCH, using a ‘128-bit segment’ concept for their operation. Sadly, many instructions (such as gather, broadcast etc) don’t have the notion of a 128-bit element size, which would’ve helped, efficiency-wise, with adopting such a strategy.

Whilst there’s instructions such as ZIP1.Q and co, these require SME or the FP64 matrix multiplication extension (which don’t appear to be available on upcoming CPUs, based on Neoverse/Cortex optimisation guides). Similarly, SME adds the potentially useful REVD instruction, but support for that will be non-existent for a while.

It’s interesting to note that it’s possible to broadcast an element within each 128-bit segment, but this comes in the form of a multiply instruction, and doesn’t work with 8-bit elements.

Update: ARM has added Hybrid Vector Length Agnostic instructions as a part the SVE2p1 extension, which includes a bunch of handy 128-bit element instructions

Data transposition

A problem that crops up in a number of places (such as AoS ↔ SoA): you have N vectors, each containing N elements, and you with to transpose this data (i.e. take the first element of these N vectors and put them in a single vector, the next element of these N vectors into another, and so on). Consider an example where we have N streams of 32-bit words - if we have a 128-bit vector length, we can fit up to 4 words from each stream in a vector (i.e. N=4). To process these 4 streams concurrently, we read in one vector from each of the 4 streams, then transpose these 4 vectors so that we have the first 32 bits of each data stream in one vector, the next 32 bits in another, and so on.

SIMD transposition

(note: the streams may be longer than shown and exist in non-consecutive memory locations, so LD4 cannot be used here)

If the vector length was 256-bit instead, we’d ideally be processing 8 streams at a time.

This raises an issue: reading N vectors requires knowing N in advance, as you need to allocate the appropriate number of registers. Traditional transposition algorithms (transpose halves, then transpose quarters etc) are width specific, and are designed for power-of-2 vector widths, which doesn’t translate well to variable length vectors.

The other option is to ditch the transposition idea and instead rely on gather, which would allow one implementation to work on all vector sizes. Whilst likely simpler to implement, performance may be a concern, particularly as the vectors get longer.

A compromise could be made:

  • implement 5 power-of-2 variants, using traditional transposition, and a gather variant for other sizes. This still requires you to write 6 implementations however
    • Considering limitations on number of available registers, you may not need an implementation for every power-of-2. Alternatively, you can choose to ignore the larger vector sizes (focusing primarily on 128, 256 and 512-bit) and only optimise for 1024/2048 bit when such a machine eventuates
  • use 64-bit element gathers (unfortunately an ideal 128-bit element gather doesn’t exist) and transpose down to the element size you desire. This is obviously more expensive than traditional transposition, but cheaper than a full gather, whilst enabling you to stick to one implementation
    • If you have separate 64-bit pointers, a 64-bit gather would be your only choice as a 32-bit gather may be unable to reference all sources
    • Since gather can only be done with 32 or 64-bit elements, if your element size is smaller (i.e. 8/16-bit), you’ll also need to take this approach over a pure gather

using gather-transpose

Data Permutation? Just Leave a Mess on the TBL

Trying to shuffle data around in a vector becomes a lot more difficult when you don’t know where the boundaries (length) lie. This will likely cause complications with algorithms which heavily depend on data swizzling/permutation.

length of a string

Despite offering just enough tools to get the job done (i.e. the TBL instruction), SVE2 does little to help on this front, and swizzling seems to be an aspect not well considered. Particularly egregious is that a number of SVE extensions (e.g. FP64MatMul, SME) add new swizzling instructions to cater for the narrow use case the extension is targeting, as clearly the ones included in the base SVE2 instruction set aren’t considered sufficient.

Copying leads to mediocrity at best

NEON had a number of useful swizzling instructions, and I think makes it overall fairly good ISA at swizzling. SVE2 inherits many of these, often with little change, and because of that, it’s much more problematic. To provide an example, let’s take a look at the EXT instruction, which fits this description:

NEON forgoes byte level shifting instructions (such as those in x86’s SSE2) in favour of the EXT instruction, which can be used for both left and right shifts, given appropriate inputs. Note that EXT naturally shifts a vector to the right, but left shift can be done by subtracting the desired shift amount from 16 (the byte width of NEON vectors).

SVE2 carries over this instruction with the same functionality, which makes right shifting just as trivial. However, left shifting now becomes more difficult as you don’t know the vector length in advance (so a 16-n approach doesn’t work), and the shift amount must be an immediate value (so a svcntb()-n solution isn’t possible).

To do a left byte shift, your options may be limited to:

  • write a version of this instruction for every possible vector width, and switch() on the vector width - this may be sensible if this is a critical instruction in a hot loop, where you may want to implement 16 copies of the loop and select the version based on vector width
  • write vectors out to memory, then load it back with an appropriate offset
  • reverse (REV) the vector, do a right byte shift (EXT), then reverse again
  • create a mask then use SPLICE
  • for shifting by 1, 2, 4 or 8 bytes, INSR is also an option

It probably would’ve made sense to include a left-shift version of EXT, or perhaps a byte shift which accepts a variable (non-immediate) shift amount.

TBL and TBX with bytes: different paths to the same destination

Due to the limited number of useful swizzling instructions available, developers may be encouraged/forced into using the all-purpose TBL/TBX to get data into the correct place. The SVE manual explicitly mentions these instructions are not vector length agnostic, which can make it a little awkward to use, as well a potential source of subtle bugs.

SVE’s TBL supports non 8-bit element sizes (in contrast with NEON), and for these, the instruction is fairly straightforward.

However, for 8-bit elements, the distinction between TBL and TBX blurs a lot. In theory, the two instructions do different things, but in canonical SVE code, you can’t rely on that being the case. If the vector width is 2048-bit, both instructions do exactly the same thing, so unless you have separate code paths for different vector lengths (or some other length checking/controlling mechanism), there is no point in ever using TBX with 8-bit elements. Similarly, you can’t rely on TBL‘s zeroing feature (e.g. by setting the bytes you want zeroed to 0xff in the index vector) as it may never be triggered with 2048-bit vectors.

Of course, I fully expect there to be software which doesn’t heed this, particularly since 2048-bit vector machines seem unlikely in the near future, leading to possible unexpected bugs on future machines.

Zeroing (and merging) via predication may have been a better approach, but why come up with some new SVE-esque way of doing things when you can just copy TBL and TBX from NEON and call it a day?

TBL2: two TBLs to bang your head against

(also known as the 2 source vector TBL from SVE2, but ACLE refers to it as ‘TBL2’, so I will as well)

As with (single vector) TBL and TBX, this instruction pretty much works the same way it does in NEON, but as with EXT, it’s also much more awkward to use than the NEON version. To reference elements from the second vector, you need to know the vector length, and then add that to the indicies that refer to elements of the second vector. This means that you’ll almost always need a bunch of instructions to prepare the index vector for TBL2.

Following on from the point in the previous section, using TBL2 with 8-bit elements is completely unusable unless you have vector length checks in place. For 2048-bit vectors, TBL2 can reference none of the elements of the second vector.

It probably would’ve been better if the top bit of each element was used to indicate whether it pulls from the first or second source vector. Alternatively, predication could be used to select the source vector (similar to the SEL instruction). Both of these are likely easier than having to do a count, broadcast and or-merge to reliably select the desired elements.

Are arbitrary sized permutes always wanted?

ARM notes that the permutation instructions TBL, TBL2 and TBX aren’t vector length agnostic. This is understandable, particularly if you want a full vector shuffle/permute, but one does wonder whether limiting the scope may have helped with problems that scale in terms of 128-bit units (such as VPSHUFB in x86 which restricts shuffling to the 16 bytes per lane). On the other hand, full vector permutes are certainly a boon to those used to dealing with the hassles of 128-bit lanes on x86’s AVX.

However, there’s another concern with allowing full width permutes: ARM notes in a presentation that many of the widen/narrow instructions had their behaviour changed to avoid transferring data across long distances. However the TBL family by definition must support transferring data across the entire vector, even if you’re not actually using it for that purpose. This raises a concern with potential performance penalties for using the instruction.

SVE2 pairwise operations

x86’s AVX implementations usually have cross-lane penalties with instructions that cross 128-bit lane boundaries, suggesting that this isn’t just a problem with ARM designs.

However, looking at the Neoverse V1 optimization guide (and assuming accurate figures), there doesn’t appear to be any penalty with a 256-bit (SVE) TBL over a 128-bit (NEON) one. Similarly, other instructions which may need to move data across distances, such as REV, show no penalty.

Aside: it’s interesting to note the way the ALUs are arranged - the Neoverse V1 has 4x 128-bit NEON ALUs which can be arranged into 2x 256-bit for SVE. In such a case, it appears that port V2 is folded into V0 and V3 into V1 - in other words, 128-bit instructions which can be issued to ‘V02’ (i.e. ports V0 and V2) will turn into a 256-bit instruction issued only to V0 (since V2 and V3 are ‘unavailable’ under SVE). TBL in ASIMD is noted as V01, and is also V01 for SVE, suggesting that the Neoverse V1 actually has 2x 256-bit shuffle ports, which simply get restricted to 128-bit when executing NEON, as opposed to the more traditional 4x 128-bit → 2x 256-bit setup. Looking at the Cortex X2 optimisation guide (4x 128-bit SVE2), the ASIMD TBL operates the same, but SVE’s TBL now operates on all four 128-bit ports. This matches the stated Neoverse V1 behaviour, but is weird for SVE’s TBL to have double the throughput of ASIMD’s TBL despite doing the same operation. Another interpretation is that at least some of the published figures for SVE’s TBL are incorrect.

The A64FX guide also doesn’t show a penalty for a 512-bit TBL over ASIMD, but it is noted that TBL always has a higher latency relative to other SIMD instructions on that platform.

It’s entirely possible that moving data across the entire vector just isn’t a problem at current SIMD widths, but could be problematic on wider vectors, or maybe just isn’t that much of a problem period. It’s also unknown whether TBL2 would have any penalty over TBL for width >=256-bit, but speculating from existing ARM architectures, perhaps not.

Update: ARM has now added instructions like TBLQ as part of their Hybrid Vector Length Agnostic instructions in the SVE2p1 extension

Let memory solve it?

SVE adds support for gather/scatter operations, which helps vectorize a number of problems that may otherwise require a lot of boilerplate shuffling. It also can help alleviate issues with lack of good in-register swizzling, since data can be shuffled to/from memory, however it’s not without its downsides:

  • limited to 32/64-bit element sizes (note that the 8/16-bit variants only support sign/zero extension to 32/64-bit)
  • poor performance due to being bottlenecked by the processor’s LSU (and likely being executed as multiple µops), as it’s effectively a scalar operation in terms of performance. In other words, performance decreases linearly as the vector width increases
  • still requires index generation like TBL

Throwing in the towel

Perhaps trying to swizzle data, when you don’t know the length of the vector, is all just too hard. Perhaps we can just limit the vector width instead and pretend to be using a fixed length vector ISA?

SVE does actually provide some support for operating with constrained vector sizes (such as a multiple of 3, or restricting it to a power of 2). Unfortunately, this is done primarily via predication (and support from some other instructions like INCB) and does not help with swizzling instructions (most of which don’t support predication to begin with).

There does, however, seem to be a mechanism for actually setting a vector length, but I don’t think it’s available at EL0 (user-space), so an interface must be provided by the OS (prctl(PR_SVE_SET_VL, ...) under Linux). Note that the contents of vector registers become unspecified after changing the width, and compilers are likely not expecting such changes to occur during execution, so it’s not something that can be changed on a whim. Also, you can only set a vector size smaller than, or same as, what is supported on the hardware; you can’t just set an arbitrary sized vector, such as 512-bits, and expect the hardware to magically deal with it, if the hardware only supports 256-bit vectors. Due to the downsides and the design goal of this mechanism, it seems like this isn’t aimed at easing development, but rather, perhaps serve as a compatibility measure (e.g. allow the OS to run an application with a limited vector width if the application wasn’t designed with wider vectors in mind), or maybe for allowing live server migrations.

However, since the vector length is expected to remain constant throughout execution, you can do checks on the vector length and either have multiple code paths, and/or outright reject supporting some configurations.

Alternatively, since the register files are shared, you can use NEON instructions if you don’t need more than 128 bits, although SVE support doesn’t seem to guarantee NEON support. ACLE also doesn’t help as it doesn’t provide a mechanism to typecast between the two vector types.

The Next Fad: 1664-bit Registers

Update: ARM has now disallowed non power-of-two widths in an update (see C215), which addresses a number of concerns mentioned below

SVE allows any vector width that’s a multiple of 128-bit up to 2048 bits, with no further restriction. This means that a system with 1664-bit vectors would adhere to SVE specifications. I don’t know whether anyone will ever make a machine with such an unusual width, but it undoubtedly enables such flexibility on the hardware side. Regardless, the programmer still has to ensure their code works for such cases to be fully ‘SVE compliant’ (also keep in mind that width can be adjusted down by privileged software, so a system with 512-bit vectors could be configured to operate at 384-bit width). My guess is that ARM is banking heavily on the ‘scalable’ part working well to minimise the impact this has on the programmer, whilst simultaneously maximising hardware configurability.

Unfortunately, not only is SVE not fully width agnostic, this does come with a number of caveats, particularly with dealing with non-power-of-2 widths:

  • memory alignment: if you already do this, it’s probably not worth trying too hard with SVE. I guess you could choose to only bother if the vector width is a power of 2, or maybe try some alignment such that at least some fraction of loads/stores will be aligned (assuming desired alignment boundaries are always a power of 2). It’s possible that trying to cater for alignment isn’t so much of a concern with newer processors (as misalign penalty seems to be decreasing), though interestingly, the addition of memcpy/memset instructions in ARMv9.3 catering for alignment seems to suggest otherwise
  • power-of-2 bit hacks: multiplication/division by vector width is no longer a bit shift away, and if you have code which computes multiples/rounding or similar, such code may need to be revised if you previously assumed vectors would always be a power-of-2 in length (there’s an instruction to multiply by vector width though (RDVL); it may have been handy to have an instruction which does a modulo/divide by vector width without needing to invoke a slow division instruction or resort to runtime division hacks)
  • divide-and-conquer strategies: algorithms such as transposition or custom reductions may require more care as you can’t always repeatedly halve the vector width
  • common multiple: one cannot assume that all possible vector sizes will fit neatly into a statically-chosen smallish array (the lowest common multiple of all SVE vector sizes is 3,843,840 bytes). This can complicate some matters, such as static AoSoA setups
  • bottlenecks on a 128-bit vector machine are likely very different to that of a 2048-bit vector machine. As vectors increase in size, cache/memory bandwidth can become more of a bottleneck, but this may be difficult to predict without benchmarking on a wide vector machine

Currently announced/available SVE/2 widths are 128, 256 (Neoverse V1) and 512 bits (A64FX); it’s unknown if a 384 bit SVE implementation will ever eventuate (seems unlikely to me, but ARM felt it necessary to allow it).

Small/fast vs big/slow

big vs small

On the x86 side, the first AVX512 implementation had an interesting quirk: the frequency of the processor may throttle when executing ‘heavy 512-bit code’. As the vector width is programmer-selectable there, this allowed code to be either latency focused (keeping to shorter vectors) or throughput focused (using longer vectors).

SVE doesn’t readily allow the programmer to select the vector width, which means this choice isn’t available. This could limit hardware implementations to select widths that have a limited impact on frequency/latency. The Neoverse V1 optimisation guide doesn’t provide info on performance when the SVE width is restricted to 128-bit, so it’s yet to be seen if a processor would perform differently depending on the vector width set - particularly with instructions that may not scale so well with vector length (e.g. gather/scatter).

Interestingly, ARM introduces a ‘streaming mode’ as part of SME, which enables it to have a longer vector width relative to SVE, and may be a solution?

I Must Upgrade NEON to SVE2

Apart from being variable length and supporting predication, many instructions in SVE2 mirror that of NEON, which makes understanding easy for anyone familiar with NEON, and helps with translating code.

ARM does highlight a key change made from NEON to SVE2 in how widening/narrowing instructions work:

SVE2 widen/narrow

Not highlighted, but another difference would be that all comparisons yield a predicate result instead of a vector; if the old behaviour is desired, it has to be constructed from the predicate.

Some NEON instructions also seem to be missing SVE2 equivalents:

  • ORN (bitwise or-not): replaceable with BSL2N at the cost of being destructive
  • CMTST (bit-test compare): unsure why this was removed; you’ll probably need to take the manual approach (AND+CMPEQ) with this one
  • vector ↔ scalar moves, e.g. SMOV/UMOV and INS: for extracting elements from a vector, LASTA/LASTB could work.
  • XTN: UZP with a zero vector seems to do the same thing
  • 3-4 vector TBL has been removed, as well as multi-vector TBX; though inconvenient to have removed, those instructions typically performed poorly, and the implementation of 2-vector TBL in SVE2 feels somewhat iffy (at least with 8-bit elements).
  • multi-vector LD1 and ST1: probably not enough benefit over issuing the load/store instructions individually

ARM’s own migration guide can be found here.

Translation Tools

Some developers may be familiar with libraries like sse2neon, which can help transition code written in one ISA to another. At the moment, I don’t know of any tools which can translate intrinsics into SVE/2.

Being variable width, translating a fixed width vector ISA into SVE/2 will likely be more challenging than what a simple header library can achieve.

However, translating the other way around may eventually be possible with something like SIMDe.

Aside: Comparison with AVX512

I may as well make a comment about x86’s closest, very well received competing SIMD extension. The overall comparison between NEON ↔ SSE largely remains, with the ARM side feeling more orthogonal in general (with fewer odd gaps in the instruction set), but has less “Do Tons of Stuff Really Fast” style instructions that x86 has.

SVE2 ↔ AVX512 has been closing the gap however. AVX512 plugs many of the holes that SSE had, whilst SVE2 adds more complex operations (such as histogramming and bit permutation), and even introduces new ‘gaps’ (such as 32/64-bit element only COMPACT, no general vector byte left-shift, non-universal predication etc).

SVE‘s handling of predication feels more useful than AVX512’s implementation - whilst support for it isn’t as extensive, the idea of the WHILELT instruction and FFR are a great use of the feature, allowing it to be easily used for trailing element handling and even speculative loading (useful for null terminated strings, for example).

AVX512 may seem more broken up, due to all its extensions, but x86 has the benefit of there being few implementations and low variability, making this perceived fragmentation mostly illusory in practice (currently, Skylake-X and Ice Lake are the major compatibility targets; though the programmer must be aware of this). It’s unknown how SVE/2 will fare in that regard, but having a solid base instruction set makes it less likely an issue.

As for the elephant in the room, the key difference between the two is that AVX512 allows the programmer to specify the vector width (between 128/256/512 bit) but requires hardware to support all such widths, whereas SVE2 allows the hardware to dictate the width (128-2048 bit) at the expense of the programmer having less control over such.

In terms of hardware support, it’s expected all future ARM designed CPUs will support SVE2 across all A-profile products, albeit many possibly limited to 128-bit vectors, whilst AVX512 is somewhat of an unknown with Intel being uncertain about support for it in their latest client processors. AVX512 does currently have the benefit of actually being available to consumers for a few years though.

Development Resources

Documentation/Information

Being new, there isn’t a lot of info out there about SVE at the moment. ARM provides some documentation, including an instruction manual, as well as an ACLE PDF; the descriptions in the latter are rather spartan, so you’ll often need to refer back to the assembly manual for details.

SVE/2 has recently been added to ARM’s Intrinsics page.

The SVE documentation in Linux is worth checking out for ABI details, if Linux is your target platform.

Compilers

The latest GCC and Clang/LLVM versions have support for SVE2 via intrinsics. I haven’t tested armclang (which I presume wouldn’t have any issues), but have tested GCC 10 and Clang 12 (I’ve found issues in Clang 11 that seem to be fixed in 12, so I suggest the latter). I have not tested auto-vectorization, neither have I looked at assembler support, but recent-ish GNU Binutils’ objdump handles it fine.

I don’t know about support in non C/C++ compilers, but I’d imagine it to be fairly minimal at best, at present.

Testing Code

Without SVE hardware available, your only option is to use an emulator. Even if you have SVE hardware, you’ll probably still want to use an emulator - just because your code works on your SVE CPU doesn’t mean it’ll work on a CPU with a different vector width. I can see this latter point being a not-unlikely source of bugs; the additional complexity of getting something to work with arbitrary sized vectors can mean unnoticed bugs if code isn’t verified against a range of vector sizes.

ARM provides their ARM Instruction Emulator, which unfortunately, isn’t open source and only released for select AArch64 Linux hosts (which shouldn’t be too surprising). This means that you’d likely have to run it on an ARM SBC (of which many aren’t particularly performant), a smartphone which can run a Linux container, or rent an ARM instance from a cloud provider (which is becoming more available these days).

One thing to note is that if you’re doing dynamic dispatch (e.g. testing SVE availability via getauxval in Linux), you’ll need to bypass the check when testing because ARM’s emulator doesn’t intercept that call. If you need to set a vector width dynamically (i.e. via prctl)… I guess you’ll have to hope it works in practice unless you have an actual test machine.

Another option may be to use qemu, or other suggestions from ARM, but I haven’t tried these myself. It seems like ARM’s own IE doesn’t support big-endian, so you may need to use qemu to test compatibility if big-endian support is a concern.

OS support

From what I can tell:

Amongst popular OSes, it seems like only Linux/Android really supports SVE so far.

Conclusion

SIMD programmers have often been annoyed with having to deal with multiple SIMD instruction sets, not to mention multiple ISAs on the same platform (see SSE/AVX/AVX512). SVE aims to address the latter problem (at the expense of introducing yet another ISA) with a different approach, whilst providing greater flexibility to CPU implementers and new capabilities to programmers. Towards that goal, it does feel like SVE largely achieves it with a reasonable balance of concerns, despite all the compromises this entails.

To enable this, SVE has to operate at a higher level of abstraction relative to fixed length SIMD ISAs. In addition to possibly adding some overhead, it inevitably makes this ISA more opinionated, meaning that it can simplify use cases which suit its model whilst making others more challenging to deal with. And SVE does feel like it’s more focused on specific scenarios than NEON was.

Depending on your application, writing code for SVE2 can bring about new challenges. In particular, tailoring fixed-width problems and swizzling data around vectors may become much more difficult when the length is unknown. Whilst SVE2 should provide enough tools to get the job done, it’s underwhelming support for data swizzle operations feels poorly thought out and leaves much to be desired. On the other hand, SVE2 shares a lot of functionality with NEON, which greatly simplifies adopting existing NEON code to SVE2.

Unfortunately, current implementations of SVE intrinsics can make software wrappers more restricted, particularly for those hoping to use one to unify the multiple SIMD ISAs out there. I can also see subtle bugs potentially cropping up for code not tested against a variety of vector lengths, particularly since SVE allows for great flexibility in that regard, which means your development pipeline may need to incorporate an instruction emulator for testing, regardless of hardware availability/support.

Finally, SVE2’s main selling point over NEON - the ability to have >128-bit vectors on ARM - will likely not eventuate on many processor configurations for quite some time. As such, and in addition to slow software/OS adoption, it could be a while before we get to see the true extent of what SVE is capable of, and how well the ‘scalable’ aspect plays out.

@EwoutH
Copy link

EwoutH commented Dec 9, 2022

Super interesting read, thanks!

With Neoverse V2 announced, we see Arm take an interesting direction:

Arm Announces Neoverse V2 and E2: The Next Generation of Arm Server CPU Cores
On the latter, Arm is making an interesting change here by reconfiguring the width of their vector engines; whereas V1 implemented SVE(1) using a 2 pipeline 256-bit SIMD, V2 moves to 4 pipes of 128-bit SIMDs. The net result is that the cumulative SIMD width of the V2 is not any wider than V1, but the execution flow has changed to process a larger number of smaller vectors in parallel. This change makes the SIMD pipeline width identical to Arm’s Cortex parts (which are all 128-bit, the minimum size for SVE2), but it does mean that Arm is no longer taking full advantage of the scalable part of SVE by using larger SIMDs.

One comment gave an interesting possible expenation:

I propose an even simpler reason: faster NEON performance, which is what basically all existing hand coded ARM intrinsics use now.

And the following discussion was not bad either.


Meanwhile, without having seen SME in the wild, SME2 is already announced as part of ARMv9.4-A.

SME2

In 2021 Arm announced the Scalable Matrix Extension (SME) to Armv9-A. SME added new capabilities to efficiently process matrices, including matrix tile storage and outer-product operations. In 2022, Arm builds on the capabilities of SME by introducing SME2.

SME provides outer-product instructions to accelerate matrix operations. SME2 significantly extends the capabilities with instructions for multi-vector operations, multi-vector predicates, range prefetches and 2b/4b weight compression.

7658 pastedimage1664443035960v8 png-456x260

The new instructions enable SME2 to accelerate more workloads than the original SME. Including GEMV, Non-Linear Solvers, Small and Sparse Matrices, and Feature Extraction or tracking.

Noticeably, ARM keeps communicating SME as a superset of SVE2, and SME2 as a super set of SME.

Edit: Also, it looks like Arm is already working on SME2.1. See:

@zingaburga
Copy link
Author

Thanks for the comment/update.!

Turns out my prediction that SVE2 will be stuck at 128-bit was truer than I had anticipated!
I don't really buy the "faster NEON performance" reason for V2's change, as V1 already ran NEON at 4x128b (though I don't know how it operated if SVE was set to 128b, or if SVE was mixed with NEON).
There's a lot of discussion/speculation here, if you're interested in reading such.

ARM has added some things since I wrote the above article - in particular, SVE2p1 plugs some of the issues I mentioned above. Maybe I'll do an update at some point.
SME isn't something I focus much on unfortunately, as I don't really work on such workloads, so I've stuck with focusing on SVE.

@clausecker
Copy link

As I understood it, a program cannot configure the vector size. How then are you supposed to test your code? Unless the operating system has support for configuring vector sizes (and doing so beyond what the hardware supports with trap/emulate approaches), it'll be impossible to test what your code does on platforms with larger vector sizes. I foresee that some popular applications will only work with 128 bit SVE due to programming bugs, effectively freezing the vector size to that for consumer applications. I really don't understand why there is no way to tell the CPU what your desired vector size is (causing it to trap into the OS if that is more than what the HW can do) resp. telling the CPU to pick its own ideal vector size.

@zingaburga
Copy link
Author

I really don't understand why there is no way to tell the CPU what your desired vector size is (causing it to trap into the OS if that is more than what the HW can do) resp. telling the CPU to pick its own ideal vector size.

What would the OS do in such a case? Emulate the SVE code? If so, that doesn't sound particularly enticing, since SVE is meant to be performant.

There is a way to specify a vector length, but you can only go down in size, and it generally cannot be changed during program execution. So for applications specifically programmed to 128-bit SVE, this function allows the OS to force the application to run with 128-bit vectors, even if the CPU's native width is larger. So it's designed to guard against compatibility issues/bugs, such as you describe, but not for testing (or allow the programmer to assume a width, as SVE is meant to be coded to any width).

@clausecker
Copy link

What would the OS do in such a case? Emulate the SVE code? If so, that doesn't sound particularly enticing, since SVE is meant to be performant.

Yes, correct. This means that programs demanding a larger vector size than what the HW can do can at least run, but they won't be fast. This is mostly important for testing, but will also be important for forwards compatibility. For example, imagine a program that operates on data items that are 32 byte in size. It would be hard to design the code such that it can also work on 128 bit registers, so in the future, when 128 bit SVE implementations are rare, it might just unconditionally request 256 bit vector size. Trap/emulate permits such programs to work anyway.

There is a way to specify a vector length, but you can only go down in size, and it generally cannot be changed during program execution. So for applications specifically programmed to 128-bit SVE, this function allows the OS to force the application to run with 128-bit vectors, even if the CPU's native width is larger.

This sounds good in theory but doesn't compose. What if different program parts demand different vector sizes? E.g. you have a graphics library that wants 128 bit but a video codec that wants 256 bit. Requested vector length must be program settable. What needs to be OS settable is the largest supported vector length the CPU reports so you can test your program by faking a differently sized SVE unit than what the HW has.

@zingaburga
Copy link
Author

This is mostly important for testing

Testing can already be done on an emulator - there's no need to have this mandated for non-development systems.

What if different program parts demand different vector sizes? E.g. you have a graphics library that wants 128 bit but a video codec that wants 256 bit.

I think you're missing the point of SVE - you're not supposed to write code like that. SVE code is supposed to work across all vector sizes. This can indeed raise complications in some scenarios, but SVE strives to make it work.

If you really must code to a specific width (or you're stuck with third party libraries that assume such), you can put an if(vector_width == X) condition around it. This obviously has its limits, but SVE isn't really supposed to work in this fashion.

@clausecker
Copy link

Testing can already be done on an emulator - there's no need to have this mandated for non-development systems.

Having to use emulators to make sure your software works correctly does not sound like the ISA is well designed. Now your project not only has to ship a test suite but also an emulator that works on all platforms the project supports.

I think you're missing the point of SVE - you're not supposed to write code like that. SVE code is supposed to work across all vector sizes. This can indeed raise complications in some scenarios, but SVE strives to make it work.

I do weird combinatorial programming involving lots of permutation operations. Not knowing the vector width makes the code a lot more complicated or possibly impossible to write. Many of these algorithms cannot even benefit from wider vectors, using each vector as a structure on which mostly horizontal operations are performed.

If you really must code to a specific width (or you're stuck with third party libraries that assume such), you can put an if(vector_width == X) condition around it. This obviously has its limits, but SVE isn't really supposed to work in this fashion.

Then the code no longer works on all SVE implementations. Once again, this sucks.

@zingaburga
Copy link
Author

Now your project not only has to ship a test suite but also an emulator that works on all platforms the project supports.

Requiring all OSes to ship with an emulator, whilst making production code run very slowly sounds worse.

I do weird combinatorial programming involving lots of permutation operations. Not knowing the vector width makes the code a lot more complicated or possibly impossible to write.

As mentioned in the article, permutation is (IMO) a weakness in the SVE design. If that's the case, you might be better suited to sticking with NEON.

Many of these algorithms cannot even benefit from wider vectors, using each vector as a structure on which mostly horizontal operations are performed.

I don't know your problem specifically, but speaking generally, this sort of design typically doesn't translate well to SIMD. You often want to use an SoA memory layout to avoid having a vector represent a structure, and reduce reliance on horizontal operations.

@clausecker
Copy link

clausecker commented Dec 13, 2022

Requiring all OSes to ship with an emulator, whilst making production code run very slowly sounds worse.

The only case in which the code traps and is emulated is when you request a higher vector size than what the hardware supports. I believe this would not be a common case, but it should be a supported case (e.g. for testing and for forward compatibility to run code assuming a larger vector size on old platforms that don't have vectors large enough).

As mentioned in the article, permutation is (IMO) a weakness in the SVE design. If that's the case, you might be better suited to sticking with NEON.

Unfortunately they aren't adding any of the useful SVE2 features to NEON, so this too is kind of annoying to do.

I don't know your problem specifically, but speaking generally, this sort of design typically doesn't translate well to SIMD. You often want to use an SoA memory layout to avoid having a vector represent a structure, and reduce reliance on horizontal operations.

Combinatorial algorithms often cannot benefit from this because you only manipulate one combinatorial object at a time. Take for example a chess board in a chess engine. It is very difficult to write a chess engine such that it processes multiple chess positions at the same time due to various factors. Neverthless, SIMD instructions are tremendously useful for implementing the operations on these. But to do that, you need to know exactly how long your registers are.

In another application, I was working with sliding tile puzzles where you can benefit from SIMD/vector instructions to hold the 16/25/36 byte tile map, permitting you to easily compute indices on them. So we have to know that the vectors are at least that wide (depending on the puzzle type) to be able to make use of them. Cracking a 25 byte structure into two 16 byte vectors is possible, but requires a different code path that has to be tested separately. Once again I would really not like to have to use emulators for that as emulators are largely not portable across operating systems and make testing a lot more annoying.

For this application too you only work on one puzzle at a time. And even if you had multiple puzzles, any sort of SoA layout would not work as you have a lot of fundamentally horizontal operations that need to be computed. For example, one step we do is taking a subset of tiles and computing a bit map of where they are in the puzzle (effectively done with SVE2's MATCH instruction it seems). This could not reasonably be done with a SoA layout.

Many applications that don't boil down to crunching large matrices of numbers suck when you don't have control over vector length. Right now the area of combinatorial SIMD programming is poorly researched (I'm doing my PhD on it), but the possible applications are tremendous. It's very sad to see vector architectures making this stuff needlessly harder than it could be.

@zingaburga
Copy link
Author

I believe this would not be a common case

Actually, I'd think it'd practically be a non-existent case. No one wants to write SVE code, only to have it run very very slowly.
If the width is unsupported, the code should fall back to scalar code (if not NEON), which would probably run much faster than emulated SVE.
If such a feature did exist, I'd certainly write my code to ensure it never runs in emulated mode.

It is very difficult to write a chess engine such that it processes multiple chess positions at the same time due to various factors.

I don't know anything about chess engines, but processing multiple positions at the same time is, realistically, the only way to take advantage of wider SIMD. It may be very difficult to program, but that's not really the fault of the ISA.

If that can't be done though, it should be possible to just use the lower lanes of longer vectors. Permutation does require some care, as they don't really scale naturally for such tasks, but it should be possible, even if it takes some efficiency loss.

If, for example, you require 256-bit vectors, you could:

  • write a version for 256-bit or wider vectors (with >256b parts of the vector being ignored), whereas 128-bit SVE is unsupported (so falls back to scalar code)
  • the above, but also include a separate 128-bit SVE code path

Cracking a 25 byte structure into two 16 byte vectors is possible, but requires a different code path that has to be tested separately.
And even if you had multiple puzzles, any sort of SoA layout would not work as you have a lot of fundamentally horizontal operations that need to be computed

For this example, I'd consider breaking it into two halves as you describe. Longer vectors just simply process more puzzles in parallel.

For example, for 128b vectors, vector A would hold the first 16 bytes/tiles, and vector B would hold the last 9.
For 256b vectors, vector A would hold the first 16 tiles of puzzle 1, followed by the first 16 tiles of puzzle 2, whilst vector B would hold the last 9 tiles of the two puzzles (each followed by 7 blank bytes, i.e. 9 tiles + 7 blank + 9 tiles + 7 blank).
For 384b vectors, each vector would hold 3 puzzle 'halves', and so on.

In other words, try to think in terms of 128b (or smaller) 'units', rather than arbitrary sized vectors.

This may require some rethinking on how you write your code, and it may be difficult, but it's a more SVE-esque way of approaching the problem, and you don't have to write separate code paths for different vector widths.

@clausecker
Copy link

Actually, I'd think it'd practically be a non-existent case. No one wants to write SVE code, only to have it run very very slowly.
If the width is unsupported, the code should fall back to scalar code (if not NEON), which would probably run much faster than emulated SVE.
If such a feature did exist, I'd certainly write my code to ensure it never runs in emulated mode.

My main imagined use case (apart from being able to test code with different vector lengths) would be forwards compatibility: say in 10 years no SVE implementation using less than 256 bit per vector is on the market so SIMD kernel developers stop writing code paths for implementations with 128 bit per vector. This means that their code simply won't run on those older processors. Being able to trap/emulate would mean that it can run at all, although at a reduced speed. This is vastly preferable over code not running.

I don't know anything about chess engines, but processing multiple positions at the same time is, realistically, the only way to take advantage of wider SIMD. It may be very difficult to program, but that's not really the fault of the ISA.

This is not applicable to many of the algorithms used. They fundamentally cannot benefit from being able to process multiple positions at once because each position being processed depends on the previous position having been processed (think of algorithms similar to depth-first search).

As a part of my previous research, I had actually tried to develop a variant of IDA* (for solving the puzzles) that can process multiple configurations at once, but did not end up with any good results.

For example, for 128b vectors, vector A would hold the first 16 bytes/tiles, and vector B would hold the last 9.
For 256b vectors, vector A would hold the first 16 tiles of puzzle 1, followed by the first 16 tiles of puzzle 2, whilst vector B would hold the last 9 tiles of the two puzzles (each followed by 7 blank bytes, i.e. 9 tiles + 7 blank + 9 tiles + 7 blank).
For 384b vectors, each vector would hold 3 puzzle 'halves', and so on.

With that sort of structure I will not be able to compute any permutations at all without first seriously preprocessing the index vectors involved. Hard pass. However, in any way, this too would require different code paths for different vector sizes. Which I'll once again not be able to test in a reasonable way without actually having a processor with sufficiently beefy SVE and the luck of having an operating system that permits access to the special registers needed to configure the SVE vector size. As is, I won't be able to test this at all as I certainly cannot afford such a processor. Emulation sucks for testing this and runs afoul of the other issues previously outlined.

In other words, try to think in terms of 128b (or smaller) 'units', rather than arbitrary sized vectors.

This may require some rethinking on how you write your code, and it may be difficult, but it's a more SVE-esque way of approaching the problem, and you don't have to write separate code paths for different vector widths.

Sure, I try to do that. But some problems (like those I previously outlined) just don't lend themselves to this sort of approach. I kind of love how the vector-ISA fanboys are all “no you can't use vectors to hold individual data structures! We did not intend the vector arch to be used that way! That is so wrong!”

@zingaburga
Copy link
Author

This means that their code simply won't run on those older processors

It should, as long as some fallback code is included (which most real world software should definitely have).
If, for whatever reason, they don't include such, then I'd blame the developer, not the ISA. Regardless, such code won't work, but this isn't any different if you try to run SVE code on a CPU without SVE support, or AVX2 kernels on CPUs without AVX2 support etc - if the developer has explicitly decided not to support such a processor, I don't see why OSes should be burdened with trying to do it (and ultimately doing a poor job of such).

think of algorithms similar to depth-first search

Ideally, you try to refactor into a breadth-first search, or hybrid approach (e.g. depth-first search on several paths).
(this isn't just beneficial to SIMD, but likely helps ILP as well)

With that sort of structure I will not be able to compute any permutations at all without first seriously preprocessing the index vectors involved. Hard pass.

I've found this to often be the case with SVE permuations - you need to spend some effort computing index vectors due to the limited permutations available. Which means it's doable, but you'll need to evaluate whether computing the index vectors is worthwhile.

However, in any way, this too would require different code paths for different vector sizes.

I'm not quite convninced on this, but I'll take your word on it.

Emulation sucks for testing this

Although I'm not completely sure why you dislike such emulation, I'm guessing it may be more of an issue with the tooling, rather than the ISA.
Needing an emulator does complicate the testing routine, as I mention in the article, but I don't think it should be dismissed as unviable.

But some problems (like those I previously outlined) just don't lend themselves to this sort of approach

If it really doesn't, then your problem likely cannot take advantage of wider vectors. I don't see this being a problem with the ISA either - like, I can't see how an ISA could be designed to solve it.
I do think SVE2 could do a better job of catering to such cases, in particular, with better permutation capabilities (which SVE2p1 does help with, by the way), but there's only so much that can be done.

I kind of love how the vector-ISA fanboys are all

As someone who writes SIMD using a fair amount of permutes, for problems that aren't entirely SIMD friendly, I do agree with you there, and was interested to see how well a 'scalable vector' ISA would cater to such workloads (hence the article).

@clausecker
Copy link

clausecker commented Dec 15, 2022

It should, as long as some fallback code is included (which most real world software should definitely have).

Which once again cannot be tested in situ as you cannot configure the vector size in software...

If, for whatever reason, they don't include such, then I'd blame the developer, not the ISA. Regardless, such code won't work, but this isn't any different if you try to run SVE code on a CPU without SVE support, or AVX2 kernels on CPUs without AVX2 support etc - if the developer has explicitly decided not to support such a processor, I don't see why OSes should be burdened with trying to do it (and ultimately doing a poor job of such).

Existing enterprise architectures like VAX or SPARC have always done stuff like that. This isn't exactly unreasonable.

Ideally, you try to refactor into a breadth-first search, or hybrid approach (e.g. depth-first search on several paths).
(this isn't just beneficial to SIMD, but likely helps ILP as well)

Okay, so you are telling me it is verboten to use SVE2 to optimise parts of a program without restructuring every aspect of it around the vector paradigm? That doesn't sound exactly reasonable and makes me doubt the whole concept.

The point why we don't use breath-first search is that the search front alone would consume several TB of memory. Not an option. “Hybrid approaches” are tricky to get right; I did not manage to find a good implementation. They key problem is that each node generates a varying number of child nodes. Even with compression support, this is very difficult to realise in the data structures.

I've found this to often be the case with SVE permuations - you need to spend some effort computing index vectors due to the limited permutations available. Which means it's doable, but you'll need to evaluate whether computing the index vectors is worthwhile.

The indexing and permutations are on the critical path of the code in question. Anything that makes these slower significantly affects the speed of the program.

Although I'm not completely sure why you dislike such emulation, I'm guessing it may be more of an issue with the tooling, rather than the ISA. Needing an emulator does complicate the testing routine, as I mention in the article, but I don't think it should be dismissed as unviable.

User space emulators are operating system specific. This means that if your test suite is dependent on emulation, it is no longer portable across operating systems. What good is a test suite if you can't run it completely on a new target you are trying to build your project on? You also introduce significant complexity into the project by depending on an emulator being present.

They are also often incomplete, yielding behaviour that seriously diverges from real hardware, especially with respect to concurrency. Userspace emulators in addition require that an entire userspace environment (dynamic loader, libraries, ...) of the target architecture is available. This is nasty to set up.

I've spent months trying to build software packages with such emulators and eventually had to give up due to all sorts of weird, seemingly impossible errors that cropped up. And that was with QEMU, which is generally regarded as a mature emulator.

If it really doesn't, then your problem likely cannot take advantage of wider vectors. I don't see this being a problem with the ISA either - like, I can't see how an ISA could be designed to solve it.

It some times can, but different code may be needed for each vector width (or for different ranges of vector widths). Code that once again cannot be tested as it is impossible to set the vector size. For example, in the case of the puzzles, I can do with 128 bit vectors by cracking each puzzle into two vectors and dealing with them with slightly more complex code. But having 256 bit vectors allows for much more straightforward code, so two paths are reasonable. Other problems have similar constraints.

Another issue that might become a constraint in the future is that for vectors larger than 512 bit, you cannot copy masks into general purpose registers to manipulate them as they won't fit anymore. This may become a problem if you want to perform operations that are not available on mask registers. In one of my upcoming projects, I've done this for AVX-512 to move certain complex mask computations off the SIMD ports onto more plentiful scalar ALUs, significantly increasing ILP. Assuming future processors will too implement SVE2 with different ALUs than scalar operations, it might be good for performance to do the same there. So once again, a separate code path is needed for <=512 bit vectors and larger vectors. Only that I can't test the >512 bit stuff at all as I will not be able to afford one of these fancy Fujitsu machines.

@zingaburga
Copy link
Author

Which once again cannot be tested in situ as you cannot configure the vector size in software...

'Fallback code' referred to scalar code, so no vector size required.

Okay, so you are telling me it is verboten to use SVE2 to optimise parts of a program without restructuring every aspect of it around the vector paradigm?

If you wish to take full advantage of the ISA, yes, you may need to do that.
This isn't too different to some problems where an existing application uses AoS layout, when SoA should be used to maximise SIMD efficiency.

“Hybrid approaches” are tricky to get right

Oh I never said it would be easy...

The indexing and permutations are on the critical path of the code in question. Anything that makes these slower significantly affects the speed of the program.

Indeed - you'll need to evaluate whether that's worthwhile (where it may only be beneficial with wider vectors).

User space emulators are operating system specific.

I thought qemu was cross platform? Regardless, that's an issue with tooling, not the ISA.

They are also often incomplete, yielding behaviour that seriously diverges from real hardware, especially with respect to concurrency

Why would you think an OS built-in emulator would be any different?

Userspace emulators in addition require that an entire userspace environment (dynamic loader, libraries, ...) of the target architecture is available

I haven't tried qemu (which might indeed require it due to being more of a system emulator), but I haven't had any such problems with Intel's SDE or ARM's Instruction Emulator - they just use existing libraries on your system and have relatively minimal setup requirements.

For example, in the case of the puzzles, I can do with 128 bit vectors by cracking each puzzle into two vectors and dealing with them with slightly more complex code. But having 256 bit vectors allows for much more straightforward code, so two paths are reasonable

SVE doesn't really prevent you from taking this approach either. Your objection seems to be with testing these combinations, but the capability exists, even if you don't like the approach.

Another issue that might become a constraint in the future is that for vectors larger than 512 bit, you cannot copy masks into general purpose registers to manipulate them as they won't fit anymore.

As mentioned, SVE doesn't provide any way to directly transfer between masks and GPRs, so how you choose to deal with larger masks is up to you.
If you're only using the bottom 512 bits of the vector, regardless of actual size, it shouldn't really be a problem, since you'll just be ignoring the rest of the mask anyway (assuming you're referring to 8 bit elements).

So once again, a separate code path is needed for <=512 bit vectors and larger vectors. Only that I can't test the >512 bit stuff at all as I will not be able to afford one of these fancy Fujitsu machines.

I don't see how it's that different on the AVX side. If an "AVX1024" comes out, you'll need to write a separate code path to take advantage of it, which you won't be able to test without an emulator or access to whatever CPU that ends up supporting it.

@clausecker
Copy link

I thought qemu was cross platform? Regardless, that's an issue with tooling, not the ISA.

The system-level emulator is, the userspace emulator is not (it needs to know the layout of the operating system's datastructures and what its system calls are). If I have to start mutliple full VMs with QEMU just to run the test suite I guess I won't be doing much SVE development in the future.

Why would you think an OS built-in emulator would be any different?

That emulator would only trap/emulate the relevant SVE instructions, not the full instruction set. Thus, problems stemming from difficult to emulate behaviour of the processor (such as aspects of concurrency) can be sidestepped. Presence of such an emulator can be mandated by the ABI specification, making it portable to rely on it (as is the case on SPARC).

I haven't tried qemu (which might indeed require it due to being more of a system emulator), but I haven't had any such problems with Intel's SDE or ARM's Instruction Emulator - they just use existing libraries on your system and have relatively minimal setup requirements.

Neither of these two are portable and both seem to be closed source programs (one even only available for payment apparently?) In particular, neither are available for operating systems I commonly develop on such as FreeBSD. Software I write runs on dozens of operating systems. Having an emulator that is closed source, commercial, and only available on two of them be a mandatory prerequisite for running the full test suite does not inspire confidence.

And in any way: I do not think a userspace emulator can even be written to be fully portable as it has to be adapted manually to the system calls of each operating system it can run on.

If you're only using the bottom 512 bits of the vector, regardless of actual size, it shouldn't really be a problem, since you'll just be ignoring the rest of the mask anyway (assuming you're referring to 8 bit elements).

The use case I was thinking about here was one where I could reasonably write vector-length agnostic code, so a restriction to the bottom 512 bits of the vector would not be sensible.

I don't see how it's that different on the AVX side. If an "AVX1024" comes out, you'll need to write a separate code path to take advantage of it, which you won't be able to test without an emulator or access to whatever CPU that ends up supporting it.

If I write and test AVX code today, it will work and do the exact same thing on all processors that support AVX, even future processors. Such processors won't suddenly start to do different things when executing AVX instructions, possibly breaking my code. I can be certain that having tested the code thoroughly on one machine, it will always be correct.

The same cannot be said about SVE code. Unless I test with all possible vector lengths, there's just no way to be sure that my code will work correctly on future processors. Maybe a future processor with 1664 bit SVE suddenly triggers an unexpected corner case in my code that cannot be triggered with any other vector length and everything crashes and burns.

It is boggling my mind how nobody sees a problem with that. Industry best practice is to test your code both with unit and integration tests and with as much coverage as possible. Programmers, even smart ones, make mistakes and all that. But suddenly we have new instructions that are specified to behave differently on different CPUs (something that was never the case before!) and everybody says “ah no, we don't need to test for that, it'll be just fine. In fact, we won't even add facilities to make it possible test for that. Because you know, it will be fine.” Just imagine some of the usual things that could happen:

  • someone writes SVE code that accumulates data into a register, flushing it out after processing a vector-length dependent amount of data. For short vector lengths this works fine. For longer vector lengths (which have not been/could not be tested for) the register overflows and it fails.
  • someone incorrectly (our of laziness, pragmatism, ...) assumes that the vector length won't exceed 1024 bit anyway (as that's the largest SVE implementation on the market) and writes code that makes such an assumption. In 10 years, CPUs routinely have 2048 bit SVE and the code fails horribly. A ticking time bomb right there.
  • some SVE code stores arrays of temporary vectors on the stack, using a fixed-size array. For small vector lengths as shipped on current CPUs, this works fine. For large vector lengths on future CPUs, we have a surprise zero-day exploit on out hands.
  • as above, but using a VLA for the vectors. Now we no longer have stack corruption, but the function just straight up crashes the program with a stack overflow for larger vector lengths.

I mean SVP64 at least has a user-settable register to clamp the vector length if you need to, but with SVE, there is no ZCR_EL0. It's just completely incomprehensible to me. As if all these vector people drank their own koolaid so hard, they forgot that people could seriously fuck up.

@zingaburga
Copy link
Author

The system-level emulator is, the userspace emulator is not

Ah I see.

That emulator would only trap/emulate the relevant SVE instructions, not the full instruction set
I do not think a userspace emulator can even be written to be fully portable as it has to be adapted manually to the system calls of each operating system it can run on

What's your definition of 'portable'? Yes, there will be OS specific stuff, but that's no different from any other software out there.
Otherwise, I don't see why an application emulator couldn't do a similar thing - that is, trap on SIGILL and emulate the relevant instructions (I wouldn't be surprised if ARM's IE is doing exactly that).

Neither of these two are portable and both seem to be closed source programs (one even only available for payment apparently?)

They are closed source, but are freely available for use. I link to ARM's IE above, and Intel's SDE is available here.

But again, this sounds like an issue with tooling, not an ISA issue nor OS deficiency. If an open source, 'portable' emulator existed, I'm guessing you wouldn't have any qualms?

In particular, neither are available for operating systems I commonly develop on such as FreeBSD

Does FreeBSD even support SVE at the moment? Don't you think mandating the kernel include an emulator would slow down adoption by OSes?

The use case I was thinking about here was one where I could reasonably write vector-length agnostic code, so a restriction to the bottom 512 bits of the vector would not be sensible.

Wait, weren't you saying that your problem is heavily width dependent and doesn't work well with VLA?
If you can't make your problem VLA, ignoring higher bits of the vector/mask seems like a sensible approach.

It is boggling my mind how nobody sees a problem with that

Well, I did point it out in the article, so I don't think you're the only one...

@clausecker
Copy link

What's your definition of 'portable'? Yes, there will be OS specific stuff, but that's no different from any other software out there.
Otherwise, I don't see why an application emulator couldn't do a similar thing - that is, trap on SIGILL and emulate the relevant instructions (I wouldn't be surprised if ARM's IE is doing exactly that).

My usual portability standard is: must work on any POSIX-like system with a C compiler. For this use case, I can reasonably also assume that the system runs ARM64 and that the C compiler supports all the relevant SVE intrinsics. If the library I am developing is written in assembly, I may make further assumptions about ABI (such as, that it follows the ARM procedure call standard and so on).

Trapping SIGILL (resp SIGEMT) to emulate instructions cannot be done in a portable manner as none of the usual specifications (ISO 9899, POSIX, ...) specify the layout of the data structures needed to do so.

They are closed source, but are freely available for use. I link to ARM's IE above, and Intel's SDE is available here.

I do not plan to use closed-source software for my development.

But again, this sounds like an issue with tooling, not an ISA issue nor OS deficiency. If an open source, 'portable' emulator existed, I'm guessing you wouldn't have any qualms?

Once again, I do not believe such a thing can be made. And again, integrating that into a test suite is already pretty annoying. Not to mention that the test suite will have to be run 32 times, once for each possible vector size.

Does FreeBSD even support SVE at the moment? Don't you think mandating the kernel include an emulator would slow down adoption by OSes?

I am not sure if FreeBSD supports SVE right now as I do not have any hardware to test on. ARM could provide the required code as a public domain package for integration into operating systems to enable easy integration.

Wait, weren't you saying that your problem is heavily width dependent and doesn't work well with VLA?
If you can't make your problem VLA, ignoring higher bits of the vector/mask seems like a sensible approach.

You are confusing different applications I mentioned. One is the puzzle-solving application which tries to use vectors for fixed-length structures. Another can use vectors of arbitrary length, but could benefit from processing masks in general purpose registers.

Well, I did point it out in the article, so I don't think you're the only one...

I guess we can agree on that. My prediction: so much code will have trouble with longer vector lengths that they'll retrofit a “max supported vector length” note into their ELF spec as a compatibility measure. Long term I believe it'll be likely that application-class ARM chips won't ever extend past 128 bit SVE due to such issues with existing code.

@zingaburga
Copy link
Author

cannot be done in a portable manner as none of the usual specifications (ISO 9899, POSIX, ...) specify the layout of the data structures needed to do so.

You could just support a wide range of OSes.
I guess you could argue that this still isn't 'portable' in the POSIX sense, but I'd counter that it'd be good enough for all practical purposes. Not to mention that SVE itself isn't supported on all POSIX OSes anyway.

Not to mention that the test suite will have to be run 32 times, once for each possible vector size.

There's 16 possible vector sizes under SVE, but you don't have to test every possible width. Just like you don't have to do 4.2B tests for every function that takes a 32-bit integer, you can be selective with the widths to test.

But yes, needing to incorporate an emulator into the test routine does add complexity.

You are confusing different applications I mentioned. One is the puzzle-solving application which tries to use vectors for fixed-length structures. Another can use vectors of arbitrary length, but could benefit from processing masks in general purpose registers.

Okay that makes more sense. I talk about mask processing in GPRs here.

@jan-wassenberg
Copy link

@zingaburga nice, some quick updates:

The _x predication variant is actually important for avoiding extra MOVPRFX instructions being generated.

It is no longer necessary to worry about non-power of two vector lengths [https://developer.arm.com/documentation/102105/ia-00/]

@zingaburga
Copy link
Author

Interesting, I can't think of how the _x predication could do that, would you be able to explain?
Note that I'm comparing _x predication against no predication.

Also interesting is ARM's reversal on non-power-of-2 widths; looks like this earlier post got actioned. The change does indeed remove a quirk of the ISA.

I see the following in your linked document:

2.45 C215: SVE
Arm is making a retrospective change to the SVE architecture to remove the capability of selecting a non-power-of-two vector length in non-Streaming SVE as well as in Streaming SVE mode.
Specific updates as a result of this change will be communicated in due course.

Does 'selecting' refer to software selection (i.e. setting ZCR_ELx.LEN), or the ability for hardware to select such widths, or both?
It looks like this document hasn't been updated yet, so we'll have to wait and see.

Thanks for the update!

@zingaburga
Copy link
Author

Added some minor updates to address changes ARM has made since late 2021. I haven't gone through the full article (or reviewed all changes made to SVE) and, at the moment, don't feel like rewriting it or writing a new article to address all the changes.

Happy to add 'update' lines if anyone picks up something I've missed.

@jan-wassenberg
Copy link

Interesting, I can't think of how the _x predication could do that, would you be able to explain?

If we insist upon zeromasks then the instruction encoding may not have enough bits for one of the regs, hence MOVPRFX is required. We use _x when we don't care and the predicate is all-true, and _z only where we really want zeros.

Does 'selecting' refer to software selection (i.e. setting ZCR_ELx.LEN), or the ability for hardware to select such widths, or both?

I think one could reasonably conclude based on this information that software will not have to deal with non-pow2.

@zingaburga
Copy link
Author

We use _x when we don't care and the predicate is all-true

I'd assume that in the vast majority of cases, _x would only be used with all-true predicates. Which means that the predicate is basically pointless and they should've just dropped it.

To put it another way, svadd_x(svptrue_b32(), a, b) could've been just svadd(a, b) instead, meaning that the _x variant is rather pointless.
In some weird edge cases, _x might save a PTRUE (not MOVPRFX), but it seems such an odd case for a lot of ACLE to optimise for.

I think one could reasonably conclude based on this information that software will not have to deal with non-pow2.

Yeah, that seems logical.

@jan-wassenberg
Copy link

I'd assume that in the vast majority of cases, _x would only be used with all-true predicates. Which means that the predicate is basically pointless and they should've just dropped it.

Agreed, but they appear to have made the design decision to have predicates in as many places as possible, with a few exceptions (svzip etc.). In fairness, this would allow for clock-gating the inactive lanes, but that seems like a very small win especially when software such as ours just sets ptrue anyway.

@zingaburga
Copy link
Author

zingaburga commented Jan 20, 2023

In fairness, this would allow for clock-gating the inactive lanes

As far as I can tell, not really, as the assembly doesn't support the notion of "don't care predication", so the compiler would have to turn it into merge-predication, or unpredicated. At which point, they may as well have made this behaviour explicit.

The design decision just seems to be really weird to me - unnecessarily complex for questionable gain.

they appear to have made the design decision to have predicates in as many places as possible

I'm not sure if it came through clearly, but I was thinking more in-line with how AVX-512 intrinsics are done, e.g. a _mm_add_X for unpredicated, and _mm_mask_add_X for predicated. Instead of doing _x/_z/_m, the _x variant would just become unpredicated, so you can still have masks in all the places where it makes sense.

@jan-wassenberg
Copy link

Yes, I would also have preferred the AVX-512 style of encoding here. The only downside is that one of the mask regs means "no mask", but we anyway rarely use more than a few masks.

As far as I can tell, not really, as the assembly doesn't support the notion of "don't care predication", so the compiler would have to turn it into merge-predication, or unpredicated. At which point, they may as well have made this behaviour explicit.

To clarify: I was thinking of the HW, not any intrinsics or compiler transform. Predicating each instruction makes it slightly more likely that users will pass through a not-all-true mask, vs. a hypothetical ISA that gives a "no mask" option. This would allow some clock-gating, but I agree that's a really small savings, mainly at the end of loops.

@zingaburga
Copy link
Author

The only downside is that one of the mask regs means "no mask"

Many instructions are unpredicated or have unpredicated forms. For those that require a predicate, PTRUE basically sets a mask appropriately, so you can get the same effect by dedicating it as a "no masking" predicate, but with more flexibility.

To clarify: I was thinking of the HW, not any intrinsics or compiler transform

I don't really have any objections to the choices made in assembly/machine-code regarding predication, my concern is with the weird decision to have _x in ACLE.

@xlazom00
Copy link

@zingaburga btw are there any phone SOC with SVE/SVE2 ?
I am testing Snapdragon Gen2 and NO LUCK :(

@zingaburga
Copy link
Author

I can't provide advice on that front. From memory, Qualcomm disables SVE on their CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment