-
No segmentation (except for
fs
andgs
for special purposes like threading) -
Relative to base register
- used for data on the stack, arrays, structs and class members
- [
base
+index
*scale
+immediate_offset
] base
is mandatory, can be any 64-bit registerindex
can be any 64-bit register exceptrsp
scale
can be 1, 2, 4, or 8immediate_offset
(called displacement with Gas) relative to the base register- Gas syntax is
immediate_offset(base, index, scale)
-
RIP-relative (a.k.a. PC-relative)
- used for static data
- contains a 32-bit sign-extended offset relative to the instruction pointer
- explicitely specified using
mov eax [rel label]
ordefault rel
/default abs
commands with NASM (uses 32-bit absolute addressing otherwise) - explicitely specified using
mov eax label(%rip)
with Gas
-
32-bit absolute
- 32 bits constant address sign-extended to 64 bits
- works for addresses below 2^31
- don't use for simple memory operands since RIP-relative addressing is shorter, faster (no need for relocations) and works everywhere
- used to access static arrays with an index register like
mov ebx, [intarray + rsi*4]
though it doesn't work for Windows and Linux DLLs and for MacOSX exes and DLLs because addresses are above 2^32 (it is used by gcc and clang for Linux exes, an image base relative addressing is used on Windows exes by MASM) - an alternative that works everywhere is first loading the static array address into
rbx
usinglea
with a RIP-relative address and then address relatively from this base register (lea rbx, [array]
thenmov eax, [rbx + rcx*4]
), other static arrays can then be accessed relatively (mov [(array2-array1) + rbx + rcx*4], eax
)
-
64-bit absolute
mov eax, dword [qword a]
- can only be used with
mov
and registersal
,ax
,eax
orrax
(src or dst) - can't contain a segment, base or index register
- Easier and faster than the 32-bit Global Offset Table (GOT) technique since RIP-relative is position independent (note that the technique to access static arrays with an index register described earlier is position independent too)
bit 0 - 63 | bit 0 - 31 | bit 0 - 15 | bit 8 - 15 | bit 0 - 7 |
---|---|---|---|---|
rax |
eax |
ax |
ah |
al |
rbx |
ebx |
bx |
bh |
bl |
rcx |
ecx |
cx |
ch |
cl |
rdx |
edx |
dx |
dh |
dl |
rsi |
esi |
si |
sil |
|
rdi |
edi |
di |
dil |
|
rbp |
ebp |
bp |
bpl |
|
rsp |
esp |
sp |
spl |
|
r8 |
r8d |
r8w |
r8b |
|
r9 |
r9d |
r9w |
r9b |
|
r10 |
r10d |
r10w |
r10b |
|
r11 |
r11d |
r11w |
r11b |
|
r12 |
r12d |
r12w |
r12b |
|
r13 |
r13d |
r13w |
r13b |
|
r14 |
r14d |
r14w |
r14b |
|
r15 |
r15d |
r15w |
r15b |
|
rflags |
flags |
|||
rip |
- CF (Carry Flag, bit 0) — Set if an arithmetic operation generates a carry or a borrow out of the most-significant bit of the result; cleared otherwise. This flag indicates an overflow condition for unsigned-integer arithmetic. It is also used in multiple-precision arithmetic.
- PF (Parity Flag, bit 2) — Set if the least-significant byte of the result contains an even number of 1 bits; cleared otherwise.
- AF (Auxiliary carry Flag, bit 4) — Set if an arithmetic operation generates a carry or a borrow out of bit 3 of the result; cleared otherwise. This flag is used in binary-coded decimal (BCD) arithmetic.
- ZF (Zero Flag, bit 6) — Set if the result is zero; cleared otherwise.
- SF (Sign Flag, bit 7) — Set equal to the most-significant bit of the result, which is the sign bit of a signed integer. (0 indicates a positive value and 1 indicates a negative value.)
- OF (Overflow Flag, bit 11) — Set if the integer result is too large a positive number or too small a negative number (excluding the sign-bit) to fit in the destination operand; cleared otherwise. This flag indicates an overflow condition for signed-integer (two’s complement) arithmetic.
- Wraparound arithmetic — With wraparound arithmetic, a true out-of-range result is truncated (that is, the carry or overflow bit is ignored and only the least significant bits of the result are returned to the destination). Wraparound arithmetic is suitable for applications that control the range of operands to prevent out-of-range results. If the range of operands is not controlled, however, wraparound arithmetic can lead to large errors. For example, adding two large signed numbers can cause positive overflow and produce a negative result.
- Signed saturation arithmetic — With signed saturation arithmetic, out-of-range results are limited to the representable range of signed integers for the integer size being operated on. For example, if positive overflow occurs when operating on signed word integers, the result is saturated to 7FFFH, which is the largest positive integer that can be represented in 16 bits; if negative overflow occurs, the result is saturated to 8000H.
- Unsigned saturation arithmetic — With unsigned saturation arithmetic, out-of-range results are limited to the representable range of unsigned integers for the integer size. So, positive overflow when operating on unsigned byte integers results in FFH being returned and negative overflow results in 00H being returned.
- MOV — Move data between general-purpose registers; move data between memory and general-purpose or segment registers; move immediates to general-purpose registers.
- CMOVcc — Conditional move.
- XCHG — Exchange.
- BSWAP — Byte swap.
- XADD — Exchange and add.
- CMPXCHG — Compare and exchange.
- CMPXCHG8B / CMPXCHG16B — Compare and exchange 8/16 bytes.
- PUSH — Push onto stack.
- POP — Pop off of stack.
- PUSHA / PUSHAD — Push general-purpose registers onto stack.
- POPA / POPAD — Pop general-purpose registers from stack.
- CWD / CDQ / CQO — Convert word to doubleword/Convert doubleword to quadword.
- CBW / CWDE / CDQE — Convert byte to word/Convert word to doubleword in
rax
register. - MOVSX / MOVSXD — Move and sign extend.
- MOVZX — Move and zero extend.
- ADCX — Unsigned integer add with carry.
- ADOX — Unsigned integer add with overflow.
- ADD — Integer add.
- ADC — Add with carry.
- SUB — Subtract.
- SBB — Subtract with borrow.
- IMUL — Signed multiply.
- MUL — Unsigned multiply.
- IDIV — Signed divide.
- DIV — Unsigned divide.
- INC — Increment.
- DEC — Decrement.
- NEG — Negate.
- CMP — Compare.
- AND — Perform bitwise logical AND.
- OR — Perform bitwise logical OR.
- XOR — Perform bitwise logical exclusive OR.
- NOT — Perform bitwise logical NOT.
- SAL / SAR / SHL / SHR — Shift arithmetic/logical left/right.
- SHLD — Shift left double.
- SHRD — Shift right double.
- RCL / RCR / ROL / ROR — Rotate left/right and rotate left/right through carry.
- BT — Bit test.
- BTS — Bit test and set.
- BTR — Bit test and reset.
- BTC — Bit test and complement.
- BSF — Bit scan forward.
- BSR — Bit scan reverse.
- SETcc — Set byte on condition.
- TEST — Logical compare.
- CRC32 — Provides hardware acceleration to calculate cyclic redundancy checks for fast and efficient implementation of data integrity protocols.
- POPCNT — This instruction calculates the number of bits set to 1 in the second operand (source) and returns the count in the first operand (a destination register).
- JMP — Jump.
- Jcc — Jump if condition is met (RIP-relative operand).
- LOOP / LOOPcc — Loop with
rcx
counter. - CALL — Call procedure.
- RET — Return.
- IRET / IRETD / IRETQ — Return from interrupt.
- INT n / INTO / INTO 3 — Call to interrupt procedure.
- ENTER — High-level procedure entry.
- LEAVE — High-level procedure exit.
- MOVS / MOVSB / MOVSW / MOVSD / MOVSQ — Move data from string to string.
- CMPS / CMPSB / CMPSW / CMPSD / CMPSQ — Compare string operands.
- SCAS / SCASB / SCASW / SCASD — Scan string.
- LODS / LODSB / LODSW / LODSD / LODSQ — Load string.
- STOS / STOSB / STOSW / STOSD / STOSQ — Store string.
- REP / REPE / REPZ / REPNE / REPNZ — Repeat string operation prefix.
- STC — Set carry flag.
- CLC — Clear the carry flag.
- CMC — Complement the carry flag.
- CLD — Clear the direction flag.
- STD — Set direction flag.
- LAHF — Load flags into
ah
register. - SAHF — Store
ah
register into flags. - PUSHF / PUSHFQ — Push
rflags
onto stack. - POPF / POPFQ — Pop
rflags
from stack. - STI — Set interrupt flag.
- CLI — Clear the interrupt flag.
- LEA — Load effective address.
- NOP — No operation.
- UD — Undefined instruction.
- XLAT / XLATB — Table lookup translation.
- CPUID — Processor identification.
- MOVBE — Move data after swapping data bytes.
- PREFETCHW — Prefetch data into cache in anticipation of write.
- CLFLUSH — Flushes and invalidates a memory operand and its associated cache line from all levels of the processor’s cache hierarchy.
- CLFLUSHOPT — Flushes and invalidates a memory operand and its associated cache line from all levels of the processor’s cache hierarchy with optimized memory system throughput.
- RDRAND — Retrieves a random number generated from hardware.
- RDSEED — Seed the random number generator from hardware.
- XSAVE — Save processor extended states to memory.
- XSAVEC — Save processor extended states with compaction to memory.
- XSAVEOPT — Save processor extended states to memory, optimized.
- XRSTOR — Restore processor extended states from memory.
- XGETBV — Reads the state of an extended control register.
- ANDN — Bitwise AND of first source with inverted 2nd source operands.
- BEXTR — Contiguous bitwise extract.
- BLSI — Extract lowest set bit.
- BLSMSK — Set all lower bits below first set bit to 1.
- BLSR — Reset lowest set bit.
- BZHI — Zero high bits starting from specified bit position.
- LZCNT — Count the number leading zero bits.
- MULX — Unsigned multiply without affecting arithmetic flags.
- PDEP — Parallel deposit of bits using a mask.
- PEXT — Parallel extraction of bits using a mask.
- RORX — Rotate right without affecting arithmetic flags.
- SARX / SHLX / SHRX — Shift arithmetic/logic left/right without affecting flags.
- TZCNT — Count the number trailing zero bits.
- x87 FPU state is aliased to the MMX state, care must be taken when making transitions to MMX instructions to prevent incoherent or unexpected results.
- FLD — Load floating-point value.
- FST / FSTP — Store floating-point value without/with pop.
- FILD — Load integer.
- FIST / FISTP — Store integer with/without pop.
- FBLD — Load BCD.
- FBSTP — Store BCD and pop.
- FXCH — Exchange registers.
- FCMOVcc — Floating-point conditional move.
- FADD / FADDP / FIADD — Add floating-point.
- FSUB / FSUBP / FISUB — Subtract floating-point.
- FSUBR / FSUBRP / FISUBR — Subtract floating-point reverse.
- FMUL / FMULP / FIMUL — Multiply floating-point.
- FDIV / FDIVP / FIDIV — Divide floating-point.
- FDIVR / FDIVRP / FIDIVR — Divide floating-point reverse.
- FPREM — Partial remainder.
- FPREM1 — IEEE Partial remainder.
- FABS — Absolute value.
- FCHS — Change sign.
- FRNDINT — Round to integer.
- FSCALE — Scale by power of two.
- FSQRT — Square root.
- FXTRACT — Extract exponent and significand.
- FCOM / FCOMP / FCOMPP — Compare floating-point.
- FUCOM / FUCOMP / FUCOMPP — Unordered compare floating-point.
- FICOM / FICOMP — Compare integer.
- FCOMI / FCOMIP / FUCOMI / FUCOMIP — Compare floating-point and set
rflags
. - FTST — Test floating-point (compare with 0.0).
- FXAM — Examine floating-point.
- FSIN — Sine.
- FCOS — Cosine.
- FSINCOS — Sine and cosine.
- FPTAN — Partial tangent.
- FPATAN — Partial arctangent.
- F2XM1 — 2x − 1.
- FYL2X — y ∗ log2x.
- FYL2XP1 — y ∗ log2(x + 1).
- FLD1 / FLDL2T / FLDL2E / FLDPI / FLDLG2 / FLDLN2 / FLDZ — Load constants.
- FINCSTP — Increment FPU register stack pointer.
- FDECSTP — Decrement FPU register stack pointer.
- FFREE — Free floating-point register.
- FINIT / FNINIT — Initialize FPU.
- FCLEX / FNCLEX — Clear floating-point exception flags.
- FSTCW / FNSTCW — Store FPU control word.
- FLDCW — Load FPU control word.
- FSTENV / FNSTENV — Store FPU environment.
- FLDENV — Load FPU environment.
- FSAVE / FNSAVE — Save FPU state.
- FRSTOR — Restore FPU state.
- FSTSW / FNSTSW — Store FPU status word.
- WAIT / FWAIT — Wait for FPU.
- FNOP — FPU no operation.
- SIMD execution model to handle 64-bit packed integer data.
- Eight new 64-bit data registers, called MMX registers.
- Three new packed data types:
- 64-bit packed byte integers (signed and unsigned)
- 64-bit packed word integers (signed and unsigned)
- 64-bit packed doubleword integers (signed and unsigned)
- MMX state is aliased to the x87 FPU state, care must be taken when making transitions to x87 FPU instructions to prevent incoherent or unexpected results.
- MOVD / MOVQ — Move doubleword/quadword from/to MMX registers.
- PACKSSWB / PACKSSDW — Pack words/doublewords into bytes with signed saturation.
- PACKUSWB — Pack words into bytes with unsigned saturation.
- PUNPCKHBW / PUNPCKHWD / PUNPCKHDQ — Unpack high-order bytes/words/doublewords.
- PUNPCKLBW / PUNPCKLWD / PUNPCKLDQ — Unpack low-order bytes/words/doublewords.
- PADDB / PADDW / PADDD — Add packed byte/word/doubleword integers.
- PADDSB / PADDSW — Add packed signed byte/word integers with signed saturation.
- PADDUSB / PADDUSW — Add packed unsigned byte/word integers with unsigned saturation.
- PSUBB / PSUBW / PSUBD — Subtract packed byte/word/doubleword integers.
- PSUBSB / PSUBSW — Subtract packed signed byte/word integers with signed saturation.
- PSUBUSB / PSUBUSW — Subtract packed unsigned byte/word integers with unsigned saturation.
- PMULHW — Multiply packed signed word integers and store high result.
- PMULLW — Multiply packed signed word integers and store low result.
- PMADDWD — Multiply and add packed word integers.
- PCMPEQB / PCMPEQW / PCMPEQD — Compare packed bytes/words/doublewords for equal.
- PCMPGTB / PCMPGTW / PCMPGTD — Compare packed signed byte/word/doubleword integers for greater than.
- PAND — Bitwise logical AND.
- PANDN — Bitwise logical AND NOT.
- POR — Bitwise logical OR.
- PXOR — Bitwise logical exclusive OR.
- PSLLW / PSLLD / PSLLQ — Shift packed words/doublewords/quadwoards left logical.
- PSRLW / PSRLD / PSRLQ — Shift packed words/doublewords/quadwords right logical.
- PSRAW / PSRAD — Shift packed words/doublewords right arithmetic.
- EMMS — Empty MMX state.
- Expand the SIMD execution model by adding facilities for handling packed and scalar single-precision floating-point values contained in 128-bit registers.
- Sixteen (eight for 32-bit mode) new 128-bit packed single-precision floating-point XMM registers available.
- 128-bit packed and scalar single-precision floating-point instructions.
- Enhancements to MMX instruction set with new operations on packed integer operands located in MMX registers.
- Explicit prefetching of data, control of the cacheability of data, control of the ordering of store operations.
- MOVAPS — Move four aligned packed single-precision floating-point values between XMM registers or between XMM register and memory.
- MOVUPS — Move four unaligned packed single-precision floating-point values between XMM registers or between XMM register and memory.
- MOVHPS — Move two packed single-precision floating-point values to an from the high quadword of an XMM register and memory.
- MOVHLPS — Move two packed single-precision floating-point values from the high quadword of an XMM register to the low quadword of another XMM register.
- MOVLPS — Move two packed single-precision floating-point values to an from the low quadword of an XMM register and memory.
- MOVLHPS — Move two packed single-precision floating-point values from the low quadword of an XMM register to the high quadword of another XMM register.
- MOVMSKPS — Extract sign mask from four packed single-precision floating-point values.
- MOVSS — Move scalar single-precision floating-point value between XMM registers or between an XMM register and memory.
- ADDPS — Add packed single-precision floating-point values.
- ADDSS — Add scalar single-precision floating-point values.
- SUBPS — Subtract packed single-precision floating-point values.
- SUBSS — Subtract scalar single-precision floating-point values.
- MULPS — Multiply packed single-precision floating-point values.
- MULSS — Multiply scalar single-precision floating-point values.
- DIVPS — Divide packed single-precision floating-point values.
- DIVSS — Divide scalar single-precision floating-point values.
- RCPPS — Compute reciprocals of packed single-precision floating-point values.
- RCPSS — Compute reciprocal of scalar single-precision floating-point values.
- SQRTPS — Compute square roots of packed single-precision floating-point values.
- SQRTSS — Compute square root of scalar single-precision floating-point values.
- RSQRTPS — Compute reciprocals of square roots of packed single-precision floating-point values.
- RSQRTSS — Compute reciprocal of square root of scalar single-precision floating-point values.
- MAXPS — Return maximum packed single-precision floating-point values.
- MAXSS — Return maximum scalar single-precision floating-point values.
- MINPS — Return minimum packed single-precision floating-point values.
- MINSS — Return minimum scalar single-precision floating-point values.
- CMPPS — Compare packed single-precision floating-point values.
- CMPSS — Compare scalar single-precision floating-point values.
- COMISS — Perform ordered comparison of scalar single-precision floating-point values and set flags in
rflags
register. - UCOMISS — Perform unordered comparison of scalar single-precision floating-point values and set flags in
rflags
register.
- ANDPS — Perform bitwise logical AND of packed single-precision floating-point values.
- ANDNPS — Perform bitwise logical AND NOT of packed single-precision floating-point values.
- ORPS — Perform bitwise logical OR of packed single-precision floating-point values.
- XORPS — Perform bitwise logical XOR of packed single-precision floating-point values.
- SHUFPS — Shuffles values in packed single-precision floating-point operands.
- UNPCKHPS — Unpacks and interleaves the two high-order values from two single-precision floating-point operands.
- UNPCKLPS — Unpacks and interleaves the two low-order values from two single-precision floating-point operands.
- CVTPI2PS — Convert packed doubleword integers to packed single-precision floating-point values.
- CVTSI2SS — Convert doubleword integer to scalar single-precision floating-point value.
- CVTPS2PI — Convert packed single-precision floating-point values to packed doubleword integers.
- CVTTPS2PI — Convert with truncation packed single-precision floating-point values to packed doubleword integers.
- CVTSS2SI — Convert a scalar single-precision floating-point value to a doubleword integer.
- CVTTSS2SI — Convert with truncation a scalar single-precision floating-point value to a scalar doubleword integer.
- PAVGB / PAVGW — Compute average of packed unsigned byte integers.
- PEXTRW — Extract word.
- PINSRW — Insert word.
- PMAXUB — Maximum of packed unsigned byte integers.
- PMAXSW — Maximum of packed signed word integers.
- PMINUB — Minimum of packed unsigned byte integers.
- PMINSW — Minimum of packed signed word integers.
- PMOVMSKB — Move byte mask.
- PMULHUW — Multiply packed unsigned integers and store high result.
- PSADBW — Compute sum of absolute differences.
- PSHUFW — Shuffle packed integer word in MMX register.
- MASKMOVQ — Non-temporal store of selected bytes from an MMX register into memory.
- MOVNTQ — Non-temporal store of quadword from an MMX register into memory.
- MOVNTPS — Non-temporal store of four packed single-precision floating-point values from an XMM register into memory.
- PREFETCHh — Load 32 or more of bytes from memory to a selected level of the processor’s cache hierarchy.
- SFENCE — Serializes store operations.
- Packed and scalar 128-bit double-precision floating-point instructions.
- Additional 64-bit and 128-bit packed byte/word/doubleword/quadword integers instructions.
- 128-bit versions of integer instructions introduced with MMX and SSE.
- Additional cacheability-control and instruction-ordering instructions.
- MOVAPD — Move two aligned packed double-precision floating-point values between XMM registers or between and XMM register and memory.
- MOVUPD — Move two unaligned packed double-precision floating-point values between XMM registers or between and XMM register and memory.
- MOVHPD — Move high packed double-precision floating-point value to an from the high quadword of an XMM register and memory.
- MOVLPD — Move low packed single-precision floating-point value to an from the low quadword of an XMM register and memory.
- MOVMSKPD — Extract sign mask from two packed double-precision floating-point values.
- MOVSD — Move scalar double-precision floating-point value between XMM registers or between an XMM register and memory.
- ADDPD — Add packed double-precision floating-point values.
- ADDSD — Add scalar double precision floating-point values.
- SUBPD — Subtract packed double-precision floating-point values.
- SUBSD — Subtract scalar double-precision floating-point values.
- MULPD — Multiply packed double-precision floating-point values.
- MULSD — Multiply scalar double-precision floating-point values.
- DIVPD — Divide packed double-precision floating-point values.
- DIVSD — Divide scalar double-precision floating-point values.
- SQRTPD — Compute packed square roots of packed double-precision floating-point values.
- SQRTSD — Compute scalar square root of scalar double-precision floating-point values.
- MAXPD — Return maximum packed double-precision floating-point values.
- MAXSD — Return maximum scalar double-precision floating-point values.
- MINPD — Return minimum packed double-precision floating-point values.
- MINSD — Return minimum scalar double-precision floating-point values.
- ANDPD — Perform bitwise logical AND of packed double-precision floating-point values.
- ANDNPD — Perform bitwise logical AND NOT of packed double-precision floating-point values.
- ORPD — Perform bitwise logical OR of packed double-precision floating-point values.
- XORPD — Perform bitwise logical XOR of packed double-precision floating-point values.
- CMPPD — Compare packed double-precision floating-point values.
- CMPSD — Compare scalar double-precision floating-point values.
- COMISD — Perform ordered comparison of scalar double-precision floating-point values and set flags in
rflags
register. - UCOMISD — Perform unordered comparison of scalar double-precision floating-point values and set flags in
rflags
register.
- SHUFPD — Shuffles values in packed double-precision floating-point operands.
- UNPCKHPD — Unpacks and interleaves the high values from two packed double-precision floating-point operands.
- UNPCKLPD — Unpacks and interleaves the low values from two packed double-precision floating-point operands.
- CVTPD2PI — Convert packed double-precision floating-point values to packed doubleword integers.
- CVTTPD2PI — Convert with truncation packed double-precision floating-point values to packed doubleword integers.
- CVTPI2PD — Convert packed doubleword integers to packed double-precision floating-point values.
- CVTPD2DQ — Convert packed double-precision floating-point values to packed doubleword integers.
- CVTTPD2DQ — Convert with truncation packed double-precision floating-point values to packed doubleword integers.
- CVTDQ2PD — Convert packed doubleword integers to packed double-precision floating-point values.
- CVTPS2PD — Convert packed single-precision floating-point values to packed double-precision floating-point values.
- CVTPD2PS — Convert packed double-precision floating-point values to packed single-precision floating-point values.
- CVTSS2SD — Convert scalar single-precision floating-point values to scalar double-precision floating-point values.
- CVTSD2SS — Convert scalar double-precision floating-point values to scalar single-precision floating-point values.
- CVTSD2SI — Convert scalar double-precision floating-point values to a doubleword integer.
- CVTTSD2SI — Convert with truncation scalar double-precision floating-point values to scalar doubleword integers.
- CVTSI2SD — Convert doubleword integer to scalar double-precision floating-point value.
- CVTDQ2PS — Convert packed doubleword integers to packed single-precision floating-point values.
- CVTPS2DQ — Convert packed single-precision floating-point values to packed doubleword integers.
- CVTTPS2DQ — Convert with truncation packed single-precision floating-point values to packed doubleword integers.
- MOVDQA — Move aligned double quadword.
- MOVDQU — Move unaligned double quadword.
- MOVQ2DQ — Move quadword integer from MMX to XMM registers.
- MOVDQ2Q — Move quadword integer from XMM to MMX registers.
- PMULUDQ — Multiply packed unsigned doubleword integers.
- PADDQ — Add packed quadword integers.
- PSUBQ — Subtract packed quadword integers.
- PSHUFLW — Shuffle packed low words.
- PSHUFHW — Shuffle packed high words.
- PSHUFD — Shuffle packed doublewords.
- PSLLDQ — Shift double quadword left logical.
- PSRLDQ — Shift double quadword right logical.
- PUNPCKHQDQ — Unpack high quadwords.
- PUNPCKLQDQ — Unpack low quadwords.
- CLFLUSH — Flush cacheline.
- LFENCE — Serializes load operations.
- MFENCE — Serializes load and store operations.
- PAUSE — Improves the performance of “spin-wait loops”.
- MASKMOVDQU — Non-temporal store of selected bytes from an XMM register into memory.
- MOVNTPD — Non-temporal store of two packed double-precision floating-point values from an XMM register into memory.
- MOVNTDQ — Non-temporal store of double quadword from an XMM register into memory.
- MOVNTI — Non-temporal store of a doubleword from a general-purpose register into memory.
- https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf
- http://www.agner.org/optimize/optimizing_assembly.pdf
- https://www.nasm.us/xdoc/2.13.03/nasmdoc.pdf
- https://godbolt.org/
- https://www.lri.fr/~filliatr/ens/compil/x86-64.pdf
- https://0xax.github.io/categories/assembler/