loicmolinari/x86-64_asm_sheet.md

## x86-64_asm_sheet.md

      
    Raw
  

              x86-64_asm_sheet.md
            
          
    x86-64 ASM sheet

Addressing


No segmentation (except for fs and gs for special purposes like threading)


Relative to base register

used for data on the stack, arrays, structs and class members
[base + index * scale + immediate_offset]
base is mandatory, can be any 64-bit register
index can be any 64-bit register except rsp
scale can be 1, 2, 4, or 8
immediate_offset (called displacement with Gas) relative to the base register
Gas syntax is immediate_offset(base, index, scale)


RIP-relative (a.k.a. PC-relative)

used for static data
contains a 32-bit sign-extended offset relative to the instruction pointer
explicitely specified using mov eax [rel label] or default rel / default abs commands with NASM (uses 32-bit absolute addressing otherwise)
explicitely specified using mov eax label(%rip) with Gas


32-bit absolute

32 bits constant address sign-extended to 64 bits
works for addresses below 2^31
don't use for simple memory operands since RIP-relative addressing is shorter, faster (no need for relocations) and works everywhere
used to access static arrays with an index register like mov ebx, [intarray + rsi*4] though it doesn't work for Windows and Linux DLLs and for MacOSX exes and DLLs because addresses are above 2^32 (it is used by gcc and clang for Linux exes, an image base relative addressing is used on Windows exes by MASM)
an alternative that works everywhere is first loading the static array address into rbx using lea with a RIP-relative address and then address relatively from this base register (lea rbx, [array] then mov eax, [rbx + rcx*4]), other static arrays can then be accessed relatively (mov [(array2-array1) + rbx + rcx*4], eax)


64-bit absolute

mov eax, dword [qword a]
can only be used with mov and registers al, ax, eax or rax (src or dst)
can't contain a segment, base or index register


Position-Independent Code (PIC)


Easier and faster than the 32-bit Global Offset Table (GOT) technique since RIP-relative is position independent (note that the technique to access static arrays with an index register described earlier is position independent too)

General purpose registers


bit 0 - 63
bit 0 - 31
bit 0 - 15
bit 8 - 15
bit 0 - 7


rax
eax
ax
ah
al


rbx
ebx
bx
bh
bl


rcx
ecx
cx
ch
cl


rdx
edx
dx
dh
dl


rsi
esi
si

sil


rdi
edi
di

dil


rbp
ebp
bp

bpl


rsp
esp
sp

spl


r8
r8d
r8w

r8b


r9
r9d
r9w

r9b


r10
r10d
r10w

r10b


r11
r11d
r11w

r11b


r12
r12d
r12w

r12b


r13
r13d
r13w

r13b


r14
r14d
r14w

r14b


r15
r15d
r15w

r15b


rflags

flags


rip


rflags register


CF (Carry Flag, bit 0) — Set if an arithmetic operation generates a carry or a borrow out of the most-significant bit of the result; cleared otherwise. This flag indicates an overflow condition for unsigned-integer arithmetic. It is also used in multiple-precision arithmetic.
PF (Parity Flag, bit 2) — Set if the least-significant byte of the result contains an even number of 1 bits; cleared otherwise.
AF (Auxiliary carry Flag, bit 4) — Set if an arithmetic operation generates a carry or a borrow out of bit 3 of the result; cleared otherwise. This flag is used in binary-coded decimal (BCD) arithmetic.
ZF (Zero Flag, bit 6) — Set if the result is zero; cleared otherwise.
SF (Sign Flag, bit 7) — Set equal to the most-significant bit of the result, which is the sign bit of a signed integer. (0 indicates a positive value and 1 indicates a negative value.)
OF (Overflow Flag, bit 11) — Set if the integer result is too large a positive number or too small a negative number (excluding the sign-bit) to fit in the destination operand; cleared otherwise. This flag indicates an overflow condition for signed-integer (two’s complement) arithmetic.

Saturation and wraparound modes (of the instruction set)


Wraparound arithmetic — With wraparound arithmetic, a true out-of-range result is truncated (that is, the carry or overflow bit is ignored and only the least significant bits of the result are returned to the destination). Wraparound arithmetic is suitable for applications that control the range of operands to prevent out-of-range results. If the range of operands is not controlled, however, wraparound arithmetic can lead to large errors. For example, adding two large signed numbers can cause positive overflow and produce a negative result.
Signed saturation arithmetic — With signed saturation arithmetic, out-of-range results are limited to the representable range of signed integers for the integer size being operated on. For example, if positive overflow occurs when operating on signed word integers, the result is saturated to 7FFFH, which is the largest positive integer that can be represented in 16 bits; if negative overflow occurs, the result is saturated to 8000H.
Unsigned saturation arithmetic — With unsigned saturation arithmetic, out-of-range results are limited to the representable range of unsigned integers for the integer size. So, positive overflow when operating on unsigned byte integers results in FFH being returned and negative overflow results in 00H being returned.

Stack frames

Data transfer instructions


MOV — Move data between general-purpose registers; move data between memory and general-purpose or segment registers; move immediates to general-purpose registers.
CMOVcc — Conditional move.
XCHG — Exchange.
BSWAP — Byte swap.
XADD — Exchange and add.
CMPXCHG — Compare and exchange.
CMPXCHG8B / CMPXCHG16B — Compare and exchange 8/16 bytes.
PUSH — Push onto stack.
POP — Pop off of stack.
PUSHA / PUSHAD — Push general-purpose registers onto stack.
POPA / POPAD — Pop general-purpose registers from stack.
CWD / CDQ / CQO — Convert word to doubleword/Convert doubleword to quadword.
CBW / CWDE / CDQE — Convert byte to word/Convert word to doubleword in rax register.
MOVSX / MOVSXD — Move and sign extend.
MOVZX — Move and zero extend.

Binary arithmetic instructions


ADCX — Unsigned integer add with carry.
ADOX — Unsigned integer add with overflow.
ADD — Integer add.
ADC — Add with carry.
SUB — Subtract.
SBB — Subtract with borrow.
IMUL — Signed multiply.
MUL — Unsigned multiply.
IDIV — Signed divide.
DIV — Unsigned divide.
INC — Increment.
DEC — Decrement.
NEG — Negate.
CMP — Compare.

Logical instructions


AND — Perform bitwise logical AND.
OR — Perform bitwise logical OR.
XOR — Perform bitwise logical exclusive OR.
NOT — Perform bitwise logical NOT.

Shift and rotate instructions


SAL / SAR / SHL / SHR — Shift arithmetic/logical left/right.
SHLD — Shift left double.
SHRD — Shift right double.
RCL / RCR / ROL / ROR — Rotate left/right and rotate left/right through carry.

Bit and byte instructions


BT — Bit test.
BTS — Bit test and set.
BTR — Bit test and reset.
BTC — Bit test and complement.
BSF — Bit scan forward.
BSR — Bit scan reverse.
SETcc — Set byte on condition.
TEST — Logical compare.
CRC32 — Provides hardware acceleration to calculate cyclic redundancy checks for fast and efficient implementation of data integrity protocols.
POPCNT — This instruction calculates the number of bits set to 1 in the second operand (source) and returns the count in the first operand (a destination register).

Control transfer instructions


JMP — Jump.
Jcc — Jump if condition is met (RIP-relative operand).
LOOP / LOOPcc — Loop with rcx counter.
CALL — Call procedure.
RET — Return.
IRET / IRETD / IRETQ — Return from interrupt.
INT n / INTO / INTO 3 — Call to interrupt procedure.
ENTER — High-level procedure entry.
LEAVE — High-level procedure exit.

String instructions


MOVS / MOVSB / MOVSW / MOVSD / MOVSQ — Move data from string to string.
CMPS / CMPSB / CMPSW / CMPSD / CMPSQ — Compare string operands.
SCAS / SCASB / SCASW / SCASD — Scan string.
LODS / LODSB / LODSW / LODSD / LODSQ — Load string.
STOS / STOSB / STOSW / STOSD / STOSQ — Store string.
REP / REPE / REPZ / REPNE / REPNZ — Repeat string operation prefix.

rflags control instructions


STC — Set carry flag.
CLC — Clear the carry flag.
CMC — Complement the carry flag.
CLD — Clear the direction flag.
STD — Set direction flag.
LAHF — Load flags into ah register.
SAHF — Store ah register into flags.
PUSHF / PUSHFQ — Push rflags onto stack.
POPF / POPFQ — Pop rflags from stack.
STI — Set interrupt flag.
CLI — Clear the interrupt flag.

Miscellaneous instructions


LEA — Load effective address.
NOP — No operation.
UD — Undefined instruction.
XLAT / XLATB — Table lookup translation.
CPUID — Processor identification.
MOVBE — Move data after swapping data bytes.
PREFETCHW — Prefetch data into cache in anticipation of write.
CLFLUSH — Flushes and invalidates a memory operand and its associated cache line from all levels of the processor’s cache hierarchy.
CLFLUSHOPT — Flushes and invalidates a memory operand and its associated cache line from all levels of the processor’s cache hierarchy with optimized memory system throughput.
RDRAND — Retrieves a random number generated from hardware.
RDSEED — Seed the random number generator from hardware.

User-mode extended states save/restore instructions


XSAVE — Save processor extended states to memory.
XSAVEC — Save processor extended states with compaction to memory.
XSAVEOPT — Save processor extended states to memory, optimized.
XRSTOR — Restore processor extended states from memory.
XGETBV — Reads the state of an extended control register.

Bit manipulation instructions (BMI1, BMI2)


ANDN — Bitwise AND of first source with inverted 2nd source operands.
BEXTR — Contiguous bitwise extract.
BLSI — Extract lowest set bit.
BLSMSK — Set all lower bits below first set bit to 1.
BLSR — Reset lowest set bit.
BZHI — Zero high bits starting from specified bit position.
LZCNT — Count the number leading zero bits.
MULX — Unsigned multiply without affecting arithmetic flags.
PDEP — Parallel deposit of bits using a mask.
PEXT — Parallel extraction of bits using a mask.
RORX — Rotate right without affecting arithmetic flags.
SARX / SHLX / SHRX — Shift arithmetic/logic left/right without affecting flags.
TZCNT — Count the number trailing zero bits.

x87 FPU overview


x87 FPU state is aliased to the MMX state, care must be taken when making transitions to MMX instructions to prevent incoherent or unexpected results.

x87 FPU data transfer instructions


FLD — Load floating-point value.
FST / FSTP — Store floating-point value without/with pop.
FILD — Load integer.
FIST / FISTP — Store integer with/without pop.
FBLD — Load BCD.
FBSTP — Store BCD and pop.
FXCH — Exchange registers.
FCMOVcc — Floating-point conditional move.

x87 FPU basic arithmetic instructions


FADD / FADDP / FIADD — Add floating-point.
FSUB / FSUBP / FISUB — Subtract floating-point.
FSUBR / FSUBRP / FISUBR — Subtract floating-point reverse.
FMUL / FMULP / FIMUL — Multiply floating-point.
FDIV / FDIVP / FIDIV — Divide floating-point.
FDIVR / FDIVRP / FIDIVR — Divide floating-point reverse.
FPREM — Partial remainder.
FPREM1 — IEEE Partial remainder.
FABS — Absolute value.
FCHS — Change sign.
FRNDINT — Round to integer.
FSCALE — Scale by power of two.
FSQRT — Square root.
FXTRACT — Extract exponent and significand.

x87 FPU comparison instructions


FCOM / FCOMP / FCOMPP — Compare floating-point.
FUCOM / FUCOMP / FUCOMPP — Unordered compare floating-point.
FICOM / FICOMP — Compare integer.
FCOMI / FCOMIP / FUCOMI / FUCOMIP — Compare floating-point and set rflags.
FTST — Test floating-point (compare with 0.0).
FXAM — Examine floating-point.

x87 FPU transcendental instructions


FSIN — Sine.
FCOS — Cosine.
FSINCOS — Sine and cosine.
FPTAN — Partial tangent.
FPATAN — Partial arctangent.
F2XM1 — 2x − 1.
FYL2X — y ∗ log2x.
FYL2XP1 — y ∗ log2(x + 1).

x87 FPU load constants instructions


FLD1 / FLDL2T / FLDL2E / FLDPI / FLDLG2 / FLDLN2 / FLDZ — Load constants.

x87 FPU control instructions


FINCSTP — Increment FPU register stack pointer.
FDECSTP — Decrement FPU register stack pointer.
FFREE — Free floating-point register.
FINIT / FNINIT — Initialize FPU.
FCLEX / FNCLEX — Clear floating-point exception flags.
FSTCW / FNSTCW — Store FPU control word.
FLDCW — Load FPU control word.
FSTENV / FNSTENV — Store FPU environment.
FLDENV — Load FPU environment.
FSAVE / FNSAVE — Save FPU state.
FRSTOR — Restore FPU state.
FSTSW / FNSTSW — Store FPU status word.
WAIT / FWAIT — Wait for FPU.
FNOP — FPU no operation.

x87 FPU and SIMD state management instructions


FXSAVE — Save x87 FPU and SIMD state.
FXRSTOR — Restore x87 FPU and SIMD state.

MMX overview


SIMD execution model to handle 64-bit packed integer data.
Eight new 64-bit data registers, called MMX registers.
Three new packed data types:

64-bit packed byte integers (signed and unsigned)
64-bit packed word integers (signed and unsigned)
64-bit packed doubleword integers (signed and unsigned)


MMX state is aliased to the x87 FPU state, care must be taken when making transitions to x87 FPU instructions to prevent incoherent or unexpected results.

MMX data transfer instructions


MOVD / MOVQ — Move doubleword/quadword from/to MMX registers.

MMX conversion instructions


PACKSSWB / PACKSSDW — Pack words/doublewords into bytes with signed saturation.
PACKUSWB — Pack words into bytes with unsigned saturation.
PUNPCKHBW / PUNPCKHWD / PUNPCKHDQ — Unpack high-order bytes/words/doublewords.
PUNPCKLBW / PUNPCKLWD / PUNPCKLDQ — Unpack low-order bytes/words/doublewords.

MMX packed arithmetic instructions


PADDB / PADDW / PADDD — Add packed byte/word/doubleword integers.
PADDSB / PADDSW — Add packed signed byte/word integers with signed saturation.
PADDUSB / PADDUSW — Add packed unsigned byte/word integers with unsigned saturation.
PSUBB / PSUBW / PSUBD — Subtract packed byte/word/doubleword integers.
PSUBSB / PSUBSW — Subtract packed signed byte/word integers with signed saturation.
PSUBUSB / PSUBUSW — Subtract packed unsigned byte/word integers with unsigned saturation.
PMULHW — Multiply packed signed word integers and store high result.
PMULLW — Multiply packed signed word integers and store low result.
PMADDWD — Multiply and add packed word integers.

MMX comparison instructions


PCMPEQB / PCMPEQW / PCMPEQD — Compare packed bytes/words/doublewords for equal.
PCMPGTB / PCMPGTW / PCMPGTD — Compare packed signed byte/word/doubleword integers for greater than.

MMX logical instructions


PAND — Bitwise logical AND.
PANDN — Bitwise logical AND NOT.
POR — Bitwise logical OR.
PXOR — Bitwise logical exclusive OR.

MMX shift and rotate instructions


PSLLW / PSLLD / PSLLQ — Shift packed words/doublewords/quadwoards left logical.
PSRLW / PSRLD / PSRLQ — Shift packed words/doublewords/quadwords right logical.
PSRAW / PSRAD — Shift packed words/doublewords right arithmetic.

MMX state management instructions


EMMS — Empty MMX state.

SSE overview


Expand the SIMD execution model by adding facilities for handling packed and scalar single-precision floating-point values contained in 128-bit registers.
Sixteen (eight for 32-bit mode) new 128-bit packed single-precision floating-point XMM registers available.
128-bit packed and scalar single-precision floating-point instructions.
Enhancements to MMX instruction set with new operations on packed integer operands located in MMX registers.
Explicit prefetching of data, control of the cacheability of data, control of the
ordering of store operations.

SSE data transfer instructions


MOVAPS — Move four aligned packed single-precision floating-point values between XMM registers or between XMM register and memory.
MOVUPS — Move four unaligned packed single-precision floating-point values between XMM registers or between XMM register and memory.
MOVHPS — Move two packed single-precision floating-point values to an from the high quadword of an XMM register and memory.
MOVHLPS — Move two packed single-precision floating-point values from the high quadword of an XMM register to the low quadword of another XMM register.
MOVLPS — Move two packed single-precision floating-point values to an from the low quadword of an XMM register and memory.
MOVLHPS — Move two packed single-precision floating-point values from the low quadword of an XMM register to the high quadword of another XMM register.
MOVMSKPS — Extract sign mask from four packed single-precision floating-point values.
MOVSS — Move scalar single-precision floating-point value between XMM registers or between an XMM register and memory.

SSE packed arithmetic instructions


ADDPS — Add packed single-precision floating-point values.
ADDSS — Add scalar single-precision floating-point values.
SUBPS — Subtract packed single-precision floating-point values.
SUBSS — Subtract scalar single-precision floating-point values.
MULPS — Multiply packed single-precision floating-point values.
MULSS — Multiply scalar single-precision floating-point values.
DIVPS — Divide packed single-precision floating-point values.
DIVSS — Divide scalar single-precision floating-point values.
RCPPS — Compute reciprocals of packed single-precision floating-point values.
RCPSS — Compute reciprocal of scalar single-precision floating-point values.
SQRTPS — Compute square roots of packed single-precision floating-point values.
SQRTSS — Compute square root of scalar single-precision floating-point values.
RSQRTPS — Compute reciprocals of square roots of packed single-precision floating-point values.
RSQRTSS — Compute reciprocal of square root of scalar single-precision floating-point values.
MAXPS — Return maximum packed single-precision floating-point values.
MAXSS — Return maximum scalar single-precision floating-point values.
MINPS — Return minimum packed single-precision floating-point values.
MINSS — Return minimum scalar single-precision floating-point values.

SSE comparison instructions


CMPPS — Compare packed single-precision floating-point values.
CMPSS — Compare scalar single-precision floating-point values.
COMISS — Perform ordered comparison of scalar single-precision floating-point values and set flags in rflags register.
UCOMISS — Perform unordered comparison of scalar single-precision floating-point values and set flags in rflags register.

SSE logical instructions


ANDPS — Perform bitwise logical AND of packed single-precision floating-point values.
ANDNPS — Perform bitwise logical AND NOT of packed single-precision floating-point values.
ORPS — Perform bitwise logical OR of packed single-precision floating-point values.
XORPS — Perform bitwise logical XOR of packed single-precision floating-point values.

SSE shuffle and unpack instructions


SHUFPS — Shuffles values in packed single-precision floating-point operands.
UNPCKHPS — Unpacks and interleaves the two high-order values from two single-precision floating-point operands.
UNPCKLPS — Unpacks and interleaves the two low-order values from two single-precision floating-point operands.

SSE conversion instructions


CVTPI2PS — Convert packed doubleword integers to packed single-precision floating-point values.
CVTSI2SS — Convert doubleword integer to scalar single-precision floating-point value.
CVTPS2PI — Convert packed single-precision floating-point values to packed doubleword integers.
CVTTPS2PI — Convert with truncation packed single-precision floating-point values to packed doubleword integers.
CVTSS2SI — Convert a scalar single-precision floating-point value to a doubleword integer.
CVTTSS2SI — Convert with truncation a scalar single-precision floating-point value to a scalar doubleword integer.

SSE MXCSR management instructions


LDMXCSR — Load MXCSR register.
STMXCSR — Save MXCSR register state.

SSE 64-bit integer instructions (MMX enhancements)


PAVGB / PAVGW — Compute average of packed unsigned byte integers.
PEXTRW — Extract word.
PINSRW — Insert word.
PMAXUB — Maximum of packed unsigned byte integers.
PMAXSW — Maximum of packed signed word integers.
PMINUB — Minimum of packed unsigned byte integers.
PMINSW — Minimum of packed signed word integers.
PMOVMSKB — Move byte mask.
PMULHUW — Multiply packed unsigned integers and store high result.
PSADBW — Compute sum of absolute differences.
PSHUFW — Shuffle packed integer word in MMX register.

SSE cacheability control, prefetch and ordering instructions


MASKMOVQ — Non-temporal store of selected bytes from an MMX register into memory.
MOVNTQ — Non-temporal store of quadword from an MMX register into memory.
MOVNTPS — Non-temporal store of four packed single-precision floating-point values from an XMM register into memory.
PREFETCHh — Load 32 or more of bytes from memory to a selected level of the processor’s cache hierarchy.
SFENCE — Serializes store operations.

SSE2 overview


Packed and scalar 128-bit double-precision floating-point instructions.
Additional 64-bit and 128-bit packed byte/word/doubleword/quadword integers instructions.
128-bit versions of integer instructions introduced with MMX and SSE.
Additional cacheability-control and instruction-ordering instructions.

SSE2 FP64 data movement instructions


MOVAPD — Move two aligned packed double-precision floating-point values between XMM registers or between and XMM register and memory.
MOVUPD — Move two unaligned packed double-precision floating-point values between XMM registers or between and XMM register and memory.
MOVHPD — Move high packed double-precision floating-point value to an from the high quadword of an XMM register and memory.
MOVLPD — Move low packed single-precision floating-point value to an from the low quadword of an XMM register and memory.
MOVMSKPD — Extract sign mask from two packed double-precision floating-point values.
MOVSD — Move scalar double-precision floating-point value between XMM registers or between an XMM register and memory.

SSE2 FP64 packed arithmetic instructions


ADDPD — Add packed double-precision floating-point values.
ADDSD — Add scalar double precision floating-point values.
SUBPD — Subtract packed double-precision floating-point values.
SUBSD — Subtract scalar double-precision floating-point values.
MULPD — Multiply packed double-precision floating-point values.
MULSD — Multiply scalar double-precision floating-point values.
DIVPD — Divide packed double-precision floating-point values.
DIVSD — Divide scalar double-precision floating-point values.
SQRTPD — Compute packed square roots of packed double-precision floating-point values.
SQRTSD — Compute scalar square root of scalar double-precision floating-point values.
MAXPD — Return maximum packed double-precision floating-point values.
MAXSD — Return maximum scalar double-precision floating-point values.
MINPD — Return minimum packed double-precision floating-point values.
MINSD — Return minimum scalar double-precision floating-point values.

SSE2 FP64 logical instructions


ANDPD — Perform bitwise logical AND of packed double-precision floating-point values.
ANDNPD — Perform bitwise logical AND NOT of packed double-precision floating-point values.
ORPD — Perform bitwise logical OR of packed double-precision floating-point values.
XORPD — Perform bitwise logical XOR of packed double-precision floating-point values.

SSE2 FP64 compare instructions


CMPPD — Compare packed double-precision floating-point values.
CMPSD — Compare scalar double-precision floating-point values.
COMISD — Perform ordered comparison of scalar double-precision floating-point values and set flags in rflags register.
UCOMISD — Perform unordered comparison of scalar double-precision floating-point values and set flags in rflags register.

SSE2 FP64 shuffle and unpack instructions


SHUFPD — Shuffles values in packed double-precision floating-point operands.
UNPCKHPD — Unpacks and interleaves the high values from two packed double-precision floating-point operands.
UNPCKLPD — Unpacks and interleaves the low values from two packed double-precision floating-point operands.

SSE2 FP64 conversion instructions


CVTPD2PI — Convert packed double-precision floating-point values to packed doubleword integers.
CVTTPD2PI — Convert with truncation packed double-precision floating-point values to packed doubleword integers.
CVTPI2PD — Convert packed doubleword integers to packed double-precision floating-point values.
CVTPD2DQ — Convert packed double-precision floating-point values to packed doubleword integers.
CVTTPD2DQ — Convert with truncation packed double-precision floating-point values to packed doubleword integers.
CVTDQ2PD — Convert packed doubleword integers to packed double-precision floating-point values.
CVTPS2PD — Convert packed single-precision floating-point values to packed double-precision floating-point values.
CVTPD2PS — Convert packed double-precision floating-point values to packed single-precision floating-point values.
CVTSS2SD — Convert scalar single-precision floating-point values to scalar double-precision floating-point values.
CVTSD2SS — Convert scalar double-precision floating-point values to scalar single-precision floating-point values.
CVTSD2SI — Convert scalar double-precision floating-point values to a doubleword integer.
CVTTSD2SI — Convert with truncation scalar double-precision floating-point values to scalar doubleword integers.
CVTSI2SD — Convert doubleword integer to scalar double-precision floating-point value.

SSE2 FP32 instructions (SSE enhancements)


CVTDQ2PS — Convert packed doubleword integers to packed single-precision floating-point values.
CVTPS2DQ — Convert packed single-precision floating-point values to packed doubleword integers.
CVTTPS2DQ — Convert with truncation packed single-precision floating-point values to packed doubleword integers.

SSE2 integer instructions


MOVDQA — Move aligned double quadword.
MOVDQU — Move unaligned double quadword.
MOVQ2DQ — Move quadword integer from MMX to XMM registers.
MOVDQ2Q — Move quadword integer from XMM to MMX registers.
PMULUDQ — Multiply packed unsigned doubleword integers.
PADDQ — Add packed quadword integers.
PSUBQ — Subtract packed quadword integers.
PSHUFLW — Shuffle packed low words.
PSHUFHW — Shuffle packed high words.
PSHUFD — Shuffle packed doublewords.
PSLLDQ — Shift double quadword left logical.
PSRLDQ — Shift double quadword right logical.
PUNPCKHQDQ — Unpack high quadwords.
PUNPCKLQDQ — Unpack low quadwords.

SSE2 cacheability control and ordering instructions


CLFLUSH — Flush cacheline.
LFENCE — Serializes load operations.
MFENCE — Serializes load and store operations.
PAUSE — Improves the performance of “spin-wait loops”.
MASKMOVDQU — Non-temporal store of selected bytes from an XMM register into memory.
MOVNTPD — Non-temporal store of two packed double-precision floating-point values from an XMM register into memory.
MOVNTDQ — Non-temporal store of double quadword from an XMM register into memory.
MOVNTI — Non-temporal store of a doubleword from a general-purpose register into memory.

References


https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf
http://www.agner.org/optimize/optimizing_assembly.pdf
https://www.nasm.us/xdoc/2.13.03/nasmdoc.pdf
https://godbolt.org/
https://www.lri.fr/~filliatr/ens/compil/x86-64.pdf
https://0xax.github.io/categories/assembler/

Instruction tables


http://www.agner.org/optimize/instruction_tables.pdf

Examples


https://github.com/torvalds/linux/tree/master/arch/x86
https://gist.github.com/rygorous/bf1659bf6cd1752ed114367d4b87b302
https://www.csee.umbc.edu/portal/help/nasm/sample_64.shtml

Utils


https://software.intel.com/sites/landingpage/IntrinsicsGuide/
https://git.ffmpeg.org/gitweb/ffmpeg.git/blob_plain/HEAD:/libavutil/x86/x86inc.asm
https://gist.github.com/rygorous/f729919ff64526a46e591d8f8b52058e
bit 0 - 63	bit 0 - 31	bit 0 - 15	bit 8 - 15	bit 0 - 7
`rax`	`eax`	`ax`	`ah`	`al`
`rbx`	`ebx`	`bx`	`bh`	`bl`
`rcx`	`ecx`	`cx`	`ch`	`cl`
`rdx`	`edx`	`dx`	`dh`	`dl`
`rsi`	`esi`	`si`		`sil`
`rdi`	`edi`	`di`		`dil`
`rbp`	`ebp`	`bp`		`bpl`
`rsp`	`esp`	`sp`		`spl`
`r8`	`r8d`	`r8w`		`r8b`
`r9`	`r9d`	`r9w`		`r9b`
`r10`	`r10d`	`r10w`		`r10b`
`r11`	`r11d`	`r11w`		`r11b`
`r12`	`r12d`	`r12w`		`r12b`
`r13`	`r13d`	`r13w`		`r13b`
`r14`	`r14d`	`r14w`		`r14b`
`r15`	`r15d`	`r15w`		`r15b`
`rflags`		`flags`
`rip`