Skip to content

Instantly share code, notes, and snippets.

@mikesmullin
Last active November 17, 2024 21:11
Show Gist options
  • Save mikesmullin/6259449 to your computer and use it in GitHub Desktop.
Save mikesmullin/6259449 to your computer and use it in GitHub Desktop.
Notes on x86-64 Assembly and Machine Code

Mike's x86-64 Assembly (ASM) Notes

Assembling Binary Machine Code

Operating Modes:

These determine the assumed/default size of instruction operands, and restricts which opcodes are available, and how they are used.

Modern operating systems, booted inside Real mode, must escalate first to Protected mode, and then Long mode, as support and capability is detected. This is done to remain backward-compatible.

This means modern applications run exclusively in Long 64-bit mode.

Mode Default Operand Size Default Address Size Description
Long 32-bit 64-bit Latest.
Protected 32-bit 32-bit Legacy. Introduced segment registers (protected virtual addresses).
Real 16-bit 16-bit Legacy. Unlimited direct access to addressable memory. Compatible with oldest x86 CPUs.

There are also modes called Virtual 8086 and Long Compatbility which are middle steps that emulate the previous mode. They are meant for backward-compatibility and are provide fast context-switching for multi-tasking. (ie. So you can run 32-bit applications in a 64-bit operating system.)

Data Types:

Common variations you'll see:

Type Bits Bytes Aliases
n/a 4 ½ nibble, semioctet (rarely mentioned)
BYTE 8 1 byte, octet, char
WORD 16 2 word, short
DWORD 32 4 long, doubleword, longword, int, int32
QWORD 64 8 longword, long long, quadword, int64
n/a 128 16 octaword, double quadword (for data heavy maths)

TRIVIA: The WORD type actually refers to the largest integer the CPU can process in a single instruction, but that was back when Intel 8086 processors were 16-bit. Though processor capabilities have improved, the Intel manuals, and therefore just about everything else, still refer to it as in the table above. However, you may find specialized processor documentation that applies the original definition to very new or very old hardware. Just read the manufacturer's manual to be sure you know what you are working with.

x86 Instruction Data Structure:

The length of any instruction must not exceed 15-bytes, or the processor will trigger an exception.

Data structure of a single instruction

0-4 bytes 1-3 bytes 0-1 byte 0-1 byte 0,1,2,4 bytes 0,1,2,4,8 bytes
Prefix Opcode Mod-Reg R/M Scale-Index-Base (SIB) Displacement Immediate

References:

The Prefix

Each prefix byte is optional, but must appear in the following order:

Prefix Bytes Effect
Legacy




0xf0, 0xf2, 0xf3,
0x2e, 0x36, 0x3e,
0x26, 0x64, 0x65,
0x2e, 0x3e, 0x66,
0x67
Mandatory for some older instructions.




REX 0b0100WRXB Enables 64-bit operand size and extended registers.
VEX/XOP 2-3 bytes, complex Vector [math] extensions (3 operands)
Segment Register Prefix Byte

These are mostly relevant to Real and Protected modes, which have a related Global Descriptor Table (GDT).

Mnemonic Byte Name Legacy x86 Purpose x64 Purpose
SS 0x2e Stack Segment Pointer to process stack. Pointer to 0x0; unused.
CS 0x36 Code Segment Pointer to process code. Pointer to 0x0; unused.
DS 0x3e Data Segment Pointer to process data. (ie. strings) Pointer to 0x0; unused.
ES 0x26 Extra Segment Pointer to extra data. (User defined) Pointer to 0x0; unused.
FS 0x64 F Segment Pointer to extra data. (User defined) Pointer to thread-local process data.
GS 0x65 G Segment Pointer to extra data. (User defined) Pointer to thread-local process data.

These were designed for extended range, userland stability, and security--but were eventually outmoded by the immense/unfathomable range provided by 64-bit address space, and in favor of paging tables.

References:

REX Prefix Byte Data Structure (8 bits)
Field Bit Length Effect
0b0100 4 Constant; recognizable magic prefix.
W 1 1: 64-bit operand size (ie. RAX)
0: Default operand size (usually 32-bit, but per-instruction)
R 1 1: Prepend MODRM.(R)eg by 1-bit to map registers R8-R15.
X 1 1: Prepend SIB.inde(X) by 1-bit to map registers R8-R15.
B 1 1: Prepend MODRM.rm and SIB.(B)ase by 1-bit to map registers R8-R15.

Trivia:

  • In theory, only one REX prefix should be used. In practice, only the last one is taken into account.
  • A REX prefix before a Legacy prefix is silently ignored.

References:

The Operation Code (Opcode)

You can think of these as hardware level functions. When there are bugs in these functions, we have to wait until the next model of CPU is out to replace them. (ie. Meltdown and Spectre vulnerabilities announced in 2018)

Knowledge of less than 25 mnemonics from the very first set of 8086 instructions from 1976 are all that is required to build a basic program. Learn these first: ADD, CALL, CMP, DEC, DIV, HLT, IDIV, IMUL, INC, INT, INTO, IRET, JNZ, JMP, LEA, MOV, MUL, POP, PUSH, RET, RETN, SUB, XOR. In total there are around 560 unique mnemonics, with more added each year through extensions such as MMX, SIMD, 3DNow, and the latest hardware-level AES and SHA cryptography.

When converting a mnemonic like XOR to the correct byte in machine code, you realize there it is not a single function--but a collection of more than 20 separate function overloads--where each implementation is specialized by the type of operands it can accept. So, if one were to browse a table showing all function overloads by opcode byte, you would find more than 1,070 in total, not including undocumented opcodes which people continue to discover through reverse engineering.

  • Primary Opcodes: In the first release of x86, we had only 1-byte opcodes.
  • Secondary Opcodes: Future opcodes made room by prefixing the escape byte 0xf0. These are 2-byte opcodes.
  • Opcode Extension: If the instruction does not require a second operand, then the 3-bit MODRM.reg field is considered an extension of the opcode. Since it can only be a value 0-7, it is noted as /digit (Opcode) like 0xda/0 FIADD, where 0 is the value of the opcode extension.
  • Multi-Byte Opcodes: Eventually, escape sequences 0x0f38 and 0x0f3a made way for 3-byte opcodes.

So, the operation code can be 1-3 bytes in length, but the last byte is considered primary.

References:

Opcode Special Fields in the Primary Opcode Data structure (8-bits)

Not every primary opcode byte has special fields, but when one does, its important to understand its meaning and possible values:

7 6 5 4 3 2 1 0 Special Field Meaning Example
.


.


.


.


.


.


.


w


PO.w


Width of operands:
w=0: 8-bit BYTE
w=1: Full width (16/32-bit), based on Operand-Size Prefix
0x04 ADD


.


.


.


.


.


.


d


.


PO.d


Direction:
d=0: target operand2 (from MODRM.reg to MODRM.rm)
d=1: target operand1 (from MODRM.rm to MODRM.reg)
0x00 ADD


.


.


.


.


.


.


s


.


PO.s


Sign-extend:
s=0: No effect
s=1: Pad zeros to fill 16 or 32-bit destination
0x6b IMUL


. . . . t t t n PO.tttn Condition Test ie. JMP IF ttn, maps to 16 variations 0x70 JO
. . . . . r e g PO.reg General Register (0-7) 0x40 INC
. . . . . e e e PO.eee Debug Register (0-7)
.
.
.
.
.
.
.
.
.
.
.
s
s
r
r
e
PO.sreg2
PO.sreg3
Segment Register (0-4) (Legacy)
Segment Register incl. Extras (0-7) (Legacy)
0x06 PUSH
0x0fa0 PUSH

NOTE: When the alias is shown with mixed case letters, lowercase are 0 and uppercase are 1. (ie. tTtN is 0b0101)

References:

Operand types

Some opcodes accept 0, 1, 2, or 3 operands.
You will see these referred to by how they are passed via the Mod-Reg R/M byte,
in which case there are 3 types of operands an opcode can accept:

Operand Type Notation Description
Immediate imm<bits> Binary value fitting entirely within the instruction.
Register

r<bits> 3-bit reference to one of eight on-processor General Purpose Registers,
which is expected to already hold a valid value.
Memory m<bits> A pointer to system address space, where another value begins.

Where <bits> is one of 8, 16, 32, 64, or 128.

The Immediate Operand Type

We will discuss this type first because it is the simplest.

Some instructions use data encoded in the instruction itself as a source operand. Arithmetic instructions allow the source operand to be an immediate value. The maximum value allowed for an immediate operand varies among instructions, but can never be greater than the maximum value of an unsigned doubleword integer (2³²).

For example, 0x142f is the immediate operand in this instruction:

ADD EAX, 142fh

The size of the immediate operand is determined by the opcode.

The Register Operand Type

This type is the next simplest. It only requires 1 byte, the Mod-Reg R/M byte,
which can specify one of the following tuple combinations:

2-bits (0-4)
MODRM.mod
3-bits (0-7)
MODRM.reg (reg/opcode)
3-bits (0-7)
MODRM.rm (register/memory)
0b11 opcode extension register
0b11 register register
0b00
0b01
0b10
register


memory addressing mode
(via subsequent Scale-Index-Base byte)

When we reference a register in MODRM.reg or MODRM.rm, we are expecting that the register holds the value the operation needs.

Example:

MOV EAX, ECX

But in the third case above, we can also place references to a register in SIB.index and SIB.base, which means that the register holds a [memory] address, that the CPU will dereference, and instead return a value held at that address.

Example:

MOV EAX, [ECX]

Mapping the Width of an Operand

The width of a register or memory address operand (8/16/32/64/128 bits)
is determined by several factors, of which these are some:

Factors, Highest Precedence First
REX.W=1 Prefix
L Flag in Code Segment Descriptor
0x66 Operand-Size Prefix
0x67 Address-Size Prefix
64-bit Long operating mode
Effective Operand Size 32 32 16 16 64 64 64 64
Effective Address Size 64 32 64 32 64 32 64 32

What the opcode defines as acceptable operand widths also matters.

The General Purpose Registers

Once you know the width of the register holding a value or an address to dereference, its simply a matter of mapping 3-bits to one of eight registers (A, B, C, D, BP, SP, SI, DI). In Long mode there is an extra 4th bit provided by REX/VEX/XOP prefixes, which unlocks eight additional registers (R8-15). All of these are 64-bit registers, but the operand width (discussed above) determines how many bits you are actually reading/writing per-instruction.

The exact meaning of the values held by each register are imbued by a combination of the opcodes, and calling conventions determined in the context of your operating system and the compiler that assembled your program. But it is helpful to know a few general meanings that are universal:

Register Name Commonly used as
A Accumulator Return value, especially the sum of arithmetic operations.
B Base index Starting point of an array or list structure.
C Counter Used by loops ie. the i in for(int i=0; i<9; i++)
D Data Extended space for accumulator.
(ie. 32-bit mode will combine EAX+EDX to work on 64-bit values)
BP Base Pointer Pointer to address of current stack frame.
(where function parameters end, and local variables begin)
SP Stack Pointer Pointer to address of last bytes PUSHed to memory.
SI Source Index Starting point of unbounded stream data, especially a string.
DI Destination Index Ending point of unbounded data, especially in slicing operations.

As a helpful mnemonic convention when programming assembly and referencing documentation, Intel defines a set of prefix (R=64-bit, E=32-bit, none=16-bit) and suffix (X/D=DWORD, W=WORD, L/B=Low BYTE, H=High BYTE) when referring to these registers, which describes both a) operand width, and b) where those bits are located within the full register.

                            | If most significant byte first (little-endian)
                A register [0100011101001111010011110100010001001010010011110100001000100001]
                    offset  0       8       16             32                             64
          (Low 8-bits)  AL  |<----->|       |              |                               |
         (High 8-bits)  AH          |<----->|              |                               |
         (Low 16-bits)  AX  |<------------->|              |                               |
         (Low 32-bits) EAX  |<---------------------------->|                               |
(Full 64-bit register) RAX  |<------------------------------------------------------------>|

While there are several places you may reference a register, including MODRM.reg, MODRM.rm, SIB.index, SIB.base, and PO.reg, you'll find they all use the same 3 or 4-bit mapping convention, as follows:

Register
Reference
(3-bit / 4th-bit=0b1)
Low 8-bits³

High 8-bits¹ ³

Low 16-bits

Low 32-bits⁴

Full 64-bit Register
0b000 AL/R8B AX/R8W EAX/R8D RAX/R8
0b001 CL/R9B CX/R9W ECX/R9D RCX/R9
0b010 DL/R10B DX/R10W EDX/R10D RDX/R10
0b011 BL/R11B BX/R11W EBX/R11D RBX/R11
0b100 SPL²/R12B AH SP/R12W ESP/R12D RSP/R12
0b101 BPL²/R13B CH BP/R13W EBP/R13D RBP/R13
0b110 SIL²/R14B DH SI/R14W ESI/R14D RSI/R14
0b111 DIL²/R15B BH DI/R15W EDI/R15D RDI/R15

NOTES:

  1. The high 8-bit registers (AH, CH, DH, BH ) are not addressable when a REX prefix is used.
  2. These low 8-bit registers (SPL, BPL, SIL, DIL) are only addressable when a REX prefix is used.
    This is because the 3-bit mappings used for them are overlapping, as seen in the footnote and table above.
    In fact, the lower 8 bytes of SP, BP, SI, and DI were not even addressable before x64 Long mode.
  3. Both high and low 8-bit registers are only directly addressable from Real mode or Virtual 8086 mode,
    but you can always grab the larger-width version of the same register, and it will contain those bytes, of course.
  4. WARNING: 32-bit registers are zero-extended when used in Long mode.
    (ie. INC EAX will zero-fill all of RAX, but INC AL or INC AX will not.)

References:

The Memory Address Operand

This is the most complex type of operand, but not too complex.
If either or both of your source and destination operands are inside system address space,
you will have to use these 2-3 bytes:

Data Structure Size Required
Mod-Reg R/M 8-bits Required
Scale-Index-Base (SIB) 8-bits Required
Displacement 0/8/16/32-bits Optional

The structure of SIB is, briefly:

2-bits 3-bits 3-bits
Scale Index Base

When calculating the address, the formula is, generally:

Real Address = Segment + SIB.base + (SIB.index × SIB.scale) + Displacement

Where:

Variable Meaning
Segment Augend to the following variables. Remember most segments are mapped to 0x00 in Long mode.
SIB.base Refers to a register, whose value holds the augend to the product of SIB.scale and SIB.index.
SIB.scale Multiplicand of SIB.index: 0b00=×1, 0b01=×2, 0b10=×4, 0b11=×8
SIB.index Refers to a register, whose value holds the multiplier of SIB.scale.
Displacement

Literal value, holds an actual relative address; an addend to all previous variables.
If no SIB byte is present in 32-bit mode, address is relative to RIP/EIP instruction pointer.

While the order always remains the same, certain variables are omitted according to the current addressing mode. This is determined by MODRM.mod; when one of its three encodings references a memory address--0b00, 0b01, or 0b10--it is then combined with the MODRM.rm field, for a total of 24 possibilities, and these specify the various memory addressing modes, as follows:

With 16-bit registers (Real or Protected modes):

MODRM.mod MODRM.rm
0b000
AX
0b001
CX
0b010
DX
0b011
BX
0b100
SP
0b101
BP¹
0b110
SI
0b111
DI
0b00 [BX+SI] [BX+DI] [BP+SI] [BP+DI] [SI] [DI] disp16² [BX]
0b01 [BX+SI]+disp8³ [BX+DI]+disp8 [BP+SI]+disp8 [BP+DI]+disp8 [SI]+disp8 [DI]+disp8 [BP]+disp8 [BX]+disp8
0b10 [BX+SI]+disp16 [BX+DI]+disp16 [BP+SI]+disp16 [BP+DI]+disp16 [SI]+disp16 [DI]+disp16 [BP]+disp16 [BX]+disp16

NOTES:

  1. The default segment register is SS for the BP register, DS for everything else.
  2. disp<bits> means Displacement with a width of said <bits>.
  3. Warning: disp8 is sign-extended wherever it is allowed to be used.
  4. The SIB byte cannot be used in Real mode.

With 32-bit (Protected or Long modes) and 64-bit registers (Long mode):

MODRM.mod MODRM.rm/B¹
0b000/1
EAX/R8
0b001/1
ECX/R9
0b010/1
EDX/R10
0b011/1
EBX/R11
0b100/1
ESP/R12
0b101/1
EBP/R13
0b110/1
ESI/R14
0b111/1
EDI/R15
0b00 [EAX/R8] [ECX/R9] [EDX/R10] [EBX/R11] [SIB] [RIP/EIP]²+disp32 [ESI] [EDI]
0b01 [EAX/R8]+disp8 [ECX/R9]+disp8 [EDX/R10]+disp8 [EBX/R11]+disp8 [SIB] [EBP/R13]+disp8 [ESI/R14]+disp8 [EDI/R15]+disp8
0b10 [EAX/R8]+disp32 [ECX/R9]+disp32 [EDX+/R10]+disp32 [EBX/R11]+disp32 [SIB] [EBP/R13]+disp32 [ESI/R14]+disp32 [EDI/R15]+disp32

Where SIB equals:

Formula MODRM.mod B¹+SIB.base X³+SIB.index
disp32 0b00 0d5,13 0d4
[SIB.index × SIB.scale] + disp32 0b00 0d5,13 0d0-3,5-15
[SIB.base] 0b00 0d0-4,6-12,14-15 0d4
[SIB.base] + [SIB.index × SIB.scale] 0b00 0d0-4,6-12,14-15 0d0-3,5-15
[SIB.base] + disp8 0b01 0d0-15 0d4
[SIB.base] + [SIB.index × SIB.scale] + disp8 0b01 0d0-15 0d0-3,5-15
[SIB.base] + disp32 0b10 0d0-15 0d4
[SIB.base] + [SIB.index × SIB.scale] + disp32 0b10 0d0-15 0d0-3,5-15

NOTES:

  1. Variable B represents that a prefix REX.B, VEX.B, or XOP.B is present, enabling R8-R15 MODRM.rm and SIB.base registers.
  2. In Protected mode, this is actually just zero-based 0+disp32 displacement addressing.
    But Long mode changes this to RIP-relative by default, or EIP-relatve (when 0x67 Address-Size Prefix is also present).
    If you want zero-based behavior in Long mode, you must use the one of the SIB byte forms and make its address effectively zero.
  3. Variable X represents that a prefix REX.X, VEX.X, or XOP.X is present, enabling R8-R15 SIB.index registers.
  4. Format of this column is a list of 4-bit unsigned decimal ranges, to keep the table compact.

References:


Appendix: Let's Manually Assemble an Instruction!

Let's translate the following NASM-compatible assembly instruction into 32/64-bit compatible machine code:

opcode operand1 operand2
XOR CL, [12H]

Beginning with the opcode byte first, consulting the Intel IA-32 manual, Volume 2C, Chapter 5, "XOR" --we find 0x32 XOR which states a) it requires 2 operands, b) the operands have a direction, and the first operand is the destination, c) the first operand is a register of 8-bits width, d) the second operand is also 8-bit but can be either a register or memory address, and e) the destination register CL will be overridden to contain the result of the operation. This fits our case above, because the first operand is CL (L meaning lower 8-bits of the C register), and the second operand is a reference the the value stored in memory at 0x12 (a direct/absolute pointer or address reference). It doesn't look like we need any prefix bytes to get the operand sizes we want.

As an interesting observational aside, this opcode has special fields of 001100dw:

  • d=1 because the register is the destination.
  • w=0 because the operands (r/8,imm8) are 8-bit.

Now we know we need a ModR/M byte, because the opcode requires it; a) it requires more than zero operands, and b) they are not defined within the opcode or any prefix, and c) there is no Immediate operand. So again we consult the Intel manual, Volume 2A, Chapter 2, Section 2.1.5 "Addressing-Mode Encoding of ModR/M and SIB Bytes", Table 2-2 "32-Bit Addressing Forms with the ModR/M Byte". We know the first operand is going to be our destination register, CL, so we see that maps to REG=001b. Next we look for an Effective Address formula which matches our second operand, which is a displacement with no register (and therefore no segment, base, scale, or index). The nearest match is going to be disp32, but reading the table is tricky because of the footnotes. Basically our formula is not in that table, the one we want requires a SIB byte noted as [--][--], which tells us we need to specify Mod=0b00, R/M=0b100 to enable the SIB byte. Our second byte is therefore 0b00001100 or 0x0C.

We know the SIB byte, if it is used, always follows the ModR/M byte, so we continue to the next Table 2-3 "32-Bit Addressing Forms with the SIB Byte" in the Intel manual, and look for the combination of Scale, Index, and Base values which will give us the disp32 formula we need. Notice there is a footnote [*], this basically tells us to specify Scale=00b, Index=100b, Base=101b which means disp32 with no index, no scale, and no base. So our third byte is now 0x25.

We know the Displacement byte, if used, always follows the ModR/M and SIB byte, so here we simply specify our 32-bit unsigned integer value in little-endian, meaning our next four bytes are 0x12000000.

Finally, we have our machine code:

XOR CL, [12H] = 00110010 00001100 00100101 00010010 00000000 00000000 00000000 = 32 0c 25 12 00 00 00

This instruction works in both 32-bit Protected mode and 64-bit Long mode.

And here is the 16-bit version for Real mode:

XOR CL, [12H] = 00110010 00001110 00010010 00000000 = 32 0e 12 00

References:


Appendix: x86 Extensions

As new models of the x86 family are released, the instruction set is extended with new features. Here we provide a chronologically ordered summary of what was added, when, and why.

History of the FPU

The floating point featureset deserves its own history.

In 1978, Intel introduced the 8086 CPU architecture. All processors at the time would perform integer math only. This meant floating-point precision had to be emulated per-application in the software layer, which was slow, and difficult for the average programmer.

In 1980, Intel releases the 8087 math co-processor, a separate chip designed to be installed in parallel to the 8086 CPU, exclusively for carrying out hardware-optimized mathematical operations with floating point numbers. This introduced +83 new hardware-optimized instructions all beginning with the letter F.

It would be another 9 years before the Intel 80486, the first CPU with a built-in math co-processor. This introduced +8 new 80-bit registers called ST0-ST7. The instruction set remains the same for backward-compatibility.

Single Instruction, Multiple Data (SIMD) + Digital Signal Processing

SIMD is a classification of parallel processing strategy, where multiple processors perform the same operation on multiple data points, simultaneously; allowing you to scale by processing N datas in the same number of clock cycles as just one data.

Such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) at a given moment.

Many ISA/PCI peripherial manufacturers (e.g., Creative Sound Blaster 16) were becoming popular for providing specialized digital signal processors which utilized SIMD.

Eventually, Intel reasoned that it made sense to centralize that technology into the CPU.

Intel MMX vs. AMD 3DNow!

In 1997, Intel released the P5-based Pentium line of microprocessors, designated as "Pentium with MMX Technology". This was effectively SIMD/DSP technology built-into the CPU. It introduced +60 new instructions, but re-used the 80-bit FPU registers ST0-7, renaming the lower 64-bits MMX0-7.

The following year, AMD answered with the K6-2 processor featuring all the MMX instructions plus a few enhancements. The two companies competed in court fiercely over naming and rights to use the technology. AMD would eventually brand theirs as 3DNow!

These implementations both proved unpopular and are basically now deprecated, though you can still find their registers and instructions usable in modern processors.

Intel Streaming SIMD Extensions (SSE)

By 1999, Intel announced SSE as the successor to Intel MMX, which added +70 new instructions and +8 new 128-bit registers XMM0-7... later, when amd64 introduced +8 more registers XMM8-15, Intel followed suit.

This addressed two main problems: a) MMX only worked with integers, and b) switching between MMX/FPU instructions was too inefficient for practical use, because they had to share the same FPU registers.

There have been several versions of SSE to date, including SSE, SSE2, SSE3, SSSE3, SSE4a, SSE4.1, and the latest as of this writing SSE4.2. The latest version is backward compatible to the first version.

There is also the aborted AMD bastard child SSE5 or XOP which only existed briefly in one processor and was then abandoned after it was rejected by Intel.

While the names, implementations, and their exact instruction sets are different, the concept has remained the same--SIMD; whether you're doing video encoding, audio synthesis, or streaming textures to a GPU--you optimize by performing a single operation across a nice matrix/vector of floating point data whenever possible.

Advanced Vector Extensions (AVX / AVX2 / AVX512)

Today, certain processors designed for heavy workloads offer SIMD instruction sets that operate on even bigger registers:

  • AVX: Sixteen new 256-bit registers (YMM0-15), with the XMM registers occupying the lower 128 bits of the same numbered YMM register.
  • AVX-512: Thirty-two new 512-bit registers (ZMM0-31), with same numbered YMM and XMM registers occupying the lower 256 and 128 bits of the ZMM register.

Virtualization (Intel VT-x / AMD-V)

Leveraged by popular virtual machines / hypervisors to get closer-to-native performance for their guest OS.

Cryptography (AES-256, SHA-1)

Recently, Intel CPUs come with hardware implementations of these popular crypto functions for an easy performance boost.

References:


Floating Point Numbers (IEEE-754)

Floats come in various sizes. When serialized for compact transmission over the network, a clever dev may try to encode them as a string, or a tuple of 1-byte integers (integer and mantissa, optionally an exponent). But when you need the processor to do really quick, especially bulk, binary floating point math, the following is the standard form used everywhere.

SIMD instructions operate almost exclusively on ST0-7, MMX0-7, and more recently the XMM0-15 registers. When utilizing the high-precision 80/128-bit values, you may need to perform multiple MOV and PUSH operations to fill the entire register, since the other registers and immediate operands are much smaller. As an optimization, some instructions accept a memory pointer operand to read/write a long array of floats to/from a block of memory in one operation.

Data Structure:

  • 1-bit Sign (0=positive)
  • 8-bit base2 Exponent add +127 bias (why not signed two's compliment?)
    Take the whole number integer part, convert to binary, remove any [insignificant] 0 prefixes, count digits, minus one, that's the binary exponent convert that binary exponent (say, 8 digits) to binary and add +127
  • 23-bit Mantissa a.k.a. Significand
    This is the combination of the integer and fractional parts concatenated.
    The integer part is encoded as a simple unsigned int.
    However, the fractional part is encoded as a base2 binary fraction, which commonly results in a continued fraction pattern,
    which gets truncated--and can lead to infamous FPU rounding errors if not handled carefully.
    ex: 3.1f = 0b11 + 0b000 1100 1100 1100 1100 110... (the pattern would repeat infinitely if not truncated)
    This is stored little-endian so any zero-fill happens on the right side.

Let's manually encode 1.0f!

  • sign: 0b0 = a positive number
  • mantissa: 0b1 + 0b0 zero-extended
    (It is easier to calculate in this order because the mantissa value informs the exponent value.)
  • exponent: 0d0 + 0d127 = 0d127 = 0b01111111
IEEE-754 32-bit (single precision) Floating Point (x86; little-endian)

  offset  0  1          10                         32
  single [0  0111 1111  1000 0000 0000 0000 0000 000] = 0x3f800000 = 1.0f
          |  |       |  |                          |
   sign   1  |       |  |                          |
exponent     |<--8-->|  |                          |
mantissa                |<-----------23----------->|

The structure is the same for 64-bit (double precision) floats except the exponent has 11 bits, and a bias of +1023.

The exponent bit has a four magic values which have reserved special meanings:

Exponent Mantissa Meaning
0b0 0b0 zero (0d0)
0b0 non-zero denormalized
all 0b1's 0b0 Infinity
all 0b1's non-zero NaN ¹

NOTES:

  1. You can hide data inside the mantissa of NaN structures.
    Some compilers use this to specify more precise reason codes (ie. if NaN resulted from failed computation.)

References:


Appendix: Stack vs. Heap

The stack is a data structure in memory the processor can understand and maintain, used for holding variables that wouldn't fit in CPU registers. Its structure is a Last-In, First-Out (LIFO) queue, growing from bottom (highest address range) to top (approaching zero), like plates returning to a dishwasher in a cafeteria.

Typical candidates for the stack include CPU register data which is:

  • Too long or too many to fit in the desired registers.
  • Backed up prior and then restored after, so that your function may run without leaving unwanted traces or side-effects on functions that will follow.
  • Stateful data with a lifetime longer than a single opcode instruction, which includes almost every higher-than-assembly programming language feature (ie. concepts like function, for...loop, multi-variable expressions, etc.) and the Stack Frame, explained below.

The Stack Frame data structure

Pretend we have a function:

function playSound(name:string, volume:int, wait:bool):bool {
  var basePath = "C:\Sounds\";
  var delay = 1000;
  // ...
  return true;
}

and we execute it like:

playSound('moo.wav', 20, false);

Your compiler's operating system and calling convention determines exactly how these should be laid out in stack, but let's look at the common right-to-left C Declaration (cdecl) convention, and we'll assume we're operating in 32-bit Protected mode.

Memory address Little-endian value Variable name Relative offset Length Significance
0x00000000 ? ? ? ? Random data, not ours
...
0xabcd3FE8 0x03e80000 delay [EBP-8] 32-bits 2nd local variable
0xabcd3FEC Address of "C:\Sounds\" string in DS basePath [EBP-4] 32-bits 1st local variable
0xabcd3FF0 ? Frame Pointer (FP) [EBP] 32-bits Backup of the EBP value from before our function began
0xabcd3FF4 ? Return Address (RA) [EBP+4] 32-bits Backup of Instruction Pointer (IP) value;
the address where we should JMP to return control to the calling function,
once we are done executing ours
0xabcd3FF8 Address of "moo.txt" string in DS [EBP+8] 32-bits 1st argument
0xabcd3FFC 0x14000000 [EBP+12] 32-bits 2nd argument
0xabcd4000 0x01000000 [EBP+16] 32-bits 3rd argument

PUSH and POP instructions add/remove stack data, and decrement/increment the SP register which points to the top of the stack; the most recent byte written. The BP register is for use by the programmer, conventionally pointing at the byte occuring just prior to the current function's first local variable, a quick reference which you can offset positively to reach function arguments, or negatively to reach local variables.

By the time the function returns, everything it added has been removed again. Registers that held important values before the function began are now returned to their original values. The only thing remaining on the stack from this function is maybe a return value. This means any data which you wish to persist beyond the lifetime of a function cannot exist on the stack. (Unless you get creative with the return value or referencing data from a calling function occuring earlier in the calling hierarchy.)

The Heap data structure

By now you'll see that the heap is the only place for long-lived data structures, which have no means to persist in either the register or the stack; heap is the place for "everything else."

The structure of the heap is determined by the programmer. It is nothing more than a blank slice of bytes for writing from a random section of free memory, typically reserved to an application upon malloc() request, and recycled upon free() or process end by the operating system, or virtual machine, depending on the environment.

Some applications like Java will reserve a large block of memory on process start, and have a very complex implementation of garbage collection so that they can work entirely in that single allocation for the life of the process. Others like the typical C/C++ application will reserve and free many small blocks of memory, repeatedly throughout the life of the process, relying on the operating system to try to keep it organized--which can lead to problems with alignment, fragmentation, and performance--as going back to the OS for more memory can be slow, and the OS is allowed to say "no", ie.:

  • Out-of-memory (OOM): Your extreme inefficiency, or that of another process, has exhausted the machines resources.
  • Segmentation fault (segfault): Security/stability related; you're requesting an address within a code or data segment of a process that does not belong to you.

References:


Appendix: Big vs. Little Endianness

This only applies at the byte level. It is the order which bytes are read by the processor. The x86 processor expects little-endian, which means the most significant byte is to the left.

ie. 0d2 is 0x02000000 in 32-bit little-endian, and 0d-2 is 0xfeffffff in 32-bit little-endian,
where as the same values in big endian would be 0x00000002 and 0xfffffffe.

WARNING: Sometimes tools like debuggers, disassemblers, calculators, etc. will print the values opposite to what you are expecting for the architecture in context. In these cases, they are simply trying to be too helpful. Be aware of the byte order, and maybe check with a hex editor or multiple tools to be certain when it matters.

QUIRK: Registers are typically drawn with the EAX, AX, AH, AL on the right-hand side, but in fact if you set a value like 0d24 in RAX and then print the values of RAX, EAX, AX, AL you will see they all equal 0d24, and AH equals 0d0, which means that their slices all actually begin from the most significant byte first. I like to think that the registers are stored little endian too, for consistency, and that all those drawings are backwards. Its uncommon to set RAX only to select EAX, so it may not matter, but its a little trivia to be aware of.

References:


Appendix: Other Registers

As you master your understanding of x86 architecture, there are a few registers which exist but don't typically get talked about until the very end:


Appendix: Addressing Modes and Pointers

There is a common vernacular across all cpu architectures when describing pointers, which we'll attempt to summarize here.

Addressing Modes

Term Description
implied pre-determined by opcode; no way to affect
stack implied, but affected by stack push/pop
register src/dst operand is a register
pro: fast; within cpu.
pc-relative signed (-128,+127) constant disp8 from IP program counter (short jmp/addr)
pro: fast; within instruction. ideal for jmp, branching, threading, fwd/bkwd
con: limited max range
direct memory address constant via displacement or immediate
pro: fast; within instruction
con: unchangeable; addr should not be modified once running/cpu-cached
indirect [register or memory] address is pointer to another memory address
only variations of JMP and CALL will automatically dereference an indirect address.
otherwise, manual dereferencing requires multiple instructions.
pro: can change address pointed to at runtime
con: slow; requires two or more memory accesses, and the memory to store them

Definitions

Effective addresses are any operand to an instruction which references memory.

Calculated in some of the following ways:

  • indexed offset:

    segment + base + (scale * index) + displacement
    

    A segment address is always implied unless you override the selector.
    The rest is optional.
    Index defaults to 1 while scale, base, displacement default to 0.

  • near pointer "segment_register:offset" or just "offset":
    address is relative to given segment register,
    otherwise relative to the default segment--which is usually DS but may vary by instruction.
    The SEGMENT REGISTERS are: CS, DS, FS, ES, and SS

  • far pointer "segment_selector:offset" data type:
    two addr concat in single operand
    the segment_selector refers to the GDT which refers to a protected memory page
    the offset is the address relative to that.

References:


Appendix: Brief History of Assemblers

One of the earliest commercial-grade assembler tools was Microsoft Macro Assembler (MASM) in 1981. It was initially marketed for commercial use, and included documentation. Beginning with v7 (1991) it was only available packaged with various Microsoft SDKs and C compilers, and its license required you to own a copy of Visual Studio. Since then its documentation has also become sparse and difficult to get ahold of.

Its early influence led to many derivatives; importantly, it inspired the open-source Netwide Assembler (NASM) project, which is basically MASM with improvements that allow it to work across all platforms.

Some hardcore enthusiasts still author primarily in MASM and hoan their techniques by collecting, preserving, and resharing rare code artifacts from fellow enthusiasts.

Today there are numerous assemblers to choose from, including Richard Stallman's GNU Assembler (GAS) which ships with Linux coreutils, but these are the most common choices.

References:


Appendix: Reverse Engineering & Malware Analysis

References:


Appendix: Windows PE/COFF Binary format

Windows executables (*.exe, *.dll) use Portable Executable (PE) format, which is a wrapper around and Component Object File Format (COFF), which is used by binary linker files (*.obj, *.lib). Technically Windows 64-bit uses a version internally called PE32+.

A linker (ie. link.exe, cl.exe, ld, etc.) is basically designed to parse one or more COFF files, and wrap them into a single executable with a PE header.

Here is some useful trivia about that:

  • .obj is Windows COFF, .o is the equivalent Linux ELF; same purpose, different formats.
  • Microsoft COFF is an extended version of the original by AT&T.
  • .obj and .lib files contain a simple table data structure mapping unique ASCII string symbol names to code or address offsets in another file.
  • .lib may include source code (static), but most of the time (e.g., in Visual Studio) they are just header stubs (dynamic) with pointers to address offsets in a .dll which must match the exact release version and compiler used.
  • Confusingly, there is no trivial way to tell static and dynamic .lib files apart, except that [dynamic] import libraries for DLLs will be much smaller than the matching static library would be.
  • .lib files may only be used at compile time to build statically linked binaries.
  • .dll files are intended to only be used at runtime to as dynamically linked binaries.
  • Technically .dll files contain enough information that a reverse engineer could statically link them without a .lib, if they wanted to.
  • If you only have a .dll, you may be missing the compile-time constants passed as function arguments. These are typically shared in the form of a C header (*.h) file, as part of an SDK (e.g, windows sdk , opengl sdk), if the developer wants you to have them. The other thing you may not have is the documentation about what inputs are valid, when, and what effect they have on the .dll functions. Though a determined hacker could successfully guess them by looking at example code which uses the .dll, or via fuzz testing.
  • The version of gcc toolchain GNU linker (ld) ported to Windows can statically link using .dll inputs directly, which means it is able to implicitly synthesize the normally required but missing .lib stubs automagically!
  • Decorated names or mangled names are a symbol naming convention used in the COFF files. They are a series of ASCII prefix and suffixes which guarantee that each function is named uniquely when merged into the same flat COFF table format. The additional data mangled into the name includes:
    • The function name.
    • The class name that the function is a member of, if it is a member function.
      This may include the class that encloses the class that contains the function, and so on.
    • The namespace the function belongs to, if it is part of a namespace.
    • The C function parameter types, in order.
    • The calling convention.
    • The return type of the function.
  • You can decode decorated/mangled names using supplied tools, like so:
    "> C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\dumpbin.exe" /symbols "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\lib\amd64\msvcrt.lib"
    "> C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\undname.exe" "??$?RUTlsDtorNode@@@__crt_internal_free_policy@@QEBAXQEBUTlsDtorNode@@@Z"
    Undecoration of :- "??$?RUTlsDtorNode@@@__crt_internal_free_policy@@QEBAXQEBUTlsDtorNode@@@Z"
    is :- "public: void __cdecl __crt_internal_free_policy::operator()<struct TlsDtorNode>(struct TlsDtorNode const * __ptr64 const)const __ptr64"
    

References:


Appendix: Linux ELF Binary format

References:


Appendix: Writing a Compiler

References:


Appendix: Miscellanous Tools & References

@lancejpollard
Copy link

Best resource hands down.

@mikesmullin
Copy link
Author

Best resource hands down.

❤️ Thanks dude. I worked hard on it for my own use at the time! Glad someone else discovered the value. :)

@namjkee
Copy link

namjkee commented Feb 6, 2021

very good!

@mikesmullin
Copy link
Author

very good!

Thank you :)

@evertonse
Copy link

Nice, certainly have helped me

@timakro
Copy link

timakro commented Oct 25, 2024

Found a typo:

Future opcodes made room by prefixing the escape byte 0xf0

should read

Future opcodes made room by prefixing the escape byte 0x0f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment