TomHarte/86Bugs.txt

## 86Bugs.txt
(C) Copyright 1993, 1994 By Harald Feldmann Revision 04, Nov 3rd 1994.


Hamarsoft's 86BUGS list, (C) 1993/94 By Hamarsoft (R)
──────────────────────────────────────────────────────────────────────────────

The 86BUGS list, distributed with Ralf Brown's Interrupt list, is maintained
and provided to you by Hamarsoft, the maker of the HAP & PAH datacompression
program. Latest version of HAP & PAH is 3.14e. If you like this list you are
encouraged to register the HAP 3.00 shareware program. You will receive
the latest, registered, version of HAP 3.14e by air-mail on 3.5" diskette.
FTP to garbo.uwasa.fi and get pc/arcers/hap300re.zip   for more info.
────────────────────────────────┬───────────────────────────────────────────
To contact Hamarsoft, write to  │ or send e-mail over Internet to:
                                │ harald.feldmann@almac.co.uk
Hamarsoft,         New Address! ├───────────────────────────────────────────
Harald Feldmann,                │ or send e-mail to HARALD FELDMANN over
P.o. Box 451,                   │ Ilink in the international COMPRESS echo
6400 AL  Heerlen,               │ The p.o. box will be maintained if e-mail
The Netherlands                 │ should no longer be possible.
────────────────────────────────┴───────────────────────────────────────────
Various people have contributed to this list. They are mentioned in a
separate page, click on <acknowledgements> to see their names and e-mail
addresses. These people are not employed by or affiliated with Hamarsoft.

Hamarsoft and all people who contributed to the 86BUGS list do not accept
any liability whatsoever regarding the use, inability to use, correctness
or completeness of the information presented in the 86BUGS list.

Attention authors: if you mention this list in your article or book, please
send a courtesy copy to the P.o. box address by airmail. Thank you.

This is 86BUGS list revision level 04, issued November 3rd 1994.
(C) Copyright 1993, 1994 By Harald Feldmann.


Acknowledgements
──────────────────────────────────────────────────────────────────────────────

This file lists undocumented and buggy instructions of the Intel 80x86
family of processors as well as features of processors compatible with
Intel products. Note that Intel does not support the special features and
may decide to drop opcode variants and instructions in future products.
Wherever the notation 88,86,87,186,286,287,287xl,386,386sx,387,387sx,
486,486sx,487 and Pentium is used, Intel CPUs are referenced unless
noted otherwise.

All mentioned trademarks and/or tradenames are owned by the respective
owners and are acknowledged.

I would like to give credit to those who provided useful information or
who in another way contributed to the 86BUGS list.

9308 Chris Lueders  (chris_lueders@zaphod.fido.de) iAPX program & mul bugs
9311 Anthony Naggs  (amn@ubik.demon.co.uk) NEC differences and CPU tests
9407 Christian Ludloff (Ludwig-Khn-Str. 15, 09123 Chemnitz, Germany)
                    Discovered CPUID instruction on 486.
9410 Robert Mashlan (rmashlan@r2m.com) BOUND difference on NEC V20
9410 Anthony Naggs  (amn@ubik.demon.co.uk) POP CS & MOV CS on 86/88
                    SETALC on NEC & i186 BOUND difference, NEC specific
                    instructions.
9410 Christian Ludloff (see above for address) Pentium extensions (MSRs),
                    INFO and STAT programs.

If you contributed, but are not listed, please send a note.


AAA   Adjust After BCD Addition
──────────────────────────────────────────────────────────────────────────────

Mnemonic: AAA
Opcode  : 37  (88=8, 86=8, 286=3, 386=4, 486=3 clocks)
Bug in  : Different implementation in 88 and 86 versus 286+

Function:
The 88 and 86 processors would not add a carry out of al into ah if an
invalid operand would be in al (FF), the newer processors _will_, yielding
different results for the same _invalid_ operand. Execution is effectively
the same when valid operands are loaded.
Highest 4 bits of AL are always cleared.


AAD    Adjust After BCD Division
──────────────────────────────────────────────────────────────────────────────

Mnemonic: AAD
Opcode  : D5 imm8  (88=60, 86=60, 286=14, 386=19, 486=14 clocks)
Bug in  : Is an opcode variant on Intel's 88,86,286,386,486
          Variant does not work on NEC's V-series, probably not on AMD CPUs

Function:
This instruction regularly performs the following action:
  - unpacked BCD in AX   example (AX = 0104h)
  - AL = AH * 10d + AL   (AL = 0eh )
  - AH = 00              (AH = 00h )

The normal opcode decodes as follows: d5,0a
The instruction itself is an instruction plus operand. By replacing the
second byte with any number in the range 00 - ff you can build your own
instruction AAD for various number systems in those ranges. For example
by coding d5,10 you achieve an instruction that performs:

  - AL = AH * 16d + AL.
  - AH = 00

This feature of Intel's chips can be used to determine whether there is
a true Intel CPU installed in a system.

(NEC difference supplied by Anthony Naggs)


AAM   Adjust After BCD Multiplication
──────────────────────────────────────────────────────────────────────────────

Mnemonic: AAM
Opcode  : D4 imm8  (88=83, 86=83, 286=16, 386=17, 486=15 clocks)
Bug in  : Is an opcode variant on Intel's 88,86,286,386,486

Function:
This instruction regularly performs the following action:
  - binary number in AL
  - AH = AL / 10d
  - AL = AL MOD 10d

Thus creating an unpacked BCD in AX. The normal opcode decodes as follows:
d4,0a. The instruction itself is an instruction plus operand. By replacing
the second byte with any number in the range 00 - ff you can build your own
instruction AAM for various number systems in that range. For example by
coding d4,07 you achieve an instruction that performs:
  - binary number in AL
  - AH = AL / 07d
  - AL = AL MOD 07d


AAS   Adjust After BCD Subtraction
──────────────────────────────────────────────────────────────────────────────

Mnemonic: AAS
Opcode  : 3F
Bug in  : Intel's documentation

Function:
Adjusts result of two subtracted BCD numbers to form a valid new BCD number.
Highest 4 bits of AL are always cleared.


ADD4S   Addition of packed BCD strings (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: ADD4S
Opcode  : 0F 20  (7+19n clocks, n is the number of bytes per operand)
Bug in  : Rarely documented, except in NEC manuals

Function:
Adds the packed BCD string at DS:SI to the packed BCD string at ES:DI. The
length of the string, in BCD digits, is specified in CL. Unlike Intel string
operations CL, DI & SI are unchanged by the operation. The Zero Flag (ZF) is
set if both operands are zero.  The Carry Flag (CF) and Overflow Flag (OF)
appear to be set by the addition of the most significant digits.

Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)

See also SUB4S, CMP4S, ROL4, ROR4


BOUND  Checks register against limits
──────────────────────────────────────────────────────────────────────────────

Mnemonic: BOUND reg,mem
Opcode  : 62 [mod:reg:r/m]
Bug in  : NEC V20 handles it differently from Intel 286+. But apparently,
          according to Intel documentation, equal to 186.

Function:
Bound checks a register against limits and generates exception 5 if the
value falls outside the limit. On NEC CPUs the mnemonic is apparently also
referred to as 'CHKIND'.
Note that the mem component refers to two consecutive memory locations, of
size 'reg' which contain the lower and upper limit for the value in 'reg'
as [low limit][high limit].

'reg' size:     'mem' specifies address of:

    word            dword
    dword           qword

Normally, on Intel 286+ CPUs, the exception saves the CS:IP pointing TO the
BOUND instruction. On the NEC V20, the saved CS:IP point to the instruction
following the BOUND instruction.

According to Intel's documentation the 186 handles this exception the same
way the NEC does. It has been verified on a 486 that the CS:IP of BOUND on
that CPU indeed points TO the instruction itself and not the following one.

Also, contrary to what one might expect, BOUND only allows word or dword
registers to be tested. Byte registers are invalid.

(V20 supplied by Robert Mashlan)
(186 difference & 'CHKIND' supplied by Anthony Naggs)


Breakpoint errors while debugging
──────────────────────────────────────────────────────────────────────────────

Mnemonic: N/A
Opcode  : N/A
Bug in  : some 386, some 486

Function:
Breakpoints are used in the process of debugging programs.
On the 386+, debug registers may be used instead of a one byte opcode.

386 specific debugging bugs occurring on some 386s:
Breakpoints are missed under the following conditions:

- A data breakpoint set to a mem16 operand of a VERR, VERW, LSL or LAR while
  the segment with selector at mem16 is not accessible.

- A data breakpoint is set to the write operand of a REP MOVS instruction
  and the read cycle of the next iteration generates a fault.

- A code or data breakpoint is set on the instruction following a MOV or
  POP to SS while the instruction needs more than two clocks.
  (see <MOV> and <POP>)

Random breakpoints may occur under the following condition:

- Breakpoints set using debug registers DR0 to DR4 may produce spurious
  breaks if breakpoints were enabled before a MOV from CR3, TR6 or TR7 took
  place. These unreliable breaks may continue to occur until the next JMP
  instruction is executed. A workaround would be to:
  = disable breakpoints before any MOV from CR3, TR6 or TR7
  = MOV the values
  = perform a JMP
  = enable breakpoints.

Single stepping is not disabled in the handler for a TSS fault if the code
that caused the fault was being single-stepped and a task gate was used to
handle the fault.

486 specific debugging bugs occurring on some 486s:

A code breakpoint set on control transfer instructions (like CALL, RET, JMP
etc.) will clear the lowest four bits of DR6 when the breakpoint is taken.

A code breakpoint set on an instruction immediately following a RETN, JCXZ,
intrasegment indirect CALL (CALL word ptr [bx] for example) or
intrasegment indirect JMP (JMP word ptr [bx] for example) will always be
satisfied, even when the control instruction is taken. A breakpoint set at
the target of these control transfer instructions will not be taken,
even if control is transferred to them, because the buggy breakpoint sets
the RF (Resume Flag). There is said to be no workaround other than to avoid
the situation, however, coding a nop after the control transfer instruction
and setting the breakpoint to the instruction following the nop may,
according to my view, very well solve the problem. (untested)


BRKEM   Break for emulation (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: BRKEM   imm
Opcode  : 0F FF imm  (38 clocks)
Bug in  : Rarely documented, except in NEC manuals

Function:
(8080 is written here as 8O8O to avoid visual confusion with the 8088).
This is the basic instruction used to switch to 8O8O emulation mode.
The BRKEM instruction is used in a similar way to an INT instruction,
(referred to as BRK by NEC). The mode flag (MD) is set to zero, the Flags,
CS & IP are pushed onto the stack then CS & IP are loaded from the
specified interrupt vector.

In 8O8O emulation mode the V30 registers and flags are mapped to 8O8O
registers and flags.

    General purpose register names:
                    ┌───┬───┬───┬───┬───┬───┬───┬───┬───┐
    8O8O name───────┤ A │ B │ C │ D │ E │ H │ L │ SP│ PC│
    Intel x86 name──┤ AL│ CH│ CL│ DH│ DL│ BH│ BL│ BP│ IP│
    V30 name────────┤ AL│ CH│ CL│ DH│ DL│ BH│ BL│ BP│ PC│
                    └───┴───┴───┴───┴───┴───┴───┴───┴───┘

    Individual flag names:
                    ┌───┬───┬───┬───┬───┐
    8O8O name───────┤ C │ Z │ S │ P │ AC│
    Intel x86 name──┤ CF│ ZF│ SF│ PF│ AF│
    V30 name────────┤ C │ Z │ S │ P │ AC│
                    └───┴───┴───┴───┴───┘

In 8O8O emulation mode the segment used for instructions is determined
by the CS (PS) register. The DS (DS0) register determines the segment
used for data.

When an interrupt occurs during 8O8O emulation the CPU switches to native
V30 mode to process the interrupt. When the interrupt handler is complete
the IRET, (RETI in NEC nomenclature), will return to 8O8O emulation mode.

From 8O8O emulation mode RETEM (Return from Emulation, opcode ED FD) returns
to native mode, setting MD flag and restoring flags, CS & IP from the native
stack. Alternatively CALLN imm8 (Call Native, opcode ED ED imm) can be used
to call native V30 interrupts, (just like a regular INT).

Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)


BSF, Bit Scan Forward
──────────────────────────────────────────────────────────────────────────────

Mnemonic: BSF op1,op2
Opcode  : 0F BC
Bug in  : Intel's documentation

Function:
Finds the first (lowest) bit set to 1 in op2, sets ZF=1 and returns the bit
position in op1. If op2 is 0, ZF=0 and the value of op1 is undetermined,
some 386's leave the old value in op1, some early 486's load garbage into
op1 and later 486's leave op1 unchanged.


BSWAP reg32   Byte Swap
──────────────────────────────────────────────────────────────────────────────

Mnemonic: BSWAP reg32
Opcode  : 0F C8+reg# (00001111 11001rrr)
Bug in  : 486

Function:
Swaps all bytes in 32 bit registers, changing the sequence from ABCD to
DCBA, handy for converting numbers to a CPU format where the byte order
is reversed. Bug appears when BSWAP is not preceded by prefix 66h to
indicate 32 bit registers in 16 bit mode or when it IS preceded by 66h
in 32 bit mode.
Do not use this instruction with 16 bit registers as operand.
Results are undefined in that case. Use XCHG reg8,reg8 instead if you need
to swap 2 bytes in a 16 bit register like AX.


BT op1,op2  Bit Test
──────────────────────────────────────────────────────────────────────────────

Mnemonic: BT
Opcode  : 0F A3 op1,op2
Bug in  : No bug, avoid use on ports in 386, 486

Function:
Basically copies bit(op2) from op1 into CY. Memory variant is more complex.
Do not use on memory mapped I/O ports or memory operands that span into or
lie completely within nonexistent memory.
In the case of memory mapped I/O ports, use MOV and TEST instead.


BTC op1,op2   Bit Test and Complement
──────────────────────────────────────────────────────────────────────────────

Mnemonic: BTC op1,op2
Opcode  : 0F BB reg1,reg2
          0F BA reg,mem
Bug in  : No bug, avoid use on ports in 386, 486

Function:
Basically copies bit(op2) from op1 into CY and complements bit(op2) of op1.
Memory variant is more complex. Do not use on memory mapped I/O ports or
memory operands that span into or lie completely within nonexistent memory.
In the case of memory mapped I/O ports, use MOV and TEST instead.


BTR op1,op2   Bit Test and Reset
──────────────────────────────────────────────────────────────────────────────

Mnemonic: BTR op1,op2
Opcode  : 0F B3 [mod:reg:r/m]
          0F BA [mod:110:r/m] imm8
Bug in  : No bug, avoid use on ports in 386, 486

Function:
Basically copies bit(op2) from op1 into CY and sets bit(op2) of op1 to 0.
Memory variant is more complex. Do not use on memory mapped I/O ports or
memory operands that span into or lie completely within nonexistent memory.
In the case of memory mapped I/O ports, use MOV and TEST instead.


BTS op1,op2   Bit Test and Set
──────────────────────────────────────────────────────────────────────────────

Mnemonic: BTS
Opcode  : 0F BA [mod:101:r/m] imm8 / 0F AB [mod:reg:r/m]
Bug in  : No bug, avoid use on ports in 386, 486

Function:
Basically copies bit(op2) from op1 into CY and sets bit(op2) of op1 to 1.
Memory variant is more complex. Do not use on memory mapped I/O ports or
memory operands that span into or lie completely within nonexistent memory.
In the case of memory mapped I/O ports, use MOV and TEST instead.


Chip Step information for Intel CPUs
──────────────────────────────────────────────────────────────────────────────

CPUs are manufactured in models (like the 80386). While these models are
manufactured, errors in the mask layout and mask design may become
apparent. These errors may be corrected before a new batch of chips is
made. To distinguish between these revisions an identification code is
placed within the mask design on 386+ CPUs. By testing the CPU with CPUID
or by performing a RESET, this information is copied to specific registers.

The register used to hold mask info after a RESET is DX (apparently also
sometimes the high word of EDX on some 486s).

This page lists some component and revision ID's found in the DX register
for the 386SX, 386DX, 486SX and 486DX models from Intel.


        CPU:        DX:     Step:
        386SX       2304h   A0
                    2305h   B
                    2306h   C
                    2308h   D1

        386DX       0303h   B0 - B10
                    0305h   D0
                    0308h   D1 & D2

        486SX       0420h   A0

        486DX       0000h   A1
                    0401h   Bn
                    0302h   C0
                    0404h   D0
                    0410h   cAn
                    0411h   cBn


CLEAR1  Clears a specific bit to 0 (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: CLEAR1 reg/mem,CL/immediate
Opcode  : CLEAR1 r/m8,CL   : 0F 12 [mod:000:r/m]      (5/14 clocks)
          CLEAR1 r/m8,imm3 : 0F 1A [mod:000:r/m] imm  (6/15 clocks)
          CLEAR1 r/m16,CL  : 0F 13 [mod:000:r/m]      (5/14 clocks)
          CLEAR1 r/m16,imm4: 0F 1B [mod:000:r/m] imm  (6/15 clocks)
          CLEAR1 CY        : F8   (NEC nomenclature for Intel's CLC)
          CLEAR1 DIR       : FC   (NEC nomenclature for Intel's CLD)
Bug in  : Rarely documented, except in NEC manuals

Function:
Clears the specified bit in the register/memory operand. The bit number (CL
or immediate) is ANDed with 07 (for 8-bit operands) or 0F (for 16-bit
operands) to get a valid bit number. No flags are affected by this
operation, except by CLEAR1 CY and CLEAR1 DIR.

The first (smaller) clock count of each pair is for register operands.
Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)

See Also: NECINS, EXT, TEST1, NOT1, SET1


CMP4S   Subtraction of packed BCD strings (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: CMP4S
Opcode  : 0F 26  (7+19n clocks, n is the number of bytes per operand)
Bug in  : Rarely documented, except in NEC manuals

Function:
Subtracts the packed BCD string at DS:SI from the packed BCD string at
ES:DI, but does not store the result. The length of the string, in BCD
digits, is specified in CL. Unlike Intel string operations CL, DI & SI are
unchanged by the operation. The Zero Flag (ZF) is set if the result is zero.
The Carry Flag (CF) and Overflow Flag (OF) appear to be set by the
subtraction of the most significant digits.

Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)

See Also: ADD4S, SUB4S, ROL4, ROR4


CMPS Compare String Bytes, Word or Dword
──────────────────────────────────────────────────────────────────────────────

Mnemonic: CMPS
Opcode  : A6 (Bytes)
          A7 (Words)
          66 A6 (Bytes)
          66 A7 (DWords)
Bug in  : Early 286 in protected mode

Function:
Compares two strings in memory.
Repeated version (REP CMPS) in early 286 protected mode has a bug that
shows when, during execution, a segment limit exception or IO Privilege
Level Exception occurs.
In that case the exception handler sees the value of CX as it was at the
start of the REP instruction. SI and DI however reflect the correct index
of the elements currently scanned at the time of the exception.

Workaround: Do not scan beyond segment limits or into memory mapped I/O
areas.


CMPXCHG op1,op2   Compare and Exchange
──────────────────────────────────────────────────────────────────────────────

Mnemonic: CMPXCHG
Opcode  : 0F B0 reg,mem/reg (Byte)
          0F B1 reg,mem/reg (Word)
          66 0F b0/b1 (Byte / DWord)
Bug in  : pre-B step 486

Function:
Compares the accumulator (8,16 or 32 bit form) with op1 by internally
subtracting op1 from the accumulator and setting ZF according to the result.
If ZR, op2 is copied to op1, otherwise op1 is loaded into the accumulator.

On the A-step of the 486, this Mnemonic was coded using the opcodes for
the, discarded, A- to B0-step 386 instructions XBTS (a6) and IBTS (a7).
Because of software conflicts with software written for the early 386 DX the
opcodes for the 486 were changed to the ones above starting with the B step.

Note that some 386 software won't run on older 386es and some 486
software will not run on early 486es when using this instruction.


CPUID Identify CPU on 486 and higher CPUs
──────────────────────────────────────────────────────────────────────────────

Mnemonic: CPUID
Opcode  : 0F A2
Bug in  : Is undocumented for 486, seems not to work on tested AMD 486s
          Officially introduced as a new instruction with the Pentium.

Function:
Identifies CPU and revision information for the installed CPU. Note that
Intel officially introduced CPUID only with the Pentium processor.
It seems the instruction was unofficially introduced in the later
486 CPUs as well. Discovered by Christian Ludloff (see acknowledgements).
Supported by the UMC U5S 486 clones as well.

Executing it on an early 486 yields an Invalid Opcode Exception.
To safely use this instruction, an exception handler must be installed.
A safer workaround though is to test whether the ID bit in EFLAGS is set.
If so, the CPU supports CPUID. See <EFLAGS> image.

The instruction expects input in the EAX register and outputs information
in the EAX, EBX, ECX and EDX registers.

Input:  EAX = 0000 0000 : Check CPU 486+ installed

Output: after CPUID:
        EAX = 0000 0001 : OK, instruction supported
        EBX = 756e 6547 : 'uneG'
        EDX = 4965 6e69 : 'Ieni'
        ECX = 6c65 746e : 'letn'
        effectively the CPU says 'GenuineIntel'

Officially this returns a 'vendor string', which may indicate other than
Intel strings for OEMs.
The UMC U5S-33 returns 'UMC UMC UMC ' or ' UMC UMC UMC' (untested).

Input:  EAX = 0000 0001 : Obtain model specific information

Output: after CPUID:
        EAX = RRRR RFMS : revision information
            R = Reserved  Zero, but reserved
            F = Family    (4=486, 5=Pentium)
            M = Model     (3 on tested 486DX-2/66, 1 on tested Pentium/60)
            S = Stepping  (5 on tested 486DX-2/66, 3 on tested Pentium/60)
        EBX = RRRR RRRR
            R = Reserved  Zero, but reserved
        ECX = RRRR RRRR
            R = Reserved  Zero, but reserved
        EDX = xxxx xxxx : Bitmapped features, 1 means option available
            Bit 0 =       FPU built-in (supported on 486 and Pentium)
            Bit 1 =       V-86 mode extensions present
            Bit 2 =       I/O breakpoints possible
            Bit 3 =       4 MB paging supported
            Bit 4 =       Time Stamp Counter present
            Bit 5 =       Has Pentium compatible Model Specific Registers
            Bit 6 =       Reserved (0)
            Bit 7 =       Machine Check Exception supported (P5 only)
            Bit 8 =       CMPXCHG8B supported (apparently Pentium only)
            Bits 9-31     Reserved
            Assume zero if bit is not mentioned.

Note that this instruction is not supported on all 486 CPUs. However,
Christian Ludloff has tested it on some 486 DX and 486 SX models, in
addition to the Pentium/60 and found them to be present on those machines.
Any step and model information you find this instruction to run on is
welcomed. Please forward it to Christian.

Apparently all new(er) Intel CPUs are equipped with (some) of these
extensions, not just the Pentium.


CR0-4 register layout (386+)
──────────────────────────────────────────────────────────────────────────────

    = CR0: Some bits remain from the Machine Status Word of the 286.

      Bit 31                         16                              0
      ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
      │P│C│N│r│r│r│r│r│r│r│r│r│r│A│r│W│r│r│r│r│r│r│r│r│r│r│n│e│t│E│m│p│
      └┬┴┬┴┬┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴┬┴─┴┬┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴┬┴┬┴┬┴┬┴┬┴┬┘
       │ │ │                     │   └───────────────────┐ │ │ │ │ │ │
       │ │ │                     └─────────────────────┐ │ │ │ │ │ │ │
       │ │NW Not Write through (1 if write through)    │ │ │ │ │ │ │ │
       │ └CD Cache Disable (1 if disabled)             │ │ │ │ │ │ │ │
       └──PE Paging Enabled                            │ │ │ │ │ │ │ │
          AC Alignment mask (1=masked)─────────────────┘ │ │ │ │ │ │ │
          WP Write Protect (1 if read-only pages protected)│ │ │ │ │ │
          NE Numeric Error (1 if errors should be ignored)─┘ │ │ │ │ │
          ET Extension Type (1=387 type FPU,0=287 type FPU)──┘ │ │ │ │
          TS Task Switch (1=task switch has occurred)──────────┘ │ │ │
          EP Emulate Processor Extension ────────────────────────┘ │ │
             (1=execute exception 7 on FPU codes)                  │ │
          MP Math Present (1=_FPU_ will handle FPU codes)──────────┘ │
          PE Protection Enabled (1=Protected mode activated)─────────┘

      If EP=1 and MP=0, the FPU codes will be handled by software routines
      via exception 7. Coprocessor emulators use this property.

    = CR1: Is reserved
    = CR2: Linear 32-bit address of Page Fault


    = CR3: Page Directory Base Register (386+)

      Bit 31                         16                              0
      ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
      │x│x│x│x│x│x│x│x│x│x│x│x│x│x│x│x│x│x│x│x│r│r│r│r│r│r│r│p│P│r│r│r│
      └┬┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴┬┴─┴─┴─┴─┴─┴─┴─┴┬┴┬┴─┴─┴─┘
       └─────Page Directory Base Register────┘               │ │       PDBR
       (used in the Paging process implemented on the 386+)  │ │
                                                             │ │
       Page-level Cache Disable (486+)───────────────────────┘ │       PCD
       Page-level Writes Transparent (486+)────────────────────┘       PWT


    = CR4: Extended Machine Control (Pentium+)

      Bit 31                         16                              0
      ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
      │r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│r│M│r│p│D│T│P│V│
      └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴┬┴─┴┬┴┬┴┬┴┬┴┬┘
       Machine Check Enable (1=enabled)──────────────────┘   │ │ │ │ │ MCE
       Page Size Extension (1=4 Mb paging instead of 4 Kb)───┘ │ │ │ │ PSE
       Debugging Extension (1=breakpoints also valid for I/O)──┘ │ │ │ DE
       Time Stamp instruction Disable (1=RDTSC only with CPL=0)──┘ │ │ TSD
       Protected mode Virtual Interrupts (1=use VI flag in PM)─────┘ │ PVI
       Virtual86 mode Virtual Interrupts (1=use VI flag in VM)───────┘ VME


       The VME bit allows a V86 (or VM) task to use the 'virtual' interrupt
       flag. Setting and clearing the interrupt flag (IF) in EFLAGS is no
       longer intercepted by the V86 Monitor program (a very time consuming
       procedure), instead, the Pentium+ sets and clears the VI flag in
       EFLAGS, instead of the IF flag. This saves task switches to the
       monitor to handle the CLI and STI instructions and thus a lot
       of time in general purpose 8086 programs running in V86 mode.

       The PVI bit allows the same for Protected Mode procedures who would
       otherwise need supervision by a different task. That is:
       Tasks with CPL<0 may now call tasks with CPL=0 without crashing
       the system, but only under specific circumstances.

       The TSD bit changes the CPL-sensitivity of the RDTSC (Read Time
       Stamp Counter) instruction, a built-in CPU counter which is
       incremented every internal clockpulse.
       When TSD is 0, <RDTSC> is accessible for all CPL levels.
       With TSD set to 1 however, RDTSC is available only to tasks with
       CPL=0.

       The DE bit allows the Pentium+ to set breakpoints in I/O space
       using the breakpoint registers. The R/W coding 10b is used to
       indicate that the breakpoint is in I/O space on the Pentium+.
       The 10b encoding was marked as 'invalid' for pre-Pentium CPUs.

       The PSE bit determines the size of the pages controlled by the
       Paging Unit. With PSE = 0, the Paging mechanism uses 4 Kb pages.
       With PSE set to 1 however, the Paging mechanism uses 4 Mb pages.

       The MCE bit is used to allow generation of a Machine Check Exception.
       This exception is the result of a Parity error _within_ the Pentium
       or an active BUSCHK signal (low) on pin T3 (upper right hand corner,
       fourth pin from right, third from top when pin A1 is upper left
       corner, TOP view). The exception is vectored through interrupt 18d
       (or 12h). Execution after this exception may void system integrity.
       The Machine Check Address register holds the value of the address
       bus at the moment the event took place.
       The Machine Check Type register holds the type of bus access at the
       time the event took place.
       Both these registers are internal 64 bit registers which can only be
       read through the instruction <RDMSR> (Read Model Specific Register).
       See also <WRMSR> (Write Model Specific Register).


EFLAGS register layout (8088 to Pentium & NEC)
──────────────────────────────────────────────────────────────────────────────

      Bit 31                         16                              0
      ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
      │r│r│r│r│r│r│r│r│r│r│c│p│v│a│V│R│M│N│IOP│O│D│I│T│S│Z│r│A│r│P│r│C│
      └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴─┴┬┴┬┴┬┴┬┴┬┴┬┴─┴┬┴─┴┬┴─┴┬┘
                           │ │ │ │ │ │ │ │ │   │ │ │ │ │ │   │   │Carry
      CPUID available ─────┘ │ │ │ │ │ │ │ │   │ │ │ │ │ │   │   Parity
      Virtual Interrupt Pending│ │ │ │ │ │ │   │ │ │ │ │ │   └Aux carry
      Virtual Interrupt flag ──┘ │ │ │ │ │ │   │ │ │ │ │ └──────── Zero
      Alignment check ───────────┘ │ │ │ │ │   │ │ │ │ └────────── Sign
      Virtual-86 mode enabled ─────┘ │ │ │ │   │ │ │ └ Trap (step mode)
      Resume flag ───────────────────┘ │ │ │   │ │ └── Interrupt enable
      Mode Flag ───────────────────────┘ │ │   │ └──── Direction (1=up)
      Nested Task ───────────────────────┘ │   └────────────── Overflow
                                           └── I/O privilege level 0..3

   Note: the Mode Flag is supported only on the NEC V20/30,
   it is reserved on Intel CPUs.

   The diagram below shows the names for each bit as referenced to in most
   books, along with the CPU in which the bit was =officially= introduced.

      Description:                  Name:   CPU introduced:

      CPUID available───────────────ID      Pentium
      Virtual Interrupt Pending─────VIP     Pentium
      Virtual Interrupt flag────────VI      Pentium
      Alignment Check Flag───────────C      486
      Virtual-86 Mode Flag──────────VM      386
      Resume Flag───────────────────RF      386
      Mode Flag (8O8O emulation on)─MD      V20/V30 only
      Nested Task───────────────────NT      286
      I/O privilege level 0..3──────IOPL    286
      Overflow Flag─────────────────OF       86
      Direction Flag (1=up)─────────DF       86
      Interrupt Flag (1=enabled)────IF       86
      Trap Flag (single step mode)──TF       86
      Sign Flag─────────────────────SF       86
      Zero Flag─────────────────────ZF       86
      Auxiliary carry Flag──────────AF       86
      Parity Flag───────────────────PF       86
      Carry Flag────────────────────CF       86

(8080 is written here as 8O8O to avoid visual confusion with the 8088).
(Mode Flag supplied by Anthony Naggs)


EXT   Extract bit field (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: EXT reg8,reg8 / EXT reg8,imm4
Opcode  : 0F 33 [mod:reg:r/m]  (26-55 clocks)
Bug in  : Rarely documented, except in NEC manuals

Function:
Loads AX with bit field data. Bit field length is specified by the lowest
four bits of the second operand, more significant bits in AX are set to
zero. DS:SI specify the first memory location to read, and the low 4-bits
of the first operand specify the bit start position.  The bit field can
cross a byte boundary. After each complete data transfer, SI and the first
operand are automatically updated to point to the next bit field.

Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)

See Also: NECINS, TEST1, NOT1, CLEAR1, SET1


FPO2   Floating Point Operation 2 (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FPO2  fp-op / FPO2 fp-op,mem
Opcode  : 0110011X [mod:XXX:r/m]  (2/11 clocks)
Bug in  : Rarely documented, except in NEC manuals

Function:
Intended to communicate with NEC maths co-processors. The NEC "FPO1" opcode
corresponds to Intel's "ESC" prefix for co-processor instructions. Although
data sheets exist for NEC maths co-processors, they have never been
manufactured.

Note that the 386+ CPUs implement the opcodes 66 and 67 as Operand Size and
Address Size prefixes respectively.

(Supplied by Anthony Naggs)


HLT  Halt the processor
──────────────────────────────────────────────────────────────────────────────

Mnemonic: HLT
Opcode  : F4
Bug in  : No bug, handy use of instruction described below

Function:
Halts the processor, CPU restarts only when external event takes place such
as RESET activation, NMI request on NMI lines or maskable interrupt request
on INTR when interrupts are enabled.
Handy to use with following piece of code:

          STI        ; enable interrupts
     lazy:
          HLT        ; suspend CPU internal bus clock
          IN AL,60h  ; Key pressed !
          CMP AL,whatever_key
          JNE lazy   ; was not our key, just go back to sleep.

If the CPU is not going to be used for any processing tasks (hence is idle)
one may execute the code above to cool down the CPU because it stops the
internal CPU bus clock. It also saves (some) energy.


IBTS op1,op2   Insert Bit String
──────────────────────────────────────────────────────────────────────────────

Mnemonic: IBTS op1,op2
Opcode  : 0F A7
Bug in  : 386, 486 conflicting instruction opcode.

Function:
Obsolete instruction which was introduced on the A step of the 386 and
removed on the B1 step of the 386. The opcode a7 is used by the A step 486
to function as part of the CMPXCHG instruction. Because of software
conflicts (some compilers generated code for IBTS and its counterpart XBTS)
Intel decided to change the opcode for CMPXCHG on the B step of the 486.
Do NOT use IBTS in general purpose 386 or 486 applications.


IMUL  Integer, signed, Multiply
──────────────────────────────────────────────────────────────────────────────

Mnemonic: IMUL op
          IMUL op1,op2
          IMUL op1,op2,op3
          IMUL op1,op3
Opcode  : F6w [mod:101:r/m] disp
Bug in  : Apparently no bug, timing formula may be handy

Function:
It is mentioned here because of the timing formula.
The clocks used on 386 and 486 equal 9 or ceiling(log2(multiplier))+6.
Depending on which one is bigger.
Add an additional 3 clocks if multiplier is a memory operand.

See <MUL> for 32-bit MUL bugs.


INS  Input String from IO port
──────────────────────────────────────────────────────────────────────────────

Mnemonic: INS, INSB, INSW, INSD
Opcode  : AA, AB
Bug in  : early 286, some 386, early 486, NEC conflicting mnemonic: INS

Function:
Reads values from a port address in DX into a string at ES:DI or ES:EDI
in memory. When used with the REPcondition prefix, CX or ECX contains the
number of values to read.

There is also a NEC specific instruction with the conflicting mnemonic INS,
see <NECINS> or select <NEC specific instructions> from the mnemonic list
page for more information regarding that instruction.

Bugs in the 286;
If, in protected mode, ES would contain a null selector or ES:DI would
point beyond the segment limit when executing the single INS, causing
exception 0dh, the 0d exception handler would point to the instruction
following INS and not to it.

If, in protected mode, during the repeated version of the instruction, a
segment limit or IOPL exception occurred, the exception handler would see
the CX value as it was before the start of the instruction, DI would reflect
the proper index at the time of the exception though. This type of bug
also occurs with the CMPS instruction.

Bugs in the 386:
The value of CX or ECX after the REPcondition version is not correct when
the instruction is followed by a PUSH, POP or memory reference. After
REP INS the value of CX, ECX is -1, not 0. Do not assume (E)CX to be 0.

When REP INS or INS is followed by an instruction that uses a different
address size or when they are followed by an instruction that references
the stack implicitly while the B bit of the SS descriptor is different than
the address size used by the instruction, INS will not properly update
the (E)DI and REP INS will not properly update the (E)CX register.
The actual address size used will be the one of the instruction following
the (REP) INS.
A workaround for this bug is to code a NOP with the same address size as the
INS right behind it by using the address size prefix byte 67h (when needed).

Bugs in the 486:
Early 486 may hang if the INS destination address spans across a doubleword
boundary, while not asserting BS16# or BS8#.
To avoid this, always align the string at a doubleword.


INS  (NECINS) Insert bit field (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: INS reg8,reg8 / INS reg8,imm4
Opcode  : 0F 31 [mod:reg:r/m]  (31-117 clocks)
Bug in  : Rarely documented, except in NEC manuals

Function:
Stores bit field data from AX into memory. Bit field length is specified by
the lowest four bits of the second operand. ES:DI specify the first memory
location to write, and the low 4-bits of the first operand specify the bit
offset position. The bit field can cross a byte boundary. After each
complete data transfer, DI and the first operand are automatically updated
to point to the next bit field.

This mnemonic (INS) conflicts with the Intel mnemonic INS, which reads
a string from an I/O port. This Intel instruction has bugs which are listed
with the entry for <INS>. For clarity, this NEC version is referred to as
"NECINS" where possible in this list.

Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)

See Also: EXT, TEST1, NOT1, CLEAR1, SET1


INVD Invalidate internal and external caches
──────────────────────────────────────────────────────────────────────────────

Mnemonic: INVD
Opcode  : 0F 08
Bug in  : some 486

Function:
INVD tells the processor that all data in both the internal as well as the
external caches is invalid. Data held in external write-back caches is
discarded.

If on some 486's a cache line fill is in progress while the INVD instruction
is being executed, that line is NOT invalidated and the buffer contents
is moved into the cache. Valid cache lines are ALWAYS used to satisfy
read requests on all 486's, regardless whether the cache is enabled or not.

Workaround is to disable the cache prior to flushing it like this:

        MOV EAX,CR0
        OR  EAX,60000000h  ; cache disable bits
        PUSHFD
        CLI
        MOV BL,CS:here
        OUT dummyport,dummydata
        MOV CR0,EAX
here:
        INVD
        AND EAX,9fffffff   ; cache enable, write-through
        MOV CR0,EAX
        POPFD


JMP   Jump unconditionally.
──────────────────────────────────────────────────────────────────────────────

Mnemonic: JMP dest
Opcode  : EB disp8
Bug in  : A to C0 step of 486

Function:
JMP transfers execution to a location within -127 to +128 bytes from the
jump instruction. The bug occurs when the jump causes a General Protection
Violation while an NMI or INTR occur at exactly the same clockpulse.

Although very unlikely to occur, it is listed for completeness.


LAR   Load Access Rights (Protected Mode)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: LAR reg1,reg/mem
Opcode  : 0F 02
Bug in  : some 386

Function:
LAR Loads the Access rights of a descriptor in the Global Descriptor Table,
whose selector is reg/mem into reg1. When successful, ZF=1, otherwise ZF=0.

Some 386es allow access to selector 0 in the GDT leaving ZF=1.
Normally this should not be possible and produce the condition ZF=0.

Workaround would be to create an entry 0 in the GDT that consists of only
zeroes. This will cause access with a selector of 0 to fail and
produce ZF=0.

A data breakpoint set to the mem16 operand of LAR can be missed on some
386es if the segment with the selector at mem16 is not accessible.
(see also <debugging>)


286-LOADALL / 386-LOADALL
──────────────────────────────────────────────────────────────────────────────

Mnemonic: LOADALL
Opcode  : 286 : 0F 05  (195 clocks)
          386+: 0F 07  (  ? clocks)
Bug in  : Is an undocumented opcode on 286,some 386,some early 486 ?
          Support for this instruction has been dropped with the 486.

Function:
Loads virtually all processor registers with defined values from memory.
Initialises processor to specified state. Apparently aliased on the 286 by
opcode 0f 04.

The 286 LOADALL instruction reads a block of 102 bytes into the chip,
starting at address 000800 hex.

          Memory description for LOADALL read area on 286:
          (addresses are in hexadecimal, lengths in decimal)

          0800:  6   N/A
          0806:  2   MSW (Machine Status Word)
          0808: 14   N/A
          0816:  2   TR (Task Register)
          0818:  2   FLAGS (286 Flags)
          081a:  2   IP (Instruction Pointer)
          081c:  2   LDT (Local Descriptortable)
          081e:  2   DS (Data Segment)
          0820:  2   SS (Stack Segment)
          0822:  2   CS (Code Segment)
          0824:  2   ES (Extra Segment)
          0826:  2   DI (Destination Index)
          0828:  2   SI (Source Index)
          082a:  2   BP (Base Pointer)
          082c:  2   SP (Stack Pointer)
          082e:  2   BX (BX register)
          0830:  2   DX (DX register)
          0832:  2   CX (CX register)
          0834:  2   AX (AX register)
          0836:  6   ES cache (ES descriptor _cache_)
          083c:  6   CS cache (CS descriptor _cache_)
          0842:  6   SS cache (SS descriptor _cache_)
          0848:  6   DS cache (DS descriptor _cache_)
          084e:  6   GDTR (Global Descriptor Table)
          0854:  6   LDT cache (Local Descriptor_cache_)
          085a:  6   IDTR (Interrupt Descriptor table)
          0860:  6   TSS cache (Task State Segment _cache_)

          Descriptor caches layout:
          3 bytes    24 bit physical address of segment
          1 byte     access rights byte, same format as access right byte
                     in a regular descriptor. The 'present' bit now
                     represents a 'valid' bit. If this bit is cleared
                     (zero) the segment is invalid and accessing it will
                     trigger exception 0dh.
                     The DPL (Descriptor Privilege Level) fields of the CS
                     and SS descriptor caches determine the CPL
                     (Current Privilege Level).
          2 bytes    16 bit segment length limit.

This layout is the same for the GDTR and IDTR registers,
except that the access rights byte must be zero.

The register caches are internal CPU registers containing a copy of the last
'composed' address and access information loaded for a particular register
in protected mode (e.g. ES). An outline of the basics of 286 protected
mode register caching and register layout is beyond the scope of this file


The 386 LOADALL loads 204 (dec) bytes from the address at ES:EDI and resumes
execution in the specified state.

          Memory description for LOADALL read area on 386+:
          (addresses are in hexadecimal, lengths in decimal)

          relative offset: Bytes:   Registers:
          0000:  4   CR0
          0004:  4   EFLAGS
          0008:  4   EIP
          000c:  4   EDI
          0010:  4   ESI
          0014:  4   EBP
          0018:  4   ESP
          001c:  4   EBX
          0020:  4   EDX
          0024:  4   ECX
          0028:  4   EAX
          002c:  4   DR6
          0030:  4   DR7
          0034:  4   TR
          0038:  4   LDT
          003c:  4   GS (zero extended)
          0040:  4   FS (zero extended)
          0044:  4   DS (zero extended)
          0048:  4   SS (zero extended)
          004c:  4   CS (zero extended)
          0050:  4   ES (zero extended)
          0054: 12   TSS descriptor cache
          0060: 12   IDT descriptor cache
          006c: 12   GDT descriptor cache
          0078: 12   LDT descriptor cache
          0084: 12   GS descriptor cache
          0090: 12   FS descriptor cache
          009c: 12   DS descriptor cache
          00a8: 12   SS descriptor cache
          00b4: 12   CS descriptor cache
          00c0: 12   ES descriptor cache

          Descriptor caches layout:
          1 byte     zero
          1 byte     access rights byte, same as 286
          2 bytes    zero
          4 bytes    32 bit physical base address of segment
          4 bytes    32 bit segment length limit


LSL   Load Segment Limit
──────────────────────────────────────────────────────────────────────────────

Mnemonic: LSL reg1,reg/mem
Opcode  : 0F 03
Bug     : some 386

Function:
Loads the limits of a segment in protected mode by reading GDT entry reg/mem
into reg1. Proper completion generates ZF=1, otherwise ZF=0.

Some 386es allow access to selector 0 in the GDT leaving ZF=1.
Normally this should not be possible and produce the condition ZF=0.

Workaround would be to create an entry 0 in the GDT that consists of only
zeroes. This will cause access with a selector of 0 to fail and
produce ZF=0.

Some 386es leave SP/ESP corrupted after successful completion of LSL, when
LSL is followed by an explicit stack reference, using instructions like
CALL, ENTER, LEAVE, IRET, RET, PUSH, POP, PUSHA, POPA, PUSHF and POPF.
System-induced exceptions or interrupts however do not corrupt SP/ESP in
that case. A workaround is to code a NOP after LSL.

A data breakpoint set to the mem16 operand of LSL can be missed on some
386es if the segment with the selector at mem16 is not accessible.
(see also <debugging>)


MOV   Move data to and from registers and or memory
──────────────────────────────────────────────────────────────────────────────

Mnemonic: MOV involving CRx, DRx or TRx, MOV to SS, CS
Opcode  : 0F 2n [mod:rrr:r/m], 8E [mod:sreg:r/m]
Bug in  : some 88,some 86,some 386,all 386,A to C0 step of 486

Function:
MOV Moves data in and out of (special) registers and memory.

Some _very early_ 88 and 86 processors do not disable interrupts following
a MOV sreg,reg. This causes them to crash when an interrupt uses the stack
between MOV SS,reg and MOV SP,op. These versions carry a copyright message
for 1978 on the package. Later, corrected revisions, carry both 1978 and
1981 as the copyright year.
Normally interrupts would be disabled between the move to SS and execution
of the instruction following it on 88 and 86es. A workaround is to manually
disable the interrupts when reloading SS. The 286 and higher processors only
disable interrupts after a MOV SS, in contrast to earlier CPUs, including
the NECs, who do this with all MOV sreg,op instructions.

An unsolvable problem occurs when an unmaskable interrupt or exception
takes place while executing the instruction pair on an old 88 or 86.
There are conflicting messages though about this type of interrupts having
no effect on the bug.

On the 86 and 88, but not on the C-MOS versions 80C86 and 80C88, the
instruction MOV CS,op is valid and causes an unconditional jump.
The C-MOS versions, as well as the NEC V20 and V30 ignore this coding.
This may also be the case on the 186 but has not been tested.
The 286+ CPUs consider CS an invalid operand for this instruction and
generate exception 6 (Invalid opcode).
The opcode for the MOV CS,op is: 8e [mod:001:r/m] See also <POP CS>.

On some 386es, random breakpoint breaks occur from the debug registers
D0-D3 when a MOV from CR3, TR6 or TR7 is executed. This will continue until
after a jump instruction is executed. The actual contexts of D0-D3 is not
affected. Workaround is to disable breakpoints before the MOV from CR3,TR6
or TR7, execute a jmp right after the move and enabling breakpoints again.
See also <debugging>

On some 386es a MOV to SS may cause a code or data breakpoint set to the
instruction following the MOV to be missed if the instruction takes more
than two clocks. (see <debugging>)

On all 386es a MOV to or from CRx, TRx or DRx executes correctly regardless
of the mod field (the first two bits in the third byte of the opcode).
The mod should be 11b. Intel documentation for the 386 stated it was
undefined.
Some 386 assemblers and compilers may generate values other than 11b for
mod and fail on early 486es, causing an Invalid Opcode Exception, since they
do require the mod field to be correct. More recent 486es recognize the
aliased instructions as valid and execute them accordingly.

On all 386es, moves to or from DR4 and DR5 are aliased to DR6 and DR7.
On the early 486es these encodings are not recognized and generate an
Invalid Opcode Exception. More recent 486es do recognize these aliases and
execute them correctly.

On the A to C0 steps of the 486, loading TR5 with a reg32 operand may hang
the CPU if bits 0 and 1 (control bits) activate cache read, cache write or
flush. A workaround is:

JMP fetcher

ALIGN 16
fetcher:
     NOP
     IN AL,port   ; Note that this corrupts EAX...
     MOV TR5,EBX  ; EBX contained the new TR5 value.
     NOP
     NOP

On the A to C0 step of the 486 loading a value into CR0 which disables the
cache may corrupt the cache. Forcing a prefetch will avoid this.

     PUSHFD
     CLI
     MOV BL,CS:label
     MOV CR0,EAX
   label:
     POPFD
     NOP

Using EBX:
Note that using EBX under the Microsoft Windows 3.0 DOS box in standard mode
or after Microsoft Windows 3.0' termination after running standard mode, for
32-bit addressing in real or virtual 86 mode, is likely to crash the system
due to the fact that apparently the Windows 3.0 DOS box trashes EBX while
servicing interrupts, turning bit 18 of EBX to 1 and thus causing unwanted
segment violation errors. Use of EBX in calculations is likely to cause
spurious errors and may yield unpredictable behaviour of your code under
the aforementioned circumstances.

(MOV CS,op for NEC and 88/86, C88/C86, & 1978 copyright message
 supplied by Anthony Naggs).


MOVS  Move string of bytes, words or doublewords in memory
──────────────────────────────────────────────────────────────────────────────

Mnemonic: MOVSB / MOVSW / MOVSD
Opcode  : A4    / A5    / 66 A5
Bug in  : early 286 in PM, some 386

Function:
MOVS moves strings in memory. Possible units to move are byte, word and
doubleword. Typically the source is DS:(E)SI, the target ES:(E)DI

If the single instruction MOVS (not prefixed by REPx) is executed with a
NULL selector in ES or when ES:DI points beyond the segment limit while
executing the the single instruction, causing exception 0dh, the CS:IP
saved by the 0dh exception handler will point after the MOVS instruction,
instead of to it on some 286s.

If a segment limit exception or IOPL violation exception occurs during the
REPx prefixed form of MOVS in Protected Mode, some early 286 will reset CX
to its initial setting (before the REPx started) instead of showing CX as
it was at the time of the exception. SI and DI are not affected and show the
values they had at the time of the exception.

During debugging with breakpoints set, REP MOVS can cause data breakpoints
to be missed on some 386, see <debugging>.

If, on some 386es, MOVS is followed by an instruction which uses a different
address size, or by an instruction which implicitly references the stack
(like POP, PUSH, IRET, RET, CALL, ENTER, LEAVE, PUSHA, POPA, PUSHF and POPF)
while the D-bit for the stack is different from the current address size
used by the MOVS instruction, the destination register updated will depend
on the address size of the instruction that follows, rather than that of
the MOVS. This can result in the updating of only DI when EDI was meant
or EDI when only DI was meant.

The repeated form REPx MOVS has the same bug, but in addition to (E)DI,
also (E)SI is affected.

A workaround is to always code a NOP with the same address size after MOVS
and REPx MOVS.

Example:

    (16-bit code segment)
    MOVSW       ; 16-bit addressing MOVS
    NOP         ; 16-bit addressing NOP
    db 67h
    MOVSW       ; 32-bit addressing MOVS
    db 67h
    NOP         ; 32-bit addressing NOP

    (32-bit code segment)
    MOVSD       ; 32-bit addressing MOVS
    NOP         ; 32-bit addressing NOP
    db 67h
    MOVSD       ; 16-bit addressing MOVS
    db 67h
    NOP         ; 16-bit addressing NOP


MUL  Unsigned Multiply 16 & 32-bit versions
──────────────────────────────────────────────────────────────────────────────

Mnemonic: MUL reg
Opcode  : (66) F7 Ex
Bug in  : 386

Function:
MUL multiplies ax with a 16-bit operand to form a 32-bit result in dx:ax.
The 32-bit version multiplies eax with a 32-bit operand to form a 64-bit
result in edx:eax.

Some 386es have a problem which redirects output from the 32-bit MUL
to the wrong parts of the wrong registers.

Typically the following happens:

Properly operating 32-bit version:  Properly operating 16-bit version:

  EAX: 'A':'B'                        EAX: 'A':'B'
  EBX: 'C':'D'                        EBX: 'C':'D'
  EDX: 'E':'F'                        EDX: 'E':'F'

CD x AB gives a result in EF:AB     D x B gives a result in F:B

While executing the 32-bit MUL, the faulty CPU takes CD times AB and puts
the value it should have added to 'A' into 'F' while at the same time
adding the value it should have put into EF to AB.

No workaround other than to use 16-bit multiply.

Some 386's have a bug which generates incorrect values in 16-bit mode.
The iAPX program from IGEL (Chris Lueders) tests for this bug.

Intel apparently organized a replacement project to get the faulty chips
returned to factory for screening. After testing at Intel the faulty CPUs
were sold again to bulk buyers who installed them in 16-bit only machines.
These tested and failed chips carry the text "16-bit S/W only" or a single
sigma. The tested and passed chips carry a double sigma (aa) on the package.

(supplied by Chris Lueders)


NEC V20/V30 introduction
──────────────────────────────────────────────────────────────────────────────

The NEC V series microprocessors are functionally similar to the 8086 design
which NEC licensed from Intel. The internal microcode and most NEC mnemonics
are different from Intel's, to avoid Intel copyright claims. Only the
NEC V20 & V30, pin compatible with 8088 & 8086 respectively, are usually
found in IBM compatible PCs.
The V20 and V30 are often supplied as an "upgrade kit" for PCs originally
equipped with an 88/86, as they execute most instructions in fewer clocks
and can be used at a higher clock rate than the Intel parts.

Occasionally single board PCs use the V40 & V50, which are based on the same
CPU core and have integrated peripheral functions. Other V series family
members diverge further from the Intel x86 series and are used in
controllers and instrumentation rather than PCs.

The V20 and V30 have four classes of extra instructions beyond those
present on the 86/88:
*   the instructions Intel introduced on the 186/188
*   unique instructions for the NEC V series
*   instructions to switch to/from 8O8O emulation mode
*   8O8O instructions in 8O8O emulation mode

(8080 is written here as 8O8O to avoid visual confusion with the 8088).
Since the 188/186 instructions are widely documented, and the 8O8O
instructions are of use only if you are writing a CP/M emulator or similar,
these instructions are not listed. The special instructions which can be
used in Intel x86 mode are listed in the <NEC mnemonics page>

(Supplied by Anthony Naggs)


NEC V20/V30-specific mnemonics list
──────────────────────────────────────────────────────────────────────────────

  Bit field instructions:

  <INS>     (NECINS) Insert bit field  <EXT>    Extract bit field
  <TEST1>   Test a specific bit        <NOT1>   Invert a specific bit
  <CLEAR1>  Clear a specific bit       <SET1>   Set a specific bit

  Packed BCD support:

  <ADD4S>   Add packed BCD numbers     <SUB4S>  Subtract BCD strings
  <CMP4S>   Compare BCD strings (subtract without storing)
  <ROL4>    Rotate left 4 bits         <ROR4>   Rotate right 4 bits

  Instruction prefixes:

  <REPC>    Repeat while Carry         <REPNC>  Repeat while No Carry

  Floating point escape:               Start 8O8O emulation:

  <FPO2>    NEC equivalent of ESC      <BRKEM>  Break to 8O8O emulation mode


(Supplied by Anthony Naggs)


NOT1 Invert a specific bit (NOT operation) (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: NOT1 reg/mem,CL/immediate
Opcode  : NOT1 r/m8,CL   : 0F 16 [mod:000:r/m]      (4/18 clocks)
          NOT1 r/m8,imm3 : 0F 1E [mod:000:r/m] imm  (5/19 clocks)
          NOT1 r/m16,CL  : 0F 17 [mod:000:r/m]      (4/18 clocks)
          NOT1 r/m16,imm4: 0F 1F [mod:000:r/m] imm  (5/19 clocks)
          NOT1 CY        : F5  (NEC nomenclature for Intel's CMC)
Bug in  : Rarely documented, except in NEC manuals

Function:
NOTs the specified bit in the register/memory operand. The bit number (CL
or immediate) is ANDed with 07 (for 8-bit operands) or 0F (for 16-bit
operands) to get a valid bit number. No flags are affected by this
operation, except by NOT1 CY.

The first (smaller) clock count in each pair is for register operands.
Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)

See Also: NECINS, EXT, TEST1, CLEAR1, SET1


POP  Pop register from stack
──────────────────────────────────────────────────────────────────────────────

Mnemonic: POP
Opcode  : 51+reg (01011rrr) for general purpose registers, 0F for POP CS
Bug in  : POP CS is a valid opcode for 88, 86, invalid opcode for 186
          0F is prefix byte on NEC V20/30 and 286+
          POP SS and breakpoints on some 386

Function:
POP retrieves data from the stack while adjusting the stackpointer.

The 88 and 86 allow the encoding of 0f for POP CS. The NEC V20 and V30,
as well as the 286+ CPUs use that encoding to indicate new instructions.
On the 88 and 86 POP CS causes an unconditional jump. Executing 0F on
the 186 generates an Invalid opcode exception (6).

On some 386es a code or data breakpoint set to the instruction following
POP SS will not be taken if the instruction takes more than two clocks.
(see also <debugging>)

(POP CS supplied by Anthony Naggs)


POPA / POPAD Pop all general purpose registers
──────────────────────────────────────────────────────────────────────────────

Mnemonic: POPA / POPAD
Opcode  : 61 / 66 61
Bug in  : some 386

Function:
POPA and POPAD pop all general purpose registers from the stack.
POPA pops 16-bit registers and POPAD pops 32-bit registers. The opcode is
the same. POPAD is POPA with an operand size prefix (66h).

If either POPA or POPAD is followed by an instruction which uses an
effective address calculation consisting of a base register and another
register other than (E)AX as an index, the contents of EAX is corrupted.

Also, if POPA or POPAD in 16-bit mode is followed by an instruction which
uses an effective address using EAX as a base or index, the CPU will hang.

The workaround is to always code a NOP after POPA as well as POPAD.


Prefetch queue, bus & cache parameters per CPU
──────────────────────────────────────────────────────────────────────────────

              NEC         NEC          sx  dx  sx  dx
           88 V20 188  86 V30 186 286 386 386 486 486 Pentium
         ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───────┐
  SPQB───┤ 4 │ 4 │ 4 │ 6 │ 6 │ 6 │ 6 │16 │16 │32 │32 │32 x 2 │
 NEBIPQ──┤ 1 │ 1 │ 1 │ 2 │ 2 │ 2 │ 2 │ 2 │ 4 │16 │16 │     ? │
 MPBRMP──┤ 1 │ 1 │ 1 │ 1 │ 1 │ 1 │ 1 │ 1 │ 1 │16b│16b│    32a│
  DIQL───┤ - │ - │ - │ - │ - │ - │ 3 │ 3 │ 3 │ - │ - │     ? │
  OCSKB──┤ - │ - │ - │ - │ - │ - │ - │ - │ - │ 8 │ 8 │ 8 x 2 │
  DBSB───┤ 8 │ 8 │ 8 │16 │16 │16 │16 │16 │32 │32 │32 │    64 │
  ABSB───┤20 │20 │20 │20 │20 │20 │24 │24 │32 │32 │32 │    32 │
         └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───────┘
  Legend:

   SPQB = Size of the Prefetch Queue (PQueue) in Bytes
 NEBIPQ = Number of Empty Bytes In PQueue to initiate prefetch cycle
*MPBRMP = Minimum possible number of Bytes to Read from Memory to Prefetch
   DIQL = Decoded Instruction Queue Length, measured in instructions
  OCSKB = On-chip Cache Size in KiloBytes
   DBSB = Data Bus Size in Bits
   ABSB = Address Bus Size in Bits
      - = None
      b = 16-byte burst mode cache line fill
      a = 32-byte burst mode cache line fill

* note that starting with the 486, prefetches are read from the cache.
  A cache line fill is performed in case of a cache miss and starts to
  read on paragraph boundaries only. A cache line on the 486 is 16 bytes
  in size. On the Pentium, a line fill starts on a boundary which lies
  at an even number of paragraphs (32-byte chunks).

(NEC & 188/186 prefetches supplied by Anthony Naggs)


PUSH  Pushes value or register onto the stack.
──────────────────────────────────────────────────────────────────────────────

Mnemonic: PUSH reg / PUSH mem
Opcode  : 01010rrr / FF [mod:110:r/m]
Bug in  : PUSH (E)SP different operation on 286+, PUSH mem on some 286 in PM

Function:
PUSH pushes a value or register onto the stack.

Normally, the value pushed is placed in the location pointed to by SS:SP
(or SS:ESP on 386+), after which (E)SP is decremented by a word or dword.

When pushing any register or value, the difference between 286+ and previous
CPUs is not visible and causes no problems.
However, when pushing SP (or ESP on 386+) the value pushed is different
between 286 and previous CPUs.

On CPUs prior to the 286, SP would be decremented and then pushed.
On 286+ however, SP gets pushed and then decremented, leaving a different
value on the stack for SP. On the 386+ the same is in effect when
pushing ESP

If PUSH mem on the 286 in Protected Mode causes a stack limit violation -
exception 0bh, the saved CS:IP will point _after_ the PUSH instead of _to_
it on some early 286.


RDTSC Read Time Stamp Counter
──────────────────────────────────────────────────────────────────────────────

Mnemonic: RDTSC
Opcode  : 0F 31
Bug in  : Poorly documented for Pentium Processor

Function:
RDTSC reads a Pentium internal 64 bit register which is being incremented
from 0000 0000 0000 0000 at every CPU internal clockcycle. Note that this
gives a clockcycle-accurate timer with a range of more than 8800 years at
66 Mhz...

The instruction places the counter in the EDX:EAX register pair.


REPNC / REPC  Repeat next string operation while (No) Carry
──────────────────────────────────────────────────────────────────────────────

Mnemonic: REPC / REPNC
Opcode  : 65   / 64 (  ? clocks) (GS/FS override on 386+)
Bug in  : Rarely documented except in NEC manuals, invalid on Intel CPUs
          Conflicting opcode for GS and FS segment override for 386+

Function:
REPC repeats the following string instruction while the Carry Flag is set.
REPNC repeats the following string instruction while the Carry Flag is
clear. CX should hold the maximum number of iterations,
just as with REPZ/REPNZ.

Note that since these instructions works with the Carry Flag, they have no
special effect on MOVS and LODS. A simple REP should be used in these cases.

These instructions are NEC specific. They are not implemented on the Intel
CPUs. Note that the 386+ implements the listed opcodes 64 and 65 for the
segment override instructions FS and GS respectively.

If your software will run on a NEC, they may be handy.


ROL4   Rotate left 4 bits (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: ROL4  reg8/mem8
Opcode  : 0F 28 [mod:000:r/m]  (25/28 clocks)
Bug in  : Rarely documented, except in NEC manuals

Function:
Rotates a BCD digit (4 bits) left out of the operand, through the low 4 bits
of AX.

                    AL                   reg/mem
             7 . . . . . . 0         7 . . . . . . 0
            ┌───────┬───────┐       ┌───────┬───────┐
            │       │       │<──────┤       │       │<───┐
            └───────┴───┬───┘       └───────┴───────┘    │
                        └──>─────────────────────────────┘

The first (smaller) clock count is for a register operand.
Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)

See Also: ADD4S, SUB4S, CMP4S, ROR4


ROR4   Rotate right 4 bits (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: ROR4  reg8/mem8
Opcode  : 0F 2A [mod:000:r/m]  (29/33 clocks)
Bug in  : Rarely documented, except in NEC manuals

Function:
Rotates a BCD digit (4 bits) right out of the operand, through the low 4
bits of AX.

                    AL                   reg/mem
             7 . . . . . . 0         7 . . . . . . 0
            ┌───────┬───────┐       ┌───────┬───────┐
            │       │       ├──────>│       │       │>───┐
            └───────┴───┬───┘       └───────┴───────┘    │
                        └──<─────────────────────────────┘

The first (smaller) clock count is for a register operand.
Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)

See Also: ADD4S, SUB4S, CMP4S, ROL4


SET1 Set a specific bit (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: SET1 reg/mem,CL/immediate
Opcode  : SET1 r/m8,CL   : 0F 14 [mod:000:r/m]      (4/13 clocks)
          SET1 r/m8,imm3 : 0F 1C [mod:000:r/m] imm  (5/14 clocks)
          SET1 r/m16,CL  : 0F 15 [mod:000:r/m]      (4/13 clocks)
          SET1 r/m16,imm4: 0F 1D [mod:000:r/m] imm  (5/14 clocks)
          SET1 CY        : F9   (NEC nomenclature for Intel's STC)
          SET1 DIR       : FD   (NEC nomenclature for Intel's STD)
Bug in  : Rarely documented, except in NEC manuals

Function:
Sets the specified bit in the register/memory operand. The bit number (CL
or immediate) is ANDed with 07 (for 8-bit operands) or 0F (for 16-bit
operands) to get a valid bit number. No flags are affected by this
operation, except the Carry and Direction Flag with SET1 CY and SET1 DIR.

The first (smaller) clock count in each pair is for register operands.
Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)

See Also: NECINS, EXT, TEST1, NOT1, CLEAR1


SETALC   Set AL according to Carry
──────────────────────────────────────────────────────────────────────────────

Mnemonic: SETALC
Opcode  : D6  (  ? clocks)
Bug in  : Is an undocumented opcode on 88,86,286,386,486
          Does not work on NEC and Sony V20+ (is alias for XLATB there)

Function:
This instruction copies the Carry Flag to the AL register without changing
any flags. In case of a CY, AL becomes ffh. When the Carry Flag is cleared,
AL becomes 00.

(NEC & Sony difference, and 86/88 availability supplied by Anthony Naggs)


Shift and Rotate operand limitations
──────────────────────────────────────────────────────────────────────────────

Mnemonic: SHL, SAL, SHR, SAR, ROL, RCL, ROR, RCR, and all xxxD variants
Opcode  : various
Bug in  : 186+ will AND the shift- or rotate count with 1f before execution
          NEC V20 and V30 act like 88 / 86 and do not limit the count.

Function:
The instructions mentioned above will limit the actual number of bits
shifted or rotated to the number of bits to be shifted AND 1f. The
remainder is actually shifted or rotated. A shift of 21h will actually be
a shift of 1.

This is also the case for the double shifts on 386+.

(186 and NEC difference supplied by Anthony Naggs)


SUB4S   Subtraction of packed BCD strings (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: SUB4S
Opcode  : 0F 22  (7+19n clocks, n is the number of bytes per operand)
Bug in  : Rarely documented, except in NEC manuals, is conflicting opcode
          on 386+ (MOV)

Function:
Subtracts the packed BCD string at DS:SI from the packed BCD string at
ES:DI. The length of the string, in BCD digits, is specified in CL. Unlike
Intel string operations CL, DI & SI are unchanged by the operation. The
Zero Flag (ZF) is set if the result is zero. The Carry Flag (CF) and
Overflow Flag (OF) appear to be set by the subtraction of the most
significant digits.

Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+ CPUs.

(Supplied by Anthony Naggs)

See Also: ADD4S, CMP4S, ROL4, ROR4


TEST1 Test a specific bit (NEC V20/30 only)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: TEST1 reg/mem,CL/immediate
Opcode  : TEST1 r/m8,CL   : 0F 10 [mod:000:r/m]      (3/12 clocks)
          TEST1 r/m8,imm3 : 0F 18 [mod:000:r/m] imm  (4/13 clocks)
          TEST1 r/m16,CL  : 0F 11 [mod:000:r/m]      (3/12 clocks)
          TEST1 r/m16,imm4: 0F 19 [mod:000:r/m] imm  (4/13 clocks)
Bug in  : Rarely documented, except in NEC manuals, opcodes 0f 10 and
          0f 11 are conflicting opcodes on 386+ (MOV aliases for 88-8b)

Function:
Tests the specified bit in the register/memory operand, if it is zero the
Z flag is set otherwise it is cleared. The bit number (CL or immediate)
is ANDed with 07 (for 8-bit operands) or 0F (for 16-bit operands) to get a
valid bit number.

The first (smaller) clock count in each pair is for register operands.
Note that 0F is treated as <POP CS> on the 88/86 and prefixes newer
instructions on 286+.

(Supplied by Anthony Naggs)

See Also: NECINS, EXT, NOT1, CLEAR1, SET1


UNKNOWN opcode, info wanted
──────────────────────────────────────────────────────────────────────────────

Mnemonic: UNKNOWN
Opcode  : 0F 04  (  ? clocks)
Bug in  : Is an unknown opcode on 286

Function:
Exact purpose unknown, when executed it hangs the machine, likely bringing
it into protected mode, anyone with a hardware debugger may check to find
out. This instruction is likely to be an alias for the LOADALL on the 286.
It does not generate an exception. >> info wanted <<


VERR / VERW Verify a segment selector for Reading or Writing
──────────────────────────────────────────────────────────────────────────────

Mnemonic: VERR op / VERW op
Opcode  : 0F 00 [mod:100:r/m] / 0f 00 [mod:101:r/m]
Bug in  : some 386

Function:
VERR verifies that the segment selector in memory, pointed to by op, is
readable and accessible with the current privilege level (CPL).
If so, the Zero Flag is set to 1, if not, the Zero Flag is cleared.

VERW verifies that the segment selector in memory, pointed to by op, is
writable and accessible with the current privilege level (CPL).
If so, the Zero Flag is set to 1, if not, the Zero Flag is cleared.

On some 386 both instructions allow a NULL selector to be specified,
accessing selector zero in the GDT, instead of failing unconditionally with
ZF=0, which would be the normal procedure. Workaround is to fill descriptor
zero in the GDT with all zeroes. Accessing it will then always fail and
produce the desired effect.

On some 386 both VERR and VERW can hang the CPU until an INTR, NMI or RESET
occurs. This bug occurs when there is no memory operand, JMP or CALL
instruction in the <prefetch queue> along with the VERR or VERW.
Workaround is to code a JMP or Jcondition instruction right after the VERR
or VERW, with the added condition that _the last byte_ of the VERR / VERW
and the _complete_ JMP instruction must fit in the same aligned doubleword.

A data breakpoint set to the mem16 operand of either VERR or VERR can be
missed on some 386es if the segment with the selector at mem16 is not
accessible. (see also <debugging>)


WBINVD Write back & invalidate both internal & external caches
──────────────────────────────────────────────────────────────────────────────

Mnemonic: WBINVD
Opcode  : 0F 09
Bug in  : some 486

Function:
WBINVD tells the processor that all data in both the internal as well as the
external caches is invalid. Data held in external write-back caches is
written back to memory before the flush.

If on some 486's a cache line fill is in progress while the WBINVD
instruction is being executed, that line is NOT invalidated and the buffer
contents is moved into the cache. Valid cache lines are ALWAYS used to
satisfy read requests on all 486's, regardless whether the cache is enabled
or not.

Workaround is to disable the cache prior to flushing it like this:

        MOV EAX,CR0
        OR  EAX,60000000h  ; cache disable bits
        PUSHFD
        CLI
        MOV BL,CS:here
        OUT dummyport,dummydata
        MOV CR0,EAX
here:
        WBINVD
        AND EAX,9fffffff   ; cache enable, write-through
        MOV CR0,EAX
        POPFD


Write / Read Model Specific Register (Pentium+ compatible)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: WRMSR / RDMSR
Opcode  : 0F 30 / 0f 32
Bug in  : Are minimally documented opcodes for Pentium+ compatible CPUs

Function:
It should be possible to use the WRMSR & RDMSR instructions on any CPU which
A: supports the CPUID instruction and
B: has the extension bit 5 in the feature bitmap of EDX set after
   executing function 1 (EAX=1) with CPUID.

WRMSR writes to a Model Specific Register. EDX:EAX contain the value to
write into the register whose number is given in ECX.

RDMSR reads from a Model Specific Register. EDX:EAX will receive the value
from the MSR whose number is given in ECX.

  List of Model Specific Registers:

  00h   Machine Check Exception-Address register (Read-only)
  01h   Machine Check Exception-Type register (Read-only)
  02h   Unknown
  ..
  0dh   Unknown
  0eh   Test register T12
  0fh   Unknown
  10h   Time Stamp Counter (See RDTSC)
  11h   Counter / Event Selection register (See CESR Map)
  12h   Counter #0 (40 bit resolution)
  13h   Counter #1 (40 bit resolution)


      CESR Map. Note that CESR is a 64-bit register, of which only the
      bottom 32 bits are currently known to be used.

      Bit 31                         16                              0
      ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
      │r│r│r│r│r│r│r│c│3│2│t│t│t│t│t│t│r│r│r│r│r│r│r│C│3│2│T│T│T│T│T│T│
      └─┴─┴─┴─┴─┴─┴─┴┬┴┬┴┬┴┬┴─┴─┴─┴─┴┬┴─┴─┴─┴─┴─┴─┴─┴┬┴┬┴┬┴┬┴─┴─┴─┴─┴┬┘
                     │ │ │ └─────┬───┘               │ │ │ └────┬────┘
      Counting method┴─│─│───────│───────────────────┘ │ │      │
                       │ └─────┐ │                     │ │      │
                       └┬──────│─│─────────────────────┘ │      │
      Allow counting in CPL3   │ │                       │      │
      Allow counting in CPL0-2─┴─│───────────────────────┘      │
      Event type (what to count)─┴──────────────────────────────┘
      (see list below)
      └───────────┬──────────────────┘└─────────────┬────────────────┘
      Counter #1:─┘                     Counter #0:─┘

      Counting methods:         1= count CPU cycles     0= count events
      Allow count in CPL3:      1= Yes                  0= No
      Allow count in CPL0-2:    1= Yes                  0= No

      Event Type List:
        00h data read
        01h data write
        02h data TLB miss
        03h data read miss
        04h data write miss
        05h Write (hit) to M (modified) or E (exclusive) cacheline
            (MESI protocol)
        06h data cache lines written back
        07h data cache snoops
        08h data cache snoop hits
        09h memory accesses in both pipes
            (cumulative ?)
        0ah data bank access conflicts (U & V pipe access same data line in
            data cache).
        0bh misaligned data memory references
        0ch code read
        0dh code TLB miss
        0eh code cache miss
        0fh any segment register load
        10h segment descriptor cache accesses
        11h segment descriptor cache hits
        12h branches
        13h Branch Target Buffer (BTB) hits
        14h taken branch or BTB hit
        15h pipeline flushes
        16h instructions executed
        17h instructions executed in V pipe
        18h bus utilization (apparently events in which the CPU has to wait
            for bus access).
        19h pipeline stalled by write backups
        1ah pipeline stalled by data memory read
        1bh pipeline stalled by write to M or E line
        1ch locked bus cycle (for instance during xchg)
        1dh I/O read or write cycles
        1eh noncacheable memory references
        1fh pipeline stalled by Address Generation Interlock (AGI)
        20h unknown
        21h unknown
        22h floating point operations
        23h breakpoint 0 match
        24h breakpoint 1 match
        25h breakpoint 2 match
        26h breakpoint 3 match
        27h hardware interrupts
        28h data read or data write
        29h data read miss or data write miss

    (All info provided by Christian Ludloff)


All mentioned x86 CPU instructions by Mnemonic
──────────────────────────────────────────────────────────────────────────────

  Click on any instruction mnemonic to see details.
  See <Breakpoint errors> for CPU bugs relating to debugging.
  See <Chip Step info> for a summary on revision codes.
  See <General FPU bugs> for FPU bugs unrelated to instructions.
  See <FPU mnemonics> for FPU bugs related to FPU instructions.
  See <List of NEC mnemonics> for a list of NEC instructions.
  See <NEC general info> for a summary of special features in NECs.


  <AAA>     Adjust after addition      <AAD>    Adjust after division
  <AAM>     Adjust after multiply      <AAS>    Adjust after subtraction
  <BOUND>   Bounds check
  <BSF>     Bit scan forward           <BSWAP>  4-Byte swap (e-registers)
  <BT>      Bit test                   <BTC>    Bit test & complement
  <BTR>     Bit test & reset           <BTS>    Bit test & set
  <CHKIND>  Alias mnemonic for BOUND on NEC

  <CMPS> CMPSB CMPSW CMPSD  String compare, Byte, Word, Doubleword

  <CMPXCHG> Compare & exchange        <CPUID>   Identify CPU (486+)

  <CR0> CR1 CR2 CR3 CR4 Map of control registers

  <EFLAGS>  Map of EFLAGS register

  <HLT>     Halt the CPU              <IBTS>    Insert bit string
  <IMUL>    Integer multiply

  <INS> INSB INSW INSD Input of string from I/O port, Byte, Word, Doubleword

  <INVD>    Invalidate cache          <JMP>     Unconditional jump
  <LAR>     Load access rights        <LOADALL> Load all registers.
  <LSL>     Load segment limit        <MOV>     Move data to/from registers
  <MOVS>    Move string               <MUL>     Multiply unsigned
  <POP>     Pop data from stack       <POPA>    Pop all registers
  <PUSH>    Push value onto stack     <RDTSC>   Read time stamp counter

  <RDMSR>   Read Model Specific Register (Pentium+)

  <Rotate and Shift>   Concerns all Rotation and Shift instructions

  <SETALC>  Carry bit to all of al    <UNKNOWN> An unknown opcode
  <VERR>    Verify segment for Read   <VERW>    Verify segment for Write

  <WBINVD>  Write Back and Invalidate Cache (486+)
  <WRMSR>   Write Model Specific Register (Pentium+)


All mentioned FPU instructions by Mnemonic
──────────────────────────────────────────────────────────────────────────────

Alphabetic listing on FPU Mnemonics for instructions behaving different
than expected. Instructions marked with * are considered undocumented.

* <FCOS>              FPU Cosine in radians on IIT math coprocessor

  <FDISI / FNDISI>    Disable Floating point interrupts
  <FDIV  /  FDIVP>    Divide
  <FDIVR / FDIVRP>    Divide reversed
  <FENI  /  FNENI>    Enable Floating point interrupts

  <FLDENV>            Load Floating point Environment
  <FMUL4X4>           Matrix multiply on IIT math coprocessor
  <FPREM>             Modulus of ST by ST(1) into ST
  <FPTAN>             Tangent ratio of ST into ST & ST(1)
  <FRSTPM>            Tells the FPU to use Real (or V86) Mode formats
  <FRSTOR>            Loads the FPU state from memory see FSAVE
  <FSAVE>             Saves the FPU state to memory see FRSTOR
* <FSBP0,1,2,3>       Bankswitching on IIT math coprocessor
  <FSCALE>            Adds the value in ST to the exponent in ST(1)
  <FSETPM>            Tells the FPU to use Protected Mode formats
* <FSIN>              FPU Sine in radians on IIT math coprocessor
  <FSINCOS>           calculates FPU sine and cosine in radians
  <FSTENV>            Store Floating point Environment


General Intel FPU bugs, unrelated to opcodes
──────────────────────────────────────────────────────────────────────────────

Mnemonic: N/A
Opcode  : N/A
Bug in  : some 486 / 487

Function:
While using a maths coprocessor (also referred to as floating point
unit FPU), errors may occur and invalid numbers may be generated.
While most FPUs don't have any problem handling these situations, some
steps may lock up or misbehave otherwise. The list below shows known
malfunctions which may arise during FPU operations on some systems.

    True bugs:
    <FERR# not handled correctly by FPU>
    <FPU performance degradation because IGNNE# active>

    Incompatibilities between different types of FPU:
    <Four indications for 'empty' in Condition Code Bits after FXAM>

    '87 to 287 specific differences:
    <Error signal does not go through PIC on 287+>
    <Exceptions are different>
    <Exception pointers saved by 287+ save prefixes>

    <287+ need no synchronization>
    <287 & 387 use reserved I/O ports>


FERR# not handled correctly by FPU
──────────────────────────────────────────────────────────────────────────────
 <Back> (General Intel FPU bugs, unrelated to opcodes)

* FERR# not handled correctly by FPU:

    In some cases an FPU operation may generate a floating point error,
    which will not be recognized by the CPU.
    The workaround for this is to replace all FWAIT with FNOP or follow
    all FWAIT with a NOP, while masking all floating point errors.


FPU performance degradation because IGNNE# active
──────────────────────────────────────────────────────────────────────────────
 <Back> (General Intel FPU bugs, unrelated to opcodes)

* FPU performance degradation because IGNNE# active:

    If an unmasked exception occurs with bit NE (Numeric Error or Numeric
    Exception) in CR0 cleared (recognize exceptions), while IGNNE# is
    active, all following FPU instructions will require an additional 17 to
    22 clocks. This because the exception remains pending due to the logic
    conflict caused by contradicting signals. It lets the 486/487 execute
    microcode in order to classify and analyze the exception, but it does
    not let it handle it, prior to executing the next FPU opcode.
    A workaround is to clear all unmasked exceptions with FCLEX or FINIT
    within an exception handler before it finishes or to make sure IGNNE#
    is not made active so exceptions are recognized and handled immediately
    as they occur (when NE is cleared).


Four indications for 'empty' in Condition Code Bits after FXAM
──────────────────────────────────────────────────────────────────────────────
 <Back> (General Intel FPU bugs, unrelated to opcodes)

* Four different indications for 'empty' in Condition Code Bits after FXAM:

    The various FPUs use different bit patterns to indicate an empty FPU
    register after the FXAM instruction. You should rely only on bits C0
    and C3 to be 1 in case an FPU register is to be considered empty.
    (See <FPU Condition Code Bits>)


Error signal does not go through PIC on 287+
──────────────────────────────────────────────────────────────────────────────
 <Back> (General Intel FPU bugs, unrelated to opcodes)

* Error signal does not go through PIC on 287+

    On the 86, an FPU error is signalled through the PIC (Programmable
    Interrupt Controller). Starting with the 287, FPU errors are
    signalled over a dedicated pin on the CPU / FPU combination,
    namely ERROR#. There may be code which depends on the PIC handling
    the error. These error handlers will need to be rewritten.


Exceptions are different
──────────────────────────────────────────────────────────────────────────────
 <Back> (General Intel FPU bugs, unrelated to opcodes)

* Exceptions are different

    The coprocessor segment overrun exception (09) is issued when the
    FPU attempts to read the second or subsequent words of a data
    operand beyond a segment limit on a 286. On a 386 it is not normally
    used. The 486 signals exception 0dh instead.

    The segment wraparound exception (General Protection exception 0dh)
    will be issued if the FPU attempts to execute an instruction that
    spans into or lies beyond a segment limit.

    All other errors are signalled through interrupt 10h in 286 systems.


Exception pointers saved by 287+ save prefixes
──────────────────────────────────────────────────────────────────────────────
 <Back> (General Intel FPU bugs, unrelated to opcodes)

* Exception pointers saved by 287+ save prefixes

    The exception pointers on the 87 would point to the ESC instruction
    itself, regardless of any segment overrides (or other prefixes for
    that matter). The 287+ pointers point to the first prefix before
    the ESC instruction, if any.


287+ need no synchronization
──────────────────────────────────────────────────────────────────────────────
 <Back> (General Intel FPU bugs, unrelated to opcodes)

* 287+ need no synchronization

    On the 87, the FPU and CPU worked separated from each other. Any
    communication between the FPU and CPU had to be coordinated with
    WAITs. On the 287+, no WAITs are required except for control
    instructions. The CPU examines the BUSY# signal before communicating
    with the FPU to assure the FPU can accept commands.

    The 387 also examines BUSY# before sending commands to the FPU.
    Data transfers are regulated by monitoring the PEREQ# pin.


287 & 387 use reserved I/O ports
──────────────────────────────────────────────────────────────────────────────
 <Back> (General Intel FPU bugs, unrelated to opcodes)

* 287 & 387 use reserved I/O ports

    On the 287, FPU instructions and data are sent to and received from
    the FPU via I/O ports. These ports are f0-ff on the 286 / 287.
    This property is important to consider when the number of I/O
    waitstates on the mainboard can be changed. To safely increase the
    FPU performance some experimentation may be necessary, but a 25%
    speed increase has been accomplished on a 12 MHz 286 with 20 MHz
    IIT 2c87 by decreasing the number of I/O waitstates from 6 to 4.

    On the 387, FPU instructions and data are sent to and received from
    the FPU via I/O ports too. These ports are 800000f0 - 800000ff.
    Note that the I/O waitstate trick may very well work on 386 / 387
    systems as well.


FPU Condition Code Bits after a test, compare or reduction
──────────────────────────────────────────────────────────────────────────────

Vatious FPU test instructions set the Condition Code bits C0 to C3 based
on the values tested. Below is a list of possible bit combinations.

These C-bits map to the flags register as follows after stswax and sahf:

Eflags map: ZF  PF  -   CF  (C1 has no flag assigned to it)
            C3  C2  C1  C0

Examine     0   0   0   0   +Unnormal (positive, valid, unnormalized)
            0   0   0   1   +NaN      (positive, invalid, exponent is 0)
            0   0   1   0   -Unnormal (negative, valid, unnormalized)
            0   0   1   1   -NaN      (negative, invalid, exponent is 0)
            0   1   0   0   +Normal   (positive, valid, normalized)
            0   1   0   1   +Infinity (positive, infinity)
            0   1   1   0   -Normal   (negative, valid, normalized)
            0   1   1   1   -Infinity (negative, infinity)
            1   0   0   0   +Zero     (positive, zero)
            1   0   0   1   Empty     (empty register)
            1   0   1   0   -Zero     (negative, zero)
            1   0   1   1   Empty     (empty register)
            1   1   0   0   +Denormal (positive, invalid, exponent is 0)
            1   1   0   1   Empty     (empty register)
            1   1   1   0   -Denormal (negative, invalid, exponent is 0)
            1   1   1   1   Empty     (empty register)

FCOM or
STST        0   0   ?   0   ST > Source with FCOM or ST > 0 with FSTST
            0   0   ?   1   ST < Source with FCOM or ST < 0 with FSTST
            1   0   ?   0   ST = Source with FCOM or ST = 0 with FSTST
            1   1   ?   1   ST cannot be compared ot tested

Reduction   b1  0   b0  b2  If reduction was complete, bits 0,1 and 2
                            equal the three lowest bits of the qoutient
            ?   1   ?   ?   Reduction was incomplete


FPU Status Word, Control Word and Tag Word layout
──────────────────────────────────────────────────────────────────────────────

The layout of the Status-, Control- and Tag Word of the FPU.

      FPU Status Word

      Bit 15                8                        0
      ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
      │ B│c3│  ST n  │c2│c1│c0│ES│sf│Pe│Ue│Oe│Ze│De│Ie│
      └─┬┴─┬┴─┬┴──┴─┬┴─┬┴──┴─┬┴─┬┴─┬┴─┬┴─┬┴─┬┴─┬┴─┬┴─┬┘
        │  │  └──┬──┘  └──┬──┘  │  │  │  │  │  │  │  │
      Busy └──────────────┤     │  │  │  │  │  │  │  │
      Stack Top──┘        │     │  │  │  │  │  │  │  │
      Condition Code Bits─┘     │  │  │  │  │  │  │  │
      Exception Summary * ──────┘  │  │  │  │  │  │  │
      Stack fault──────────────────┘  │  │  │  │  │  │
      Precision exception (1=occurred)┘  │  │  │  │  │
      Underflow exception (1=occurred)───┘  │  │  │  │
      Overflow exception (1=occurred)───────┘  │  │  │
      Zero divison exception (1=occurred)──────┘  │  │
      Denormalized operand exception (1=occurred)─┘  │
      Invalid operation exception (1=occurred)───────┘

      * The Exception summary is called Interrupt request on 8087.

      FPU Control Word

      Bit 15                8                        0
      ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
      │ r│ r│ r│ic│round│prec.│ie│ r│Pm│Um│Om│Zm│Dm│Im│
      └──┴──┴──┴─┬┴──┴─┬┴─┬┴──┴─┬┴──┴─┬┴─┬┴─┬┴─┬┴─┬┴─┬┘
      Infinity   │     │  │     │     │  │  │  │  │  │
      control────┘     │  │     │     │  │  │  │  │  │
      Rounding control─┘  │     │     │  │  │  │  │  │
      Precision control───┘     │     │  │  │  │  │  │
      Interrupt enable mask─────┘     │  │  │  │  │  │
                                      └┐ │  │  │  │  │
      Precision exception Mask 1=masked┘ │  │  │  │  │
      Underflow exception Mask 1=masked──┘  │  │  │  │
      Overflow exception Mask 1=masked──────┘  │  │  │
      Zero divison exception Mask 1=masked─────┘  │  │
      Denormalized operand exception Mask 1=masked┘  │
      Invalid operation exception Mask 1=masked──────┘

    Infinity control is supported on the 8087 and 287 only.
    The 87 and 287 (not the 287xl) have ic cleared by default and then
    support projective closure. The 287xl+ only support affine closure.
    To make sure an 87 or 287 will handle the numbers in the same way
    as the 287xl+, set bit ic to make 87 & 287 support affine closure
    as well. Note that a FINIT will clear ic again.
    The ic setting is ignored on 287xl+.

    Rounding control is set to 00 by default.
    00 = Round to nearest or even
    01 = Round down (towards negative infinity)
    10 = Round up (towards positive infinity)
    11 = Chop towards zero

    Precision control is set to 11 by default.
    00 = 24 bit precision (mantissa)
    01 = reserved
    10 = 53 bit precision (mantissa)
    11 = 64 bit precision (mantissa)

    Note: lesser precision does not significantly reduce execution time.


      FPU Tag Word

      Bit 15                8                        0
      ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
      │ x  x│ x  x│ x  x│ x  x│ x  x│ x  x│ x  x│ x  x│
      └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
           7     6     5     4     3     2     1     0 Tag number

      The tag number 0 corresponds to the register which is
      currently ST0.
      The bits for each tag have the same meaning:

       0  0  Valid
       0  1  Zero
       1  0  Special (NaN,Infinity,Denormal,Unnormal,Unsupported)
       1  1  Empty


IIT bankswitching  (IIT math coprocessor)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FSBP0, FSBP1, FSBP2, FSBP3
Opcode  : DB E8, DB Eb, EB EA, DB E9  (6 clocks)
Bug in  : Are IIT 2c87+ instructions

Function:
FSBP0 Selects the original bank. (default)
FSBP1 Selects bank 1 from <FMUL4X4> instruction diagram
FSBP2 Selects bank 2 from FMUL4X4 instruction diagram
FSBP3 Selects the scratchpad bank3 used by the FMUL4X4 internally.

The FSBP3 instruction is not publicly supported by IIT, it can be used to
select the last bank of registers, which unfortunately cannot be used for
regular operation. However, it is listed for completeness.


FSIN / FCOS   Floating point sine and cosine
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FSIN  / FCOS
Opcode  : D9 FE / D9 FF
Bug in  : Undocumented instructions on IIT 2c87 math chips

Function:
FSIN calculates the radial sine of the value in ST(0), leaving the result
in ST(0). Apparently the IIT FSIN functions according to Intel's 287xl
and 387+ specifications.

FCOS calculates the radial cosine of the value in ST(0), leaving the result
in ST(0). Apparently the IIT FCOS functions according to Intel's 287xl
and 387+ specifications.

Both these instructions are not officially supported by IIT for the 2c87.
Both instructions are available on Intel 287xl and 387+ processors using the
listed opcodes.


FDIV / FDIVP  Floating point division / divide & POP
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FDIV / FDIVP
Opcode  : various
Bug in  : some 486

Function:
FDIV divides destination by source and returns the result in destination.
FDIVP does the same but pops the FPU stack afterwards.

The bug occurs when the instruction operates on an FPU register which is
tagged as empty, but holds a nonzero value and the next FPU instruction
occurs within 35 FPU clock counts. In that case, the current instruction
will use the invalid number in the empty location, producing an invalid
result and causing the following instruction to generate an invalid
result as well. There is no workaround.


FDIVR / FDIVRP  Floating point division reversed / divide & POP
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FDIVR / FDIVRP
Opcode  : various
Bug in  : some 486

Function:
FDIVR divides source by destination and returns the result in destination.
FDIVRP does the same but pops the FPU stack afterwards.

The bug occurs when the instruction operates on an FPU register which is
tagged as empty, but holds a nonzero value and the next FPU instruction
occurs within 35 FPU clock counts. In that case, the current instruction
will use the invalid number in the empty location, producing an invalid
result and causing the following instruction to generate an invalid
result as well. There is no workaround.


FLDENV  Load Floating point Environment
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FLDENV
Opcode  : D9 [mod:100:r/m] disp
Bug in  : some 387

Function:
FLDENV loads the entire FPU environment from the address given by the
memory operand. See <FPU environment layout>.

If either of the two last bytes of the environment cannot be read for
whatever reason, the instruction cannot be restarted on some 387s.

A workaround is to attempt to read those bytes before the FLDENV is
executed or to align the environment on a 128 byte boundary so it is
unlikely to fall outside a segment or page boundary.
Should that be the case, the integer unit can cause an exception or
make sure the page (in case of a swapped page) is read into memory
before FLDENV starts.


FMUL4X4 Matrix Multiply (IIT math coprocessor)
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FMUL4X4 or F4X4
Opcode  : DB F1  (2c87=242, 3c87sx=242, 3c87=242 clocks)
Bug in  : Is an IIT special instruction

Function:
This instruction is available only on the IIT (Integrated Information
Technology Inc.) math processors. The instruction performs a 4x4 matrix
multiply in one instruction using three banks of 8 floating point registers.
The operands must be loaded to a specific bank in a specific order using

Xn = (A00 * Xo) + (A01 * Xo) + (A02 * Xo) + (A03 * Xo)
Yn = (A10 * Yo) + (A11 * Yo) + (A12 * Yo) + (A13 * Yo)
Zn = (A20 * Zo) + (A21 * Zo) + (A22 * Zo) + (A23 * Zo)
Vn = (A30 * Vo) + (A31 * Vo) + (A32 * Vo) + (A33 * Vo)

Where Xo stands for the original X value and Xn for the result. Operands
must be loaded to the following registers in the specified banks in the
specified order.

          Before FMUL4X4             After FMUL4X4

                   bank              bank
          Register: 0    1    2      0

          ST(0)     Xo   A33  A31    Xn
          ST(1)     Yo   A23  A21    Yn
          ST(2)     Zo   A13  A11    Zn
          ST(3)     Vo   A03  A01    Vn
          ST(4)          A32  A30     ?
          ST(5)          A22  A20     ?
          ST(6)          A12  A10     ?
          ST(7)          A02  A00     ?

All four banks can be selected by using the bankswitching instructions,
but only bank 0, 1 and 2 make sense since bank 3 is an internal scratchpad.
The separate banks can contain 8 floating point numbers and may be used
with normal instructions. Each bank acts like an independent 287.
Provided the status of the status word is saved inbetween and restored
properly after a bankswitch each bank can be used simultaneously.

Alternatively you could keep an eye on the TOP and STACKPOINTER indicators,
making sure they are the same as before when initiating a bankswitch.
By using FFREE, FFREEP and FINCSTP or FDECSTP instructions you may manually
manipulate the stack.

This feature of the IIT chips can be used to perform complex operations
in registers with many components remaining the same for a large dataset,
only saving intermediary results to one memory location, bankswitching
to the next series of operands, loading that one operand and continuing the
calculation with the next set of operands already in that bank. This does
require another read into the new bank but may save time and memoryspace
compared to memory based operands or multiple pass algorithms with multiple
arrays of intermediary results.


FENI / FDISI  Enable /Disable Floating point interrupts
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FENI     / FNENI / FDISI    / FNDISI
Opcode  : 9B DB E0 / DB E0 / 9B DB E1 / DB E1
Bug in  : Opcodes have no meaning on 287+ (are ignored there)

Function:
FENI Clears the interrupt enable mask in the FPU Control Word, effectively
allowing the FPU to generate interrupts. FNENI does not issue a WAIT
before doing this. These instruction only have a meaning on 87s.

FDISI Sets the interrupt enable mask in the FPU Control Word, effectively
denying the FPU to generate interrupts. FNDISI does not issue a WAIT
before doing this. These instruction only have a meaning on 87s.

All these instructions are effectively ignored on the 287+.
They do not cause an invalid opcode exception.


FPREM  Calculate modulus of ST by ST(1), store in ST
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FPREM
Opcode  : D9 F8
Bug in  : all 87 and 287

Function:
FPREM calculates the modulus remainder of ST divided by ST(1) and stores
the result into ST. The procedure can also be seen as a repeated
subtraction of ST by ST(1).

There are several interesting things about this instruction:

The exponent magnitude difference should be no more than 63 or else the
instruction cannot reduce the ST properly in one execution. This means
you would have to execute the instruction several times to get a correct
result for large magnitude differences.
If this is the case, condition code bit C2 is set until the result in ST
is ok. Storing the Status Word and checking C2 should be done if the
condition could occur in your data set.

In addition to that, if the instruction is done, the least-significant
three bits of the quotient are stored in C3,C1 and C0.
If arguments to the tangent function are reduced by PI/4 the codes
represent one of the eight octants of a radius for which the tangent is
to be calculated.

FPREM does not operate according to the IEEE 754 standard, FPREM1
with opcode d9 f5 does, but is about 15-25 clocks slower than FPREM.

The bug appears on the 87 and 287 when 64^a+b is performed with a>=1
and b==1 or 2. In that case the condition code bits represent an
incorrect value. There is no FP workaround. Test to prevent the situation.
Apparently this bug does not appear in the FPREM1 instruction.


FPTAN  Calculate tangent of ST
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FPTAN
Opcode  : D9 F2
Bug in  : some 486 / 487, difference between pre-287xls and 287xl+

Function:
FPTAN calculates the ratio between x and y in the following formula:

   x
   -  = TAN(original ST)
   y

The y result replaces the original argument in ST and x is then pushed
onto the stack. On pre-287xl FPUs, the values for y and x may be anything,
the ratio however is correct. On 287xl+ FPUs, x is always 1.
ST(1) represents the fractional value itself there.
To generate the same set of results on all FPUs, the FPTAN should be
followed by FDIV and FLD1. Note that this reproduces the original
results on the 287xl+.

Note that ST(7) must be free or an invalid operation exception may occur
because x is pushed onto the stack.

The 486 bug occurs when a specific set of code is executed with a specific
set of data. There is no way you can anticipate this and the workaround
should always be implemented if code will run on a 486/487.
The bug corrupts the FPU stack without signalling it to either FPU or CPU.
Data corruption is usually the result.
Workaround: FPTAN should always be followed by: FCLEX, FINIT, FLDCW, FSTSW,
FSTSWAX, <FSAVE> or <FSTENV> or by a WAIT and a non-FPU instruction.
Do note that some of these FPU instructions contain bugs themselves.


FRSTOR  Restore FPU state saved to memory by FSAVE
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FRSTOR
Opcode  : DB [mod:100:r/m] disp
Bug in  : some 387

Function:
FRSTOR loads the FPU internal registers (including ST-registers) and the
environment from the memory operand. See <FPU State image layout>.

If either of the two last bytes of the image being read by FRSTOR cannot
be read for whatever reason, the instruction cannot be restarted on
some 387s.

A workaround is to attempt to read those bytes before the FRSTOR is
executed or to align the image on a 128 byte boundary so it is
unlikely to fall outside a segment or page boundary.
Should that be the case, the integer unit can cause an exception or
make sure the page (in case of a swapped page) is read into memory
before FRSTOR starts.


FSAVE  Save FPU state to memory
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FSAVE / FNSAVE
Opcode  : (9B) DB [mod:110:r/m] disp
Bug in  : some 387, some 386

Function:
FSAVE saves the FPU internal registers (including ST-registers) and the
environment to the memory operand. See <FPU State image layout>.

The FPU does not execute this instruction until all pending FPU
operations have completed (decoded instructions have been processed).
After completion, FSAVE initializes the FPU as if it had executed FINIT.

Apparently on all FPUs, the contents of the data pointer field is
undefined if the last FPU arithmetic instruction did not use a memory
operand.

On some 386s operating in Real or V86 mode, the opcode saved is incorrect.
The linear address saved for the opcode's address however is correct and
can be used to retrieve the opcode. No opcode is saved in Protected mode.

If either of the two last bytes of the image being saved by FSAVE cannot
be accessed for whatever reason, the instruction cannot be restarted on
some 387s.

A workaround is to attempt to write to those bytes before the FSAVE is
executed or to align the image on a 128 byte boundary so it is
unlikely to fall outside a segment or page boundary.
Should that be the case, the integer unit can cause an exception or
make sure the page (in case of a swapped page) is read into memory
before FSAVE starts.


FSETPM  Make FPU use Protected Mode format in FSAVE and FSTENV
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FSETPM
Opcode  : DB E4
Bug in  : no bug, it only works on 287 and 287xl. ignored on 386+

Function:
FSETPM tells the FPU to use the data format specified in the Protected
Mode format of the <FSTENV> and <FSAVE> instructions.
These instructions save different types of data depending on the current
operating mode of the FPU.

The instruction only has a meaning on the 287 and 287xl.


FRSTPM  Make FPU use Real-Mode format in FSAVE and FSTENV
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FRSTPM
Opcode  : DB F4
Bug in  : no bug, it only works on 287 and 287xl. ignored on 386+

Function:
FRSTPM tells the FPU to use the data format specified in the Real-Mode
format of the <FSTENV> and <FSAVE> instructions.
These instructions save different types of data depending on the current
operating mode of the FPU.

The instruction only has a meaning on the 287 and 287xl.


FSCALE  Adds the integer number in ST(1) to the exponent of ST
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FSCALE
Opcode  : D9 FD
Bug in  : some 486

Function:
FSCALE multiplies the value in ST by a power of two, given in ST(1).
Pre-387s assume the value in ST(1) to be an integer in the range
-2^15 <= , < +2^15. 387+ do not assume anything about the value.
The value in ST(1) is always chopped to the nearest integer closest
to zero.

There is a bug in some 486s which allows denormal or pseudo-denormals to
be returned as a result, apparently without issuing an Invalid Operation
exception. For this to happen, ST(1) must be within the range
-1 < ST(1) < 1 and ST must be a pseudo-denormal or denormal while
underflow exceptions must not be masked. When it occurs, the value from
ST is returned as the result.

There is no workaround other than to avoid the situation. Leaving
underflow exceptions masked may prevent this bug from showing up.


FSINCOS  Calculate both Sine and Cosine of ST
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FSINCOS
Opcode  : DB FB
Bug in  : some 486, invalid on pre-287xl and IIT

Function:
FSINCOS calculates both Sine and Cosine of an argument in ST.
The first result, sine, is stored into the original ST, destroying the
source value. The second result, cosine, is then pushed onto the stack.

Note that ST(7) must be free or an invalid operation exception may occur
because the cosine is pushed onto the stack.

The 486 bug occurs when a specific set of code is executed with a specific
set of data. There is no way you can anticipate this and the workaround
should always be implemented if code will run on a 486/487.
The bug corrupts the FPU stack without signalling it to either FPU or CPU.
Data corruption is usually the result.
Workaround: FSINCOS should always be followed by: FCLEX, FINIT, FLDCW,
FSTSW, FSTSWAX, <FSAVE> or <FSTENV> or by a WAIT
and a non-FPU instruction. Do note that some of these FPU instructions
contain bugs themselves.


FSTENV  Store Floating point Environment
──────────────────────────────────────────────────────────────────────────────

Mnemonic: FSTENV
Opcode  : (9B) D9 [mod:110:r/m] disp
Bug in  : some 386

Function:
FSTENV saves the FPU environment to the memory operand.
See <FPU environment image layout>.
This environment does not include the FPU stack, but does include
Control Word, Status Word, Tag Word and exception pointers.

The FPU does not execute this instruction until all pending FPU
operations have completed (decoded instructions have been processed).
After completion, FSTENV initializes the FPU as if it had executed FINIT.

Apparently on all FPUs, the contents of the data pointer field is
undefined if the last FPU arithmetic instruction did not use a memory
operand.

On some 386s operating in Real or V86 mode, the opcode saved is incorrect.
The linear address saved for the opcode's address however is correct and
can be used to retrieve the opcode. No opcode is saved in Protected mode.

If either of the two last bytes of the image being saved by FSTENV cannot
be accessed for whatever reason, the instruction cannot be restarted on
some 387s.

A workaround is to attempt to write to those bytes before the FSTENV is
executed or to align the image on a 128 byte boundary so it is
unlikely to fall outside a segment or page boundary.
Should that be the case, the integer unit can cause an exception or
make sure the page (in case of a swapped page) is read into memory
before FSTENV starts.


Layout of environment & state stored by FSTENV and FSAVE
──────────────────────────────────────────────────────────────────────────────

The environment area saved by <FSTENV> and loaded by <FLDENV> depends on the
current operating mode of the FPU. Apart from the mode, the current
default addressing mode within the operating mode is also important.

The state information saved by <FSAVE> and loaded by <FRSTOR>
consists of the environment mentioned above but also has the eight FPU
stack registers appended to it in temporary real format starting with the
current ST register. Note that which register represents ST depends on
the values in the Control Word.

There are four states in which the 387+ FPU can operate

  16-bit real or V86 mode (like in DOS)
  16-bit Protected Mode (16-bit code segment)
  32-bit real or V86 mode (using 66h and 67h prefixes)
  32-bit Protected Mode (32-bit code segment)

        16-bit real or V86 mode:

    15     12      8       4       0
    ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
    │d│d│d│d│0│0│0│0│0│0│0│0│0│0│0│0│ d = Data pointer bits 16 - 19
    ├─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┤
    │ Data pointer bits 0-15        │
    ├─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┤    bit 11 is zero, not a typo.
    │i│i│i│i│0│o│o│o│o│o│o│o│o│o│o│o│ i = Instruction pointer bits 16 - 19
    ├─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┤ o = Opcode highest 11 bits
    │ Instruction pointer bits 0-15 │
    ├───────────────────────────────┤
    │ Tag Word (16 bit)             │
    ├───────────────────────────────┤
    │ Status Word (16 bit)          │
    ├───────────────────────────────┤
    │ Control Word (16 bit)         │ Low memory
    └───────────────────────────────┘


        16-bit Protected Mode:

    15     12      8       4       0
    ┌───────────────────────────────┐
    │ Data selector                 │
    ├───────────────────────────────┤
    │ Data offset                   │
    ├───────────────────────────────┤
    │ Instruction selector          │
    ├───────────────────────────────┤
    │ Instruction offset            │
    ├───────────────────────────────┤
    │ Tag Word (16 bit)             │
    ├───────────────────────────────┤
    │ Status Word (16 bit)          │
    ├───────────────────────────────┤
    │ Control Word (16 bit)         │ Low memory
    └───────────────────────────────┘


        32-bit Real Mode:

    31     28      24      20       15     12       8       4       0
    ┌─┬─┬─┬─┬───────────────────────────────┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
    │0│0│0│0│  Data pointer bits 16-31      │0│0│0│0│0│0│0│0│0│0│0│0│
    ├─┴─┴─┴─┴───────────────────────┬───────┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┤
    │- - - - - - - - - - - - - - - -│ Data pointer bits 0-15        │
    ├─┬─┬─┬─┬───────────────────────┴───────────────────────────────┤
    │0│0│0│0│ Instruction pointer bits 16-31303 Opcode top 11 bits  │
    ├─┴─┴─┴─┴───────────────────────┬───────────────────────────────┤
    │- - - - - - - - - - - - - - - -│ Instruction pointer 0-15      │
    ├───────────────────────────────┼───────────────────────────────┤
    │- - - - - - - - - - - - - - - -│ Tag Word (16 bit)             │
    ├───────────────────────────────┼───────────────────────────────┤
    │- - - - - - - - - - - - - - - -│ Status Word (16 bit)          │
    ├───────────────────────────────┼───────────────────────────────┤
    │- - - - - - - - - - - - - - - -│ Control Word (16 bit)         │
    └───────────────────────────────┴───────────────────────────────┘
                                                          Low memory


        32-bit Protected Mode:

    31     28      24      20       15     12       8       4       0
    ┌───────────────────────────────┬───────────────────────────────┐
    │- - - - - - - - - - - - - - - -│ Data selector                 │
    ├───────────────────────────────┴───────────────────────────────┤
    │                      Data offset (32-bit)                     │
    ├───────────────────────────────┬───────────────────────────────┤
    │- - - - - - - - - - - - - - - -│ Instruction selector          │
    ├───────────────────────────────┴───────────────────────────────┤
    │                  Instruction offset (32-bit)                  │
    ├───────────────────────────────┬───────────────────────────────┤
    │- - - - - - - - - - - - - - - -│ Tag Word (16 bit)             │
    ├───────────────────────────────┼───────────────────────────────┤
    │- - - - - - - - - - - - - - - -│ Status Word (16 bit)          │
    ├───────────────────────────────┼───────────────────────────────┤
    │- - - - - - - - - - - - - - - -│ Control Word (16 bit)         │
    └───────────────────────────────┴───────────────────────────────┘
                                                          Low memory

     - = Don't care.