Skip to content

Instantly share code, notes, and snippets.

@Recoskie
Last active March 24, 2024 00:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Recoskie/062e3ebc425919eefb63a39b82375fab to your computer and use it in GitHub Desktop.
Save Recoskie/062e3ebc425919eefb63a39b82375fab to your computer and use it in GitHub Desktop.
Building your own programming language, or compilers.

Introduction

This introductory text lets you fully understand what assembly is and the design process that went into designing modern compilers and programming languages, including a basic understanding of application container formats on different operating systems, such as relocations, loading, and linking.

Central processing unit (CPU)

The most important part of any system is the central processing unit, which actually runs your program's instructions.

A central processing unit is defined by core type for which kind of instructions it understands.

A processor understands what we call operation code numbers, which are called opcodes in short form.

A core that is x86 means it understands x86 instructions and runs x86 programs without translation.

It does not matter if Intel or AMD designs this core, as it must be able to run x86 operation codes to be able to run x86 software.

The instructions a processor understands are called a processor's instruction set architecture, which is shortened to ISA in short form. It is also important to know that some compilers will call it arch type, followed by the core type, such as x86.

Assembly/Core code

A tool called an assembler makes it so you do not have to remember the operation code numbers in order to do a specific operation code on a selected ISA (instruction set architecture) such as x86.

The assembler associates a name to each instruction code number for its operation on an x86 core.

The names given to each operation code are called mnemonic names.

MOV EAX,12345678 is translated as 184, 12345678

The number after the operation code is called the operand.

The operation code 184 sets the value of the register EAX to 12345678.

This is called an immediate input operand.

On the assembler, we shorten this to a meaningful name called MOV, which is short for move.

ADD EAX,EBX is translated as numbers 3, 3, 0, 2.

3 (opcode), 3 (memory/register), 0 (reg1), 2 (reg2)

Generally, all arithmetic operations use this operand forum, allowing us to choose if reg2 is used as an address location.

There are four values for memory/register which are 0 = Memory, 1 = Memory+disp8, 2 = Memory+disp32, 3 = register. You would also have to remember the order that the eight registers go in by number.

ADD EAX,[EBX] is 3, 0, 0, 2
ADD EAX,[EBX+12345678] is 3, 2, 0, 2, 12345678

When a register is placed in [], it uses the value of the register as an address location to read a value from RAM memory.

Memory plus displacement adds a value to the register in the address. The disp8 allows us to use a shorter number, which is 0 to 255. For anything past 255, we use disp32. This is useful if multiple values are in the same general area using the added displacement.

Each operation that uses the ModR/M operand also has another operation code that is the same, except it writes the result to the memory location rather than read it.

ADD [EBX],EAX is 1, 0, 0, 2

The order that reg1, and reg2 are displayed in the assembly instruction is flipped because the result is written to the memory location instead of being read.

Additionally, there is one register value we can not set for reg2 as it reads another three values to add more flexibly to what we can encode as an address location.

ADD [EAX+EAX*8],EAX is 1, 0, 0, 4, 2, 0, 0

3 (opcode), 2 (memory/register), 0 (reg1), 4 (reg2 4=SIB), 2 (scale), 0 (reg1), 0 (reg2)

What register 4 does is let us select two registers we want to add together in the address. The scale is what we want to multiply reg2 by in the address. Scale can be set any the following values 0 = none, 1 = *2, 2 = *4, 3 = *8.

If the instruction takes an immediate input, it comes after the ModR/M, optional displacement, and the optional SIB when reg2 is 4.

This is the basic encoding of all instructions on an x86-type core.

You can view a mapping of the ModR/M encoding and SIB as a table x86asm.net(ModR/M).

All arithmetic operations generally select from the /r register column of the ModR/M. The name of the /r registers change names based on the size of the operation, such as 16-bit, 32-bit, or 64-bit ADD.

On the ModR/M encoding, you will notice that there are shaded areas where the registers change when R=1 or B=1. This is because the limit of eight selectable registers per encoded instruction was not enough, so register extension operations were created.

Operation code 44 sets REX.R.

ADD [R8D+EAX*8],EAX is 44, 1, 0, 0, 4, 2, 0, 0

44 (opcode REX.R), 3 (opcode), 2 (memory/register), 0 (reg1), 4 (reg2 4=SIB), 2 (scale), 0 (reg1), 0 (reg2)

There are other prefix operations that can modify the instruction code that comes after the prefix operation. For example, 66, F2, and F3 hex change the format of vector operations. Luckily for you, these are all shown on the x86 .net assembler.

The x86 .net assembler shows the opcode and mnemonic instruction name, and the order the operands are displayed as columns op1, op2, op3, op4.

x86asm.net(x86 instruction map).

The r/m operand is the [] section of the instruction, including the optional displacement and SIB. The r operand is the reg1 in the ModR/M. The numbers r/m16/32/64 mean the operation can do a 16-bit, 32-bit, 64-bit add. The operation's size depends on the CPU's bit mode setting.

An instruction that is xmm/m means it selects from the xmm column of the ModR/M. We also would use /xmm instead of /r for the register operand.

Note that some instructions may only use a memory address, so they are displayed as m16/32/64. Setting the mode to register in the ModR/M will cause the operation to fail unless there are two separate instructions under the same operation code, one in register-only mode and one in memory-only mode.

The imm operand is the immediate input operand that comes after the ModR/M, optional displacement, and SIB.

Note some instructions only use immediate input or relative immediate input. The rel immediate input adds the current position of the instruction to the immediate, allowing your code to be placed in different locations and still locate to the same position in the code.

Instructions with capitalized register names mean it uses the register by default as part of the operation code and can not be changed, such as rAX.

ADD EAX,12345678 is 5 (opcode), 12345678 (immediate)

The operation code had a limit of 0 to 255, which was exceeded, so the operation code 15 (0F hex) was used to read the next 0 to 255 value as a new 0 to 255 operation code and became known as two-byte operation codes.

Because of this, you will see some operation codes starting with 0F hex and then the opcode.

Some instructions use only reg2 in the memory/register ModR/M as the operations take one input, such as moving a number to the left or right in memory, known as left shift and right shift.

The unused reg1 value is used to select from 0 to 7 operations under one operation code. On the x86 instruction map these are called instruction code groups. You will see the numbers 0 to 7 used under column o.

You can go all the way back to the very first x86 core 8086. The first x86 core 8086 ml site 8086

The encoding for all operation codes is the same; it is just that the instruction set architecture (ISA) is an older version of x86 and has fewer usable instructions. You can encode a two-byte operation code under 0F hex, but the older ISA will not know what to do with this new instruction code that did not exist on the 8086. The newer the ISA version is, the more instructions the processor can do. Every new instruction code that is added to a processor instruction set architecture, such as x86, has to have a unique encoding that does not conflict with any prior instruction encodings. This makes writing an assembler for all x86 cores very easy. This also means that you can not mistake an instruction code for a different instruction code, as all instruction codes use new operation code numbers or prefix codes that adjust the operation code as previously shown.

These same rules apply to all other processor instruction set architectures; otherwise, you would have to build new compilers and operating systems every time a new ARM-type ISA core is made or a new x86-type ISA core comes out. So, all the old instructions are kept while new encodable instruction sequences that did nothing in older cores are added. Also, companies can build their own ARM-type ISA core to run existing ARM binary software, but any instructions they add to the ARM-type ISA must be copyrighted and registered as part of the ARM-type ISA. The same rules apply to x86 cores; because of this, it is easy to take software apart one instruction at a time no matter what platform you are on, even Apple silicon.

ARM instruction architecture

ARM processors use a much simpler instruction encoding for all instructions.

The following is part of an ARM instruction encoding course and is only a few short pages introduction to ARM technology.

ARM is typically used in cell phones and other mobile devices. You can get the full instruction encoding of all instructions from ARM developer zone developer.arm.com(cpu-architecture).

All the assembler does is make it easier to write processor operation codes without remembering their operation codes and the operand encoding it takes.

Building an assembler

You only have to build an assembler once, and it will always function the same on all newer versions of a processor instruction set architecture type.

An assembler is a dictionary for a computer processor's encodable binary operations, and the dictionary type is locked in by the core type regardless of the company that manufactures the core.

Switching dictionaries only happens when you switch processor instruction architecture type from x86 cores to ARM cores that understand ARM-encoded binary operations.

You will find that there are blank operation codes with no defined operation.

These are used to add new operation codes without breaking compatibility to older x86 software and operating systems.

All companies have to be very careful not to use existing defined operation codes to implement new operations.

Some developers find that using undefined operation codes can trick the CPU into doing two operation codes at the same time to speed up their code. Still, it is important not to use undefined operations in your compiler as it may be defined eventually as a different operation and will cause unpredictable behaviour of your compiled programs on newer x86 cores.

Adding in the new operation codes is optional and is not necessary to make a working compiler that runs on all x86 systems.

You can get the full listing of x86 instructions from Intel, AMD, and sandpile.org.

There is not an infinite number of processor instruction architecture types. Actually, there are very few in use today. The main instruction set types used today are ARM and x86.

If you create an ARM and x86 assembler, you can compile code that will run on anything, from cell phones to tablets, PCs, and even game consoles.

However, each operating system identifies binary applications that can be run on the CPU hardware component directly differently. This means you would have to write the file out in the format that the operating system uses to recognize it as a binary application that can be run directly. We call these application container formats. Adding the different application container formats is very easy and straightforward.

Companies can build their own ARM or x86 cores but must stick to the defined behaviour of encodable instructions for the processor instruction set architecture type. Assembly language for the encodable instructions on ARM-dictionary, or x86-dictionary, never changes and stays the same based on what can be encoded as an ARM instruction or x86 instruction. The only thing that can change is the speed and performance of newer cores.

Compilers

Assemblers are great for building code that pushes the limits of what a CPU can do per second.

However, developers want to add lines into their code explaining what some steps do that are not part of the code.

They want to be able to create names for data and do parentheses such as val1=val2*(PI/180).

Instead of having the number PI be a relative address position in their binary code and setting registers to each value.

They also prefer to keep calculations in one line rather than multiple processor instructions.

Well, the solution to this is to build a programming language.

With a compiled programming language, we can define as many names as we like for values without dealing with registers, as the compiler records the variable names and the location it placed the variables at the end of your code.

This makes code easy to read and condenses your calculations into single-line calculations compiled as operations between registers and memory locations.

The compiler added assembly instructions together. The assembler generated the instruction encoding for each assembly instruction.

This created your compiled code which was added into the application container format understood by the target operating system.

This began the development of modern programming languages, which developers can compile to different instruction set architecture types or container formats for different operating systems.

It made the code a lot less messy. Easier to read and easier to do loops and conditional code sequences.

The programming language includes code that does complex calculations or UI/Graphics. You would have to do this by hand if the library set of functions or tools are not built into the programming language.

Compiler without assembler

Eventually, it became apparent that all things done in a programming language became the same sequence of bytes over and over with reference locations to variable names in your program.

So, instead of assembling the same instructions repeatedly, we could store the instructions already assembled and insert the location of your variable names or arrays into the byte sequences.

This allowed compilers to generate binaries extremely fast; this is where we are now with our software coding tools.

A decompiler can generally decompile such byte sequences back into your source code. This also means we have to guess which programming language you are using. Additionally, advanced methods can detect the order of operations into steps and use a control flow graph for conditional code.

Boot-sector

The boot sector is the CPU's first instruction when restarting the system.

They are typically stored at the first address in the system memory.

Game consoles and cell phones all have them, including PC.

Once you know the core type instruction set architecture (ARM or x86), you then know the machine code operations the processor uses and can throw a disassembler at it.

This allows you to see the inner workings of any system or even game console and typically makes it easy to build emulators.

The nice thing about desktop computers is they have an integrated boot system that can select a different disk to boot from.

This allows us to replace the boot code with a different one, allowing us to switch operating systems easily.

Operating systems

Building an assembler is hardware independent. All x86 cores understand x86-encoded instructions. All ARM cores understand ARM-encoded instructions.

When you design the operating system, it will either be a system that runs on all x86 or all ARM.

Unless you want to maintain two different versions of your operating system.

What sets your operating system apart from other systems is how binary applications are recognized, which is called the application container format.

Relocation section

Not all processors have support for relative addresses. This means that your compiler must record the location all your variables are located at and write it as a relocation list. Your loader must then load the relocation list and add x offset to the location where the binary is placed in RAM memory.

This allows you to load more than one application at a time and space them apart so that they do not all have to start at address zero.

Export section

The next important part is being able to define a section of code as a particular method name which stores the name and location of the section in an export list. This will make it so that you can get the memory location of a method name in another binary. Our compiler creates an export list in the binary container that the loader will use to know the location of each function/method, which is added to the address where the binary is placed in RAM memory.

Import section

When you want to call a method from another binary, your compiler makes an import list. Your loader can replace the import list entries with the location to the binary code to the external method code. Your binary code uses a relative jump to the import entire in your binary application, and it is your loader's job to ensure the import entire locate to the right export address.

Why do all of this?

All applications can have import and export lists. This is called loading and linking and is very important so we can load and link different chunks of code for sending commands and communicating with different hardware components. An important note is that operating systems like Linux and macOS call these lists symbol tables.

This also keeps our binaries compact and smaller in size. However, it has a few drawbacks.

This makes our applications runnable only on our operating system or console firmware that contains our linkable binary methods and knows how to read and run the container format we made for storing the relocation list, export, and import sections. However, emulators get around this issue by running the video game console firmware directly, which is legal. It is, however, illegal to install a video game console's firmware on our own systems and run it directly as the operating system even if we know the processor ISA type is x86 and the console is x86, even though it would give us everything we need to run everything directly. This is in the licensing agreement for video game console software agreements and macOS.

The more advanced the emulator is, the more that the console's firmware is replaced with code that knows how to read and load the binary applications/games. The linkable binary methods start to add some extra code before calling similar linkable methods on your current system to make it behave like the linkable methods and binaries on the game console. This is called high-level emulation. High-level emulation is much faster than running the console's boot firmware directly or running an operating system on a virtual machine from its boot firmware.

Operating system (software drivers)

Building an operating system on cell phones or any computer device without drivers is possible.

It is called building software drivers.

By design, all chipsets (consoles, cell phones, PC) have display output converters connected to a memory chip.

The onboard converter's job is to convert the memory to VGA, HDMI, DisplayPort or internal display output.

All digital devices have a video memory location, and you, as the designer for the system's hardware, are responsible for adding the right converter.

Sofware-rendered graphics can run on any system and does not need a graphics card. Instead, we write three values for the intensity of red, green, and blue color per pixel.

Changing these three first colors changes the color of the first top left square of your display device.

When you reach the end of the first line, you start on the next line.

The larger of the display you have the more amount of RAM memory your operating system will use for each pixel color.

During the boot code on your operating system, you can choose the memory location you wish to use for the connected monitor or display.

The y (up and down) is multiplied by the length of each line on the display, the x (Distance across the screen) is added to y then the total is multiplied by three as each color uses three numbers in memory. We then are able to write our three colors to the square (pixel) we wish to change.

On a graphics card, we can send commands to change a pixel color at x and y in the graphics card video memory. We can also disable memory-mapped output to the display. We can also run graphics functions inside the GPU, as a GPU is a separate computer with its own video-mapped memory.

Doing graphics functions on the CPU is slower, and sending commands to a graphics card that does the function for you is wise.

So graphics can be done on the CPU side without graphics drivers but is very slow.

Operating systems default to RAM memory-mapped software-rendered graphics when it does not have a driver to issue the function calls to the graphics card.

I have built my own software-rendered graphics functions for fun that I could boot into across different systems for fun, whether the system had an operating system or not. The only place on the internet where I have found an accurate description of what video memory is on techtarget.com storage Video memory.

The Bit Map picture format is based on the raw binary form of graphics memory and is a hardware independent picture format.

Audio output is also done completely in software and does not change format between systems as we can send as many points a second using high or lower precision numbers, creating an audio wave or sound.

A wave audio file stores audio as is without compression and is easy to read and play.

Everything has a fallback driver that allows you to use an operating system regardless of whether you have hardware-accelerated drivers installed or not.

All keyboards follow a standard code number for every encodable text character and are stored as a list of pictures in a font file Unicode operating system standards.

Because of the base formats, everything works without needing special drivers.

Creating a bootable OS that runs on all x86, or all ARM is actually easy. Adding in all the hardware-accelerated functions that replace your software driver functions is also important if you want your operating system to be fast.

Linux on the Nintendo Switch

The switch is Arch ARM, so a Linux OS assembled with ARM-encoded instructions is all you need. Also, you do not have to worry too much about drivers because software drivers run on the CPU.

Common compilers

Operating systems Windows, Linux, and macOS are built in C/Assembly. C/C++ allows us to organize code better and give the code more meaning, variable names, and sections of code function names. Raw binary processor instructions are done in assembly to speed up parts of the operating system that we want to be optimized as much as possible.

The C programming language is best if you are designing code to run on an instruction set architecture such as x86, or ARM and allows you to mix in assembly. Because of this, we are able to optimize code directly per CPU operation and are also able to use debuggers that disassemble core code (disassembly).

Comparison of compilers

It does not matter what programming language you are using. Everything you can do at arithmetic level can be done in any compiled language, even javascript.

If you want to really test your knowledge you can write everything in raw arithmetic in assembly or use a language that was invented to really test the skills and knowledge of software developers such as the brain fuck programming language.

You could even create a compiler in the brain fuck programming language if you wanted and do pretty much anything you want.

The only true limits that exist is what you can build and create.

As a professional developer, you do not want to be limited by callable code and functions made for you or are part of your developer tools or programming language. It is good practice to be able to understand the inner workings of the tools you use and to be able to implement them elsewhere or in variation.

A quick read of a programming language reference functions is enough to show you what you can shrink down using the programming language shortcuts.

Once you know what compilers are, switching between programming languages is very easy. The language syntax or compiler you like is entirely up to your preference and coding style.

It is not the number of programming language syntaxes that you know that makes you a good software developer, as the tools all compile to the same machine code instructions.

The knowledge of building and creating different functions and tools makes a good software developer. Understanding of graphics functions, trigonometry, physics, science, algebra and simplification, calculus, statistics, algorithms, and data science all combine to make a good software developer and a bit of creativity to create new code or functions and tools.

Additional tools and reading

This is meant to be an introductory document so that developers can better understand their tools and how they were developed to better design their code or to create basic compilers and software drivers for a custom bootable operating system.

This is the sort of introduction developers should have when learning a new programming language so they have a better perspective of what they are learning. There is no limit of what you can design or build in any programming language or coding tool. Picking the right coding tool that fits your style is what is important.

This introduction leaves out some details, such as data types, the differences between interpreted code, and code that can be run directly by the CPU.

The dissasembler project wiki puts all of the basics in even greater detail, including ideas and thoughts about good security practices to implement into systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment