(This is a translation of the original article in Japanese by moratorium08.)
(UPDATE (22/3/2019): Added some corrections provided by the original author.)
Writing your own OS to run on a handmade CPU is a pretty ambitious project, but I've managed to get it working pretty well so I'm going to write some notes about how I did it.
For those of you who want to skip straight to the code, it's on GitHub:
https://github.com/moratorium08/osmium/
I only wrote the actual OS. ATP wrote the core and dai implemented the multiply instruction.
https://github.com/hitohira/yokyo
I implemented an OS from scratch in Rust that runs on RISC-V(rv32im). When writing an OS, you need to take care of things like paging, kernel/usermode privilege separation, interrupts, etc., and I implemented these based on the RISC-V Privilege spec document. There might be a few things in the RISC-V spec that I didn't implement, but there's nothing in the implementation that is not in the spec (in other words there's no code based on spec I came up with myself). For paging, I implemented the Sv32 scheme.
The major parts of the OS that I implemented are:
- bootloader
- memory mapped UART I/O
- RISC-V Privileged Architecture, Sv32 paging
- Supervisor Mode/User Mode privilege separation
- ELF file parsing and process management
- preemptive multitasking via clock interrupt trap handling
- Copy on Write fork system calls: fork, execve, exit, mmap, etc.
- inter-process communication
- a simple filesystem
I also made a shell as an example of a usermode application that makes use of these kernel features. It's quite a simple program, but it manages its child processes in almost the same way as Linux shells can do (fork, execve, wait, exit). In other words, it's proof that the OS has sufficient features to support something as powerful as a shell.
- Summary
- What's RISC-V?
- Why Rust?
- About the implementation
- Building Rust programs to target the ISA
- About the Privilege Architecture
- UART I/O
- Paging
- Parsing ELF files
- Copy on Write fork
- Trap handling and preemptive multitasking
- Inter-process communication
- Filesystem (simple)
- System calls
- Shell
- Try it in an emulator
- Final thoughts
- References
- Appendix: about the CPU assignment
You should probably just google it. It's an ISA (instruction set architecture) that's gaining a lot of interest recently. Fundamentally it's similar to MIPS but it's a clean, modern RISC architecture without all the legacy crud of MIPS.
Writing a CISC core such as x86 is really tough, so if you're going to write an OS for fun and run it on an FPGA, I think RISC-V is a natural choice. Compiler tooling with support for this architecture, e.g. gcc and llvm, already exists, and there is also an emulator called Spike (although I didn't use it myself).
A lot of real-world kernel security vulnerabilities are caused by things like
buffer overflows and type errors. Using Rust means more powerful static
analysis, allowing you to build a more secure OS. Of course, in the process of
writing an OS you need to use a few unsafe features, so you might not achieve the
same level of safety as you would when writing a normal Rust program. But it's
easy to distinguish between the safe and unsafe parts of the code, and by
minimising the use of unsafe
you can focus your efforts on the unstable parts
of the system.
Furthermore, even if you can't use the features of std, you can still make use of Iterator, Result, Option, pattern matching, traits, etc. to make your code beautiful.
Anyway, although I enumerated some reasons above, the reason why I adopted Rust in this project was that I just felt like writing it in Rust.
Some notes on things I struggled to implement, important implementation points, etc.
Rust officially supports RISC-V as a target since last year. However, RISC-V has a few different bases (rv32i, rv64i, etc.) and defines a number of extensions. So when somebody says "RISC-V" they are actually talking about one of many variants.
The core I used to run the OS has a 32bit architecture. Currently there are 2 officially supported 32bit ISAs: rv32imac and rv32imc. The letters after the number stand for spec extensions: m stands for Integer Multiplication and Division, a for Atomic Instructions and c for Compressed Instructions.
For this project, we decided making a multicore CPU (in RISC-V terminology, a CPU with multiple harts) was a bit too ambitious, so our core has only one hart and we went for the architecture without atomic instructions. We also decided not to bother with compressed instructions. "Compressed instructions" is a 16-bit subset of the instruction set that only supports frequently used instructions, frequently used constants (e.g. 0) and frequently used registers (e.g. the zero register). According to the RISC-V spec, this subset of instructions makes up 50-60% of most programs and use of compressed instructions can reduce code size by 25-30%.
In hindsight, all you need to do to support compressed instructions is fiddle with the decoder and the instruction fetching code a bit, so maybe it wouldn't be that hard. But, as we'll see later, since it was required to implement the MMU and CSR (Control and Status Register) in the CPU, we wanted to keep things as simple as possible and focus on implementing them.
As I mentioned above, rv32im does not exist as an official target, so I had to set up my own build environment for cross-compilation. I did this with help from Writing an OS in Rust, the Rust official docs, The Embedded Rust Book and RustでRISC-V OS自作!はじめの一歩. For more details, take a look at my previous post or the GitHub repo.
The result is that I'm able to make a build that doesn't use the standard library.
The implementation follows the RISC-V Privilege Architecture spec. RISC-V currently defines three privilege levels: Machine, Supervisor and User. According to the spec, no matter what your hardware you have to implement support for a Machine mode.
For this project I wanted to implement paging, so Supervisor mode was also a must, meaning I have to implement the whole Machine-Supervisor-User set. However, to keep things simple, I decided to make it so the machine switches to Supervisor mode immediately after startup. If we assume that interrupts alway shave the delegate bit set, the Supervisor has the necessary permissions to perform all the processing.
I used nearly all of the supervisor's CSR. ATP (the person who wrote the CPU) took care of implementing the corresponding instructions (csrrw, csrrc, csrrs, etc.) and system-level instructions (sret, ecall) in the CPU.
I decided to use UART for I/O, so first of all I had to write I/O processing
that uses UART. I made the UART memory-mapped, so you just need to read or write
the correct memory location one byte at a time, but I wanted to provide slightly
richer I/O support. So I implemented print!
and println!
macros so you can
do formatted output just like in regular Rust.
This is pretty easy to do in Rust if you implement the core::fmt::Write
trait. (0x80000000
is the UART input address I chose.)
use core::fmt::Write;
const UART_RX: *const u8 = 0x80000000 as *const u8;
const UART_TX: *mut u8 = 0x80000004 as *mut u8;
struct UART;
fn write_byte(byte: u8) {
unsafe {
*UART_TX = byte;
}
}
impl Write for UART {
fn write_str(&mut self, s: &str) -> core::fmt::Result {
for c in s.bytes() {
write_byte(c);
}
Ok(())
}
}
pub fn print(arg: ::core::fmt::Arguments) {
UART.write_fmt(arg).expect("failed to send by UART");
}
#[macro_export]
macro_rules! print {
($($arg:tt)*) => ($crate::uart::print(format_args!($($arg)*)));
}
#[macro_export]
macro_rules! println {
() => (print!("\n"));
($arg:expr) => (print!(concat!($arg, "\n")));
($fmt:expr, $($arg:tt)*) => (print!(concat!($fmt, "\n"), $($arg)*));
}
RISC-V provides a spec for paging. For this project the architecture is 32bit, so the only choice of paging scheme is Sv32. Putting it very roughly and ignoring superpages, it's a 2-step jump: you use the first 10 bits of the virtual address and the next 10 bits to do a double table lookup, then add the resulting table entry to the bottom 12 bits of the virtual address to get the physical address. For more details, refer to the spec.
When it came to implement this, I struggled a bit with accessing the intermediate page table.
As mentioned above, conversion from a virtual address to a physical address involves a 2-step table lookup (again, we're ignoring superpages). Now, let T0 and T1 be the first part of the tables and the second part of the tables respectively. When you try to implement the table lookup you hit a problem, which is that after you've turned on the MMU it's difficult to access T1. For the T0 table, I was able to pre-assign one entry per process at startup time so I knew the table's virtual address. But the T1 tables are assigned dynamically (at startup time not all the memory is mapped into the kernel's page space. This means that even if you know a table's physical address, the virtual address you need to access it hasn't been defined.
One solution to this, for example with x86, is to assign one entry in the T0 table to point to itself, meaning you can avoid the need for 2 jumps (I think this is known as recursive paging). So x86 paging makes it possible to access intermediate tables without having to sacrifice precious memory space.
In Sv32 however, the spec dictates that every page's table entry must be either a leaf, i.e. the physical address of the head of a page, or a pointer to the next page table. And if you try to do more than 2 jumps, you get a page fault. (I'm a bit unsure about this stuff, so please correct me if I've misunderstood.)
Because of this, it's not possible to do recursive paging as described above. This is because the page table entry pointing to the page table you want to access is not a leaf. To work around this problem I made another table T0', which is a duplicate of T0. One entry in T0 maps to T0', but all the other entries in the two tables are identical.
(see original post for diagram)
Since this design needs one more page for paging, it wastes PAGESIZE memory space per process compared with recursive paging, but it achieves the same result. It can't really be described as beautiful though. If there's anyone out there who understands this stuff properly, I'd appreciate some advice.
Rust has a zero-copy ELF parsing library called goblin, but for reasons to do with its use of synchronous instructions I wasn't able to use it. So I wrote my own simple parser.
Assuming you can fit the whole ELF in memory, it has a structure that's easy to parse. If you define a struct like in the code below, you can simply decode the bytes into the struct (although this code might be missing some validation).
#[derive(Debug)]
#[repr(C)]
pub struct ElfHeader {
pub magic: u32,
pub elf: [u8; 12],
pub etype: u16,
pub machine: u16,
pub version: u32,
pub entry: u32,
pub phoff: u32,
pub shoff: u32,
pub flags: u32,
pub size: u16,
pub phentsize: u16,
pub phnum: u16,
pub shentsize: u16,
pub shnum: u16,
pub shstrndx: u16,
}
/*...*/
pub fn new(bytes: *const [u8]) -> Result<Elf<'a>, ElfError> {
let bytes = unsafe { &*bytes };
let elf = unsafe {
let data: *const ElfHeader = bytes.as_ptr() as *const ElfHeader;
&*(data)
};
if elf.magic != ELF_MAGIC {
return Err(ElfError::InvalidMagic);
}
Ok(Elf { bytes, elf })
For more details take a look at the code. Maybe this stuff is common knowledge to experts in the field.
I implemented fork
using Copy On Write.
In order to start a new process, you need a command that takes an exact clone of
the current process's memory and uses it to create a new process, i.e. fork
.
You need a way to copy all of a given process's memory. Of course it's possible to make a copy of all of the process's mapped memory, but this would involve a huge amount of memory access and thus would be extremely expensive.
We can improve the situation by taking a copy of the page table, setting a Copy
on Write flag on the writable pages, then only copying the pages when one of the
processes (either the original process or the forked one) actually makes a write
to that page. This is known as Copy on Write and can be thought of as lazily
evaluating the copying of pages. Often after a fork
has taken place, a call to
execve
will replace the address space, so not bothering to copy a load of
pages unnecessarily is a good optimisation.
(see original post for diagram)
In order to implement Copy on Write I used 1 bit of the RSW section in page table entries.
(see original post for diagram)
As mentioned above, the system is set up so that the supervisor is delegated to handle all interrupts, so the CSR (Control and Status Registers) are all managed by the supervisor.
In order to handle a trap, you have to first set the address of the instruction
you want to jump to when a trap happens. You do that by writing the address to
the CSR called STVEC
. (By the way, I'm using Direct mode, not Vectored mode.)
Having set that, when a trap occurs the program counter will be set to that
address and the processing of the trapframe will begin. When that happens, the
stack is replaced with a dedicated trap-handling stack and the usermode
registers are saved. The CSR called sscratch
is used to point to the location
where the usermode registers are temporarily stored. Also the sepc
holds the
program counter at the time the trap occurred, so that is also saved.
https://github.com/moratorium08/osmium/blob/master/kernel/src/trap.rs#L394-L437
After that, the supervisor looks at the scause
and stval
registers and
handles the trap appropriately.
To return to usermode after trap handling is complete, you basically do all of
the above in reverse, set all the registers appropriately, put the saved program
counter in spec
and call sret
.
https://github.com/moratorium08/osmium/blob/master/kernel/src/trap.rs#L461-L544
This trap-handling mechanism can also be used for clock interrupts. RISC-V uses
2 memory-mapped registers mtime
and mtimecmp
to control timer interrupts.
The process for handling clock interrupts is pretty simple: put the currently
running process into Runnable state, and start running one of the currently
Runnable processes. I didn't have time to implement any funky scheduling
algorithms, so I used a pretty simple rule. I sort the processes in order of how
recently the scheduler has checked if they are Runnable, then start checking
them from the oldest one until I find one that is in Runnable state.
https://github.com/moratorium08/osmium/blob/master/kernel/src/trap.rs#L340-L345
https://github.com/moratorium08/osmium/blob/master/kernel/src/proc.rs#L260-L279
There are 2 standard ways to achieve IPC: message-passing using a queue, and memory sharing, i.e. allowing multiple processes to share the same area of physical memory. For this project I implemented both.
When using message passing, you have to make a system call every time you
want to send a message. This means that, compared with memory sharing, it's not
very performant when passing large amounts of data (but, according to Operating System
Concepts this is not always true when you have a large number of cores because of the effect of cache line bouncing). My implementation is heavily
influenced by xv6
: there is one mailbox per process, and each mailbox can be
read only by that process but written by many processes. Writing to and reading
from mailboxes is asynchronous, so if you want to know whether the message has
actually been received or not, you need the recipient to send you another
message.
If we assume people will use this message-passing mechanism for sending lightweight synchronisation messages between processes, the use case for memory sharing is inter-process sharing of large amounts of data. In order to use memory sharing you only need to make a single system call to create the shared memory zone, and after that you just do plain memory reads and writes, so there are no system calls involved and it performs well.
To showcase IPC I made a toy example, catch_ball.rs. This programs first does some message passing, then sets up a shared memory zone and manipulates data inside it.
Unfortunately I wasn't able to implement a proper filesystem in time for our
final presentation of the assignment (19th Feb), so I just hacked something together to provide
files to use as input for calls to execve
etc. I hardcoded some static files
and just looked up the file content by name. This sucked, so after the final
presentation I started work on a real filesystem.
Microkernels are kernels that implement things like memory management, process management and IPC in kernel mode but provide things like filesystems and the networking stack in userland. This OS follows that design.
To briefly explain my homegrown filesystem spec, just like most mainstream filesystems it manages "blocks". The very first block is a "superblock" containing information about the filesystem, and the next is a management block containing information about which blocks are in use. After that come the blocks containing actual file data.
The filesystem supports regular files and directories, and directories can recursively contain both regular files and other directories.
Both regular files and directories have their own metadata sections, and a double-jump pointer to the actual data. For example you can access the data in a regular file by traversing the 2 tables as shown in the diagram below.
(see original post for diagram)
Directories are structured in the same way and hold pointers to their child files.
I wrote the disk formatter in Python. To be honest I was getting tired so I didn't bother adding into the OS properly. Writing all this stuff on your own is pretty tough, so if anyone wants to help out that would be great. I guess I'll get around to implementing it properly soon.
In RISC-V system calls are known as environment calls and you can use the
ecall
instruction to raise an environment call exception. The arguments to the
call are passed via a0
, a1
, ... and the result is returned in a0
.
As an example of a user program that makes use of the features described above,
I built a shell. I didn't have much time so it only includes basic features, but
it's able to perform the basic fork -> execve -> exit
flow.
pub extern "C" fn _start() -> ! {
let mut buf = [0u8; 256];
println!("** Osh **");
loop {
print!("$ ");
let (len, b) = uart::buffered_readline(&mut buf);
if len == 0 {
continue;
}
if !b {
println!("Sorry. too long. Please enter shorter command");
continue;
}
let cmd = match str::from_utf8(&buf) {
Ok(s) => s,
Err(_) => {
println!("Failed to parse your input. Try again.");
continue;
}
};
if buf[0] == b'e' && buf[1] == b'x' && buf[2] == b'i' && buf[3] == b't' && len == 4{
syscall::sys_exit(0);
}
match syscall::sys_fork() {
syscall::ForkResult::Parent(id) => {
loop {
match syscall::sys_check_process_status(id) {
syscall::ProcessStatus::Running | syscall::ProcessStatus::Runnable => {
syscall::sys_yield();
},
_ => {
break;
}
}
}
},
syscall::ForkResult::Fail => {
println!("fork failed");
},
syscall::ForkResult::Child => {
syscall::sys_execve(cmd, len as u32, &[], &[]);
}
}
}
}
I went into a lot of detail about the implementation, but if you want to see it running in an emulator you can use the docker image.
docker run -it moratorium08/osmium
It's an image I made from the build environment so it's pretty massive. It might take a while to download.
The graphics are not pretty, but at least you can see that it's actually working.
That's pretty much all I have to say. I didn't implement any networking stuff but I probably should, as it would unlock a lot more interesting userland applications. It would also be fun to implement some graphics support using VGA, maybe even a windowing system. Also the whole implementation is pretty rough around the edges (memory is not cleaned up properly, you can't clean up processes, the privileges stuff is sometimes wrong, etc.) so there's plenty of tidying up and polishing I could do.
My background regarding operating systems is pretty much:
- I read the source code of Unix V6
- I did all the exercises for the MIT xv6 course
- I read Writing an OS in Rust (and did exactly what it told me to)
- I did liva's systems programming assignment (and took the operating systems course)
Before this I'd never implemented my own OS, so it was a good experience. The OS I made is pretty basic and doesn't break any new ground, so I'd like to add some more interesting features to it in the future.
Also ATP writing an amazing core was a lifesaver, and hacking on this in a small group with a short deadline was a lot of fun.
I haven't explained it properly so far, but this was an optional extra part of a CPU assignment for my computer science course at university, so it was just for fun. Previous students to try this extra assignment have done a lot of different things, including porting xv6 and building an OS from scratch. It's quite exciting when you see your handmade CPU work for the first time, and I think quite a lot of people feel the urge to build more on top of that foundation.
In the main part of the CPU assignment, some people focussed heavily on performance, and others were experts at building cores (ATP, who got raytracing working on 3 different architectures and wrote the core for my OS to use, is an expert on cores, while uenoku, who wrote min-caml from scratch in 2 languages and wrote an LLVM backend for his original ISA is an expert on compilers), but even though I can't claim that level of expertise I thought it was worth jotting down my thoughts.
For the CPU assignment we had to build a core and an FPU in Verilog, write a compiler in OCaml, and use them to run a raytracing program on an FPGA, competing on execution speed.
Once you've got it working, the idea is to optimise it by designing cores with higher clock speeds, reducing the number of dynamically executed instructions, increasing the instruction level parallelism in the core, etc.
I was in charge of the core for this main part of the assignment, and I also helped to optimise the compiler. The architecture was a simple pipeline running at 210MHz (without instruction parallelism, but with register forwarding and out-of-order completion of load instructions).
After doing some optimisation work to reduce the instruction count, we ended up with around 1.88 billion instructions, which took 20.7 seconds to execute. The first time they did this exercise back in 2006 the record was set at 18.8 seconds, and the record as of 2016 was 4 seconds, so ours was not a spectacular result. This was probably because we didn't have instruction-level parallelism, and we could have optimised our multi-cycle floating point operations more.
The point is, as we can plainly see from looking at modern commercial CPUs, that parallelism is the key to getting good performance.
That's great, thanks for sharing!