Skip to content

Instantly share code, notes, and snippets.

@mcclure
Last active March 27, 2021 20:42
Show Gist options
  • Save mcclure/36703a4a66dc862b9fc4f58627715b1a to your computer and use it in GitHub Desktop.
Save mcclure/36703a4a66dc862b9fc4f58627715b1a to your computer and use it in GitHub Desktop.

Somebody asked my advice on writing an independent CLR (C#/.NET) VM implementation. Here's what I said.

Clarifications:

  • Download ECMA-334 + ECMA-335 PDFs first.
  • "Get" C# bytecode-compiler / disassembler means download them-- don't write your own.
  • Author of JSIL is @antumbral.

You want to get a c# -> bytecode compiler, and you want to get a bytecode disassembler

mono's is mcs and [thing i forget, but there is one], monodis?

mono is moving over to the coreclr toolchain, so roslyn and [???], but i don't know where they're at at that

but it doesn't much matter. there's minor coreclr/pre-coreclr differences but that doesn't mean i know what they are

maybe you won't run into them

the biggest differences are probably in the BCL and how the BCL is found/loaded

if you start parsing stuff, i recommend only interpreting the bytecode

the container format for CLR bytecode is actually PE. They're just tables in normal ass DLLs.

So first thing you'll want to either write, or find, a PE decoder. Either should be fine. Writing a PE decoder might be fun.

Once you've got that done and are interpreting content, there's two main phases (thinking abstractly) you'll want to think about: Metadata, and runnable code

idk if "metadata" is the right word, that's mono's word. but you're going to have to create structures that describe types. classes, methods, method specializations (mono creates a separate metadata object for every specialization of a given thing, like thing is different from thing, and this may in fact be unavoidable, at least if you choose to JIT). and you have to do at least a little of this before you can run code, because if you can't parse classes you can't find public static void main.

once you're running code, you should think about the distinction between value and reference types. remember C#/CLR have the base types (int, float, structs) which are or can be stored on stack and the reference types (any derivative of object) which are always stored on heap and can be garbage collected. you should know what "boxing" is.

since main() is a static function, you actually don't need a garbage collector until you have flow control working with the value types!

as long as you don't load libraries (IE, the BCL) i think implementing a bytecode executor will be a pretty straightforward vm iml. the problem with loading libraries is

  1. unexpected items in the standard library may be "deep"-- IE a class might import a class that imports a class* that imports a significant portion of the stdlib, or something in its import tree might contain one of the "weird" parts of C#, such as LINQ.
  2. a number of fairly basic classes that may be surprising to you at first are not written in pure C#, but call out to special hooks implemented in the VM. Worse, Mono, Microsoft CLR, and Unity compile-to-cpp each do this in a different way! Mono uses what it calls "icalls", which are a feature unique to mono. So, you may think "i'll just support strings" or "i'll just support System . IO . FileSystem" but then realize these are the hardest things to support. Make a decision about which parts of the standard lib you want to support, and don't go into this project assuming you'll be able to run general C# software instead of examples you wrote yourself (because general software probably imports arbitrary parts of the BCL). If you make it a goal to run one or more pieces of "real" C# software, don't be afraid to just re-implement parts of the BCL (such as string or io.filesystem) implementing only the subset of methods you feel like supporting, so that you can keep the import tree shallow.
  • A friend of mine wrote an independent CLR implementation called JSIL, and she talks about what she calls "The XML parser problem"-- which is that most import trees when you grab a particular class, if they're not totally trivial, will wind up importing an XML parser, and now suddenly you're building in half the standard library

three things which may cause problems if you decide to implement them late without thinking about them early on:

  1. AppDomains
  2. Pinning
  3. Finalizers AppDomains are the easiest to skip-- Unity AOT never supported these, I don't think. Finalizers, I THINK you can actually decline to run at all, ever, and still be a compliant C# VM! Pinning however is very tricky for several kinds of GC, and there is "normal" code-- particularly any code that does C interop-- that uses pinning. So maybe at least read up on pinning when you start writing your GC, and decide whether you're planning ahead from it.

Anyway good luck with this and I hope I'm not making things sound too complicated-- I'm focusing on the "weird parts" here on purpose to give you a warning of where you might get tripped up, so some of these things I'm talking about, you may encounter late or not at all.

And remember-- ECMA-334 and ECMA-335-- these documents are good and they are free to download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment