Skip to content

Instantly share code, notes, and snippets.

@JarrettBillingsley
Created October 5, 2012 03:22
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save JarrettBillingsley/3837868 to your computer and use it in GitHub Desktop.
Save JarrettBillingsley/3837868 to your computer and use it in GitHub Desktop.
Blah blah blah!
Motivation:
C has long been the king of "systems" languages, but it has problems. It is a relic of a past age, one where memory
was limited and programs were small, where there were as many computer architectures as there were companies that
made them, where complex program analysis was too computationally expensive to be practical.
C and its ABI have become the lingua franca of computing, for better or worse. The C ABI has become the virtual
machine which virtually all hardware now implements. However, the language is old, creaky, and burdened by decades
of legacy compatibility issues. Language development has slowed to a crawl due to a design-by-committee development
model, and compared to most of the other languages that are popular today, it feels very, very limited.
C has no real data abstraction capabilities besides "casting to a void pointer." It has virtually no functional
abstraction. It has no module system or namespacing mechanisms. Its compilation model is a carryover from the days
when programs were read from one tape reel and spit out on another. It is riddled with silly restrictions and
anachronisms to make it possible to implement compilers on computers with only a few kilobytes of memory. Much of
the language is under- or unspecified to allow for platform independence, but much of that implementation freedom
limits development in C because many things are now so widespread that it seems absurd to leave them undefined. Its
standard library is a joke. Its string representation alone has probably caused billions of dollars worth of
economic loss due to security vulnerabilities.
The computer hardware and software landscape is drastically different from the age of Kernighan and Ritchie. The
scope and scale of the programs being written today is far, far greater than it was then. The architectural zoo has
been whittled down to a small handful of practically-identical designs. We now live in a world where even embedded
processors give you a flat 32- or 64-bit memory space manipulated by a von Neumann CPU. Byte and word sizes have
standardized on powers of two, and integer arithmetic is always 2's complement. Everyone uses IEEE 754. Most CPUs,
even those in phones, have some kind of SIMD capabilities. Memory is plentiful. Processing is cheap and fast. High-
level languages are now widespread and practically useful instead of being confined to academia.
But all the software we use today is built on a shaky, outdated foundation. Don't we need something new?
C++ is dead. Lots of places use it because they need something that's C but not as awful and antiquated as C, but
they can't use higher level languages for performance or platform reasons. Its growth is also just as burdened by
red tape as C's. VMs such as the JVM and the CLR solve an entirely different set of problems. Languages like D and
Rust attempt to strike a balance between low-level and high-level concepts, but are limited in their use by reliance
on garbage collection or non-standard process models. What we need is a language that fills in the same niche as C,
but does so in a modern, intelligent way.
<the language> tries to do this. In many ways you can think of it as a subset of Rust, as it shares many features
and operational semantics with it. However <the language> is instead meant to do what C does: act as a relatively
thin layer over the bare hardware, allowing you to write the very lowest levels of your system in as efficient a
manner is possible.
Perhaps somewhat importantly, *<the language> does not try to be all things to all people.* It targets a specific
set of programming problems and tries to solve them well. The developers acknowledge that each language has
strengths and weaknesses, and that they should be used in concert to get the best possible software stack, rather
than trying to use one language to do everything.
Here are some goals:
* It should be compatible with the C ABI to the fullest extent possible. This means being able to call and be
called by C code in a completely seamless fashion. However, the language specification should be far more
concrete where C's is vague, to facilitate implementations and enable cross-platform compatibility of certain
techniques.
* It should have performance on par with that of raw C. You should be able to take any C code, convert it to
<this language> code, and the performance should be almost the same (or identical).
* Like C, the only code that runs is the code that you write. We have explicitly avoided operations that happen
"magically," such as most forms of operator overloading, and complex copying semantics. This makes it much
easier to reason about the function and performance of a given piece of code, even if it comes with some extra
typing.
* To go with the previous point, it should allow you to use more complex runtime functionality *but only if you
request it.* This covers things such as run-time type identification and stack tracing (e.g. for writing a GC),
debugging information, dynamic binding and so on.
* It should use modern programming language theory and analysis to help you write correct code the first time;
to allow you to avoid repeating yourself by providing various forms of data and process abstraction; and to
do all this while still maintaining a high level of performance.
And here are some non-goals:
* It does not try to solve memory management. This is the language that you use to *write* memory managers. That
being said, it does give you the tools to perform partly- or fully-automatic memory management if you so desire,
and many language features make it easier to write memory-correct code without any kind of memory management
assumptions.
* It does not try to solve multiprocessing. Again, this is the language that you use to *write* multiprocessing
systems. The process model is identical to C's. And once again, it does give you tools to write things more
safely if you so desire.
What is <the language> good for?
* OSes.
* Kernel modules (daemons, drivers, and the like).
* Runtime libraries.
* Low-level APIs.
* Compilers.
* Interpreters.
* Real-time or semi-real-time systems such as video games, firmware, etc.
* Native applications (for the platforms that support it).
What is <the language> not so good for?
* Scripting tasks.
* Hot-swappable/fault-tolerant/distributed systems. This is better handled by high-level languages like Erlang.
* "Safe" applications. Many user-level apps being developed today are being done on the web and on restricted
platforms using safe, platform-independent languages. These languages work fine for what they do.
Types:
Primitive:
void ("unit")
bool
i8/16/32/64/..
u8/16/32/64/..
int (alias for native signed int)
uint (alias for native unsigned int)
f32/64/(80/16/128?)/..
float (alias for native largest float.. do we ignore x87 80-bit floats?)
char (unicode codepoint, subtype of u32)
Derived:
[T * N] (array of N Ts)
[mut T * N] (mutable array of N Ts)
(T1, T2) (tuple)
struct
variant
enum
?T (shorthand for Option[T])
Pointers:
@T (pointer to T, much like C, arithmetic OK)
&T (borrowed pointer to T -- range/borrow checked, no arithmetic allowed)
@mut T (mutable pointer to T)
&mut T (mutable borrowed pointer to T)
Variable-sized (can only be used as elem type of pointer):
[T] (slice of T -- basically { ptr: $T, length: uint } where $ stands for pointer type before array)
[mut T] (mutable slice of T -- { ptr: $mut T, length: uint })
string (alias for { data: $[u8], length: uint }, where $ stands for pointer type before string)
string8 (alias for string)
string16 (alias for { data: $[u16], length: uint })
string32 (alias for $[u32])
Function pointers:
function(params..):return (universal funcptr, can only be called, not stored)
*function(params..):return (concrete funcptr, can be stored and called, implicitly convertible to universal)
&function(params..):return (borrowed funcptr, can only be called, implicitly convertible to universal)
Pointer types:
There are two kinds of pointers: regular pointers, written @T, and borrowed pointers, written &T. Borrowed pointers
have some similarities to reference types in C++ but have lifetimes tied to the stack which are statically checked
by the compiler. Regular pointers are very much like in C, but there is no "pointers are also arrays" bullshit.
Because there are two pointer types, there are also two address-of operators: @expr gets a regular pointer to expr,
and &expr gets a borrowed pointer to expr. There is only one dereference operator, like in C: *ptr. There is no ->
for accessing members of pointers; all member access is done with dot notation.
Generally speaking, @-pointers are used to point to the beginnings of things allocated on the heap, while &-pointers
are used for everything else: addresses of stack values, pointers into the insides of objects/arrays, and so on.
@-pointers can be implicitly converted to &-pointers, but the opposite direction is not possible. When writing
functions that take pointers, then, it's best to use &-pointers so that anything can be used.
Array and String types:
[T * N] and [mut T * N] have sizes known at compile time and can therefore be allocated in-place (on the stack or
inside composite types). You can also write [T * ?] as the type of a slot whose length should be inferred from its
initializer, like:
let arr: [int * ?] = [1, 2, 3, 4, 5]
This allocates an [int * 5] on the stack named arr.
For the types whose size is not known at compile time (slices and strings), they must always be preceded by a
pointer sigil, and the sigil immediately before the variable-size type is treated somewhat specially. A slot of type
@[int] is a sort of fat pointer, a tuple of data pointer and data length, and this tuple is passed around
by value. The sigil on the array type (in this case @) indicates what kind of pointer its data pointer is. Thus
@[int] really means { ptr: @int, length: uint }. If you then attach another sigil, it applies to the array reference
itself: &@[int] means &{ ptr: @int, length: uint }. The same goes for mutable slices and strings.
Known-size array types are implicitly convertible to slice types.
What type are string literals? Seems that they should be @string -- the data lives in ROM, and it should be possible
to treat them as &strings as well.
What about array literals that you want to construct on the heap (like @[1, 2, 3] in Rust)? That seems trickier..
there's no built-in "allocate" operator/keyword/whatever, as all memory management is handled by library functions.
Array literals could be made const-only and work like string literals, but that seems overly restrictive since many
times you want to use non-constant exprs. Maybe something like the way they're handled in D -- construct the array
on the stack and dup it to the heap, so you'd have function newArray[T](arr: &[T]): @[T] and call it like
newArray(&[1, 2, 3]). &[1, 2, 3] is like doing:
let tmp: [int * ?] = [1, 2, 3]
let val: &[int] = &tmp
Tuples:
Basically just like struct types, without named fields and without alignment control. All elements of tuples are
always immutable.
Option Shorthand:
Option types are super useful and common, so why not have shorthand for them?
?T is shorthand for Option[T].
null is shorthand for Option.None.
T can be implicitly converted to ?T by wrapping it in an Option.Just.
"x is null" where x is a ?T is the same as writing "tag(x) == Option.None" (and same for !is).
some kind of shortcut for testing/using options that doesn't involve a match().. hmmm...
Structs:
type S =
{
x: i16
y: i32
}
Structs are very straightforward. By default, they are ABI-compatible with C. You can, however, explicitly control
the alignment of the members within the struct.
By default each type has an "alignment," which means that values of that type always start at a multiple of their
alignment. For instance, i32 has an alignment of 4, meaning that any i32 value will always start at a multiple of 4
bytes from the beginning of the struct. In the above struct, x starts at offset 0 and y starts at offset 4. There
are two unused padding bytes between x and y. The total size of the struct is 8.
However you can override the default type alignments by using an "align" directive just inside the struct, like so:
type T =
{
align(2)
x: i16
y: i32
}
Now, x is at offset 0 and y is at offset 2, for a total struct size of 6. Normally you wouldn't want to do this, as
the types have the alignments they do for performance. On some architectures, loading and storing a 32-bit integer
at any addresses that aren't 4-byte aligned requires multiple loads, stores, and bit-shifting. However sometimes you
need to do this for interacting with certain APIs and hardware.
If you need absolutely precise control over the layout of a struct, you can use align(1) which will never auto-
insert padding, and place padding explicitly. For this, you can insert unnamed fields by naming them a single
underscore (used in many other places as a "throw-away" variable name):
type U =
{
align(1)
x: i16
_: [u18 * 3] // three bytes of padding
y: u8
}
This defines a structure with x at offset 0 and y at offset 5, for a total size of 6 bytes.
One last thing to mention is the size of the struct overall. Consider this:
tpye V =
{
d: f64
i: i32
}
d is at offset 0, and i is at offset 8, as you would expect, but the size of V is actually 16 bytes. Why? Because
the size of a structure is rounded up to the next even multiple of the member with the largest alignment. Therefore
the compiler inserts four bytes of padding after the last member. Furthermore the struct's alignment will also be
set to that same largest alignment, so that no matter where the struct is allocated, all the members will be
properly aligned.
Variants:
type X = variant
{
Nullary
Unary(int)
Structy { x: int, y: int }
}
X is the variant type itself.
X.Nullary, X.Unary, and X.Structy are "tags" and can be used as comparison values against the result of tag().
X.Nullary can also be implicitly converted to an X when necessary (so you don't have to put parens on it).
Actually, the type X is shorthand for X<Nullary, Unary, Structy> -- that is, it's an X that can be any one of those
things. You can also do X<Nullary, Unary> or so to just allow a subset of X's constructors.
Often you will end up with situations in which all (or a subset) of the constructors have common fields, and it'd be
nice to access those common fields without having to destructure with a match. Sure, you could separate out the
common fields into a struct and embed the variant part inside it, but that can obscure your intent.
First, for any variant type T<Ctor1, Ctor2, .. CtorN>, if Ctor1 .. CtorN are struct-style, then any shared fields at
the beginning of those ctors can be accessed without destructuring, provided the fields have the exact same names
and types.
type ASTNode = variant
{
Int { line: int, value: int }
Var { line: int, name: string }
Add { line: int, lhs: ASTNode, rhs: ASTNode }
Sub { line: int, lhs: ASTNode, rhs: ASTNode }
}
alias BinOp = ASTNode<Add, Sub>
Since all the constructors of ASTNode have "line: int" as their first member, you can access a.line from any ASTNode
without having to match it. You can't access a.lhs however since only some of the constructors have it; but from a
BinOp, you can, since all the ctors in ASTNode<Add, Sub> have lhs (and rhs).
This is fine and all, but keeping track of all the common fields like this is tedious work. It's also very common to
have subsets with their own common fields, as shown above. There are two bits of syntactic sugar to deal with this.
First is that any common struct fields can be placed within braces just inside the variant's braces:
type ASTNode = variant
{
{ line: int }
Int { value: int }
Var { name: string }
Add { lhs: ASTNode, rhs: ASTNode }
Sub { lhs: ASTNode, rhs: ASTNode }
}
What this does is automatically prepend { line: int } to the beginning of every constructor in the variant. If you
use this feature, you cannot have any tuple-style constructors.
Secondly, you can create subgroups of constructors like so:
type ASTNode = variant
{
{ line: int }
Int { value: int }
Var { name: string }
BinOp =
{
{ lhs: ASTNode, rhs: ASTNode }
Add {}
Sub {}
}
}
The syntax of a subgroup is a name, followed by the equals sign, and then what looks like the body of a variant. All
this does, however, is group together the contained constructors and create an alias. This last example is exactly
the same as the very first one.
For cases where you want to have overlapping categories, you can declare them with aliases manually.
Enums:
type Y = enum : int
{
A
B
C = 10
}
This works similarly to a C enum. No two names can have the same value. No explicit conversion between Y and int is
done. You can *explicitly* cast a Y to an int, but you must use Y_fromInteger to go the other direction (and that
can fail -- it returns ?Y). There are also Y_toString(Y): string and Y_fromString(string): ?Y methods auto-
generated.
Flags:
Maybe? Seems like a useful type..
type F = enum(flags) : int
{
Forks
Knives
Spoons
}
Kinda like a C enum, but for declaring sets of integer flag values meant to be ORed together. A slot of type F can
only hold values from or composed of those from F. There are built-in F_toString(F): string and
F_fromString(string): ?F methods auto-generated. You can also explicitly cast an F to an integer, but you must
use F_fromInteger(meta(basetype, F)): ?F to go the other direction.
Maybe auto-generate F.None for no flags set and F.All for all flags set?
Difference between enums and flags is that an int enum slot will only ever have ONE of the values given, whereas
flags can have ZERO OR MORE of the values given. In order to get flag-like behavior with an int enum, you would have
to declare every possible combination as options, which quickly gets out of hand.
For both enums and flags, you can have a member like "Name alias OtherName", which makes both Name and OtherName
aliases for the same value. This is useful for stuff like:
type Alignment = enum
{
Left
Center alias Centre
Right
}
American programmers can use Alignment.Center and British/Canadian programmers can use Alignment.Centre. Wee!
DEFAULT VALUES (used, for instance, when initializing arrays):
all primitives default to 0 or the equivalent thereof (false, U+000000 etc.)
enums default to their FIRST CONSTRUCTOR (with any values in it default-inited)
Pointers/references *cannot be default-inited*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment