Skip to content

Instantly share code, notes, and snippets.

@cotto
Created June 27, 2011 02:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cotto/1048210 to your computer and use it in GitHub Desktop.
Save cotto/1048210 to your computer and use it in GitHub Desktop.
M0 overlay thoughts
Probably the biggest question is what language we want to use. We'll be
spending a huge amount of time writing whatever it is. We need some criteria:
* how easy is the language to learn for people accustomed to C
* does the language has an object system? We'll be implementing cmop (6model)
in it, so the object system needs to be either self-hosting or non-existent
* how efficiently does the language map to M0? We want to generate efficient
M0 and to have a clear idea of what the M0 for a given snippet looks like
* it shouldn't allow things that don't make sense in M0 (not sure what this means)
* the language should allow CPS stuff, either directly or indirectly.
Indirectly is probably easier.
* need distinction between compile-time and runtime constructs
* need typed variables
* something struct-like
* a way to define M1 ops composed from M0 ops
* light syntax, easy to implement, no optimizations
* if we want macros, they need to be way smaller than C's
* function-like casts will make the language much easier to parse visually
* int* x, y; means two int*
* easy to parse
* question: INSP only or something more fine-grained?
*
We need some kind of overlay language for M0 that we can use to reimplement
Parrot. Writing poke_caller was hard and writing a factorial program will be
harder. Anything less trivial will require an actual compiler or code
generator. We have the following options:
1 (PIR/M0) emit M0 from PIR
2 (nqp/M0) emit M0 from nqp
3 (winxed/M0) emit M0 from winxed
4 (new-nqp/M0) write a new compiler targeting M0 using nqp
5 (new-custom/M0) write a new compiler targeting M0 using a separate toolchain
6 (steal/retarget) take someone's existing compiler, retarget it to M0
PIR/M0 has the advantage that we'll need to do something similar later anyway.
Being able to translate from PIR to M0 will be necessary if we want to continue
to support PIR, and we do. I'm not sure if we'll want to replace Parrot's
current C code with PIR, though. This approach is worth considering.
I don't like the idea of nqp/M0. nqp is already quite slow and I don't see it
being feasible to get a speed improvement by using it more internally. It
might be the case that we don't need to generate inefficient code to deal with
lvalue semantics if the translation is well-designed. There's also the concern
that using nqp almost universally means using a bunch of pir:: garbage, which
would make M0 translation less efficient. Overall it's a fairly nice language,
but I'm not certain that nqp/M0 is the best way forward.
winxed/M0 is a nice option. The compiler already exists and has an alternate
version (winxedxx) that targets C++. Unfortunately winxed isn't designed to
support multiple codegen backends, so we'd have to either refactor codegen into
a separate step (probably slowing down performance) or just fork it and write a
new backend. The language itself is quite pleasant, but the compiler needs
work. I'm still not convinced that this is a bad way forward.
new-nqp/M0 brings with it the speed issues of nqp. nqp is very much designed
to support a highly flexible compilation workflow, so using it to generate M0
is a reasonable approach. I'm not a fan of the langauge's speed and quirks,
though. This approach could be made to work but it doesn't sound like the best
approach.
new-custom/M0 is almost included for the sake of completeness. I'm tired of
writing meta-things and want to get some real work done. Writing a new
compiler from scratch is decidedly non-lazy.
steal/retarget is a generalization of the winxed/M0 approach. Instead of
retargeting winxed, we'd take an unrelated compiler for some language (js comes
to mind) and target that at M0 using CPS for control flow. This has the
advantage that we're not writing (and debugging) a whole new compiler from
scratch, but it depends on us finding an appropriately-licensed compiler for a
suitable language and constructed in a modular fashion.
possible syntax for mole ("M0 Overlay LanguagE")
*******
*types*
*******
I propose that we have 5 types; INSP for registers and cs, which is a C-like
string. (This probably won't be exactly like a C-string, but close enough that
C code can use it if needed.)
registers: I, N, S, P
primitive string: cs
*****************
*constant values*
*****************
This describes what kind of constants can be used in mole code.
int: [1-9]\d*
float: ...
hex: 0[xX][0-9a-fA-F]+
octal: 0[0-7]+
string: "[...]" (with escapes)
************************
*compile-time constants*
************************
**********************
*working with strings*
**********************
Strings pretend to be 0-indexed. They actually also store their length and
encoding as the first five values. The length is stored as a 4B int and the
encoding is stored in one byte, with 3 unused bytes for padding. The string
for "hello, worlds?" would look as follows in memory:
0x0 0x4 0x8 0xA 0x10 0x14
---------------------------------------------------------------------------------------------
|0x0|0x0|0x0|0xC|0x0|0x0|0x0|0x1| h | e | l | l | o | , | | w | o | r | l | d | s | ? |\0 |
---------------------------------------------------------------------------------------------
size encoding 0x0 0x4 0x8 0xA
*********
*structs*
*********
Structs may be defined as below. Once a struct is defined, it can be used
wherever any other type can be used. If a register is of a struct type, it is
assumed to point to a region of memory with the specified layout. Struct
members are accessed using the '->' notation, as in C. sizeof() can be used to
determine the number of bytes required by the struct. This is similar to C,
except that sizeof() is purely a compile-time construct and can not be used to
calculate the length of an array.
struct {
I int_thingy;
N n_thingy;
} struct_thingy;
var I quux;
var struct_thingy st;
st = m0::sys_alloc sizeof(struct_thingy);
st->int_thingy = 39292934;
st->n_thingy = 332.66;
********
*chunks*
********
Chunks are similar to functions. They have a constants table, a metadata table
and a bytecode segment. Values can be added to the constants table by
declaring a value with the keyword "const". Annotations may be added
automatically by the mole compiler and can also be added manually with the .ann
"key" "value" syntax.
chunk main (I a1, I a2, I a3) {
const I stdout 1;
const cs hello "ohai. im in ur m0";
// annotation for the right file will be added by m1 compiler
m0::print_i stdout, hello;
var I i_thingy;
i_thingy += a3++;
c::fprintf(stdout, "asdfw %d\n", i_thingy);
call_chunk "chunk_name", arg_array;
}
*********************
*calling conventions*
*********************
I don't know. There are a couple options:
1) The first is that all calling conventions need to be dealt with explicitly.
This isn't nearly as bad as it'd be under M0 because of composed ops and it
would allow a very high degree of control without requiring the management of
all the minutae of the calling conventions more than once.
2) The second option is to have a default set of calling conventions that are
used with a simple minimalist syntax, but to allow them to be overridden with
composed ops.
3) The third option is to say that only the builtins can be used for control
flow. For a very experimental language like mole, this approach is probably
insane.
**************
*composed ops*
**************
mole supports syntax to create composed ops which behave similarly to built-in
M0 ops. The syntax is similar to chunnks with a few differences. Composed ops
are declared using the "composed" keyword and do not have return statements.
Any values that the composed op needs to modify should passed as arguments.
Using a return statement in composed op is a syntax error. Composed ops may
take an arbitrary number of arguments. Variables may be declared in composed
ops as in functions. composed ops are similar to inlined functions in C.
composed init_cf(P new_cf, I retpc_label) {
alloc_cf:
I cf_size = 256;
I flags = 0;
new_cf = m0::gc_alloc cf_size, flags;
init_cf_copy:
new_cf[INTERP] = cf[INTERP];
new_cf[CHUNK] = cf[CHUNK];
new_cf[CONSTS] = cf[CONSTS];
new_cf[MDS] = cf[MDS];
new_cf[BCS] = cf[BCS];
new_cf[PCF] = cf[CF];
new_cf[CF] = new_cf;
init_cf_zero:
new_cf[EH] = 0;
new_cf[RETPC] = 0;
new_cf[SPILLCF] = 0;
RETPC = retpc_label;
init_cf_pc:
new_cf[PC] = post_set;
CF = new_cf;
post_set:
}
@soh-cah-toa
Copy link

I already mentioned these things to you, cotto, but I'm wondering what others may think as well so this is really directed at the other Parrot developers instead.

First, the syntax for allocating variables/registers may need to be distinguished. For instance:

struct {
    I int_thingy;
    N n_thingy;
} struct_thingy;

var I quux;                  # Point A
var struct_thingy st;    # Point B

At point A, it appears that you are declaring a named variable - like in PIR - that refers to an integer register. However, at point B, it looks like you are declaring a variable of type "struct_thingy". I say "variable" instead of "register" because there is no "stuct_thingy" register. Does the struct syntax define a new register type or is it merely an alternative syntax for referring to a group of registers?

Next is the syntax for constants. Take the following declaration:

const I stdout 1;

The "const" statement creates a new entry in the symbol table. I'm wondering though what the 1 does in this statement. cotto says that it initializes the entry with that value. However, when you assign a value to anything else, like a register, you use the "=" operator. For the sake of consistency, I think it may be better to change the syntax to:

const I stdout = 1;

Lastly, the syntax for calling m0 opcodes is:

m0::foo bar, baz;

Would it be worth adding support for m0 blocks, much like Q:PIR {} in NQP? For instance,

m0 {
    print_i stdout, "Hello ";
    print_i stdout, "world!\n";
}

Lastly, the syntax for calling a chunk.

    call_chunk "chunk_name", arg_array;

This syntax kind of makes "chunk_name" look like a string rather than a function (chunk, whatever). This is something that's always annoyed me about the syntax for calling subroutines/methods in PIR. Why not keep it consistent with the way every other language uses for calling functions?

chunk_name(arg_array);

Or, if you wanted to be different, I think the alternate syntax that Perl 6 uses isn't too bad:

chunk_name: arg_array;

Give me your feedback.

@lucian1900
Copy link

I really like the "fat" strings, with size information baked in. I'd also like similar features for arrays/pointers instead of raw pointer arithmetic, either classical fat pointers, or a memory view like Go's slices.

I think inline assembly is essential, I like soh-cah-toa's proposal.

Would it be desirable to rename composed ops to procedures and chunks to functions? That's vaguely what they're doing, and functions could easily be built on top of procedures. I don't know M0 very well, so there may be problems with this idea there.

@Benabik
Copy link

Benabik commented Jul 4, 2011

call_chunk "chunk_name"

is a good idea iff chunk_name is actually a string. (Like how methods are looked up by name in Parrot today.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment