daokoder/The-path-toward-a-new-system-programming-language.md Secret

## The-path-toward-a-new-system-programming-language.md

      
    Raw
  

              The-path-toward-a-new-system-programming-language.md
            
          
    The path to a new system programming language!

By Limin Fu, 2014.10.22
The starting point would be the Dao programming language and its
implementation. And the ending point would be new and statically
compiled system language that supports much of the current syntaxs
and features of Dao. With the frontend based on Dao, and the backend
based on LLVM, it should be relative easy to implement a compiler
for such language.
Please note that, this by no means makes the current Dao obsolete,
or less important. Dao is for application programming, which is the
primary application domain, which will not change or become less
important. Also the new language should have excellent interpolation
with Dao, as well as C. So if everything works out as expected, we
will have two languages with similar syntaxes and features, but at
different levels, one at lower level for system programming, and the
other at a high level for application programming. (Merging them would
not be possible because of different techinical requirements.)
For convenience, I will name this new language NeoDao for now.
Basing it on Dao means a large portion of the Dao implementation
could be reused, in particular, the parser, the type system, the
bytecode format and the virtual machine. It seems quite natural
that each source file of NeoDao should be compiled into a bytecode
file, which could then be further compiled into bitcode files and
linked into native executable files by a backend based on LLVM.
This way, it should be able to build large projects very efficiently.
Of course, such bytecode should also be executable on the virtual
machine. So the first step would be expanding Dao with more
primitive data types for NeoDao. It will also be necessary to
remove some of the built-in types and move their implementation
from C to NeoDao. All these must be done in self sufficient way
such that the compiler and VM should be able to compile and
execute NeoDao code without ever generating native code.
As a system language, NeoDao will need to support manual memory
management. It will use ownership to greatly simplify memory
deallocations. It may not be able to handle all cyclically referenced
objects. This means manual intervention will be necessary, either
by breaking such references manually, or deallocating them individually.
But an optional garbage collection could be supported as a library.
A task local garbage collector can be created, which can obtain
object references to manage those objects.
All memory errors such as dereferencing null pointer, double-free,
freeing objects still in use, leaking must all be caught a runtime.
And the type system could eliminate them as many as possible. It is
not necessary to enhance the type system to do such things, but such
abilities can be added later without any impact on the language and
the backend.
The concurrency in NeoDao should support both shared memory model
and message passing through channels, more or less as it is now.


Random ideas below:

###############################
class K
{
    var    m1 : list<int>  # object
    var  & m2 : list<int>  # reference, shared;
}
Bitwise copying
Owned variables are automatically deallocated when they become out
of scope.
"shared" variables are typed as "type&", handled in a similar way
as "var" and "invar" types. Values of variables with local scope
(lifetime) cannot be assigned to a shared field of a variable with
lifetime exceeding those variables.
"shared" variables are referenced counted. They are only
deallocated by its owner, freeing by others will only reduce
their reference count. When a shared variable is deallocated
by its owner, it will check its referenced, if it is not zero,
a memory leaking warning can be issued.
All non-primitive values have an additional reference counting
field that precedes the value. All values that are moved to a
variable of inexact types such as "any" and variant types, will
be boxed with type field preceding the refcount field (which is
added for primitive values).
Primitive variables cannot be shared, as they do not support
reference couting.
Reference Boxing:

var object = K()    # An object, not a reference;
var & ref = object  # A reference;
In C language:
struct Object
{
    ...
};
struct RefData
{
    Object *object;
    Type   *type;     // statically created;
    void   *owner;    // owner address;
    int     refcount; // must be atomic;
};

enum RefType
{
    ORIGINAL,      // the one in ObjectBox
    NORMAL,        // copy of an original reference
    TRANSIENT,     // temporary reference for embedded object
    TRANSIENT_LOC, // copy of a local transient reference
    TRANSIENT_EXT  // copy of an external transient reference
};

struct Reference
{
    RefType  reftype;
    RefData *refdata;
};

struct ObjectHeader
{
    Reference ref;
    RefData   refdata;
};

// Individually allocated object will allocate the following struct
// to include extra data. When the reference of the object will use
// the preallocated and intialized "ref" as the reference object.
// For transient references, a "ObjectHeader" will be created (on stack?),
// and its "ref" field will be used.
//
// When a reference is assigned to a variable, it will be copied
// with refcount in RefData increased.
// If the variable is on stack, reference can also be created on stack.
//
// refcount of a transient reference can be checked for illegal use
// of transient reference (hence an illegal use of the embedded object).
//
struct ObjectBox
{
    Reference ref;
    RefData   refdata;
    Object    object;
};
owner will be NULL for locally owned objects.
In other case, it will be the address of its container.
Ownership

class Klass
{
    var value = 123
}
class Klass2
{
    var & value : Klass;
}

routine Test( & param: Klass2 )
{
    var stackObject = Klass()
    # Allocation on stack;
    # Local variable stackObject has the ownership;
    #
    # DVM_CALL : ...; mode = INIT_STACK;
    
    var & heapObject = Klass()
    # Allocation on heap;
    # Local reference variable has the ownership;
    #
    # DVM_CALL : ...; mode = INIT_HEAP;
    # DVM_MOVE : ...; mode = OBJ_TO_REF;
    
    var stackObject2 = heapObject;
    # Bitwise copying;
    #
    # DVM_MOVE : ...; mode = REF_TO_OBJ;
    
    param.value = stackObject;
    # Valid, assigning an object to a reference type,
    # means automatic boxing the reference;
    # But the object and reference is local;
    # param.value should be reset before exiting the scope;
    # Otherwise a running time error will be issued;
    #
    # DVM_MOVE : ...; mode = OBJ_TO_REF;

    param.value = heapObject;
    # Valid.
    # The reference is local, but the object is not;
    #
    # DVM_MOVE : ...; mode = REF_TO_REF;

    param.value = Klass();
    # It should also allocate on heap,
    # because "param.value" expects a non stack object reference;
    #
    # DVM_CALL : ...; mode = INIT_HEAP;
    # DVM_MOVE : ...; mode = OBJ_TO_REF;
    
    # When exiting the scope:
    # Destructors of stack objects will be called;
    # Local reference will have reference count decreased;
    #
    # If an object pointer indicates that the object was
    # allocated on stack, and the reference cound did not
    # reduce to zero, and running time error will be issued.
}
After an object is created, it will be automatically owned
by any explicit variable which comes to hold a reference of
the object. To change the ownership, use := assignment.
When a new reference is assigned to a variable with an existing
reference, the existing reference will be unreferred, with
ownership checked and object deleted if the object is owned
by it and has zero refcount.
Transient Reference

A transient reference is a reference referring to a object that
is embedded in another object (or is allocated on stack?).
Such reference is only for temporary local use.
Ordinary reference is used for individually allocated objects.
The use of transient reference is no different from ordinary
references, and programmers can be completely oblivious to
the differences.
Individually allocated objects could allocate additional space for
reference handling, which makes catching memory errors possible
when deallocating them. But this is not possible with objects
that are embedded in other objects. Transient reference is employed
to ensure local and temporary use of such objects as references,
which will ensure that such objects will not be referenced anywhere
when they are deallocated along with their host objects.
class One
{
...
}
class Two
{
    var value = 123
    var object = One()
}
routine Test( & one: One )
{
    # Do something with one;
    
    # The compiler will add code to handle all the variables
    # that will become out-of-scope.
    # The code will check transient references to ensure that
    # such references are not holded by any other variables
    # that may survive beyond the current scope.
}

var two = Two()
Test( two.object )  # Transient reference;

routine Test2()
{
    var two = Two()
    var & one = two.object  # Transient reference;
    Test( one )
}


Deallocation

class Node
{
    var value = 123
    var next : Node|none = none
    
    routine delete(){
        next = none
    }
}

if (1) {
    var node = Node()  # Allocation on stack;
    node.next = node
}
# Ther parser will generate DVM_REMOVE for variables that
# have become out-of-scope:
# DVM_REMOVE: <node>, 0, 0;
# The inferencer will expand DVM_REMOVE into:
# DVM_FINALIZE: <node>, 0, 0; Invoke destructor;
# DVM_DEALLOCA: ...;

if (1) {
    var & node = Node()  # Allocation on heap;
    node.next = node
}
# DVM_UNREF: <node>, 0, 0; Un-reference;
# DVM_OWNER: 
# DVM_TEST: ownership
# DVM_FINALIZE: <node>, 0, 0; Invoke destructor;
# DVM_TEST: refcount
# DVM_DELETE:
# When UNREF is executed, the ownership of <node>
# will be checked. If it is locally owned:
# 1. memory of <node> will be freed, if its refcount has become zero;
# 2. raise an memeory leaking error, otherwise;
More syntaxes

var a: int[3] = { 1, 2, 3 }  # built-in list type;
var a         = { 1, 2, 3 }  # int[3];

class Array<@T>
{
    routine Array<@T>( invar a: @T[] );  # Invokable for initilizer list;
}
var array: Array<float> = { 1.0F, 2, 3 }

var t = [ 1, "abc" ]  # tuple?
var t: [int,string]

class Node
{
    var value = 123
    var next: Node|none = none
}

routine Func()
{
    var & node = Node()
    node.next = node
    
    node.next = none
    # Breaking the cycle;
    # Or:
    delete node
}
More instructions

Opercode and operand size should become unsigned int.
typedef  unsigned int  opcode_t;
typedef  unsigned int  operand_t;

struct DaoVmCode
{
    opcode_t   code;
    operand_t  a, b, c;
};
DVM_ALLOCA : size_high_bits, size_low_bits, handle;
DVM_ALLOC  : size_high_bits, size_low_bits, handle;
DVM_DELETE : handle, 0, 0;

Some Notes:

Automatic return will be disabled.