rikkimax/DIP-type-state-analysis.md

## DIP-type-state-analysis.md

      
    Raw
  

              DIP-type-state-analysis.md
            
          
    Type State Analysis

Analysing the type state that an expression is in, enables providing guarantees to the callee about the object to prevent errornous logic using static analysis.
To enable this we introduce a change to identifiers in the context of variable declarations to allow the use of medial character as part of it. This is a UAX31 compliant change in profile, although its usage is novel.
As part of this is a framework being developed to enable new memory analysis techniques to be described and applied to guarantee temporal safety within @safe code.
Changes

Identifier Changes

The following change in identifier profile is compliant to the Unicode Annex 31 (UAX31).
For this we take the nominal rule for an Identifier, then specify the Medial rule.
- Identifier = Start Continue*
+ IdentifierPrimary = Start Continue*
+ IdentifierSecondary = Medial Continue+
+ Identifier = IdentifierPrimary IdentifierSecondary*
Start = XID_Start | '_'
Continue = XID_Continue
+ Medial = '\''
At this stage it is important to understand that what is considered to be an identifier to all language rules apart from variable declarations (including function parameters), includes only the prior primary component of Start and Continue rules.
The Medial Continue rules produce a second token, a modifier upon the identifier that identify the input and output type states of the value stored within the variable.
So when an identifier first'second'third is seen, three tokens will be produced.

first which is the identifier
second a modifier for starting type state
third a modifier for the end type state

Specifying a type state on a variable is only required at a function declaration, type or field, anywhere else should not require it. However it should be supported.
Variable Declarations

For succinctness of this proposal we consider variable declarations to also contain function parameters, the this pointer and return value of a function.
To specify the this pointer type states, or return values type state when returned provide them as part of function parameters like so:
Obj func(this'initialized, Obj var'initialized, return'initialized) {
    Obj var'initialized;
}
These are ignored as far as the parameter list is concerned, they are extracted during parsing and may appear anywhere in the parameter list once.
SafeD

Introduction of type state analysis provides new guarantees beyond what a single function receives currently with @safe.
This has the consequence that any @safe function that has type state analysis applied to it cannot call a function that does not have type state analysis even if it is @safe.
Internally to the compiler this will require splitting @safe based upon the edition it is found in.
Newer can go into older, but not older into newer.
The behavior of @system functions remains unchanged from its behavior today.
However one type state will be verified in @system functions as this is built into the language, this is reachable.
The reachability of variables is of important note, as it is used in other memory analysis solutions such as isolated.
In @trusted functions the variable declarations will behave similarly to @system functions, their type state will only be enforced for reachable, and any variable declarations that track known values.
Both @trusted and @safe share the same behavior for function signatures.
Each variable has the default type state of initialized except for the this pointer which has a minimum type state of nonnull.
All function signatures have type state information associated with it.
This applies to both new and old editions.
However old editions will not type check against this.
It is not possible in a @safe outer function to take to a pointer to a @safe nested function.
Doing so would prevent the ability to do analysis on when called, and there may not be enough benefit to enabling a limited capability of this.
Type States

The indevidual type states for a given type are an ordered sequence of integers with identifiers for names. For each type apart from the base offerings are user supplied.
The base offerings of type states that the language knows how to reason with is unreachable, reachable, initialized, default, and nonnull.

Unrechable: cannot be accessed, to do so from this point on is an error.
Reachable: can be accessed, but could be uninitialized.
Initialized: has been initialized to something.
Default: has been initialized to the default value for a type (0, NaN, null).
Non-null: when the type is a pointer type and has an allocated object stored in it. This could be via a Garbage Collector new, or is a pointer into Read Only Memory (ROM).

After the base offering is the user supplied type states.
The full order is as follows: unreachable < reachable < initialized < default < nonnull < userdefined.
The highest type state that can be guaranteed for a given pathway should always be picked when determining what type state a variable is in.
If there are multiple pathways possible (such as an if statement), the lowest is picked out of each path way.
Corrections may need to be injected at the end of a pathway if it does not match the alternatives for the purpose of cleanup.
The type states reachable and initialized may be written to but not read.
unreachable may not be read or written to. All others may be read or written to.
If the third type state is not provided by the identifier, then it will be the second.
If the second type state is not provided by the identifier, it is the default.
User Supplied

To supply type states that a struct, class or union may be in, use the typestate keyword to open a block that provides an identifier with its runtime verification function.
struct File {
    typestate {
	    open => isOpen,
    }
}
Sometimes its useful to provide a friendlier name to a predefined or user type state.
To do this you may use = instead of =>.
Neither are required.
struct File {
    typestate {
	    closed = initialized,
    }
}
It may be desirable in some cases to disable a predefined type state.
Using @disable will work here.
struct File {
    typestate {
    @disable
	    default,
    }
}
When the default type state is disabled, it is not possible to have a variable that has been default initialized.
This does have a side effect of allowing a struct to have a constructor that does not have any parameters in its parameter list.
As it will require initialization by the means of an explicit function call to the constructor.
Expressions

To get an integer that represents a given type state use the following:
PrimaryExpression:
+	typestate Identifier
+	typestate ( Type ) Identifier
If no type is provided, it assumes typeof(this).
The typestate keyword may be used as a type that will map to an integer type of appropriete size.
An example of this utilizing a user defined type state function:
struct Foo {
	typestate {
		open,
	}

	typestate opTypeState() {
		return typestate open;
	}
}
If a user supplied opTypeState method is not defined, a generated one will be created.
It will go from largest value type state to the lowest in checks based upon the check function supplied in the typestate block.
All user supplied type states must have a check function if a type state function is generated.
To test an expression against a type state, use the is expression.
IsExpression:
+	is typestate Identifier
+	is typestate < Identifier
+	is typestate > Identifier
+	is typestate >= Identifier
+	is typestate <= Identifier
It does not need a type to be provided, as it will be taken based upon the left hand side of the binary expression.
Semantic 4

To perform the verification of the guarantees of type state analysis a new semantic pass is introduced that runs only on the edition that introduced it or newer.
This pass will start by constructing by function an Intermediate Representation (IR) that simplifies language constructs into a form that is reasoning capable for many analysis solutions.
The IR will need to understand scopes, with multiple scopes being in list for a given language construct.
An example of this is the if statement, each path (true and false), with each path having its own scope.
Some of the IR entries will be:

A list of scopes
A load from a variable and then store into another variable
A store given a set of properties (such as type state)
Load from a variable, that validates against a function parameter
Variable type state change

Some additional attibutes may be required to be stored on an entry, for example a load may take a pointer to the variable or have a context variable.
The state a variable is in depends upon at which point in the entries it is in.
It is dynamic that will change based upon the analysis being done.
An example of this is before a variable is declared it is in unreachable state, but once declared it is in reachable state.
The opposite of this may also be true, a prior analysis pass may inject a type state change to make a reachable variable into an unreachable one.
Nested Functions

A nested function is represented in the IR by two different means.

If outer or inner are not both @safe, the inner is treated as any other function.
If outer and inner are both @safe, the function body scope is used as part of a scope IR entry in between the function arg/param and the return value store.
The inner function will not be processed as normal.

The scope will need to prevent mutation passes occuring more than once.
Goto's

There are two kinds of goto's those that got forward, and those that go backwards.
Those that go backwards when it is encountered, can save the state of variables, restart from label and it either errors or when it hits that goto again (ignoring all previous gotos) take the lowest type state per variable. Restarting back at label, and ignoring that goto next time.
For those that go forward an almost opposite approach is taken.
Find the scope call node contained within the most parent scope that contains both it and the label.
Save the type state for the label into a list.
Restart from the previous scope call node that was found, using the current type states, ignore the goto.
Compare the type states once it reaches label, and use the lowest values for each variable.
Continue on.
Returning

A return statement is treated as a backwards goto.
The ending type state must match the stated type state of all function parameters and return value.
Exceptions

Exceptions are described in one of two ways.
The first is an exception has a matching catch within the IR that is already active.
This will be treated as a forward goto.
The second is an exception that does not have a matching catch within the IR that is already active.
This implies that it is a return and the return value is ignored as reachable.
Value Tracking

When memory is allocated, the value is tracked throughout the graph.
If an object is move'd into the function (so the old value is not valid) for a function parameter or return value, it is tracked.
There may be other reasons to track a value stored in variables other than the above two, provided by other memory analysis solutions.
When tracked, any variable that has the value has type state analysis performed upon it in @trusted code.
When merging multiple states and looking for the lowest value, the variable must either still have the value or become unreachable.
Examples

Writing to null

The following is an example from a Rust project that aims to introduce program insecurity in fully safe code.
This too can be done in D using @safe.
This DIP introduces type state analysis to give the compiler the ability at compile time determine what your assumptions are for nullabilty of variables which prevent this particular bug.
T* makeNull(T)() @safe {
    return null;
}

void useNull() @safe {
    int* var = makeNull!int();
    // var is in type state initialized as per makeNull return state
    
    *var = 42;
    // segfault due to var being null
}
What we want to happen instead:
T* makeNull(T)(/* return'initialized */) @safe {
    return null;
    // type state default is more than the type state initialized
    // so it is accepted
}

void useNull() @safe {
    int* var = makeNull!int();
    // var is in type state initialized as per MakeNull return state
    
    // perform load via var variable
    // this will error due to initialized is less than the nonnull type state
    // Error: Variable var is in type state initialized which could be null, cannot write to it
    *var = 42;
}
To fix, simply check for null!
void useNull() @safe {
    int* var = makeNull!int();
    // var is in type state initialized as per MakeNull return state

    if (var !is null) {
        // in scope, assume var is in type state nonnull
        *var = 42;
    }
}
Breaking Changes

This proposal does not break existing code that is annotated with an already released edition.
It does prevent copying old @safe code and expecting it to work without modification into a newer edition.
This is a major limitation of this proposal, as the goal is to eventually solve temporal safety and prevent some serious memory safety issues this pain is worth adapting to.