Analysing the type state that an expression is in, enables providing guarantees to the callee about the object to prevent errornous logic using static analysis.
To enable this we introduce a change to identifiers in the context of variable declarations to allow the use of medial character as part of it. This is a UAX31 compliant change in profile, although its usage is novel.
As part of this is a framework being developed to enable new memory analysis techniques to be described and applied to guarantee temporal safety within @safe
code.
The following change in identifier profile is compliant to the Unicode Annex 31 (UAX31).
For this we take the nominal rule for an Identifier
, then specify the Medial
rule.
- Identifier = Start Continue*
+ IdentifierPrimary = Start Continue*
+ IdentifierSecondary = Medial Continue+
+ Identifier = IdentifierPrimary IdentifierSecondary*
Start = XID_Start | '_'
Continue = XID_Continue
+ Medial = '\''
At this stage it is important to understand that what is considered to be an identifier to all language rules apart from variable declarations (including function parameters), includes only the prior primary component of Start
and Continue
rules.
The Medial
Continue
rules produce a second token, a modifier upon the identifier that identify the input and output type states of the value stored within the variable.
So when an identifier first'second'third
is seen, three tokens will be produced.
first
which is the identifiersecond
a modifier for starting type statethird
a modifier for the end type state
Specifying a type state on a variable is only required at a function declaration, type or field, anywhere else should not require it. However it should be supported.
For succinctness of this proposal we consider variable declarations to also contain function parameters, the this pointer and return value of a function.
To specify the this pointer type states, or return values type state when returned provide them as part of function parameters like so:
Obj func(this'initialized, Obj var'initialized, return'initialized) {
Obj var'initialized;
}
These are ignored as far as the parameter list is concerned, they are extracted during parsing and may appear anywhere in the parameter list once.
Introduction of type state analysis provides new guarantees beyond what a single function receives currently with @safe
.
This has the consequence that any @safe
function that has type state analysis applied to it cannot call a function that does not have type state analysis even if it is @safe
.
Internally to the compiler this will require splitting @safe
based upon the edition it is found in.
Newer can go into older, but not older into newer.
The behavior of @system
functions remains unchanged from its behavior today.
However one type state will be verified in @system
functions as this is built into the language, this is reachable
.
The reachability of variables is of important note, as it is used in other memory analysis solutions such as isolated.
In @trusted
functions the variable declarations will behave similarly to @system
functions, their type state will only be enforced for reachable
, and any variable declarations that track known values.
Both @trusted
and @safe
share the same behavior for function signatures.
Each variable has the default type state of initialized except for the this pointer which has a minimum type state of nonnull
.
All function signatures have type state information associated with it. This applies to both new and old editions. However old editions will not type check against this.
It is not possible in a @safe
outer function to take to a pointer to a @safe
nested function.
Doing so would prevent the ability to do analysis on when called, and there may not be enough benefit to enabling a limited capability of this.
The indevidual type states for a given type are an ordered sequence of integers with identifiers for names. For each type apart from the base offerings are user supplied.
The base offerings of type states that the language knows how to reason with is unreachable
, reachable
, initialized
, default
, and nonnull
.
- Unrechable: cannot be accessed, to do so from this point on is an error.
- Reachable: can be accessed, but could be uninitialized.
- Initialized: has been initialized to something.
- Default: has been initialized to the default value for a type (
0
,NaN
,null
). - Non-null: when the type is a pointer type and has an allocated object stored in it. This could be via a Garbage Collector
new
, or is a pointer into Read Only Memory (ROM).
After the base offering is the user supplied type states.
The full order is as follows: unreachable < reachable < initialized < default < nonnull < userdefined
.
The highest type state that can be guaranteed for a given pathway should always be picked when determining what type state a variable is in. If there are multiple pathways possible (such as an if statement), the lowest is picked out of each path way. Corrections may need to be injected at the end of a pathway if it does not match the alternatives for the purpose of cleanup.
The type states reachable
and initialized
may be written to but not read.
unreachable
may not be read or written to. All others may be read or written to.
If the third type state is not provided by the identifier, then it will be the second. If the second type state is not provided by the identifier, it is the default.
To supply type states that a struct
, class
or union
may be in, use the typestate
keyword to open a block that provides an identifier with its runtime verification function.
struct File {
typestate {
open => isOpen,
}
}
Sometimes its useful to provide a friendlier name to a predefined or user type state.
To do this you may use =
instead of =>
.
Neither are required.
struct File {
typestate {
closed = initialized,
}
}
It may be desirable in some cases to disable a predefined type state.
Using @disable
will work here.
struct File {
typestate {
@disable
default,
}
}
When the default
type state is disabled, it is not possible to have a variable that has been default initialized.
This does have a side effect of allowing a struct to have a constructor that does not have any parameters in its parameter list.
As it will require initialization by the means of an explicit function call to the constructor.
To get an integer that represents a given type state use the following:
PrimaryExpression:
+ typestate Identifier
+ typestate ( Type ) Identifier
If no type is provided, it assumes typeof(this)
.
The typestate
keyword may be used as a type that will map to an integer type of appropriete size.
An example of this utilizing a user defined type state function:
struct Foo {
typestate {
open,
}
typestate opTypeState() {
return typestate open;
}
}
If a user supplied opTypeState
method is not defined, a generated one will be created.
It will go from largest value type state to the lowest in checks based upon the check function supplied in the typestate
block.
All user supplied type states must have a check function if a type state function is generated.
To test an expression against a type state, use the is expression.
IsExpression:
+ is typestate Identifier
+ is typestate < Identifier
+ is typestate > Identifier
+ is typestate >= Identifier
+ is typestate <= Identifier
It does not need a type to be provided, as it will be taken based upon the left hand side of the binary expression.
To perform the verification of the guarantees of type state analysis a new semantic pass is introduced that runs only on the edition that introduced it or newer. This pass will start by constructing by function an Intermediate Representation (IR) that simplifies language constructs into a form that is reasoning capable for many analysis solutions.
The IR will need to understand scopes, with multiple scopes being in list for a given language construct. An example of this is the if statement, each path (true and false), with each path having its own scope.
Some of the IR entries will be:
- A list of scopes
- A load from a variable and then store into another variable
- A store given a set of properties (such as type state)
- Load from a variable, that validates against a function parameter
- Variable type state change
Some additional attibutes may be required to be stored on an entry, for example a load may take a pointer to the variable or have a context variable.
The state a variable is in depends upon at which point in the entries it is in. It is dynamic that will change based upon the analysis being done.
An example of this is before a variable is declared it is in unreachable state, but once declared it is in reachable state.
The opposite of this may also be true, a prior analysis pass may inject a type state change to make a reachable variable into an unreachable one.
A nested function is represented in the IR by two different means.
- If outer or inner are not both
@safe
, the inner is treated as any other function. - If outer and inner are both
@safe
, the function body scope is used as part of a scope IR entry in between the function arg/param and the return value store. The inner function will not be processed as normal.
The scope will need to prevent mutation passes occuring more than once.
There are two kinds of goto's those that got forward, and those that go backwards.
Those that go backwards when it is encountered, can save the state of variables, restart from label and it either errors or when it hits that goto again (ignoring all previous gotos) take the lowest type state per variable. Restarting back at label, and ignoring that goto next time.
For those that go forward an almost opposite approach is taken. Find the scope call node contained within the most parent scope that contains both it and the label. Save the type state for the label into a list. Restart from the previous scope call node that was found, using the current type states, ignore the goto. Compare the type states once it reaches label, and use the lowest values for each variable. Continue on.
A return statement is treated as a backwards goto. The ending type state must match the stated type state of all function parameters and return value.
Exceptions are described in one of two ways.
The first is an exception has a matching catch within the IR that is already active. This will be treated as a forward goto.
The second is an exception that does not have a matching catch within the IR that is already active. This implies that it is a return and the return value is ignored as reachable.
When memory is allocated, the value is tracked throughout the graph.
If an object is move'd into the function (so the old value is not valid) for a function parameter or return value, it is tracked.
There may be other reasons to track a value stored in variables other than the above two, provided by other memory analysis solutions.
When tracked, any variable that has the value has type state analysis performed upon it in @trusted
code.
When merging multiple states and looking for the lowest value, the variable must either still have the value or become unreachable.
The following is an example from a Rust project that aims to introduce program insecurity in fully safe code.
This too can be done in D using @safe
.
This DIP introduces type state analysis to give the compiler the ability at compile time determine what your assumptions are for nullabilty of variables which prevent this particular bug.
T* makeNull(T)() @safe {
return null;
}
void useNull() @safe {
int* var = makeNull!int();
// var is in type state initialized as per makeNull return state
*var = 42;
// segfault due to var being null
}
What we want to happen instead:
T* makeNull(T)(/* return'initialized */) @safe {
return null;
// type state default is more than the type state initialized
// so it is accepted
}
void useNull() @safe {
int* var = makeNull!int();
// var is in type state initialized as per MakeNull return state
// perform load via var variable
// this will error due to initialized is less than the nonnull type state
// Error: Variable var is in type state initialized which could be null, cannot write to it
*var = 42;
}
To fix, simply check for null!
void useNull() @safe {
int* var = makeNull!int();
// var is in type state initialized as per MakeNull return state
if (var !is null) {
// in scope, assume var is in type state nonnull
*var = 42;
}
}
This proposal does not break existing code that is annotated with an already released edition.
It does prevent copying old @safe
code and expecting it to work without modification into a newer edition.
This is a major limitation of this proposal, as the goal is to eventually solve temporal safety and prevent some serious memory safety issues this pain is worth adapting to.