rikkimax/DIP-escape-analysis.md

## DIP-escape-analysis.md

      
    Raw
  

              DIP-escape-analysis.md
            
          
    Escape Analysis


Field
Value


DIP:
(number/id -- assigned by DIP Manager)


Author:
Richard (Rikki) Andrew Cattermole firstname@lastname.co.nz


Implementation:
(links to implementation PR if any)


Status:
Draft


Abstract

The movement of pointers within a program graph easily escapes known points that own that memory within the call stack of a given thread. This logical error can result in program termination or undetected corruption of the program state. This proposal is a new attempt at preventing this corruption within the @safe code.
Contents


Rationale
Prior Work
Description
Breaking Changes and Deprecations
Reference
Copyright & License
Reviews

Rationale

In a review of the existing escape analysis solution implemented in D's reference compiler DIP1000, there is one major limitation of what it models and assumption growth to facilitate functionality.
The implementation of DIP1000 models a single output variable per function, this is the return value or if void the first parameter (could be the this pointer). In practice functions typically have more than one output, this includes mutable pointers in, ref and out function parameters.
int* /* output */ func();

struct S {
	int* /* output */ method1();
	void method2() /* output */;
}
The relationship between parameters is modelled using the return ref and return scope attributes. These communicate to the compiler the varying input and how it relates to the output for that parameter.
Needing two different attributes to determine the relationship status between parameters has been highly incommunicable to experienced programmers.
Due to it not being able to model multiple outputs, a lot of typical D code cannot be safely represented using DIP1000. The design does not protect you from extending past the modelled subset of the language.
To resolve both of these core issues in the existing design, an escape set must be modelled per parameter. While this resolves the callee's side, it does not protect the caller from misusing the callee.  The design DIP1000 attempts to solve this by modelling the relationship between parameters using the two different attributes.
Another solution to this problem is to utilize the information provided by escapes and inverse it, given an output and given the inputs that form it, protect the inputs so that nothing can invalidate the output. This resulted in the proposal that was @live, an opt-in analysis that does not communicate to either the callee or caller any guarantees cross-function, making it functionally irrelevant to the guarantees of DIP1000.
An opt-in solution to ownership does not allow for reference counting to occur safely. To safely do this, the referenced counted owner must be pinned and made effectively read-only so that both a reference to it and the borrowed resource may be passed around. This was a blocker determined by Walter and Timon for adding reference counting to the language.
Furthermore without the entry point to escape analysis having analysis associated with it, there is no differentiation of what can constitute of a safe to borrow from source and what can't be. An example of this is with a global, in the case of a variable thread local storage, it is possible in fully @safe code with DIP1000 turned on to cause a segfault.
import std;

int* tlsGlobal;

@safe:

void main() {
    tlsGlobal = new int(2);
    assert(*tlsGlobal == 2);
    
    toCall(tlsGlobal);
    assert(*tlsGlobal == 2); // Segfault
}

void toCall(ref int* data) {
    data.destroy;
}
Prior Work

Escape analysis as a subject matter is primarily an analysis of graphs. How they are mutated and who owns them at what places. Modelling this can be an expensive set of operations in the form of data flow analysis. For sufficient and best experience, a full program analysis is needed with a full graph of manipulation and logic therein analysed.
Native programming languages do not align themselves to full program analysis, due to the separate compilation model. D is a native programming language that uses this model almost exclusively. For this reason, it cannot use a full program analysis and full program graph analysis for modelling escaping. Instead, a flattened view of the graph must be representable inside a function signature.
At the time of this proposal, a solution for escape analysis has been implemented in the D reference compiler that is commonly referred to by its DIP number, DIP1000. This does not cover memory ownership guarantees, instead @live as an opt-in attribute enables some localized to the given function guarantees.
In Rust ownership is a transfer based system, so that only one variable has any ownership of memory. In contrast to D, where this is modelled and attempting to enforce this would not match how garbage collected memory would be used. Further guarantees are given, in that when a borrow occurs from an owner, only one mutable borrow is allowed in a given scope. This complements the ownership transfer system as it guarantees nobody else has the potential for aliasing.
Description

In this proposal a new attribute is introduced that will be inferred under normal usage for non-virtual functions that have a body. This attribute will describe the escape set per parameter of a function to allow modelling what inputs are required to be valid as long as an output is.
The types this proposal analyses the movement of can be categorized by them being some sort of pointer:

Raw pointers: T*
Slices: T[]
Associative arrays: T[U]
Any struct with pointer fields

The main focus of movement is on variables within a function. Unfortunately, the modelling of pointer movement within global variables and thread-local storage is not possible. They can appear anywhere within the call stack, possibly multiple times. This disables the owner to escape analysis from providing guarantees to both the caller and callee.
The primary concern for globals and thread-local storage is the act of taking a pointer to the variable. As the underlying memory would not be modelled this is not a safe operation and therefore is disallowed in @safe.
shared int global;

void func() @safe {
	auto ptr = &global; // Error: Variable `global` is a global and has an unmodeled memory backing it, it is not safe to take a pointer to it
}
Some memory should always be able to be escaped, currently only immutable qualified types would be allowed to do so. Future expansion into reference counting and ownership transfer schemes would be highly desirable as they share this trait.
Escape Set

The escape set modelled on function parameters is provided by the attribute @escapevia with an optional set described using identifiers. The set is on the input, that contributes to the listed output parameter.
AtAttribute:
+    @ EscapeViaAttribute

ParameterAttributes:
+    @ EscapeViaAttribute

+ EscapeViaAttribute:
+    escapevia ( Identifiers )
+    escapevia ( )
+    escapevia
An example of the syntax addition:
struct S {
	int* field;

	T func(@escapevia U input1, @escapevia(return) V input2) @escapevia(return);
}
Valid arguments in the @escapevia(...) set are:

return
this, also applies to the context pointer of a delegate.
__unknown, for exceptions, and globals.
__parameters, all function parameters.
Any function parameter names.

Without the bracketed list in the @escapevia annotation it defaults to @escapevia(return, this, __unknown, __parameters). When the annotation is missing, this will indicate that it is to be inferred.
alias D = T delegate(@escapevia U input1, @escapevia(return) V input2) @escapevia(return);

T freeFunction(@escapevia U input1, @escapevia(return) V input2);

class C {
	T method(@escapevia U input1, @escapevia(return) V input2) @escapevia(return);
}
It is the programmer's responsibility to annotate virtual (function pointers and delegates) and @trusted functions accordingly as their body will not be validated if available. Methods with be both validated and inferred.
Overriden methods in classes must have an escape set per parameter that is less than or equal to the parent method's set.
class Parent {
	int* method() @escapevia(return);
}

class Child : Parent {
	override int* method() @escapevia(return, throw); // Error: the escape set for the `this` pointer on `method` must be equal or lesser than the parent which is `return` not `return, throw`
}
The new semantic analysis capability that performs analysis upon the program graph with inferring must be able to be disabled using a switch similar to --disable-memorysafety. Otherwise, it is turned on by default for any module for a given edition and above, any edition below this will have inferring but no validation with implicit casts for the @escapevia attribute.
Functions that are @trusted have their function signatures inferred for escapes, but will not error within the body or when the body does not match the signature. For @safe functions these are inferred but will error within the body and when the signature does not match the body. Lastly @system functions will not be analysed for escapes and any annotation of escapes upon its signature will be ignored.
If a function without a body is not fully annotated for escapes but is marked as @safe or @trusted it is downgraded to @system, this only applies to the edition and above that applies this proposal not below it. This includes function pointers and delegates.
int* adder(int* a, @escapevia(return) int* b) @safe;
Is actually:
int* adder(int* a, int* b) @system;
If @escapevia attribute were to be added to mangling, the mangleof property value must not be accessible during compilation as the result will not reflect its inferred value. Giving the value returned from the property a symbol similar to the trait __traits(initSymbol, sym) would work around this issue as that will not be accessible during compilation.
Escape Analysis

The compiler tracks potential escape destinations for a given variable by considering assignments, function calls and its scope.
An example of two scopes, whereupon assignment resets the escape set of an inner variable:
int* outer;

{
	int* inner = ...;
	outer = inner;
	// @escapevia(outer) inner
	
	// converge `outer` with any owners of `inner` lifetimes
	inner = ...;
	// @escapevia() inner
}
An example of convergence at the end of multiple scopes:
int* func(int* input) {
	if (input is null) {
		return new int;
		// @escapevia() input
	} else {
		return input;
		// @escapevia(return) input
	}
	// @escapevia(return) input
}
This does not offer any guarantees by itself. Guarantees are provided by pinning variables as not being able to be escaped. This is analysed by owner escape analysis.
Elements in an array, fields in a class/struct/union are conflated with the variable that stores them in and may not function as an owner. The variable that stores it would be able to function as an owner in its place.
struct S {
	int* field;
}

void func(S* owner) @safe {
	int* borrow = owner.field;
	// variable `owner` is protected from mutation
}
This is important for the escape set, as the inverse is also true. You do not escape one variable into a field or an element of an array, you escape it into the variable containing it.
struct S {
	int* field;
}

void handle(int* ptr) {
	S s;
	s.field = ptr;
	// @escapevia(s) ptr
}
Root Variables

A root variable is any variable that has an empty escape set that cannot grow in size when convergence has been completed. This is provided by the user by annotating with the scope attribute upon a variable.
In the following example, a root variable demonstrates how escaping works, using a garbage collecter memory allocation, that can be replaced with malloc and it would function the same:
int* escape;

{
	// Is a root that cannot be escaped from.
	scope owner = new int;
	// @escapevia() owner
	
	ref aborrow = owner;
	// @escapevia(aborrow) owner
	
	escape = owner; // Error: `owner` cannot be escaped into a variable with a longer lifetime `escape`
	// @escapevia(aborrow, borrow) owner
}
Taking a pointer to another variable results in a scope pointer:
int stack;
/*scope*/ int* var = &stack;
The point of convergence matters for lifetime analysis. It occurs like regular function destructor cleanup. It happens in reverse order of the declarations. This has consequences, it allows a pointer that cannot grow in its escape set, grow during its scope, but not at end.
struct S {
    int* field;
}

int* acquire(ref S s) @safe {
    return s.field;
}

void caller() @safe {
    int x = 2;
    S s = S(&x);
    
    *acquire(s) = 3;
}
Is equivalent to:
struct S {
    int* field;
    
    this(@escapevia(this) int* field) @safe {
        this.field = field;
    }
}

int* acquire(@escapevia(return) ref S s) @safe {
    return s.field;
}

void caller() @safe {
    int x = 2;

    scope xPtr = &x;
    // @escapevia(xPtr) x, escape set cannot grow

    S s = S(xPtr);
    // @escapevia(s) xPtr
    
    int* fooReturn = acquire(s);
    // @escapevia(fooReturn) s
    
    *fooReturn = 3;

    __cleanup(fooReturn);
    // @escapevia() s
    __cleanup(s);
    // @escapevia() xPtr
    __cleanup(xPtr);
    // @escapevia() x
    __cleanup(x);
    // x escape set is empty, therefore ok
}
Owner Escape Analysis

To offer guarantees, there are two things required, owners and borrows from owners. An owner is any root variable, or what an output from a function call had been contributed to by the arguments.
Roots:
scope int* owner = ...;
/*scope*/ int* borrowed = owner;
Function calls:
int* func(@escapevia(return) int* input) {
	return input;
}

int* owner = ...;
int* borrowed = func(owner);
Borrowed memory is only ever valid, as long as the owners are not mutated. Mutation of the owners could unmap the borrowed memory, or change it in such a way that the program becomes corrupted. When a borrow is seen, the compiler protects the owner from mutation by requiring it be "effectively const" as long as borrows exist. It cannot be assigned to, or be passed to methods or functions mutably.
struct Top {
	int* field;
}

void func(ref Top owner) @safe {
	int* field = owner.field;
	// owner is now effectively const, it cannot be mutated
	
	owner = Top.init; // Error: The variable `owner` has a borrow and cannot be mutated
	owner.field = null; // Error: The variable `owner` has a borrow and cannot be mutated

	if (field !is null) {
		writeln(*field);
		*field = 2; // ok, fully mutable
	}
}
Mutable and non-mutable function calls are checked:
struct S {
	int field;

@safe:

	bool isNull() const {
		return false;
	}

	void makeNull() {
	}
}

S s;
int* field = &s.field;

writeln(s.isNull); // ok
s.makeNull(); // Error: Variable `s` has a borrow and may not be mutated by calling `makeNull`.
If a pointer has a known runtime lifetime it may be escaped. Currently, no features in D offer this description and therefore it is only noted as an expansion point.
Removal of Existing Language Elements

The language design elements that are being removed are DIP1000 and @live. Together these attempted to do this proposal but in a non-integrated way that has shown minimal adoption.
Attribute:
-   return

AtAttribute:
-    @ live

FuncAttr:
- FuncAttrReturn
- FuncAttrLive

- FuncAttrReturn:
-	Nj

- FuncAttrLive:
-	Nm
No timeline is specified for removal.
Breaking Changes and Deprecations

Any code that uses DIP1000 syntax will break. This can be gradually deprecated and phased out with preferential treatment to any new syntax.
Any new semantic analysis would only cause errors to be applied to a new edition and would not affect the base D2 language.
Reference


Shape Analysis (type state & memory escapes)
@system variables DIP
# Compositional pointer and escape analysis for Java programs
4.1.  What is Ownership?
4.2.  References and Borrowing

Copyright & License

Copyright (c) 2024 by the D Language Foundation
Licensed under Creative Commons Zero 1.0
History

The DIP Manager will supplement this section with links to forum discussions and a summary of the formal assessment.
Field	Value
DIP:	(number/id -- assigned by DIP Manager)
Author:	Richard (Rikki) Andrew Cattermole firstname@lastname.co.nz
Implementation:	(links to implementation PR if any)
Status:	Draft