Lecture Notes for 21025 Compiler Construction by Olivier Flückiger
A compiler is f : S -> T
-- a function from source language to target language.
It is correct iff s = f(s)
. Note the equality has to be extensional (see
[Fe] chapter 5.6 for an introduction to observational equivalence).
Therefore every transformation our compiler does has to be correct. Since they are based on the result of some analysis the analysis has to be sound.
As a compiler writer we are always looking for stronger facts about a program. With more information we can produce better error messages to the user. And we can write compilers which output faster code.
Last week we saw SSA, an IR which encodes data dependencies. It transforms variables into a data dependency graph (there is a particular IR style called Sea of Nodes [Cl] which extends on the idea and also understands control flow as a data dependency problem).
But, its not all we want to know about a function. There are many things a compiler is interested in. Just some examples:
- Which variables point to the same object. (points to analysis, alias analysis)
- Any kind of properties about data in a function. (dataflow analysis like value range analysis, livenes, reaching definitions)
- Interprocedural properties. (control flow analysis (CFA), call graph construction)
- Are the type annotations consistent. (type checking, type inference)
Data types can be understood as sets. For example the type int
is the set of all
numbers from the smallest to the biggest integer. The type Point (int, int)
is the set of all points on a discrete 2D pane.
A type system is supposed to rule out type errors in a program. For example passing a floating point number to a function expecting an integer should be caught.
Depending on the language some type errors can be found statically. For example consider the following program in Java:
1 ColoredPoint x;
2 x = new Point();
3 x.color;
The assignment on line 2 should be rejected because we assign a value of a smaller type to a reference of a bigger type. Note how the access on line 3 does not make sense for a point without a color.
Other errors are checked dynamically. Consider:
1 ColoredPoint[] cpa = new ColoredPoint[10];
2 Point[] pa = cpa;
3 assert(pa[0] == null);
4 pa[0] = new Point();
In Java arrays are (by design) covariant. A reference to a ColoredPoint[]
can be passed as a Point[]
. Storing a Point
into a ColoredPoint[]
is
forbidden. The error occurs at runtime. This program will type check statically, but
a type error (an ArrayStoreException
) is thrown at line 4 (Note: this is still
sound because the static loophole is accounted for at runtime).
The static type is a conservative approximation of the values seen at runtime. We call it conservative because sometimes it errors on the side of safety. For example the following (accidentally) correctly behaving programs is ruled out:
1 ColoredPoint x;
2 if (false) x = new Point();
Conservative also implies that the approximation is correct (ie. no type error goes unreported). If a type system can give this guarantee we call it sound.
Type soundness theorem (see [Wr])
if e has type t, either one of the three happen
1. eval(e) diverges
2. eval(e) = v where v is a value of type t
3. eval(e) throws a (known) exception
For example in C it is possible to violate type soundness:
float f = 3;
int* i = (int*)&f;
printf ("%d\n", i); // Won't print 3
Note: Soundness and static types are orthogonal. A dynamically typed language
is a special case with one possible static type t
-- the type that contains
all others. For example in Smalltalk it's Object
.
Note: only sound types can be used to optimize a program.
Typing rules are usually written as natural deductions (Gerhard Gentzen). (It's not an accident. See [Wa] for a connection between types and logical propositions). There exists a formal type system of MiniJava [Pa]. Let's look at a simplified version.
We ignore classes and have methods with one argument, one local variable and one statement in the body.
A = (x₁ ↦ t₁, x₂ ↦ t₂)
A ⊢ s A ⊢ e : t
---------------------------------------------
⊢ public t name (t₁ x₁) {t₂ x₂; s; return e }
Rules for some expressions and statements:
A ⊢ e : t₂
(assign) ----------- A(x) = t₁, t₂ ≤ t₁
A ⊢ x = e
(lookup) --------- A(x) = t
A ⊢ x : t
(lit) -----------
⊢ num : int
A ⊢ e₁ : int A ⊢ e₂ : int
(exp-<) ----------------------------
A ⊢ e₁ < e₂ : bool
Let's type check the following program:
public bool test (int i) {
int j;
j = 2;
return i < j;
}
A possible deduction is
A ⊢ 2 : int ≤ int A ⊢ i : int A ⊢ j : int
----------------- --------------------------
A := (i ↦ int, j ↦ int) A ⊢ j = 2 A ⊢ i < j : bool
---------------------------------------------------------------------------------
⊢ public bool test (int i) { int j; j = 2; return i < j; }
For a more complete version covering the full Java language see Featherweight Java [Ig].
In a compiler we might want to know more properties of a program than just types. For example in
int array[10]
int index = compute()
array[index]
is the index
guaranteed to be between 0 and 9?
We know from Rice's theorem that any non-trivial property about a program is undecidable. However as we have seen with types we can find conservative approximations for certain properties.
One of the most generic and fundamental ways of approximating the behavior of a program is called abstract interpretation [Co]. As the original paper states, a program denotes computations in some universe of objects. Abstract interpretation of programs consists of using that denotation to describe computations in another universe of abstract objects, so that the results of abstract execution give some information on the actual computations.
Note: For other example driven explanations of abstract interpretation see [Ai].
First some terminology:
A partially ordered set is a set S and a binary relation ≤
such that
- a ≤ a (reflexivity)
- if a ≤ b and b ≤ a, then a = b (antisymmetry)
- if a ≤ b and b ≤ c, then a ≤ c (transitivity)
Note: elements do not have to be comparable.
A semilattice S is a partially ordered set such that for all elements x and y of S,
the least upper bound of the x and y exists.
Alternatively we define a join relation v
such that
- x v (y v z) = (x v y) v z (Associativity)
- x v y = y v x (Commutativity)
- x v x = x (Idempotency)
x v x
is called the supremum.
We similarly define meet ∧
to be the greatest lower bound (ie. the infimum).
A lattice is a partially ordered set with meet and join.
A complete lattice is a lattice where every subset has a supremum and infimum.
It follows that there exists a supremum and infimum of the whole lattice. We call them
top (⊤) and bottom (⊥). ⊤ and ⊥ are the identity element for v and ∧ (ie. x v 0 = x
).
Given a control flow graph (CFG) we observe that every node has input- and output states. The semantics of the language describe how the input state is transformed into the output state.
Consider the program
x := 1
the in-state is {}, the out-state is {x ↦ 1}.
For many properties we want to know about a program it is enough to know the set of all possible states. To collect them all we assume every branch gets taken every time. When two branches merge we combine the information from both of them.
For example
if e then
x := 1 # x ↦ 1
else
x := 2 # x ↦ 2
y := x + 1 # x ↦ {1, 2}, y ↦ {2, 3}
This is a static approximation of what happens at runtime. There are two problems with it
- It amounts to running the full program
- The set of states might grow unbounded (think
while (true) x++
)
We define an abstraction of the values α
and an abstraction of the semantics.
As an example to calculate the sign (+/-) of a number we define the following abstraction:
α : if x < 0 then + else -
a + b =
| +, + -> +
| -, - -> -
| +, - -> ∓
| ∓, x -> ∓
Now we run on the same example
if e then
x := 1 # x ↦ +
else
x := 2 # x ↦ +
y := x + 1 # x ↦ +, y ↦ +
Abstract interpretation is defined by A, v, i
where A
is the set of abstract states,
v
is a relation such that A is a complete v-semilattice.
The abstract semantics i
operates on the abstract values and transforms abstract input to
output state. States are merged by v
. We search for the ≤
-least fixed-point for every set of states.
Cousot and Cousot showed that A being a complete lattice suffices that there exists a unique fixed point and abstract interpretation terminates. For static analysis abstract interpretation is mostly performed over finite lattices (which are trivially complete).
The question remains whether an abstraction is consistent with the concrete semantics.
Given two ordered set L and L' of concrete values and abstract values.
α maps element x from the concrete set L to the abstract element α(x) in the
abstract set L'. We say α(x) is the abstraction of x. Secondly the
concretization function γ maps elements from L' to L. Element γ(x') in called a
concretization of x'. Let f be a function from L to L and f' be a function
from L' to L'. f' is called a valid abstraction of f iff for all x' in L'
(f ° γ)(x') ≤ (γ ° f')(x')
.
x <----- γ ----- x'
| |
| f | f'
| |
y ≤ z <--- γ --- y'
We use the following lattice to represent the sign of a value
∓
/ \
- +
\ /
⊥
⊥ means unknown as in the value was never observed ∓ means the value can be anything
What happens if we execute
int i = 10
while (exp)
i += 1;
print i
and what if we execute
int i = 10
while (exp)
i -= 1;
print i
We can arbitrarily increase the precision. For example to identify constants the following domain works:
⊤
.... -3 -2 -1 0 1 2 3 ...
⊥
For range analysis we can use all ranges (a, b]
and the latice with the
structure (a, b] v (c, d] := (min(a,b), max(b,d)]
. Note: this lattice has infinite height
and a naive fixed point iteration is not guaranteed to terminate.
Question: What is the following lattice useful
⊤
|
⊥
Flow based analysis is a particular instance of abstract interpretation (see [Sc] for a good introduction). We consider a control flow graph of a program. For every node n we define
outₙ = transₙ(inₙ)
inₙ = join(outₚ ∀ p ϵ predecessors(n))
For example given a fragment of a CFG with a node n which has two predecessors a and b this is how it looks:
[ l ] [ k ]
\ outₗ / outₖ
\ /
join
|
| inₙ
transₙ
| outₙ
|
Given a value domain which is a partial order with finite size (ie. a complete partial order) and a (trans ° join) which is monotonic then an iterative analysis is guaranteed to reach a fixed point.
1 x := read()
2 y := 0
3 if (x==0)
4 y := 3
5 else
6 y := 4
7 z := y
8 print (z)
The states are a mapping from variables to line numbers where they are defined
transₙ(inₙ) = inₙ \ defined_vars(n) ∪ {defined_vars(n) ↦ n}
join(l₁, l₂) = {x ↦ l₁(x) ∪ l₂(x) | ∀ x ϵ dom(l₁, l₂) }
1 x := read() # x ↦ 1
2 y := 0 # x ↦ 1, y ↦ 2
3 if (x==0)
4 y := 3 # x ↦ 1, y ↦ 4
5 else
6 y := 4 # x ↦ 1, y ↦ 6
7 z := y # x ↦ 1, y ↦ {4, 6}, z ↦ 7
8 print (z)
The domain is finite because there are finitely many lines of code.
The states are the list of live variables. This time we run the analysis backwards, from the last line to the beginning.
transₙ(inₙ) = inₙ ∪ used_vars(n) \ defined_vars(n)
join(x₁, x₂) = x₁ ∪ x₂
1 x := read() # []
2 y := 0 # [x]
3 if (x==0) # [x]
4 y := 3 # []
5 else
6 y := 4 # []
7 z := y # [y]
8 print (z) # [z]
The domain is finite because there are finitely many variables.
[Fe] Matthias Felleisen, Matthew Flatt
Programming Languages and Lambda Calculi
https://www.cs.utah.edu/~mflatt/past-courses/cs7520/public_html/s06/notes.pdf
[Cl] Click and Paleczny
A Simple Graph-Based Intermediate Representation
http://grothoff.org/christian/teaching/2007/3353/papers/click95simple.pdf
[Wr] Wright and Felleisen
A syntactic approach to type soundness
https://pdfs.semanticscholar.org/c19c/45531a5f6c114e9da1dbfea8aedc31c31e0d.pdf
[Wa] Philip Wadler
Propositions as Types
https://www.youtube.com/watch?v=IOiZatlZtGU
[Sc] Michael I. Schwartzbach
Lecture Notes on Static Analysis
http://lara.epfl.ch/w/_media/sav08:schwartzbach.pdf
[Ig] Igarashi, Atsushi, Benjamin C. Pierce, and Philip Wadler
Featherweight Java: a minimal core calculus for Java and GJ
https://www.cis.upenn.edu/~bcpierce/papers/fj-toplas.pdf
[Co] P. Cousot and R. Cousot
Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints
https://dl.acm.org/citation.cfm?id=512973
[Pa] Jens Palsberg
The MiniJava Type System
http://web.cs.ucla.edu/~palsberg/course/cs132/miniJava-typesystem.pdf
Weblinks
[Ai] https://wiki.mozilla.org/Abstract_Interpretation,
https://www.di.ens.fr/~cousot/AI/IntroAbsInt.html