Create a gist now

Instantly share code, notes, and snippets.

@o- /analysis.md Secret
Last active Mar 31, 2017

What would you like to do?

Program Analysis

Lecture Notes for 21025 Compiler Construction by Olivier Flückiger

A compiler is f : S -> T -- a function from source language to target language. It is correct iff s = f(s). Note the equality has to be extensional (see [Fe] chapter 5.6 for an introduction to observational equivalence).

Therefore every transformation our compiler does has to be correct. Since they are based on the result of some analysis the analysis has to be sound.

Motivation

As a compiler writer we are always looking for stronger facts about a program. With more information we can produce better error messages to the user. And we can write compilers which output faster code.

Last week we saw SSA, an IR which encodes data dependencies. It transforms variables into a data dependency graph (there is a particular IR style called Sea of Nodes [Cl] which extends on the idea and also understands control flow as a data dependency problem).

But, its not all we want to know about a function. There are many things a compiler is interested in. Just some examples:

  • Which variables point to the same object. (points to analysis, alias analysis)
  • Any kind of properties about data in a function. (dataflow analysis like value range analysis, livenes, reaching definitions)
  • Interprocedural properties. (control flow analysis (CFA), call graph construction)
  • Are the type annotations consistent. (type checking, type inference)

Types

Data types can be understood as sets. For example the type int is the set of all numbers from the smallest to the biggest integer. The type Point (int, int) is the set of all points on a discrete 2D pane.

A type system is supposed to rule out type errors in a program. For example passing a floating point number to a function expecting an integer should be caught.

Static vs. Dynamically checked

Depending on the language some type errors can be found statically. For example consider the following program in Java:

1  ColoredPoint x;
2  x = new Point();
3  x.color;

The assignment on line 2 should be rejected because we assign a value of a smaller type to a reference of a bigger type. Note how the access on line 3 does not make sense for a point without a color.

Other errors are checked dynamically. Consider:

1  ColoredPoint[] cpa = new ColoredPoint[10];
2  Point[] pa = cpa;
3  assert(pa[0] == null);
4  pa[0] = new Point();

In Java arrays are (by design) covariant. A reference to a ColoredPoint[] can be passed as a Point[]. Storing a Point into a ColoredPoint[] is forbidden. The error occurs at runtime. This program will type check statically, but a type error (an ArrayStoreException) is thrown at line 4 (Note: this is still sound because the static loophole is accounted for at runtime).

Soundness

The static type is a conservative approximation of the values seen at runtime. We call it conservative because sometimes it errors on the side of safety. For example the following (accidentally) correctly behaving programs is ruled out:

1  ColoredPoint x;
2  if (false) x = new Point();

Conservative also implies that the approximation is correct (ie. no type error goes unreported). If a type system can give this guarantee we call it sound.

Type soundness theorem (see [Wr])

if e has type t, either one of the three happen
1. eval(e) diverges
2. eval(e) = v  where v is a value of type t
3. eval(e) throws a (known) exception

For example in C it is possible to violate type soundness:

float f = 3;
int* i = (int*)&f;
printf ("%d\n", i);   // Won't print 3

Note: Soundness and static types are orthogonal. A dynamically typed language is a special case with one possible static type t -- the type that contains all others. For example in Smalltalk it's Object.

Note: only sound types can be used to optimize a program.

Type checking

Typing rules are usually written as natural deductions (Gerhard Gentzen). (It's not an accident. See [Wa] for a connection between types and logical propositions). There exists a formal type system of MiniJava [Pa]. Let's look at a simplified version.

We ignore classes and have methods with one argument, one local variable and one statement in the body.

A = (x₁ ↦ t₁, x₂ ↦ t₂)
A ⊢ s                     A ⊢ e : t
---------------------------------------------
⊢ public t name (t₁ x₁) {t₂ x₂; s; return e }

Rules for some expressions and statements:

         A ⊢ e : t₂
(assign) -----------  A(x) = t₁, t₂ ≤ t₁
         A ⊢ x = e


(lookup) ---------  A(x) = t
         A ⊢ x : t


   (lit) -----------
         ⊢ num : int


         A ⊢ e₁ : int    A ⊢ e₂ : int
 (exp-<) ----------------------------
         A ⊢ e₁ < e₂ : bool

Let's type check the following program:

public bool test (int i) {
  int j;
  j = 2;
  return i < j;
}

A possible deduction is


                           A ⊢ 2 : int ≤ int         A ⊢ i : int    A ⊢ j : int
                           -----------------         --------------------------
A := (i ↦ int, j ↦ int)    A ⊢ j = 2                 A ⊢ i < j : bool
---------------------------------------------------------------------------------
⊢ public bool test (int i) { int j; j = 2; return i < j; }

For a more complete version covering the full Java language see Featherweight Java [Ig].

Abstract Interpretation

In a compiler we might want to know more properties of a program than just types. For example in

int array[10]
int index = compute()
array[index]

is the index guaranteed to be between 0 and 9?

We know from Rice's theorem that any non-trivial property about a program is undecidable. However as we have seen with types we can find conservative approximations for certain properties.

One of the most generic and fundamental ways of approximating the behavior of a program is called abstract interpretation [Co]. As the original paper states, a program denotes computations in some universe of objects. Abstract interpretation of programs consists of using that denotation to describe computations in another universe of abstract objects, so that the results of abstract execution give some information on the actual computations.

Note: For other example driven explanations of abstract interpretation see [Ai].

Lattices

First some terminology:

A partially ordered set is a set S and a binary relation such that

  1. a ≤ a (reflexivity)
  2. if a ≤ b and b ≤ a, then a = b (antisymmetry)
  3. if a ≤ b and b ≤ c, then a ≤ c (transitivity)

Note: elements do not have to be comparable.

A semilattice S is a partially ordered set such that for all elements x and y of S, the least upper bound of the x and y exists. Alternatively we define a join relation v such that

  1. x v (y v z) = (x v y) v z (Associativity)
  2. x v y = y v x (Commutativity)
  3. x v x = x (Idempotency)

x v x is called the supremum.

We similarly define meet to be the greatest lower bound (ie. the infimum).

A lattice is a partially ordered set with meet and join.

A complete lattice is a lattice where every subset has a supremum and infimum. It follows that there exists a supremum and infimum of the whole lattice. We call them top (⊤) and bottom (⊥). ⊤ and ⊥ are the identity element for v and ∧ (ie. x v 0 = x).

Approach

Given a control flow graph (CFG) we observe that every node has input- and output states. The semantics of the language describe how the input state is transformed into the output state.

Consider the program

x := 1

the in-state is {}, the out-state is {x ↦ 1}.

For many properties we want to know about a program it is enough to know the set of all possible states. To collect them all we assume every branch gets taken every time. When two branches merge we combine the information from both of them.

For example

  if e then
    x := 1      # x ↦ 1
  else
    x := 2      # x ↦ 2
  y := x + 1    # x ↦ {1, 2}, y ↦ {2, 3}

This is a static approximation of what happens at runtime. There are two problems with it

  1. It amounts to running the full program
  2. The set of states might grow unbounded (think while (true) x++)

We define an abstraction of the values α and an abstraction of the semantics.

As an example to calculate the sign (+/-) of a number we define the following abstraction:

α : if x < 0 then + else -
a + b =
  | +, +  -> +
  | -, -  -> -
  | +, -  -> ∓
  | ∓, x  -> ∓

Now we run on the same example

  if e then
    x := 1      # x ↦ +
  else
    x := 2      # x ↦ +
  y := x + 1    # x ↦ +, y ↦ +

Abstract interpretation is defined by A, v, i where A is the set of abstract states, v is a relation such that A is a complete v-semilattice. The abstract semantics i operates on the abstract values and transforms abstract input to output state. States are merged by v. We search for the -least fixed-point for every set of states.

Cousot and Cousot showed that A being a complete lattice suffices that there exists a unique fixed point and abstract interpretation terminates. For static analysis abstract interpretation is mostly performed over finite lattices (which are trivially complete).

Consistent Abstraction

The question remains whether an abstraction is consistent with the concrete semantics.

Given two ordered set L and L' of concrete values and abstract values. α maps element x from the concrete set L to the abstract element α(x) in the abstract set L'. We say α(x) is the abstraction of x. Secondly the concretization function γ maps elements from L' to L. Element γ(x') in called a concretization of x'. Let f be a function from L to L and f' be a function from L' to L'. f' is called a valid abstraction of f iff for all x' in L' (f ° γ)(x') ≤ (γ ° f')(x').

x  <----- γ -----  x'
|                  |
| f                | f'
|                  |
y ≤ z  <--- γ ---  y'

Examples

We use the following lattice to represent the sign of a value

   ∓
  / \
 -   +
  \ /
   ⊥

⊥ means unknown as in the value was never observed ∓ means the value can be anything

What happens if we execute

int i = 10
while (exp)
  i += 1;
print i

and what if we execute

int i = 10
while (exp)
  i -= 1;
print i

We can arbitrarily increase the precision. For example to identify constants the following domain works:

                   ⊤

 .... -3  -2  -1   0   1   2   3  ...

                   ⊥

For range analysis we can use all ranges (a, b] and the latice with the structure (a, b] v (c, d] := (min(a,b), max(b,d)]. Note: this lattice has infinite height and a naive fixed point iteration is not guaranteed to terminate.

Question: What is the following lattice useful

⊤
|
⊥

Dataflow

Flow based analysis is a particular instance of abstract interpretation (see [Sc] for a good introduction). We consider a control flow graph of a program. For every node n we define

outₙ = transₙ(inₙ)
inₙ  = join(outₚ ∀ p ϵ predecessors(n))

For example given a fragment of a CFG with a node n which has two predecessors a and b this is how it looks:

[ l ]       [ k ]
   \ outₗ    /  outₖ
    \       /
      join
       |
       | inₙ
     transₙ
       | outₙ
       |

Given a value domain which is a partial order with finite size (ie. a complete partial order) and a (trans ° join) which is monotonic then an iterative analysis is guaranteed to reach a fixed point.

Examples

1  x := read()
2  y := 0
3  if (x==0)
4    y := 3
5  else
6    y := 4
7  z := y
8  print (z)

Forward analysis, reaching definition

The states are a mapping from variables to line numbers where they are defined

transₙ(inₙ) = inₙ \ defined_vars(n) ∪ {defined_vars(n) ↦ n}
join(l₁, l₂) = {x ↦ l₁(x) ∪ l₂(x) | ∀ x ϵ dom(l₁, l₂) }


1  x := read()    # x ↦ 1
2  y := 0         # x ↦ 1, y ↦ 2
3  if (x==0)
4    y := 3       # x ↦ 1, y ↦ 4
5  else
6    y := 4       # x ↦ 1, y ↦ 6
7  z := y         # x ↦ 1, y ↦ {4, 6}, z ↦ 7
8  print (z)

The domain is finite because there are finitely many lines of code.

Backwards analysis, live variables

The states are the list of live variables. This time we run the analysis backwards, from the last line to the beginning.

transₙ(inₙ) = inₙ ∪ used_vars(n) \ defined_vars(n)
join(x₁, x₂) = x₁ ∪ x₂


1  x := read()    # []
2  y := 0         # [x]
3  if (x==0)      # [x]
4    y := 3       # []
5  else
6    y := 4       # []
7  z := y         # [y]
8  print (z)      # [z]

The domain is finite because there are finitely many variables.

Bibliography

[Fe] Matthias Felleisen, Matthew Flatt
     Programming Languages and Lambda Calculi
     https://www.cs.utah.edu/~mflatt/past-courses/cs7520/public_html/s06/notes.pdf

[Cl] Click and Paleczny
     A Simple Graph-Based Intermediate Representation
     http://grothoff.org/christian/teaching/2007/3353/papers/click95simple.pdf

[Wr] Wright and Felleisen
     A syntactic approach to type soundness
     https://pdfs.semanticscholar.org/c19c/45531a5f6c114e9da1dbfea8aedc31c31e0d.pdf

[Wa] Philip Wadler
     Propositions as Types
     https://www.youtube.com/watch?v=IOiZatlZtGU

[Sc] Michael I. Schwartzbach
     Lecture Notes on Static Analysis
     http://lara.epfl.ch/w/_media/sav08:schwartzbach.pdf

[Ig] Igarashi, Atsushi, Benjamin C. Pierce, and Philip Wadler
     Featherweight Java: a minimal core calculus for Java and GJ
     https://www.cis.upenn.edu/~bcpierce/papers/fj-toplas.pdf

[Co] P. Cousot and R. Cousot
     Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints
     https://dl.acm.org/citation.cfm?id=512973

[Pa] Jens Palsberg
     The MiniJava Type System
     http://web.cs.ucla.edu/~palsberg/course/cs132/miniJava-typesystem.pdf

Weblinks

[Ai] https://wiki.mozilla.org/Abstract_Interpretation,
     https://www.di.ens.fr/~cousot/AI/IntroAbsInt.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment