o-/analysis.md Secret

## analysis.md

      
    Raw
  

              analysis.md
            
          
    Program Analysis

Lecture Notes for 21025 Compiler Construction by Olivier Flückiger
A compiler is f : S -> T -- a function from source language to target language.
It is correct iff s = f(s). Note the equality has to be extensional (see
[Fe] chapter 5.6 for an introduction to observational equivalence).
Therefore every transformation our compiler does has to be correct. Since they are
based on the result of some analysis the analysis has to be sound.
Motivation

As a compiler writer we are always looking for stronger facts about a program. With more
information we can produce better error messages to the user. And we can write
compilers which output faster code.
Last week we saw SSA, an IR which encodes data dependencies. It transforms variables into a data
dependency graph (there is a particular IR style called Sea of Nodes [Cl] which extends on the idea and also
understands control flow as a data dependency problem).
But, its not all we want to know about a function. There are many things a compiler
is interested in. Just some examples:

Which variables point to the same object.
(points to analysis, alias analysis)
Any kind of properties about data in a function.
(dataflow analysis like value range analysis, livenes, reaching definitions)
Interprocedural properties.
(control flow analysis (CFA), call graph construction)
Are the type annotations consistent.
(type checking, type inference)

Types

Data types can be understood as sets. For example the type int is the set of all
numbers from the smallest to the biggest integer. The type Point (int, int)
is the set of all points on a discrete 2D pane.
A type system is supposed to rule out type errors in a program. For example
passing a floating point number to a function expecting an integer should be
caught.
Static vs. Dynamically checked

Depending on the language some type errors can be found statically. For example
consider the following program in Java:
1  ColoredPoint x;
2  x = new Point();
3  x.color;

The assignment on line 2 should be rejected because we assign a value of a
smaller type to a reference of a bigger type. Note how the access on line 3
does not make sense for a point without a color.
Other errors are checked dynamically. Consider:
1  ColoredPoint[] cpa = new ColoredPoint[10];
2  Point[] pa = cpa;
3  assert(pa[0] == null);
4  pa[0] = new Point();

In Java arrays are (by design) covariant. A reference to a ColoredPoint[]
can be passed as a Point[]. Storing a Point into a ColoredPoint[] is
forbidden. The error occurs at runtime. This program will type check statically, but
a type error (an ArrayStoreException) is thrown at line 4 (Note: this is still
sound because the static loophole is accounted for at runtime).
Soundness

The static type is a conservative approximation of the values seen at runtime.
We call it conservative because sometimes it errors on the side of safety.
For example the following (accidentally) correctly behaving programs is ruled out:
1  ColoredPoint x;
2  if (false) x = new Point();

Conservative also implies that the approximation is correct (ie. no type error goes unreported). If a
type system can give this guarantee we call it sound.
Type soundness theorem (see [Wr])
if e has type t, either one of the three happen
1. eval(e) diverges
2. eval(e) = v  where v is a value of type t
3. eval(e) throws a (known) exception

For example in C it is possible to violate type soundness:
float f = 3;
int* i = (int*)&f;
printf ("%d\n", i);   // Won't print 3

Note: Soundness and static types are orthogonal. A dynamically typed language
is a special case with one possible static type t -- the type that contains
all others. For example in Smalltalk it's Object.
Note: only sound types can be used to optimize a program.
Type checking

Typing rules are usually written as natural deductions (Gerhard Gentzen). (It's
not an accident. See [Wa] for a connection between types and logical
propositions). There exists a formal type system of MiniJava [Pa]. Let's look
at a simplified version.
We ignore classes and have methods with one argument,
one local variable and one statement in the body.
A = (x₁ ↦ t₁, x₂ ↦ t₂)
A ⊢ s                     A ⊢ e : t
---------------------------------------------
⊢ public t name (t₁ x₁) {t₂ x₂; s; return e }

Rules for some expressions and statements:
         A ⊢ e : t₂
(assign) -----------  A(x) = t₁, t₂ ≤ t₁
         A ⊢ x = e


(lookup) ---------  A(x) = t
         A ⊢ x : t


   (lit) -----------
         ⊢ num : int


         A ⊢ e₁ : int    A ⊢ e₂ : int
 (exp-<) ----------------------------
         A ⊢ e₁ < e₂ : bool

Let's type check the following program:
public bool test (int i) {
  int j;
  j = 2;
  return i < j;
}

A possible deduction is

                           A ⊢ 2 : int ≤ int         A ⊢ i : int    A ⊢ j : int
                           -----------------         --------------------------
A := (i ↦ int, j ↦ int)    A ⊢ j = 2                 A ⊢ i < j : bool
---------------------------------------------------------------------------------
⊢ public bool test (int i) { int j; j = 2; return i < j; }

For a more complete version covering the full Java language see Featherweight
Java [Ig].
Abstract Interpretation

In a compiler we might want to know more properties of a program than just types. For example in
int array[10]
int index = compute()
array[index]

is the index guaranteed to be between 0 and 9?
We know from Rice's theorem that any non-trivial property about a program is
undecidable. However as we have seen with types we can find conservative
approximations for certain properties.
One of the most generic and fundamental ways of approximating the behavior of a program
is called abstract interpretation [Co]. As the original paper states, a program
denotes computations in some universe of objects. Abstract interpretation of
programs consists of using that denotation to describe computations in another
universe of abstract objects, so that the results of abstract execution give
some information on the actual computations.
Note: For other example driven explanations of abstract interpretation see [Ai].
Lattices

First some terminology:
A partially ordered set is a set S and a binary relation ≤ such that

a ≤ a (reflexivity)
if a ≤ b and b ≤ a, then a = b (antisymmetry)
if a ≤ b and b ≤ c, then a ≤ c (transitivity)

Note: elements do not have to be comparable.
A semilattice S is a partially ordered set such that for all elements x and y of S,
the least upper bound of the x and y exists.
Alternatively we define a join relation v such that

x v (y v z) = (x v y) v z (Associativity)
x v y = y v x (Commutativity)
x v x = x (Idempotency)

x v x is called the supremum.
We similarly define meet ∧ to be the greatest lower bound (ie. the infimum).
A lattice is a partially ordered set with meet and join.
A complete lattice is a lattice where every subset has a supremum and infimum.
It follows that there exists a supremum and infimum of the whole lattice. We call them
top (⊤) and bottom (⊥). ⊤ and ⊥ are the identity element for v and ∧ (ie. x v 0 = x).
Approach

Given a control flow graph (CFG) we observe that every node has input- and output states.
The semantics of the language describe how the input state is transformed into the
output state.
Consider the program
x := 1

the in-state is {}, the out-state is {x ↦ 1}.
For many properties we want to know about a program it is enough to know the set of all
possible states. To collect them all we assume every branch gets taken every time.
When two branches merge we combine the information from both of them.
For example
  if e then
    x := 1      # x ↦ 1
  else
    x := 2      # x ↦ 2
  y := x + 1    # x ↦ {1, 2}, y ↦ {2, 3}

This is a static approximation of what happens at runtime. There are two problems with it

It amounts to running the full program
The set of states might grow unbounded (think while (true) x++)

We define an abstraction of the values α and an abstraction of the semantics.
As an example to calculate the sign (+/-) of a number we define the following abstraction:
α : if x < 0 then + else -
a + b =
  | +, +  -> +
  | -, -  -> -
  | +, -  -> ∓
  | ∓, x  -> ∓

Now we run on the same example
  if e then
    x := 1      # x ↦ +
  else
    x := 2      # x ↦ +
  y := x + 1    # x ↦ +, y ↦ +

Abstract interpretation is defined by A, v, i where A is the set of abstract states,
v is a relation such that A is a complete v-semilattice.
The abstract semantics i operates on the abstract values and transforms abstract input to
output state. States are merged by v. We search for the ≤-least fixed-point for every set of states.
Cousot and Cousot showed that A being a complete lattice suffices that there exists a unique fixed point and abstract interpretation
terminates. For static analysis abstract interpretation is mostly performed over finite lattices (which are trivially complete).
Consistent Abstraction

The question remains whether an abstraction is consistent with the concrete semantics.
Given two ordered set L and L' of concrete values and abstract values.
α maps element x from the concrete set L to the abstract element α(x) in the
abstract set L'. We say α(x) is the abstraction of x. Secondly the
concretization function γ maps elements from L' to L. Element γ(x') in called a
concretization of x'. Let f be a function from L to L and f' be a function
from L' to L'. f' is called a valid abstraction of f iff for all x' in L'
(f ° γ)(x') ≤ (γ ° f')(x').
x  <----- γ -----  x'
|                  |
| f                | f'
|                  |
y ≤ z  <--- γ ---  y'

Examples

We use the following lattice to represent the sign of a value
   ∓
  / \
 -   +
  \ /
   ⊥

⊥  means unknown as in the value was never observed
∓  means the value can be anything
What happens if we execute
int i = 10
while (exp)
  i += 1;
print i

and what if we execute
int i = 10
while (exp)
  i -= 1;
print i

We can arbitrarily increase the precision. For example to identify constants
the following domain works:
                   ⊤

 .... -3  -2  -1   0   1   2   3  ...

                   ⊥

For range analysis we can use all ranges (a, b] and the latice with the
structure (a, b] v (c, d] := (min(a,b), max(b,d)]. Note: this lattice has infinite height
and a naive fixed point iteration is not guaranteed to terminate.
Question: What is the following lattice useful
⊤
|
⊥

Dataflow

Flow based analysis is a particular instance of abstract interpretation (see [Sc] for a
good introduction). We consider a control flow graph of a program. For every node n we define
outₙ = transₙ(inₙ)
inₙ  = join(outₚ ∀ p ϵ predecessors(n))

For example given a fragment of a CFG with a node n which has two predecessors a and b this is how
it looks:
[ l ]       [ k ]
   \ outₗ    /  outₖ
    \       /
      join
       |
       | inₙ
     transₙ
       | outₙ
       |

Given a value domain which is a partial order with finite size (ie. a complete partial order) and a (trans ° join) which is monotonic then
an iterative analysis is guaranteed to reach a fixed point.
Examples

1  x := read()
2  y := 0
3  if (x==0)
4    y := 3
5  else
6    y := 4
7  z := y
8  print (z)

Forward analysis, reaching definition

The states are a mapping from variables to line numbers where they are defined
transₙ(inₙ) = inₙ \ defined_vars(n) ∪ {defined_vars(n) ↦ n}
join(l₁, l₂) = {x ↦ l₁(x) ∪ l₂(x) | ∀ x ϵ dom(l₁, l₂) }


1  x := read()    # x ↦ 1
2  y := 0         # x ↦ 1, y ↦ 2
3  if (x==0)
4    y := 3       # x ↦ 1, y ↦ 4
5  else
6    y := 4       # x ↦ 1, y ↦ 6
7  z := y         # x ↦ 1, y ↦ {4, 6}, z ↦ 7
8  print (z)

The domain is finite because there are finitely many lines of code.
Backwards analysis, live variables

The states are the list of live variables. This time we run the analysis backwards, from the last line to the beginning.
transₙ(inₙ) = inₙ ∪ used_vars(n) \ defined_vars(n)
join(x₁, x₂) = x₁ ∪ x₂


1  x := read()    # []
2  y := 0         # [x]
3  if (x==0)      # [x]
4    y := 3       # []
5  else
6    y := 4       # []
7  z := y         # [y]
8  print (z)      # [z]

The domain is finite because there are finitely many variables.
Bibliography

[Fe] Matthias Felleisen, Matthew Flatt
     Programming Languages and Lambda Calculi
     https://www.cs.utah.edu/~mflatt/past-courses/cs7520/public_html/s06/notes.pdf

[Cl] Click and Paleczny
     A Simple Graph-Based Intermediate Representation
     http://grothoff.org/christian/teaching/2007/3353/papers/click95simple.pdf

[Wr] Wright and Felleisen
     A syntactic approach to type soundness
     https://pdfs.semanticscholar.org/c19c/45531a5f6c114e9da1dbfea8aedc31c31e0d.pdf

[Wa] Philip Wadler
     Propositions as Types
     https://www.youtube.com/watch?v=IOiZatlZtGU

[Sc] Michael I. Schwartzbach
     Lecture Notes on Static Analysis
     http://lara.epfl.ch/w/_media/sav08:schwartzbach.pdf

[Ig] Igarashi, Atsushi, Benjamin C. Pierce, and Philip Wadler
     Featherweight Java: a minimal core calculus for Java and GJ
     https://www.cis.upenn.edu/~bcpierce/papers/fj-toplas.pdf

[Co] P. Cousot and R. Cousot
     Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints
     https://dl.acm.org/citation.cfm?id=512973

[Pa] Jens Palsberg
     The MiniJava Type System
     http://web.cs.ucla.edu/~palsberg/course/cs132/miniJava-typesystem.pdf

Weblinks

[Ai] https://wiki.mozilla.org/Abstract_Interpretation,
     https://www.di.ens.fr/~cousot/AI/IntroAbsInt.html