serras/program-analysis.md

## program-analysis.md

      
    Raw
  

              program-analysis.md
            
          
    Program Analysis, a Big Happy Family

The idea behind program analysis is simple, right? You just want to know stuff about your program before it runs, usually because you don't want unexpected problems to arise (those are better in movies.) Then why looking at Wikipedia gives you headaches? Just so many approaches, tools, languages 🤯
In this article I would like to give a glimpse of an overarching approach to program analysis, based on ideas from abstract interpretation. My goal is not to pinpoint a specific technique, but rather show how they have common core concepts, the differences being due mostly to algorithmic challenges. In other words, static analysis have a shared goal, but it's a challenge to make them precise and performant.
Code is meant to be executed by a computer. Take the following very simple function:
fun cantulupe(x) = {
  y = x * 2
  return y + 1
}
This code performs some computation for each concrete value of x. For example, when given 35 as input, it returns 71 as output. If we were to describe the input/output behavior by such pairs, we have a problem: the domain of inputs and outputs is infinite! (or at least very large.) Program analysis abstracts the domain to be able to give a resonable answer, and do so in reasonable time; in doing so it necessarily has to drop some information. A type system, for example, may declare that cantulupe expects and returns an integer.
fun cantulupe(x: Int): Int
This has already carved out a subset of all possible values in a program: we know that cantulupe does not return a non-integral number, for example.
Many Shades of Analysis

Different analysis techniques abstract different aspects of the program. Type systems focus on categorizing the data that flows in the system. The initial goal was to have a broad classification to prevent "stupid" mistakes while keeping the overhead low. With the pass of time, though, type systems have become more powerful.

Subtyping adds the ability of types to form a hierarchy: some functions ingest a broader sets of types, whereas other target a very narrow subset.
Refinement and dependent types adds the ability to talk about values in a more refined way, like "the set of integers greater than 1."

Still the performance is one of the main drivers of type systems, since they are often used in an interactive loop with the programmer (think of any IDE or your like). Nowadays many type systems are implemented with custom solvers, OutsideIn(X) -- the solver in GHC -- is a prime example of this trend.
Data flow analyses share the same concrete aspect: the data flowing in the program. However, instead of fixating on attaching a type to each value or variable in the program, they keep different amounts of information. For example, keeping a Boolean flag indicating whether each variable has been used in a particular block of code. In that case, the analysis of cantalupe would be { x -> used, y -> used }. In most cases data flow analysis approaches the program as a graph, where nodes represent points in the program and edges represent that one program point is immediately reachable from other.
This brings us to one important concept in our overarching framework: how to join the information represented by two abstractions, that is, how to have a single abstract element which contains all the concrete elements of the original abstractions. In Java, for example, the join of Integer and String would be Object, its common parent. In our data flow analysis about usage, the join would take each variable as used if it has been used in any of the joined abstractions.
Until now we have been talking about how to abstract values or a particular aspect of each variable. But what if instead of looking simply as data, we took whole program traces as our concrete element of study? Then we can speak not only of values, but about properties of our whole computations!
One prime example of this idea is Hoare logic. This kind of analysis shares with data flow analysis the representation of a program as a graph, but instead of some information about values in each node, it attaches two logical formulae: one that explains what holds before the node is entered, and one about the state afterwards. Each { precondition, instruction, postcondition } is called a Hoare triple.
fun cantalupe(x) = {
            { } y = x * 2    { y = x * 2 }
  { y = x * 2 } return y + 1 { result = y + 1 }
}
Model checking goes one step further, and abstract the entire trace, usually using some form of finite state automaton. This means that the possibly infinite domain of execution paths of a function is summarized in a finite description. Many model checkers use temporal logic -- an extension of regular Boolean logic with operators for "always", "after", and similar constructors -- to define properties of the system. This ability to model sequences of steps have been successfully used to check concurrent systems with software such as Alloy or TLA+.


Technique
Concrete
Abstract
Join


Type systems
Data
 Types
None


 Subtyping 
Data
Hierarchy of types
Common parent


 Refined 
  Dependent 
Data
Type + logical condition
Disjunction of conditions


Data flow
Data
Per analysis
Per analysis


 Hoare logic
Code blocks
 Hoare triples: pre + postcondition


Model checking
 Traces
 Temporal logic 
 Finite state automata
 

Non-standard Interpreters

We usually think of static analyses as something separate from regular code execution. There is one way to link them: interpreters which work on the abstract elements of the analysis instead of the concrete values. Let me use a type system as an example, in a regular execution starting with value 35, the trace looks as follows.
cantalupe(35) {
  y = 35 * 2    // y = 70
  return y + 1  // return 71
}
Now, instead of using 35 as argument, we are going to use Int as argument. Here Int is really the type itself, an abstract element taken from the types domain. The operators * and + also have to be abstracted: now they both take Int arguments to the Int abstract element.
cantalupe(Int) {
  y = Int * 2   // y = Int
  return y + 1  // return Int
}
This is what we call a non-standard interpreter: it is executing the code in a different way that "regular" execution does. We use the word "interpreter" here simply because it's customary to use "execution" only for "actual" execution in the machine.
The same trick works with other analyses. Take the example of data flow analysis of usage; here we create a small graph of the variables, in which an edge is created whenever a variable uses another one. Once we reach a return block we can traverse the graph to find all those variables which have been used.
cantalupe({ x }) {
  y = x * 2     // { y -> x }
  return y + 1  // { result -> y -> x }
}
Symbolic execution is a very powerful technique to develop non-standard intepreters. Instead of regular values or abstract elements, this technique uses symbolic variables which represent all possible arguments. That way a description of the function or code block can be obtained.
cantalupe("x") {
  y = "x" * 2   // y = "x * 2"
  return y + 1  // return "(x * 2) + 1"
}
Formalization as Abstract Interpretation

We need to dress our formal attire to better pinpoint the commonalities among all of these techniques. As mentioned above, we are going to wear our abstract interpretation hat, although we won't go very deep into details. As many others in computer science, abstract interpretation is an old idea, pioneered in the 1970s by Cousot and Cousot, who also maintain a huge list of links to resources.
The core idea is the following: every analysis ties concrete values C and abstract values A together. For example, our concrete values could be the set of integer numbers, and our abstract values a subset of { ↑, 0, + } which indicate whether we have negative numbers, zero, or positive numbers. The relation between C and A is formalized by two functions:


The abstraction function, conventially called α, moves from concrete to abstract. Some examples of our abstraction function in action:
α({1, 2, 3, 4}) = { ↑ }
α({-1, 0})      = { ↓, 0 }


The concretization function, conventially called γ, going in the other direction.
γ({ ↑ }) = {1, 2, 3, ...}      (all positive numbers)
γ({ 0 }) = {0}
γ({ ↓, 0 }) = {0, -1, -2, ...} (zero + all negative numbers)


The combination of the two functions allows us to give a sharp definition of the more blurry idea of "analysis lose some information". Take one concrete set of values {1, 2}, if we abstract those we get {↑}, and if then we concretize we have the (much larger) set {1, 2, 3, ...}. This is one of the definition properties of an analysis:
for all v in C, v ⊆ γ(α(v))  (v is a subset of γ(α(v)))

Abstraction and concretization also provide a recipe to abstract every single operation in your program, effectively telling you how to build a non-standard interpreter. Let's take for example the + operation, in order to get the result working on the abstract values { ↑ } and { ↑, 0 }:


we first concretize the elements,
γ({ ↑ })    = {1, 2, 3, ...}
γ({ ↑, 0 }) = {0, 1, 2, 3, ...}


we then apply the regular + operation on every combination of arguments,
result = {1 + 0, 1 + 1, 1 + 2, ..., 2 + 1, 2 + 2, ...}
       = {1, 2, 3, ...}


we then abstract the resulting values, so we get
abstract_+({ ↑ }, { ↑, 0 }) = { ↑ }


Note that this is usually not done by tools. Instead, the designer of the tool performs uses this abstraction machinery on some primitive elements, which are used to bootstrap the whole analysis.
There is one piece I introduced before but not mentioned with respect to abstract interpretation: join. Such an operation is needed when a certain code block has more than path of execution. Each path contributes some set of values to the final possibilities, each represented by an abstract value, the complete result is thus the join of all of them. In our example, if one path says we get { ↑ } and the other says { 0 }, the result of all possibilities is abstracted by { 0, ↑ }.
The great feature of abstract interpretation is how independent it is of the particular concrete and abstract values. You can do abstract interpretation over all possible traces of execution, leaving you with something similar to model checking. You can abstract using types, regaining type systems. Have I mention that abstract interpretation can even infer your data type definitions?
This has been really handwavy, I know. If you want to deep dive, you can look for one of the many courses about abstract interpretation lying around in the web, but my personal recommendation is to read Principles of Program Analysis. There is another new book which looks promising, but I have not read it. Just a suggestion from a friend: don't try to read the Cousot's papers, at least not at first. They are very very dense!
Galois and Categories

These are just random thoughts which might be interested for people who like category theory or abstract algebra. Feel free to skip it.
To make the whole abstract interpretation framework work, we actually need a second condition on the abstraction-concretization pair of functions.
for all v in C, v ⊆ γ(α(v))
for all x in A, α(γ(x)) ⊆ x

In many cases, the second condition is made stronger as α(γ(x)) = x, which means that if you concretize and then abstract, you are back were you started. Any two functions which satisfy these two conditions form a so-called Galois connection. This is the place in which mathematics is injected into abstract interpretation: having Galois connections as ground allows us to re-use many of the theorems coming from a branch called order theory.
Category theorists may have also raised an eyebrow, since the two Galois conditions are very similar to one of the ways of describing an adjunction. In fact, a Galois connection is nothing else than an adjunction in the specific case in which our concrete and abstract values for a partially-ordered set!
SMT Solvers

There are lots of mentions to "logic formulae" in the previous descriptions. Refinement types attach them to types, Hoare logic uses them as components of their triples, and so on. In every case we need some algorithm which is able to declare that having x > 0 and x = 0 is false or inconsistent, which would translate to type errors, unreachable states, or other output depending on the analysis
One of the best developments of the last decade in computer science is the widespread availability of SMT solvers, Z3 being maybe the best-known of all. Those systems are simply awesome at "small-scale" logic; they are in fact a combination of many very specific solvers for different areas: arithmetic, vectors, bit manipulation, and so on. Nowadays the barrier to entry of a new analysis is much lower, since many problems can be simply outsourced to SMT solvers.
Technique	Concrete	Abstract	Join
Type systems	Data	Types	None
Subtyping	Data	Hierarchy of types	Common parent
Refined Dependent	Data	Type + logical condition	Disjunction of conditions
Data flow	Data	Per analysis	Per analysis
Hoare logic	Code blocks	Hoare triples: pre + postcondition
Model checking	Traces	Temporal logic Finite state automata