watersofoblivion/ssc-cfa.md

## ssc-cfa.md

      
    Raw
  

              ssc-cfa.md
            
          
    CFA and the State Explosion Problem

I've read (among other things) both Shiver's dissertation on CFA and Might's dissertation on extending CFA, as well as Might's m-CFA paper.  In Shiver's dissertation, he gets the naive exponential CFA down to polynomial by using the single-threaded store.  In Might's dissertation, he extend this with various analyses on CFA environments, such as Gamma CFA where you garbage collect dead elements of the environment to reduce processing time.  Then, in the m-CFA paper, he shows that you can actually produce a family of polynomial CFA algorithms by using what is tantamount to flat closure conversion instead of linked closure conversion.
The running theme here seems to be that CFA has the state explosion problem and memory management is crucial to the runtime of any CFA analysis.  So much so that it can actually affect the algorithmic complexity and not just a constant factor.
Safe for Space Complexity

On the other hand, there's Appel's papers on the Safe for Space Complexity (SSC) rule.  He points out that linked closures can cause a space leak since not all values are immediately dead after their last use.  In this paper, he shows that flat closure conversion is SSC while linked is not, nor is stack allocation.   (Aside: his canonical example of a space leak is O(n^2).  The exponential case example for CFA is very similar to the exponential case example for unification.  I'm wondering if that same example could be tweaked to show an exponential space leak for linked closures based on the fact that CFA's state explosion problem is due to keeping values live.)
Then, in this paper, Appel introduces Safe for Space Complexity Closure Conversion, which is a linked closure conversion that meets the SSC rule.  The gist of the algorithm is that you build the CPS call graph, use T1/T2 Interval Anaysis (can't find original paper, but these slides present it) to find nested loops, then build a flat closure at each loop level for just the variables that are scoped to that loop level.  The reasoning behind this is that those variables share a common lifetime and become dead at the same time.  Then the "actual" closures that you pass to functions are a single semi-flat closure; they have one top-level field per loop level with a pointer to the flat closure for that loop level, plus top-level fields for any free variables immediately local to the function not covered by those loop-level closures.  (There's more to it that that, and a lot of the paper is about packing these closures into registers instead of the heap a la Orbit, but that's the crux of the idea.)
This has a lot of advantages: it meets the SSC rule since each loop-level closure is only live during its loop and any nested loop.  Closure creation is fast, because it only has to create the closure with slots for the local variables + 1 extra slot per enclosing loop level.  Closure access is constant time, since every variable is either base + offset or base + offset + offset away.  And, from the compilation perspective, you can use the information from the interval analysis to know which values to put in registers to save memory accesses and even those minimal address calculations.
SSC-CFA

The SSC result seems to run parallel to the m-CFA result, which leads me to ask if we can rephrase the question: is x-CFA Safe for Space Complexity? For k-CFA, the answer seems to be a resounding "no".  For m-CFA, the answer seems to be a qualified "no" as using stack allocation is not SSC because references can outlive their lifetimes and m-CFA uses the last m stack frames for context.  I'll admit to getting a little lost in the formalisms, but if m-CFA is using stack frames to track variables then it's not, while if it's only using it to track calls (effectively return addresses) then it is.  I believe it's the former.
The obvious next question is: if x-CFA is not Safe for Space Complexity, can it be made so?  I think that the SSC Closure Conversion algorithm would address the k-CFA memory issue better than flat closure conversion.  The closures generated still have the same constant-time access that Might points out is critical to the m-CFA analysis being polynomial.  He has to use the top m stack frames for analysis since the flat closures erase context.  If I'm following correctly, I think the SSC closures would retain that context so the analysis could be done purely in terms of the closure.  The Gamma CFA process could be done precisely, since the live range of each loop-level closure is known a priori.  Plus, there's the advantage that within the loop the portion of the closure from the outer loop levels is invariant; the code being analyzed can't affect those values until it "returns" by exiting the inner loop (calling the continuation).  So the analysis for the portions of the closures from the outer loop levels could be performed once before entering a nested loop and then effectively memoized, leaving only the variables within the current loop level needing deeper analysis.  (The outer loop variables may still need some analysis, but certain properties may hold constant making their analysis simpler.)  Finally, flat closure conversion causes the shared variables to be copied into each closure, meaning they are analyzed as separate variables.  SSC Closure Conversion shares as much as possible, so these variables would truly be treated as the same variable.  This means that their results could propagate farther and/or wouldn't have to be re-computed independently for each closure.
Conversely, the SSC paper looks like it was written in 1994, which is only 3 years after Shiver's thesis.  While it uses compile time control and data flow analysis to perfom its work, it doesn't actualy use k-CFA.  This is speculation, but it's possible that using k-CFA could feed back into the algorithm and be used to make better decisions.  Effectively, run the interval analysis, use it to make SSC abstract closures so you can abstractly interpret the program, then use those results to construct concrete SSC closures so you can execute the concrete program.
Reusing the Interval Analysis

But, similar to the way that SSC Closure Conversion uses the interval analysis beyond just structuring the closure to prioritize which values actually should live in registers, I think the interval analysis step actually buys us more here as well.  In this case, I think it gets us a better, dynamic value for k.
To motivate this, imagine you have a 4 node loop in your CFG.  Along the loop path, the k value should really be 3 since once you have 4 call sites (the current + 3 back) you have an exhaustive view of the all the calls in the loop.  It may be possible to make stronger guarantees about the analysis when you can prove you have the maximal call history, asserting it reaches some kind of fixed point.
This may also explain some of the results in the various CFA papers, where 1-CFA finds more than 0-CFA but beyond that it doesn't seem to help.  If loops have somewhat of a bimodal distribution --where they tend to be either tight loops like iterating over a list where only 1 previous call site is exhaustive or large multi-node loops where it's not with not a lot in-between-- then 1-CFA would find all those tight loops but not the larger ones.  And because the value of k is a global parameter, the algorithm chokes before you can get to a high enough k value to climb out of the bimodal valley and reap the benefits on the large loops.
We may be able to use the results of the interval analysis to turn the value of k into a local parameter, transforming it from a sledgehammer into a scalpel.
The idea is to annotate each node/edge in the CFG with the current loop level and a k value that they should use for analysis.  The k value would be the size of the current loop measured in calls.  Then the algorithm could use that value to locally increase the analysis depth in cases where increasing it would have benefits and reducing it where the depth is not needed.  For nodes not in a loop, k would be zero.  For nodes in a tight loop, it would be 1.  For nodes in the large, 4 node loop it would be 3.  (This may require double processing when changing k values.  For example, in our n node loop, you'd have to make two circuits around the loop to build up the call history.  But I believe it's only at most double re-processing, not a variable amount of re-processing.)
This helps tame the exponential or high polynomial runtime complexity.  When k goes up, we're only processing a subset of the CFG, so the non-linear growth of the algorithm is localized to a small sub-problem and only flares up instead of exploding.  And the value of the extra runtime is worth it because we know that we will achieve this exhaustive analysis of the nodes and can make stronger guarantees of the results.
This thinking may only apply to reducible CFGs, but algorithms like Node Splitting/T3 interval analysis, DJ Graphs, or the algorithm described in this paper could transform irreducible regions to reducible.  It might also be worth investigating replacing the T1/T2 analysis from earlier with one of these for other reasons, as it might provide more overall usable information.  (I have not done a deep dive on whether that is the case.)  The tradeoff is that it increases the implementation complexity.  They would have to compute dominators, then do this complex interval analysis, then use it to transform their CFG, and only then could they do the CFA analysis.  I would argue that it just requires implementors to be selective in where they place their CFA pass.  They may aready want to compute dominators or do this interval analysis for other reasons (e.g., converting to SSA), so it may not be much extra work.  And if they do all this work, they have already done much of the hard work for implementing SSC Closure Conversion (if they're willing to run stackless.)
Improving Higher-Order Inlining

It's also possible that the closure sharing of SSC Closure Conversion could benefit other uses of CFA.  In this paper, Bergstrom, Fluet, Le, Reppy, and Sandler show a practical variation of Super-Beta higher-order inlining, first introduced in Shiver's dissertation.  They perform 0-CFA, then turn the inlining problem into a reachability analysis: is there a path between closure creation and use through a rebinding of any of the free variables?
By using SSC Closure Conversion, the analysis may be able to use the loop-level invariance to eliminate the possibility of rebindings for variables and reduce the search space.  For example, if a closure doesn't upward escape the loop level it is created in, it should not have to worry about the bindings for variables at higher loop levels being rebound.