jtpaasch/CS-Papers.md

## CS-Papers.md

      
    Raw
  

              CS-Papers.md
            
          
    CS Papers (PL mostly)


Klebenov et al, Automating Regression Verification of Pointer Programs by Predicate Abstraction.
https://formal.kastel.kit.edu/~ulbrich/pdf/FMSD2017.pdf [I've read this]

Differential program analysis (relational equivalence)

Zachs and Pnueli, The "Cross-Product Program" paper.
https://llvm.org/pubs/2008-05-CoVaC.pdf
Barthes et al. Relational Verification using Product Programs.
https://software.imdea.org/~ckunz/rellog/long-rellog.pdf

Memory models

Regehr et al. Reconciling High-Level Optimizations and Low-Level Code in LLVM.
https://www.cs.utah.edu/~regehr/oopsla18.pdf

Weakest precondition


Satish Chandra, Stephen J. Fink, and Manu Sridharan. 2009. Snugglebug: A powerful approach to weakest preconditions. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09). ACM, New York, NY, 363–374.


Ivan Jager and David Brumley. 2010. Efficient Directionless Weakest Preconditions. Technical Report CMU-CyLab-10-002. CyLab, Carnegie Mellon University.


Specifications


Tomasik, J. and J. Weyman. Category Localization Semantics for Specification Refinements.
https://link.springer.com/article/10.1007/s10472-007-9055-4


See the references in that paper to Goguen et al. There's more on specifications there.


Check out the SpecWare stuff (e.g., https://www.kestrel.edu/research/specware/papers/specware/specware-jm.pdf)


Interesting uses of topology

Nowak, T., U. Schmid, and K. Winkler. Topological Characterization of Consensus under General Message Adversaries.
https://dl.acm.org/doi/10.1145/3293611.3331624

Moy, M., R. Cardona, R. Green, J. Cleveland, A. Hylton, and R. Short (2020). "Path Optimization Sheaves." Preprint. https://arxiv.org/pdf/2012.05974.pdf

The authors build a sheaf over a graph called P, which encodes paths through the graph. The graphs are traditional undirected graphs, with a distinguished source and sink, and all nodes except the source and sink have degree of 2 or higher. For each non-sink/source vertex, P assigns to it the set of all pairs of edges (think of these as paths through the node, i.e., an edge in and an edge out). For each sink/source vertex, P assigns to it the set of edges coming out or going in to it. So basically, P assigns to each node the set of edges/paths through it (or coming in/out of it in the case of sinks/sources). To each edge, P assigns to it the set {top, bottom}. The restriction maps of P are this: to each edge e with vertex v, we have a map that goes from P(v) to P(e). Basically, for the provided e attached to v, map that element to top, and for all others map them to bottom. Basically, the map identifies the "active" edge out of the set P(v), for a given edge attached to a vertex.
The authors next build another sheaf called DP, that encodes the distance through paths in the graph. The graph is the same, except to each edge is assigned a cost function. The functor basically take account of this cost, as it makes its assignments.
A local section of P (or DP) essentially picks out a fragment of a path in the graph. A global section of P (or DP) essentially picks out a path from source to sink.
The authors then reconstruct Djikstra's algorithm for finding the shortest path through a graph. The strategy is that you start with local sections (paths), and systematically build up a global section (path). So, we're using sheaves to patch together a bigger picture, starting with a smaller picture.
There is also a sheaf morphism from DP to P, where DP effectively forgets its extra structure. This correspondence is important, because it lets us know that any paths we trace in DP correspond to paths in P, which of course correspond to paths in the graph.
Pnueli, A., M. Siegel, and E. Singerman (1998). "Translation Validation." In Tools and Algorithms for Construction and Analysis of Systems, TACAS ’98, volume 1384 of LNCS, pages 151–166.

This is the original TV paper.
Tate, R., M. Stepp, Z. Tatlock, and S. Lerner. "Equality Saturation: A New Approach to Optimization." Logical Methods in Computer Science, Vol. 7 (1:10) 2011, pp. 1–37.

This is a novel approach to compiler optimizations, basically using e-graphs.
Weise, D., R. Crew, M. Ernst, and B. Steensgaard. "Value Dependence Graphs: Representation without Taxation."

This paper describes the useful value dependence graph representation of a program (which is used by Tristan et al 2011, see below).
Tristan, J-B., P. Govereau, and G. Morrisett (2011). "Evalauting Value-Graph Translation Validation for LLVM." In PLDI'11.

The authors convert the two program fragments into value dependence graphs, then they normalize them with maximal sharing.
The authors first convert each function into what they call a "Monadic Gated SSA form," which essentially attaches a memory state variable to every assembly instruction (it's like wrapping each instruction in a state monad). This makes the assembly instructions referentially transparent, and it allows substitution of sub-graphs without having to worry about global aliasing and such. Phi nodes also become explicit, mu-nodes are added at loop headers, and eta-nodes at loop exits.
Then, the authors take those Monadic Gated SSA form graphs, and they merge the two graphs by computing a shared value-graph. To do this, they replace variables with their definitions, and they maximize sharing. Then they repeatedly normalize the graph and maximize sharing until the two functions merge into a single node or they cannot perform any further normalization. At the end, you are able to see if the two functions can in fact be reduced to the same value flow graph, at which point you can conclude that the optimized function is equivalent to the former.
(The idea in this paper reminds me of something I was pondering the other day: if you calculate the data flow graph of the two functions, can you then do persistent homology to find the "holes" and "shape" of the data flow space? Could we identify loop invariants that way? Could we identify whether the data flows of the two functions are topologically equivalent in a meaningful sense?)
Godlin, Benny and Ofer Strichman (2013). "Regression Verification: Proving the Equivalence of Similar Programs." In Software Testing, Verifation and Reliability 23: 241–258.

The authors argue that instead of trying to prove the equivalence of two fully different programs, i.e., the original program vs the fully updated and modified program, it seems easier to prove the equivalence of incrementally similar programs. So, don't wait until the refactor is over. Check for equivalence with each tiny change. Prove the equivalence of "successive, closely related versions of the program" (241).
These authors coin the term "regression verification" and they use the expression "change impact analysis," i.e., verifying equivalence of a changed program implies that no regressions have been introduced, and the act of doing so can reveal what has changed about the program and so it can reveal which parts of the program need to be tested. There are formal definitions of these.
Formally, the authors say that, given two programs (each consisting of a set of functions) and a definition of function equivalence, "regression verification" is the problem of finding a map that pairs equivalent functions in the two programs.
They define two functions to be "partially equivalent" (p-equiv) if any two terminating executions of f1 and f2 emit the same outputs for the same inputs.
The authors propose a rule for recursive function equivalence called PROC-P-EQs which essentially says: if assuming that the recursive calls are partially equivalent enables us to prove that their bodies are partially equivalent, then the functions themselves are partially equivalent. (This takes the flavor of Hoare's REC rule.) The authors then use the trick of replacing the premise with an uninterpreted function (i.e., replace the recursive call to itself with a call to an uninterpreted function) to make the rule even simpler.
The authors then claim that since every loop can be replaced with a recursive function and every recursive function can be replaced with an uninterpreted function, the premise of this rule is decidable for languages with finite domains such as C. [Is that really true?]
The RVT tool, developed by Godlin with his master's thesis (see below), does some preprocessing on C programs. First, it replaces loops with recursive functions. Second, it constructs a mapping between the two programs (it maps global variables to global variables, functions to functions).
Next, RVT decomposes the programs, using the decomposition algorithm that they describe in section 3.3. It basically traverses the call graph bottom-up. It basically checks each pair of functions as it walks up the graph, and it outputs a list of which function pairs are equivalent and which are not.
Godlin, Benny and Ofer Strichman (2008). "Inference Rules for Proving the Equivalence of Recursive Procedures." Acta Informatica 45: 403–439.

This has much more theory and semantics than the above paper. Proofs and descriptions of the inference rules are here. However, in the above paper, the authors present the ideas more in a Hoare-style, and they believe it is clearer that way.
Godlin, Benny (2008). "Regression Verification: Theoretical and Implementation Aspects." Master’s Thesis, Technion, Israel Institute of Technology.

Here Berry Godlin describes previous literature and the RVT (Regression Verification Tool), as well as some of the theory in the above paper.
Kawaguchi M., S. K. Lahiri, and H. Rebelo (2010). "Conditional equivalence." Technical Report MSR-TR-2010-119, Microsoft Research.

This is the SymDiff tool, which allegedly was based on RVT (see Godlin 2008).
Lerner, S., T. Millstein, E. Rice, and C. Chambers. "Automated Soundness Proofs for Dataflow Analyses and Transformations via Local Rules."

This is the Rhodium paper (see Sorin Lerner's dissertation for the original work on Rhodium).
In prior work, the authors developed Cobalt, a DSL that lets you write optimizations, and then check automatically that they are sound (where an optimization is sound if it preserves the semantics of the program it optimizes). The check is done by converting the optimization written in the DSL into a set of constraints and asking a solver to solve it. If successful, that means the optimization will soundly optimize any program that it optimizes.
But, the authors found Cobalt too limiting to extend to bigger programs. So they developed Rhodium. The key technical impromevent for Rhodium is that they make data flow facts explicit, and they introduce a separate but extensible set of local data propagation and transformation rules that allows Rhodium to reason about modification to dataflow.
Each dataflow fact is given a semantic meaning, in the form of a predicate over program states (i.e., a post-condition). Then, a solver discharges a soundness lemma for each predicate (i.e., each dataflow fact), based on the program's other semantics.
The authors prove (with a proof by hand) that altogether, these individual soundness proofs ensure the global soundness of the optimization. Note: the proof by hand here is based on abstract interpretation.
One nice feature of Rhodium is that it is compositional. You can compose analyses and transformations.
Mechtaev, S., J. Yi, and A. Roychoudhury (2016). "Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis." In Proceedings of the 38th International Conference on Software Engineering. ACM, New York, NY, pp. 691-701.

This is the Angelix paper.
Nguyen, H., D. Qi, A. Roychoudhury, and S. Chandra (2013). "SemFix: Program Repair via Semantic Analysis." In Proceedings of the 35th International Conference on Software Engineering (ICSE’13). IEEE, Los Alamitos, CA, 772–781.

This is the SemFix paper. The tool aims to do better than syntactic-based automatic program repair tools, and so it opts for a "semantic-based" (i.e., constraint based) strategy. Like syntactic APR tools, it takes some code and a suite of tests. For a failing test, it uses a statistical fault isolation tool to get a ranked suspicion report of statements that are likely causing the bug. Then, going through the ranked list one statement at a time, it extracts a correct specification (constraint) of the buggy statement. Finally, they synthesize an expression that satisfies the extracted spec (constraint).
To find the constraint, SemFix uses symbolic execution. Basically, SemFix starts executing at the start of the function with the concrete inputs of the failing test, until it gets to the suspicious statement. Then it replaces the live variables with symbolic values, and it begins symbolic execution. It walks all the paths that it can to the end of the function, building up the symbolic values of the live variables, and collecting the path constraints that sent it down each path. Then, it looks at the return values of the function, and it picks the one that passes the test. Then, it pulls out the path constraints that were required to get to that point in the program.
SemFix repeats this for all failing tests, thereby collecting a whole set of path constraints that should get the function to the particular end-point that passes all tests.
Then, SemFix attempts to solve this set of constraints, and if it can, it synthesizes a new statement from the solution, and it patches it in.
Le, X-B., D-H. Chu, D. Lo, C. Le Goues, and W. Visser (2017). "S3: Syntax- and Semantic-Guided Repar Synthesis via Programming by Examples."  In Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Paderborn, Germany, September 4–8, 2017 (ESEC/FSE’17), 12 pages.

The authors point out that overfitting is a problem in APR. In particular, semantics-based APR tools have trouble synthesizing a good patch because they have a sparse search space. S3 is intended to improve on this problem by expanding the search space, thereby giving the tool the opportunity to synthesize higher quality patches.
Much like SemFix, S3 first localizes the problem to some line of code, then it uses symbolic execution to collect path constraints that will lead the function to pass all failing tests. It then solves the patch constraints, yielding a set of concrete valuse that will pass the tests.
Then, S3 enters the synthesis stage. For this, it begins with a sketch that will help it synthesize better patches. What sketch does it start with? The buggy statement under repair! Then, S3 constructs a set of solutions that are the same size as the original sketch, and then it ranks them by a number of factors. If it can't find a solution that's the same size as the original sketch, it tries for solutions incrementally bigger.
Le Goues, C., M. Pradel, and A. Roychoudhury (2019). "Automated Program Repair." In Proceedings of Communications of the ACM (CACM).

The authors survey automated program repair techniques.
They describe (p. 1b) how often the "specification" is a set of test suites, and failing tests indicate a bug to be fixed, and passing tests indicate that the code should be left. The end goal is a set of program changes (typically to source code) that leads to all tests passing.
The authors note (p. 1b) that the "grand challenge" in APR research is the problem of "weak specifications." Formal specs aren't usually available, so APR is usually driven by weaker specs, e.g., a suite of tests. The consequence is that generated patches may over-fit, and may not generalize outside the test-suite.
Smith, E., E. Barr, C. Le Goues, and Y. Brun (2015). "Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair." In ESEC/FSE'15.

A study on overfitting in APR.
Le, X. B. D., F. Thung, D. Lo, and C. Le Goues (2018). "Overfitting in Semantics-based Automated Program Repair." In Empir. Software Eng.

A study of "semantics-based" (i.e., constraint-based) overfitting in APR.
Xiong, Y.,  X. Liu, M. Zeng, L. Zhang, and G. Huang (2018). "Identifying Patch Correctness in Test-based Program Repair." In Proceedings of the International Conference on Software Engineering (ICSE’18). ACM, New York, NY, 789–799.

Suppose we have a test t that passes (call this test run t1). Then we apply a patch p, and we run the test t again (call this test run t2). Now, if the patch p is correct, then we would expect not only that t2 passes (just as it did before), but we also expect t2 to run similarly to the way t1 ran. In other words, when you apply a correct patch, we expect that the correct patch doesn't mess with the behavior of other correct functionality. Indeed, one might say that part of what makes the patch "correct" is precisely its non-interference with other behaviors.
Similarly, we expect a failing test to behave differently after a correct patch is applied.
Next, suppose that we automatically generate some new test, call it w, and we want to know if it should be a test that passes or fails. How do we know? Well, if it should be a passing test, we would expect that it behaves similarly to other passing tests, and if it should be a failing test, it should behave similarly to failing tests. One can use this observation to make an educated guess about whether a newly generated test should be classified as a passing or failing test.
How do these authors measure execution similarity? They compare the similarity of the "complete-patch spectrum" of the two executions. it is basically a sequence of the executed statement IDs. So, presumably then, two functions are exactly similar if they have the same sequence of executed statement IDs.
This, of course, will fail to identify patches that move statements around, or put things in a different order.
Gao, X., B. Wang, G. Duck, R. Ji, Y. Xiong, and A. Roychoudhury (2021). "Beyond Tests: Program Vulnerability Repair via Crash Constraint Extraction." In ACM Trans. Softw. Eng. Methodol., Vol. 30, No. 2, Article 14, 27 pages.

ExtractFix uses templates to match a small set of vulnerabilities, to construct constraints, find a fix location, and then CEGIS a patch in. ExtractFix is much like VIBES, but at the source level, and it is restricted to the templates that it can match.
One problem (noted by the authors of the ExtractFix paper) is that they perform backward propagation of the constraints via symbolic execution rather than true weakest precondition analysis, and as a consequence, they can miss some paths between the fix location and the crash site.
Another problem (also noted by the authors) is that ExtractFix has a limited number of templates that it can match. Sure, you can add more templates to extend it, but it's not possible to write a template for every possible kind of crash.
I feel like ExtractFix is kind of cheating. It uses a small number of hand-written templates that represent the underlying problem of the crash, and so it is no surprise that the constraints they generate from that yield higher quality patches. Other APR tools do not assume a set of pre-set templates to find. Other APR tools are trying to guess the right patch, given nothing but failing tests.
So, ExtractFix isn't really "beating them" at their own game, because it's not playing their game.
Benton, N. (2004). "Simple Relational Correctness Proofs for

Static Analyses and Program Transformations." In POPL'04.
Benton begins by pointing out that static analyses are often proved correct. However, compilers (say) use those analyses to then make a transformation of a program (e.g., to optimize it). And, says Benton, these transformations are not usually checked for soundness.
Benton notes that to prove the correctness of an analysis-based transformation of this sort, we wish to establish that, "given the results of a static analysis, the original program and the transformed program are observationally equivalent."
Why is this hard?
The first reason for its difficulty: in a transformation, we replace some part P of a larger program C[P] with a new version of P' which is not observationally equivalent to P (i.e., not P ~ P'), even though in the context of C, it is (i.e., C[P] ~ C[P']).
The second reason for its difficulty: program analyses are often described in a very "intensional way," with words like "reach at this porgram point," or "where a variable was last assigned." These ideas have no real counterpart in operational or denotational semantics, and so often the analysis is "proved correct" by means of a (somewhat bogus) instrumented analysis that sort of tracks the extra information that it needs to track. Unfortunately, the instrumented semantics has a weaker equational theory than the original one, and its not even clear how (precisely) the instrumented semantics is actually tied to the original semantics anyway.
Benton argues in this paper that both difficulties can be overcome with some elementary ideas from formal PL semantics.
Difficulty 1 can be overcome, says Benton, by treating all typing judgments (and in fact all abstract properties) as relations, and then giving rules to derive typed equations in context. Transformations can be described in terms of those rules.
Difficulty 2 can be overcome by common, standard extensional analyses.
In the paper, Benton first constructs a simple system that he calls DDCC (Dependenc, Dead Cod, and Constants). It is a type system for while-programs that tracks dependency and constant information. The goal is to derive type equalities between expressions and commands.
Benton gives a number of rules for DDCC which essentially allow us to make judgments about expressions (which say that two expressions are equivalent in a given state type) and commands (which say that two commands are equivalent in a given state type).
Transformations can be proven correct by using the DDCC rules to construct proof trees to show that the two programs are equivalent, much in the style of natural deduction.
Benton shows that this set of rules gives us the ability to prove the correctness of dead code elimination, constant folding, and known branches optimizations.
(The author also points out that a secure information flow type system can be constructed here too.)
However, the simple DDCC system that Benton presents cannot cover everything. To solve that, Benton introduces a relational Hoare logic RHL). It is not a predicate transformer, but rather a relation transformer.
Benton points out that RHL is powerful, but not suitable for direct implementation in a compiler. But, it can be used as a framework to developer simpler logics in. So, one can embed a simpler logic inside of RHL, and implement those simpler logics directly.
Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2015. DirectFix: Looking for simple program repairs. In Proceedings of the 37th International Conference on Software Engineering (ICSE’15). IEEE, Los Alamitos, CA, 448–458.

This is the "DirectFix" paper, a predecessor to Angelix, I think.
Shubhani Gupta, Abhishek Rose, and Sorav Bansal. 2020. Counterexample-Guided Correlation Algorithm for Translation Validation. Proc. ACM Program. Lang. 4, OOPSLA, Article 221 (November 2020), 29 pages.

Zhen Huang, David Lie, Gang Tan, and Trent Jaeger. 2019. Using safety properties to generate vulnerability patches. In Proceedings of the 40th IEEE Symposium on Security and Privacy (S&P’19). 539–554.

The authors present a tool called Senx. Users can write a series of "properties" (like policies or specifications of safety properties), and Senx will then use these properties to to APR. This is kind of like the ExtractFix tool: the user is basically providing the specs up front, and then the APR tool is using those to go into your code and find violations, at which point it then automatically repairs them.
Since the specifications are provided up front, I suspect the quality of these patches are higher than the usual semantics-based APR that relies on test cases for its specifications.
Qi Xin and Steven P. Reiss. 2017. Identifying test-suite-overfitted patches through test case generation. In Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’17). ACM, New York, NY, 226–236.

The authors present a tool called DiffTGen, which essentially generates new test cases, and it uses those to identify when a patch is an overfitting patch.
DiffTGen produces test cases that show a patched program to be overfitting. It works like this:

For input it takes: a buggy program, a patched program, a set of syntactic differences between them, and an oracle (e.g., a human-approved fixed version of the program).
For output it produces: a test case that shows that the patch is overfitting, or no output at all if no such test cases can be found.

The tool does its work by generating tests that exercise the syntactic differences. If the patched program fails the test, then they know it is overfitting. To get a correct value, DiffTGen can consult the oracle (which may not know the answer).
Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise condition synthesis for program repair. In Proceedings of the International Conference on Software Engineering (ICSE’17). IEEE, Los Alamitos, CA, 416–426.

The authors of this tool focus on producing better constraints/conditions for patches. Usually, program patches end up being a new "if" statement that causes the program to go down a different patch in the buggy program. So, the authors recognize that the key to a set of good constraints is to identify which variables should really be used in the synthesized "if" statement, and which predicates should really be used to test those variables. So, they basically collect information about the variables that are available and which predicates seem to matter most, and they rank them in terms of their likelihood to fix the problem without overfitting.
There are three things the authors do to glean information about variables and predicates and how they should be ranked:

they do a dependency analysis on the variables
they mine documentation to get better knowledge of which variables should be used
they mine other projects that have similar fixes, and look at the predicates used in those cases

Jinqiu Yang, Alexey Zhikhartsev, Yuefei Liu, and Lin Tan. 2017. Better test cases for better automated program repair. In Proceedings of the 11th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’17). ACM, New York, NY, 831–841.

These authors attempt to enchance heuristic-based APR tools (these are also called "generate-and-validate" tools).
They use fuzzing to generate more test cases. They also use two oracles to help enhance their tests (a crash-safety oracle and a memory-safety oracle). To decide how much a patch overfits, they propose their own metric, which they call an O-measure (it essentially measures how many tests make good tests pass and bad tests fail). The basic idea behind the O-measure is that a patch should not make more tests fail than the buggy program (the patch should be at least as good as the buggy program).
Kui Liu, Anil Koyuncu, Tegawendé F. Bissyandé, Dongsun Kim, Jacques Klein, and Yves Le Traon. 2019. You cannot fix what you cannot find! An investigation of fault localization bias in benchmarking automated program repair systems. In Proceedings of the 2019 12th IEEE Conference on Software Testing, Validation, and Verification (ICST’19). IEEE, Los Alamitos, CA, 102–113.

When an APR tool tries to fix a bug, it must identify where the bug actually occurs. This process is called fault localization (FL for short). Often it is done by statistical analysis, but sometimes it is done by, say a weakest precondition sort of backwards analysis. In this paper, the authors conduct an empirical study on how good FL is in state-of-the-art APR tools, and how much the good and bad of FL affects the good and bad of the tools that use them.
Rohan Bavishi, Hiroaki Yoshida, and Mukul R. Prasad. 2019. Phoenix: Automated data-driven synthesis of repairs for static analysis violations. In Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’19). ACM, New York, NY, 613–624.

Phoenix basically looks at static analysis results in the wild, and learns good fixes from that. Phoenix is thus an interesting kind of APR tool because it does not need a test suite. It learns what is "correct" not from test suites, but from static analyses out there in the wild.