0xdevalias/fingerprinting-minified-javascript-libraries-ast-fingerprinting-source-code-similarity-etc.md

## fingerprinting-minified-javascript-libraries-ast-fingerprinting-source-code-similarity-etc.md

      
    Raw
  

              fingerprinting-minified-javascript-libraries-ast-fingerprinting-source-code-similarity-etc.md
            
          
    Fingerprinting Minified JavaScript Libraries / AST Fingerprinting / Source Code Similarity / Etc

Some notes and tools on fingerprinting minified JavaScript libraries, AST fingerprinting, source code similarity, etc.
Table of Contents


Original Notes

Link Dump 1
Link Dump 2
Link Dump 3


Unsorted
See Also

My Other Related Deepdive Gist's and Projects


Original Notes

This gist was created as I was finding there was too much content related to this topic to keep tacking it onto my older gist on Deobfuscating / Unminifying Obfuscated Web App / JavaScript Code; but until I move all of the relevant content from there to this gist; here is a link to the main notes I was keeping track of there (largely copies of my comments on various relevant GitHub repo's exploring this topic + related research / tools / etc):

fingerprinting-minified-javascript-libraries.md


Fingerprinting Minified JavaScript Libraries


Link Dump 1

The below content was originally posted in this comment (Dec 7, 2023: Ref), and then copied over as the basis for a new issue in this comment (Dec 13, 2023: Ref)
It has been further refined/enhanced since, including fixing up the titles, adding abstracts, and removing irrelevant links.

Here is a link dump of a bunch of the tabs I have open but haven't got around to reviewing in depth yet, RE: 'AST fingerprinting' / Code Similarity / etc:
Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity

https://en.wikipedia.org/wiki/Program_dependence_graph


Program Dependence Graph - Wikipedia


In computer science, a Program Dependence Graph (PDG) is a representation of a program's control and data dependencies. It's a directed graph where nodes represent program statements, and edges represent dependencies between these statements. PDGs are useful in various program analysis tasks, including optimizations, debugging, and understanding program behavior.


https://en.wikipedia.org/wiki/Control-flow_graph


Control-Flow Graph - Wikipedia


In computer science, a control-flow graph (CFG) is a representation, using graph notation, of all paths that might be traversed through a program during its execution.


In a control-flow graph each node in the graph represents a basic block, i.e. a straight-line piece of code without any jumps or jump targets; jump targets start a block, and jumps end a block. Directed edges are used to represent jumps in the control flow. There are, in most presentations, two specially designated blocks: the entry block, through which control enters into the flow graph, and the exit block, through which all control flow leaves.


https://github.com/rudrOwO/control-flow-graph


Control-flow Graph
Generate control-flow graph (CFG) from any code consisting of C-like syntax


https://control-flow.vercel.app/


https://reverseengineering.stackexchange.com/questions/16557/building-a-control-flow-graph-from-machine-code


Building a control flow graph from machine code (2017)


https://stackoverflow.com/questions/7283702/assembly-level-function-fingerprint


Stack Overflow: Assembly-level function fingerprint (2011)


https://stackoverflow.com/questions/15087195/data-flow-graph-construction


Stack Overflow: Data Flow Graph Construction (2013)


https://codereview.stackexchange.com/questions/276387/call-flow-graph-from-python-abstract-syntax-tree


Code Review Stack Exchange: Call-flow graph from Python abstract syntax tree (2022)


https://codeql.github.com/docs/writing-codeql-queries/about-data-flow-analysis/


CodeQL Documentation: About data flow analysis


Data flow analysis is used to compute the possible values that a variable can hold at various points in a program, determining how those values propagate through the program and where they are used.


https://codeql.github.com/docs/codeql-language-guides/analyzing-data-flow-in-javascript-and-typescript/#analyzing-data-flow-in-javascript-and-typescript


Analyzing data flow in JavaScript and TypeScript
This topic describes how data flow analysis is implemented in the CodeQL libraries for JavaScript/TypeScript and includes examples to help you write your own data flow queries.


https://clang.llvm.org/docs/DataFlowAnalysisIntro.html


Clang Documentation: Data flow analysis: an informal introduction


This document introduces data flow analysis in an informal way. The goal is to give the reader an intuitive understanding of how it works, and show how it applies to a range of refactoring and bug finding problems.


Data flow analysis is a static analysis technique that proves facts about a program or its fragment. It can make conclusions about all paths through the program, while taking control flow into account and scaling to large programs. The basic idea is propagating facts about the program through the edges of the control flow graph (CFG) until a fixpoint is reached.


https://openreview.net/forum?id=BJxWx0NYPr


Adaptive Structural Fingerprints for Graph Attention Networks (2019)


Graph attention network (GAT) is a promising framework to perform convolution and massage passing on graphs. Yet, how to fully exploit rich structural information in the attention mechanism remains a challenge. In the current version, GAT calculates attention scores mainly using node features and among one-hop neighbors, while increasing the attention range to higher-order neighbors can negatively affect its performance, reflecting the over-smoothing risk of GAT (or graph neural networks in general), and the ineffectiveness in exploiting graph structural details. In this paper, we propose an "adaptive structural fingerprint" (ADSF) model to fully exploit graph topological details in graph attention network. The key idea is to contextualize each node with a weighted, learnable receptive field  encoding rich and diverse local graph structures. By doing this, structural interactions between the nodes can  be inferred accurately, thus significantly improving subsequent attention layer as well as the convergence of learning. Furthermore, our model provides a useful platform  for different subspaces of node features and various scales of graph structures to 'cross-talk' with each other through the learning of multi-head attention, being particularly useful in handling complex real-world data. Empirical results demonstrate the power of our approach in exploiting rich structural information in GAT and in alleviating  the intrinsic oversmoothing problem in graph neural networks.


https://dl.acm.org/doi/10.1145/3486860


A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features (2022)


Binary code fingerprinting is crucial in many security applications. Examples include malware detection, software infringement, vulnerability analysis, and digital forensics. It is also useful for security researchers and reverse engineers since it enables high fidelity reasoning about the binary code such as revealing the functionality, authorship, libraries used, and vulnerabilities. Numerous studies have investigated binary code with the goal of extracting fingerprints that can illuminate the semantics of a target application. However, extracting fingerprints is a challenging task since a substantial amount of significant information will be lost during compilation, notably, variable and function naming, the original data and control flow structures, comments, semantic information, and the code layout. This article provides the first systematic review of existing binary code fingerprinting approaches and the contexts in which they are used. In addition, it discusses the applications that rely on binary code fingerprints, the information that can be captured during the fingerprinting process, and the approaches used and their implementations. It also addresses limitations and open questions related to the fingerprinting process and proposes future directions.


https://inria.hal.science/hal-01648996/document


BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code Executables (2017)


Binary code fingerprinting is a challenging problem that requires an in-depth analysis of binary components for deriving identifiable signatures. Fingerprints are useful in automating reverse engineering tasks including clone detection, library identification, authorship attribution, cyber forensics, patch analysis, malware clustering, binary auditing, etc. In this paper, we present BinSign, a binary function fingerprinting framework. The main objective of BinSign is providing an accurate and scalable solution to binary code fingerprinting by computing and matching structural and syntactic code profiles for disassemblies. We describe our methodology and evaluate its performance in several use cases, including function reuse, malware analysis, and indexing scalability. Additionally, we emphasize the scalability aspect of BinSign. We perform experiments on a database of 6 million functions. The indexing process requires an average time of 0.0072 seconds per function. We find that BinSign achieves higher accuracy compared to existing tools.


https://hal.science/hal-00627811/document


Syntax tree fingerprinting: a foundation for source code similarity detection (2011)


Plagiarism detection and clone refactoring in software depend on one common concern: finding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modifications are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits.
Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Dependency Graph (PDG), we believe that the AST could efficiently support the program analysis and transformations required for the advanced similarity detection process.
In this paper we present a simple and scalable architecture based on syntax tree fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.


https://ieeexplore.ieee.org/document/5090050


Syntax tree fingerprinting for source code similarity detection (2009)


Numerous approaches based on metrics, token sequence pattern-matching, abstract syntax tree (AST) or program dependency graph (PDG) analysis have already been proposed to highlight similarities in source code: in this paper we present a simple and scalable architecture based on AST fingerprinting. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that efficiently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modification patterns seen in the intra-project copy-pastes and in the plagiarism cases.


https://igm.univ-mlv.fr/~chilowi/research/syntax_tree_fingerprinting/syntax_tree_fingerprinting_ICPC09.pdf


https://ieeexplore.ieee.org/document/9960266


Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings (2022)


Syntax Tree (AST) is an abstract logical structure of source code represented as a tree. This research utilizes information of fingerprinting with AST to locate the similarities between source codes. The proposed method can detect plagiarism in source codes using the number of duplicated logical structures. The structural information of program is stored in the fingerprints format. Then, the fingerprints of source codes are compared to identify number of similar nodes. The final output is calculated from number of similar nodes known as similarities scores. The result shows that the proposed method accurately captures the common modification techniques from basic to advance.


https://digitalcommons.calpoly.edu/theses/2040/


Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints (2019)


Code clones are pieces of code that have the same functionality. While some clones may structurally match one another, others may look drastically different. The inclusion of code clones clutters a code base, leading to increased costs through maintenance. Duplicate code is introduced through a variety of means, such as copy-pasting, code generated by tools, or developers unintentionally writing similar pieces of code. While manual clone identification may be more accurate than automated detection, it is infeasible due to the extensive size of many code bases. Software code clone detection methods have differing degree of success based on the analysis performed. This thesis outlines a method of detecting clones using a program dependence graph and subgraph isomorphism to identify similar subgraphs, ultimately illuminating clones. The project imposes few constraints when comparing code segments to potentially reveal more clones.


https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=3437&context=theses


https://dl.acm.org/doi/abs/10.1145/1286821.1286826


Dynamic graph-based software fingerprinting (2007)


Fingerprinting embeds a secret message into a cover message. In media fingerprinting, the secret is usually a copyright notice and the cover a digital image. Fingerprinting an object discourages intellectual property theft, or when such theft has occurred, allows us to prove ownership.
The Software Fingerprinting problem can be described as follows. Embed a structure W into a program P such that: W can be reliably located and extracted from P even after P has been subjected to code transformations such as translation, optimization and obfuscation; W is stealthy; W has a high data rate; embedding W into P does not adversely affect the performance of P; and W has a mathematical property that allows us to argue that its presence in P is the result of deliberate actions.
In this article, we describe a software fingerprinting technique in which a dynamic graph fingerprint is stored in the execution state of a program. Because of the hardness of pointer alias analysis such fingerprints are difficult to attack automatically.


https://dl.acm.org/doi/pdf/10.1145/1286821.1286826


https://patents.google.com/patent/US9459861B1/en


Systems and methods for detecting copied computer code using fingerprints (2016)


Systems and methods of detecting copying of computer code or portions of computer code involve generating unique fingerprints from compiled computer binaries. The unique fingerprints are simplified representations of functions in the compiled computer binaries and are compared with each other to identify similarities between functions in the respective compiled computer binaries. Copying can be detected when there are sufficient similarities between fingerprints of two functions.


https://www.unomaha.edu/college-of-information-science-and-technology/research-labs/_files/software-nsf.pdf


Software Fingerprinting in LLVM (2021)


Executable steganography, the hiding of software machine code inside of a larger program, is a potential approach to introduce new software protection constructs such as watermarks or fingerprints. Software fingerprinting is, therefore, a process similar to steganography, hiding data within other data. The goal of fingerprinting is to hide a unique secret message, such as a serial number, into copies of an executable program in order to provide proof of ownership of that program. Fingerprints are a special case of watermarks, with the difference being that each fingerprint is unique to each copy of a program. Traditionally, researchers describe four aims that a software fingerprint should achieve. These include the fingerprint should be difficult to remove, it should not be obvious, it should have a low false positive rate, and it should have negligible impact on performance. In this research, we propose to extend these objectives and introduce a fifth aim: that software fingerprints should be machine independent. As a result, the same fingerprinting method can be used regardless of the architecture used to execute the program. Hence, this paper presents an approach towardsthe realization of machine-independent fingerprinting of executable programs. We make use of Low-Level Virtual Machine (LLVM) intermediate representation during the software compilation process to demonstrate both a simple static fingerprinting method as well as a dynamic method, which displays our aim of hardware independent fingerprinting. The research contribution includes a realization of the approach using the LLVM infrastructure and provides a proof of concept for both simple static and dynamic watermarks that are architecture neutral.


https://www.computer.org/csdl/journal/ts/2023/08/10125077/1Nc4Vd4vb7W


Graph-of-Code: Semantic Clone Detection Using Graph Fingerprints (2023)


The code clone detection issue has been researched using a number of explicit factors based on the tokens and contents and found effective results. However, exposing code contents may be an impractical option because of privacy and security factors. Moreover, the lack of scalability of past methods is an important challenge. The code flow states can be inferred by code structure and implicitly represented using empirical graphs. The assumption is that modelling of the code clone detection problem can be achieved without the content of the codes being revealed. Here, a Graph-of-Code concept for the code clone detection problem is introduced, which represents codes into graphs. While Graph-of-Code provides structural properties and quantification of its characteristics, it can exclude code contents or tokens to identify the clone type. The aim is to evaluate the impact of graph-of-code structural properties on the performance of code clone detection. This work employs a feature extraction-based approach for unlabelled graphs. The approach generates a “Graph Fingerprint” which represents different topological feature levels. The results of code clone detection indicate that code structure has a significant role in detecting clone types. We found different GoC-models outperform others. The models achieve between 96% to 99% in detecting code clones based on recall, precision, and F1-Score. The GoC approach is capable in detecting code clones with scalable dataset and with preserving codes privacy.


https://www.cs.columbia.edu/~suman/secure_sw_devel/Basic_Program_Analysis_CF.pdf


Slides: Basic Program Analysis - Suman Jana


ChatGPT Summary / Abstract:


Title: Basic Program Analysis
Author: Suman Jana
Institution: Columbia University
Abstract:
This document delves into the foundational concepts and techniques involved in program analysis, particularly focusing on control flow and data flow analysis essential for identifying security bugs in source code. The objective is to equip readers with the understanding and tools needed to effectively analyze programs without building systems from scratch, utilizing existing frameworks such as LLVM for customization and enhancement of analysis processes.
The core discussion includes an overview of compiler design with specific emphasis on the Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Analysis. These elements are critical in understanding the structure of source code and its execution flow. The document highlights the conversion of source code into AST and subsequently into CFG, where data flow analysis can be applied to optimize code and identify potential security vulnerabilities.
Additionally, the paper explores more complex topics like identifying basic blocks within CFG, constructing CFG from basic blocks, and advanced concepts such as loop identification and the concept of dominators in control flow. It also addresses the challenges and solutions related to handling irreducible Control Flow Graphs (CFGs), which are crucial for the analysis of less structured code.
Keywords: Program Analysis, Compiler Design, Abstract Syntax Tree (AST), Control Flow Graph (CFG), Data Flow Analysis, LLVM, Security Bugs.


https://www.researchgate.net/publication/370980383_A_graph-based_code_representation_method_to_improve_code_readability_classification


A graph-based code representation method to improve code readability classification (2023)


Context Code readability is crucial for developers since it is closely related to code maintenance and affects developers’ work efficiency. Code readability classification refers to the source code being classified as pre-defined certain levels according to its readability. So far, many code readability classification models have been proposed in existing studies, including deep learning networks that have achieved relatively high accuracy and good performance. Objective However, in terms of representation, these methods lack effective preservation of the syntactic and semantic structure of the source code. To extract these features, we propose a graph-based code representation method. Method Firstly, the source code is parsed into a graph containing its abstract syntax tree (AST) combined with control and data flow edges to reserve the semantic structural information and then we convert the graph nodes’ source code and type information into vectors. Finally, we train our graph neural networks model composing Graph Convolutional Network (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers to extract these features from the program graph. Result We evaluate our approach to the task of code readability classification using a Java dataset provided by Scalabrino et al. (2016). The results show that our method achieves 72.5% and 88% in three-class and two-class classification accuracy, respectively. Conclusion We are the first to introduce graph-based representation into code readability classification. Our method outperforms state-of-the-art readability models, which suggests that the graph-based code representation method is effective in extracting syntactic and semantic information from source code, and ultimately improves code readability classification.


https://www.cs.odu.edu/~zeil/cs350/latest/Public/analysis/index.html


Program Analysis Tools


Contents:
1 Representing Programs
1.1 Abstract Syntax Trees (ASTs)
1.2 Control Flow Graphs
2 Style and Anomaly Checking
2.1 Lint
2.2 Static Analysis by Compilers
2.3 CheckStyle
2.4 SpotBugs
2.5 PMD
3 Reverse-Engineering Tools
3.1 Reverse Compilers
3.2 Java Obfuscators
3.3 Obfuscation Example
4 Dynamic Analysis Tools
4.1 Pointer/Memory Errors
4.2 Profilers


https://www.cs.odu.edu/~zeil/cs350/latest/Public/analysis/index.html#control-flow-graphs


1.2 Control Flow Graphs


Represent each executable statement in the code as a node,
with edges connecting nodes that can be executed one after another.
Nodes for conditional statements have two or more outgoing edges.


https://www.cs.odu.edu/~zeil/cs350/latest/Public/analysis/index.html#data-flow-analysis


1.2.2 Data Flow Analysis


Link Dump 2

The below content was originally posted in the following comment (April 30, 2024: Ref)

This is potentially more of a generalised/'naive' approach to the problem, but it would also be interesting to see if/how well an embedding model tuned for code would do at solving this sort of problem space:

https://openai.com/blog/introducing-text-and-code-embeddings

https://platform.openai.com/docs/guides/embeddings


An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.


https://platform.openai.com/docs/api-reference/embeddings


https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/


Faiss: A library for efficient similarity search


Also, here's the latest version of my open tabs 'reading list' in this space of things, in case any of it is relevant/interesting/useful here:
Unsorted/Unreviewed Link Dump RE: 'AST fingerprinting' / Code Similarity (v2)

https://en.wikipedia.org/wiki/Content_similarity_detection


Content similarity detection


https://arxiv.org/abs/2306.16171


A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023)


Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.


https://link.springer.com/article/10.1007/s10664-017-9564-7


A comparison of code similarity analysers (2017)


Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.


https://www.researchgate.net/publication/2840981_Winnowing_Local_Algorithms_for_Document_Fingerprinting


Winnowing: Local Algorithms for Document Fingerprinting (2003)


Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with Moss, a widely-used plagiarism detection service.


https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf


https://www.researchgate.net/publication/375651686_Source_Code_Plagiarism_Detection_with_Pre-Trained_Model_Embeddings_and_Automated_Machine_Learning


Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning (2023)


https://aclanthology.org/2023.ranlp-1.34.pdf


https://www.researchgate.net/publication/262322336_A_Source_Code_Similarity_System_for_Plagiarism_Detection


A Source Code Similarity System for Plagiarism Detection (2013)


Source code plagiarism is an easy to do task, but very difficult to detect without proper tool support. Various source code similarity detection systems have been developed to help detect source code plagiarism. Those systems need to recognize a number of lexical and structural source code modifications. For example, by some structural modifications (e.g. modification of control structures, modification of data structures or structural redesign of source code) the source code can be changed in such a way that it almost looks genuine. Most of the existing source code similarity detection systems can be confused when these structural modifications have been applied to the original source code. To be considered effective, a source code similarity detection system must address these issues. To address them, we designed and developed the source code similarity system for plagiarism detection. To demonstrate that the proposed system has the desired effectiveness, we performed a well-known conformism test. The proposed system showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code. As a confirmation of these results, an independent samples t-test revealed that there was a statistically significant difference between average values of F-measures for the test sets that we used and for the experiments that we have done in the practically usable range of cut-off threshold values of 35–70%.


https://www.mdpi.com/2076-3417/10/21/7519


A Source Code Similarity Based on Siamese Neural Network (2020)


Finding similar code snippets is a fundamental task in the field of software engineering.
Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese
Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.


https://www.researchgate.net/publication/337196468_Detecting_Source_Code_Similarity_Using_Compression


Detecting Source Code Similarity Using Compression (2019)


Different forms of plagiarism make a fair assessment of student assignments more difficult. Source code plagiarisms pose a significant challenge especially for automated assessment systems aimed for students' programming solutions. Different automated assessment systems employ different text or source code similarity detection tools, and all of these tools have their advantages and disadvantages. In this paper, we revitalize the idea of similarity detection based on string complexity and compression. We slightly adapt an existing, third-party, approach, implement it and evaluate its potential on synthetically generated cases and on a small set of real student solutions. On synthetic cases, we showed that average deviation (in absolute values) from the expected similarity is less than 1% (0.94%). On the real-life examples of student programming solutions we compare our results with those of two established tools. The average difference is around 18.1% and 11.6%, while the average difference between those two tools is 10.8%. However, the results of all three tools follow the same trend. Finally, a deviation to some extent is expected as observed tools apply different approaches that are sensitive to other factors of similarities. Gained results additionally demonstrate open challenges in the field.


https://ceur-ws.org/Vol-2508/paper-pri.pdf


https://www.nature.com/articles/s41598-023-42769-9


Binary code similarity analysis based on naming function and common vector space (2023)


Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match


https://arxiv.org/abs/2305.03843


REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models (2023)


This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our approach is consistent across various model architectures and programming languages. We outperform the state-of-the-art cross-language search tool by up to 44.7%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar references are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called REINFOREST.


https://www.usenix.org/conference/usenixsecurity21/presentation/ahmadi


Finding Bugs Using Your Own Code: Detecting Functionally-similar yet Inconsistent Code (2021)


Probabilistic classification has shown success in detecting known types of software bugs. However, the works following this approach tend to require a large amount of specimens to train their models. We present a new machine learning-based bug detection technique that does not require any external code or samples for training. Instead, our technique learns from the very codebase on which the bug detection is performed, and therefore, obviates the need for the cumbersome task of gathering and cleansing training samples (e.g., buggy code of certain kinds). The key idea behind our technique is a novel two-step clustering process applied on a given codebase. This clustering process identifies code snippets in a project that are functionally-similar yet appear in inconsistent forms. Such inconsistencies are found to cause a wide range of bugs, anything from missing checks to unsafe type conversions. Unlike previous works, our technique is generic and not specific to one type of inconsistency or bug. We prototyped our technique and evaluated it using 5 popular open source software, including QEMU and OpenSSL. With a minimal amount of manual analysis on the inconsistencies detected by our tool, we discovered 22 new unique bugs, despite the fact that many of these programs are constantly undergoing bug scans and new bugs in them are believed to be rare.


https://www.usenix.org/system/files/sec21summer_ahmadi.pdf


https://theory.stanford.edu/~aiken/moss/


MOSS: A1 System for Detecting Software Similarity


https://github.com/fanghon/antiplag


antiplag similarity checking software for program codes, documents, and pictures
The software mainly checks and compares the similarities between electronic assignments submitted by students. It can detect the similarities between electronic assignments submitted by students and can analyze the content of multiple programming languages (such as java, c/c++, python, etc.) and multiple formats (txt, doc, docx, pdf, etc.) Comparative analysis of text and image similarities in multiple formats (png, jpg, gif, bmp, etc.) between English and simplified and traditional Chinese documents, and output codes, texts, and images with high similarity, thereby helping to detect plagiarism between students. the behavior of.


https://github.com/dodona-edu/dolos


Dolos
Dolos is a source code plagiarism detection tool for programming exercises. Dolos helps teachers in discovering students sharing solutions, even if they are modified. By providing interactive visualizations, Dolos can also be used to sensitize students to prevent plagiarism.


https://dolos.ugent.be/
https://dolos.ugent.be/about/algorithm.html


How Dolos works
Conceptually, the plagiarism detection pipeline of Dolos can be split into four successive steps:

Tokenization
Fingerprinting
Indexing
Reporting


Tokenization
To be immune against masking plagiarism by techniques such as renaming variables and functions, Dolos doesn't directly process the source code under investigation. It starts by performing a tokenization step using Tree-sitter. Tree-sitter can generate syntax trees for many programming languages, converts source code to a more structured form, and masks specific naming of variables and functions.


Fingerprinting
To measure similarities between (converted) files, Dolos tries to find common sequences of tokens. More specifically, it uses subsequences of fixed length called k-grams. To efficiently make these comparisons and reduce the memory usage, all k-grams are hashed using a rolling hash function (the one used by Rabin-Karp in their string matching algorithm). The length k of k-grams can be with the -k option.
To further reduce the memory usage, only a subset of all hashes are stored. The selection of hashes is done by the Winnowing algorithm as described by (Schleimer, Wilkerson and Aiken). In short: only the hash with the smallest numerical value is kept for each window. The window length (in k-grams) can be altered with the -w option.
The remaining hashes are the fingerprints of the analyzed files. Internally, these are stored as simple integers.


Indexing
Because Dolos needs to compare all files with each other, it is more efficient to first create an index containing the fingerprints of all files. For each of the fingerprints encountered in any of the files, we store the file and the corresponding line number where we encountered that fingerprint.
As soon as a fingerprint is stored in the index twice, this is recorded as a match between the two files because they share at least one k-gram.


Reporting
Dolos finally collects all fingerprints that occur in more than one file and aggregates the results into a report.
This report contains all file pairs that have at least one common fingerprint, together with some metrics:

similarity: the fraction of shared fingerprints between the two files
total overlap: the absolute value of shared fingerprints, useful for larger projects
longest fragment: the length (in fingerprints) of the longest subsequence of fingerprints matching between the two files, useful when not the whole source code is copied


https://dolos.ugent.be/about/languages.html
https://dolos.ugent.be/about/publications.html


Publications
Dolos is developed by Team Dodona at Ghent University in Belgium. Our research is published in the following journals and conferences.


https://github.com/danielplohmann/mcrit


MinHash-based Code Relationship & Investigation Toolkit (MCRIT)
MCRIT is a framework created to simplify the application of the MinHash algorithm in the context of code similarity. It can be used to rapidly implement "shinglers", i.e. methods which encode properties of disassembled functions, to then be used for similarity estimation via the MinHash algorithm. It is tailored to work with disassembly reports emitted by SMDA.


https://github.com/BK-SCOSS/scoss


scoss
A Source Code Similarity System - SCOSS


https://github.com/island255/source2binary_dataset_construction


Source2binary Dataset Construction
This is the repository for the paper "One to One or One to many? What function inline brings to binary similarity analysis".


https://github.com/JackHCC/Pcode-Similarity


Pcode-Similarity
Algorithm for calculating similarity between function and library function.


https://github.com/JackHCC/Awesome-Binary-Code-Similarity-Detection-2021


Awesome Binary code similarity detection 2021
Awesome list for Binary Code Similarity Detection in 2021


https://github.com/Jaso1024/Semantic-Code-Embeddings


SCALE: Semantic Code Analysis via Learned Embeddings (2023)
3rd best paper on Artificial Intelligence track | presented at the 2023 International Conference on AI, Blockchain, Cloud Computing and Data Analytics
This repository holds the code and supplementary materials for SCALE: Semantic Code Analysis via Learned Embeddings. This research explores the efficacy of contrastive learning alongside large language models as a paradigm for developing a model capable of creating code embeddings indicative of code on a functional level.
Existing pre-trained models in NLP have demonstrated impressive success, surpassing previous benchmarks in various language-related tasks. However, when it comes to the field of code understanding, these models still face notable limitations. Code isomorphism, which deals with determining functional similarity between pieces of code, presents a challenging problem for NLP models. In this paper, we explore two approaches to code isomorphism. Our first approach, dubbed SCALE-FT, formulates the problem as a binary classification task, where we feed pairs of code snippets to a Large Language Model (LLM), using the embeddings to predict whether the given code segments are equivalent. The second approach, SCALE-CLR, adopts the SimCLR framework to generate embeddings for individual code snippets. By processing code samples with an LLM and observing the corresponding embeddings, we assess the similarity of two code snippets. These approaches enable us to leverage function-based code embeddings for various downstream tasks, such as code-optimization, code-comment alignment, and code classification. Our experiments on the CodeNet Python800 benchmark demonstrate promising results for both approaches. Notably, our SCALE-FT using Babbage-001 (GPT-3) achieves state-of-the-art performance, surpassing various benchmark models such as GPT-3.5 Turbo and GPT-4. Additionally, Salesforce's 350-million parameter CodeGen, when trained with the SCALE-FT framework, surpasses GPT-3.5 and GPT-4.


https://github.com/Aida-yy/binary-sim


binary similarity using Deep learning


Features: Function semantic information + control flow graph
Semantic feature extraction: extract the byte data, assembly instruction data, and integer data of the function respectively, use independent encoders (DPCNN, TextCNN) to encode the text representation, and obtain its Embedding representation.
Structural feature extraction, based on CFG and the assembly instructions in each block, generates ACFG, uses graph neural network to encode ACFG, and obtains Embedding representation; in addition, considering that the node order of the control flow graph of similar functions is also similar, the CFG's The adjacency matrix is taken as input and CNN is used to obtain its Embedding representation.
Contrastive learning model structure: InfoNCE loss + In-batch negatives

Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection
Investigating Graph Embedding Methods for Cross-Platform Binary Code Similarity Detection
SimCSE: Simple Contrastive Learning of Sentence Embeddings


https://github.com/jorge-martinez-gil/crosslingual-clone-detection


Transcending Language Barriers in Software Engineering with Crosslingual Code Clone Detection (2024)
Systematic study to determine the best methods to assess the similarity between code snippets in different programming languages


Link Dump 3


https://medium.com/stanford-cs224w/code-similarity-using-graph-neural-networks-1e58aa21bd92


Code Similarity Using Graph Neural Networks (2023)


https://proceedings-of-deim.github.io/DEIM2023/1b-9-4.pdf


Utilizing Abstract Syntax Tree Embedding to Improve the Quality of GNN-based Class Name Estimation (2023)


While giving comprehensible names to identifiers is essential in software development, it is sometimes difficult since it requires development experience and knowledge of the application domain. Among work to support the developer’s identifier naming, a GNN-based class name estimation approach learns a graph of relationships between program elements, i.e., classes, methods, and fields, but it ignores information within the methods. This study proposes an approach that exploits information from method bodies, which can help estimate correct class names. The proposed approach extends the existing GNN-based approach to use embeddings of the corresponding ASTs for method nodes. An evaluation experiment measures how correctly the proposed approach can estimate class names in large datasets of open-source Java projects. The experimental result shows that the proposed approach improves the estimation correctness compared to the existing approach.


https://arxiv.org/abs/2002.08653


Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree (2020)


Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection.
We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.


https://github.com/jacobwwh/graphmatch_clone


Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree


Code and data for paper "Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree".


https://arxiv.org/abs/2004.02843


Improved Code Summarization via a Graph Neural Network (2020)


Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source code as input and outputs a natural language description. Yet a strong consensus is developing that using structural information as input leads to improved performance. The first approaches to use structural information flattened the AST into a sequence. Recently, more complex approaches based on random AST paths or graph neural networks have improved on the models using flattened ASTs. However, the literature still does not describe the using a graph neural network together with source code sequence as separate inputs to a model. Therefore, in this paper, we present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries. We evaluate our technique using a data set of 2.1 million Java method-comment pairs and show improvement over four baseline techniques, two from the software engineering literature, and two from machine learning literature.


Unsorted


https://chatgpt.com/c/d2713f5a-19ee-41fe-836d-0db4ba3daeac

Private ChatGPT conversation about various things related to AST fingerprinting/etc
TODO: Summarise/pull out the relevant parts from this and include them here


https://binary.ninja/2022/06/20/introducing-tanto.html#potential-uses-and-some-speculation


What I’ve found most interesting, and have been speculating about, is using variable slices like these (though not directly through the UI) in the function fingerprinting space. I’ve long suspected that a dataflow-based approach to fingerprinting might prove to be robust against compiler optimizations and versions, as well as source code changes that don’t completely redefine the implementation of a function. Treating each variable slice as a record of what happens to data within a function, a similarity score for two slices could be generated from the count of matching operations, matching constant interactions (2 + var_a), and matching variable interactions (var_f + var_a). Considering all slices, a confidence metric could be derived for whether two functions match. Significant research would be required to answer these questions concretely… and, if you could solve subgraph isomorphism at the same time, that’d be great!


See Also

My Other Related Deepdive Gist's and Projects


https://github.com/0xdevalias/chatgpt-source-watch : Analyzing the evolution of ChatGPT's codebase through time with curated archives and scripts.

Reverse engineering ChatGPT's frontend web app + deep dive explorations of the code (0xdevalias gist)


Deobfuscating / Unminifying Obfuscated Web App Code (0xdevalias gist)
Reverse Engineering Webpack Apps (0xdevalias gist)
Reverse Engineered Webpack Tailwind-Styled-Component (0xdevalias gist)
React Server Components, Next.js v13+, and Webpack: Notes on Streaming Wire Format (__next_f, etc) (0xdevalias' gist))
Bypassing Cloudflare, Akamai, etc (0xdevalias gist)
Debugging Electron Apps (and related memory issues) (0xdevalias gist)
devalias' Beeper CSS Hacks (0xdevalias gist)
Reverse Engineering Golang (0xdevalias' gist)
Reverse Engineering on macOS (0xdevalias' gist)