Date: 2019-11-21T08:48:18+01:00 Unread emails: 61 Paper titles: 146 Uniq paper titles: 96
-
"Evaluating Semantic Representations of Source Code" (4)
Abstract: Learned representations of source code enable various software developer tools, e.g. ...
..., to detect bugs or to predict program properties. At the core of code representations often are word embeddings of identifier names in source code, because identifiers account for the majority of source code vocabulary and convey important semantic information. Unfortunately, there currently is no generally accepted way of evaluating the quality of word embeddings of identifiers, and current evaluations are biased toward specific downstream tasks. This paper presents IdBench, the first benchmark for evaluating to what extent word embeddings of identifiers represent semantic relatedness and similarity. The benchmark is based on thousands of ratings gathered by surveying 500 software developers. We use IdBench to evaluate state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions, as these are often used in current developer tools. Our results show that the effectiveness of embeddings varies significantly across different embedding techniques and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing embedding provides a satisfactory representation of semantic similarities, e.g., because embeddings consider identifiers with opposing meanings as similar, which may lead to fatal mistakes in downstream developer tools. IdBench provides a gold standard to guide the development of novel embeddings that address the current limitations. -
"Learning to Fuzz from Symbolic Execution with Application to Smart Contracts" (4)
Abstract: Fuzzing and symbolic execution are two complementary techniques for discovering ...
... software vulnerabilities. Fuzzing is fast and scalable, but can be ine ective when it fails to randomly select the right inputs. Symbolic execution is thorough but slow and often does not scale to deep program paths with complex path conditions. In this work, we propose to learn an effective and fast fuzzer from symbolic execution, by phrasing the learning task in the framework of imitation learning. During learning, a symbolic execution expert generates a large number of quality inputs improving coverage on thousands of programs. Then, a fuzzing policy, represented with a suitable architecture of neural networks, is trained on the generated dataset. The learned policy can then be used to fuzz new programs. We instantiate our approach to the problem of fuzzing smart contracts, a domain where contracts often implement similar func-tionality (facilitating learning) and security is of utmost importance. We present an end-to-end system, ILF (for Imitation Learning based Fuzzer), and an extensive evaluation over >18K contracts. Our re- sults show that ILF is e ective: (i) it is fast, generating 148 transac- tions per second, (ii) it outperforms existing fuzzers (e.g., achieving 33% more coverage), and (iii) it detects more vulnerabilities than existing fuzzing and symbolic execution tools for Ethereum -
"Generating precise error specifications for C: a zero shot learning approach" (4)
In C programs, error specifications, which specify the value range that each function ...
... returns to indicate failures, are widely used to check and propagate errors for the sake of reliability and security. Various kinds of C analyzers employ error specifications for different purposes, e.g., to detect error handling bugs, yet a general approach for generating precise specifications is still missing. This limits the applicability of those tools. In this paper, we solve this problem by developing a machine learning-based approach named MLPEx. It generates error specifications by analyzing only the source code, and is thus general. We propose a novel machine learning paradigm based on transfer learning, enabling MLPEx to require only one-time minimal data labeling from us (as the tool developers) and zero manual labeling e orts from users. To improve the accuracy of generated error specifications, MLPEx extracts and exploits project-specific information. We evaluate MLPEx on 10 projects, including 6 libraries and 4 applications. An investigation of 3,443 functions and 17,750 paths reveals that MLPEx generates error specifications with a precision of 91% and a recall of 94%, significantly higher than those of state-of-the-art approaches. To further demonstrate the usefulness of the generated error specifications, we use them to detect 57 bugs in 5 tested projects. -
-
Structural Language Models for Any-Code Generation (3)
-
AutoPandas: Neural-Backed Generators for Program Synthesis (3)
-
-
N-Grams as a Measure of Naturalness and Complexity (3)
-
A machine learning based automatic folding of dynamically typed languages (3)
-
Encodings for Enumeration-Based Program Synthesis (3)
-
Do People Prefer" Natural" code? (3)
-
MACHINE LEARNING FOR CODE SYNTHESIS AND ANALYSIS (3)
-
Relational Verification using Reinforcement Learning (3)
-
Modeling Security Weaknesses to Enable Practical Run-time Defenses (3)
-
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search (3)
-
DeepVS: An Efficient and Generic Approach for Source Code Modeling Usage (2)
-
Novel positional encodings to enable tree-based transformers (2)
-
Twin-Finder: Integrated Reasoning Engine for Pointer-related Code Clone Detection (2)
-
On The Quality of Identifiers in Test Code (2)
-
A Deep Learning Model for Source Code Generation (2)
-
zkay: Specifying and Enforcing Data Privacy in Smart Contracts (2)
-
Learning based Methods for Code Runtime Complexity Prediction (2)
-
Poster: Finding JavaScript Name Conflicts on the Web (2)
-
Learning Lenient Parsing & Typing via Indirect Supervision (2)
-
Deep Transfer Learning for Source Code Modeling (2)
-
Evaluating Lexical Approximation of Program Dependence (2)
-
Universal Approximation with Certified Networks (2)
-
Class Name Recommendation based on Graph Embedding of Program Elements (2)
-
Assessing the Generalizability of code2vec Token Embeddings (2)
-
Neural Program Synthesis By Self-Learning (2)
-
Compiler Auto-Vectorization with Imitation Learning (2)
-
Embedding Symbolic Knowledge into Deep Networks (1)
-
CLN2INV: Learning Loop Invariants with Continuous Logic Networks (1)
-
Enabling Efficient Parallelism for Applications with Dependences and Irregular Memory Accesses (1)
-
Keep It Simple: Graph Autoencoders Without Graph Convolutional Networks (1)
-
CPC: automatically classifying and propagating natural language comments via program analysis (1)
-
NeuroVectorizer: End-to-End Vectorization with Deep Reinforcement Learning (1)
-
Transferring Java Comments Based on Program Static Analysis (1)
-
-
LSC: Online Auto-Update Smart Contracts for Fortifying Blockchain-Based Log Systems (1)
-
Selective Monitoring Without Delay for Probabilistic System (1)
-
Improving Textual Network Learning with Variational Homophilic Embeddings (1)
-
Abstraction Mechanism on Neural Machine Translation Models for Automated Program Repair (1)
-
-
-
Neural Relational Inference with Fast Modular Meta-learning (1)
-
Graph Enhanced Cross-Domain Text-to-SQL Generation (1)
-
On the use of supervised machine learning for assessing schedulability: application to Ethernet TSN (1)
-
Online Robustness Training for Deep Reinforcement Learning (1)
-
Disentangling Interpretable Generative Parameters of Random and Real-World Graphs (1)
-
Sequence Model Design for Code Completion in the Modern IDE (1)
-
Testing Neural Program Analyzers (1)
-
Parallel Iterative Edit Models for Local Sequence Transduction (1)
-
Mode Personalization in Trip-Based Transit Routing (1)
-
Neural Attribution for Semantic Bug-Localization in Student Programs (1)
-
Why is Developing Machine Learning Applications Challenging? A Study on Stack Overflow Posts (1)
-
A Survey of Compiler Testing (1)
-
Deep Representation Learning for Code Smells Detection using Variational Auto-Encoder (1)
-
Inverse‐QSPR for de novo design: a review (1)
-
Static Detection of Event-Driven Races in HTML5-Based Mobile Apps (1)
-
NutBaaS: A Blockchain-as-a-Service Platform (1)
-
A Generative Model for Molecular Distance Geometry (1)
-
Memory Augmented Recursive Neural Networks (1)
-
Concealment of iris features based on artificial noises (1)
-
Code Generation as a Dual Task of Code Summarization (1)
-
Formal Verification of Workflow Policies for Smart Contracts in Azure Blockchain (1)
-
Coda: An End-to-End Neural Program Decompiler (1)
-
Program Synthesis by Type-Guided Abstraction Refinement (1)
-
Imitation-Projected Programmatic Reinforcement Learning (1)
-
Towards Robust Direct Perception Networks for Automated Driving (1)
-
A multi-stage anomaly detection scheme for augmenting the security in IoT-enabled applications (1)
-
Learning from Examples to Find Fully Qualified Names of API Elements in Code Snippets (1)
-
Augmented Example-based Synthesis using Relational Perturbation Properties (1)
-
Neural Speech Translation using Lattice Transformations and Graph Networks (1)
-
Progressive Processing of System-Behavioral Query (1)
-
Semantic Preserving Generative Adversarial Models (1)
-
An Evalutation of Programming Language Models' performance on Software Defect Detection (1)
-
Speech Recognition with Augmented Synthesized Speech (1)
-
Software Engineering Meets Deep Learning: A Literature Review (1)
-
The Internet of Things and Machine Learning, Solutions for Urban Infrastructure Management (1)
-
Exploring Robust Neural Methods in Inductive Program Synhthesis (1)
-
-
Coding as another language: a pedagogical approach for teaching computer science in early childhood (1)
-
Program Synthesis for Programmers (1)
-
A (CO) ALGEBRAIC APPROACH TO PROGRAMMING AND VERIFYING COMPUTER NETWORKS (1)
-
Efficient Graph Generation with Graph Recurrent Attention Networks (1)
-
Combining Constraint Languages via Abstract Interpretation (1)
-
Towards neural networks that provably know when they don't know (1)
-
ASTToken2Vec: An Embedding Method for Neural Code Completion (1)
-
Beyond the Single Neuron Convex Barrier for Neural Network Certification (1)
-
Word Embedding Algorithms as Generalized Low Rank Models and their Canonical Form (1)
-
A comparison of end-to-end models for long-form speech recognition (1)
-
SoCodeCNN: Program Source Code for Visual CNN Classification Using Computer Vision Methodology (1)
-
Recognizing lines of code violating company-specific coding guidelines using machine learning (1)
Software developers in big and medium-size companies are working with millions of lines of code in their codebases. Assuring the quality of this code has shifted from simple defect management to proactive assurance of internal code quality. Although static code analysis and code reviews have been at the forefront of research and practice in this area, code reviews are still an effort-intensive and interpretation-prone activity. The aim of this research is to support code reviews by automatically recognizing company-specific code guidelines violations in large-scale, industrial source code. In our action research project, we constructed a machine-learning-based tool for code analysis where software developers and architects in big and medium-sized companies can use a few examples of source code lines violating code/design guidelines (up to 700 lines of code) to train decision-tree classifiers to find similar violations in their codebases (up to 3 million lines of code). Our action research project consisted of (i) understanding the challenges of two large software development companies, (ii) applying the machine-learning-based tool to detect violations of Sun’s and Google’s coding conventions in the code of three large open source projects implemented in Java, (iii) evaluating the tool on evolving industrial codebase, and (iv) finding the best learning strategies to reduce the cost of training the classifiers. We were able to achieve the average accuracy of over 99% and the average F-score of 0.80 for open source projects when using ca. 40K lines for training the tool. We obtained a similar average F-score of 0.78 for the industrial code but this time using only up to 700 lines of code as a training dataset. Finally, we observed the tool performed visibly better for the rules requiring to understand a single line of code or the context of a few lines (often allowing to reach the F-score of 0.90 or higher). Based on these results, we could observe that this approach can provide modern software development companies with the ability to use examples to teach an algorithm to recognize violations of code/design guidelines and thus increase the number of reviews conducted before the product release. This, in turn, leads to the increased quality of the final software.
-
Multi-Modal Attention Network Learning for Semantic Source Code Retrieval (1)
Code retrieval techniques and tools have been play- ing a key role in facilitating software developers to retrieve existing code fragments from available open-source repositories given a user query (e.g., a short natural language text describing the functionality for retrieving a particular code snippet). Despite the existing efforts in improving the effectiveness of code retrieval, there are still two main issues hindering them from being used to accurately retrieve satisfiable code fragments from large- scale repositories when answering complicated queries. First, the existing approaches only consider shallow features of source code such as method names and code tokens, but ignoring structured features such as abstract syntax trees (ASTs) and control-flow graphs (CFGs) of source code, which contains rich and well- defined semantics of source code. Second, although the deep learning-based approach performs well on the representation of source code, it lacks the explainability, making it hard to interpret the retrieval results and almost impossible to understand which features of source code contribute more to the final results. To tackle the two aforementioned issues, this paper proposes MMAN, a novel Multi-Modal Attention Network for semantic source code retrieval. A comprehensive multi-modal representa- tion is developed for representing unstructured and structured features of source code, with one LSTM for the sequential tokens of code, a Tree-LSTM for the AST of code and a GGNN (Gated Graph Neural Network) for the CFG of code. Furthermore, a multi-modal attention fusion layer is applied to assign weights to different parts of each modality of source code and then integrate them into a single hybrid representation. Comprehensive experi- ments and analysis on a large-scale real-world dataset show that our proposed model can accurately retrieve code snippets and outperforms the state-of-the-art methods
-
Translationese as a Language in" Multilingual" NMT (1)
Machine translation has an undesirable propensity to produce “translationese” ar- tifacts, which can lead to higher BLEU scores while being liked less by human raters. Motivated by this, we model translationese and original (i.e. natural) text as separate languages in a multilingual model, and pose the question: can we perform zero-shot translation between original source text and original target text? There is no data with original source and original target, so we train sentence-level classifiers to distinguish translationese from original target text, and use this classifier to tag the training data for an NMT model. Using this technique we bias the model to produce more natural outputs at test time, yielding gains in human evaluation scores on both accuracy and fluency. Addi- tionally, we demonstrate that it is possible to bias the model to produce translationese and game the BLEU score, increasing it while decreasing human-rated quality. We analyze these models using metrics to measure the degree of translationese in the output, and present an analysis of the capriciousness of heuristically-based train-data tagging.