bzz/ML4SE at ICML 2023.md

## ML4SE at ICML 2023.md

      
    Raw
  

              ML4SE at ICML 2023.md
            
          
    ICML 2023

Wrokshops


DeepLearning for Code (PC)
Sparsity in NN as Jeff Dean persuasivly argues large but adaptive and sparsly activated networks both draw on neuroscience-based inspiration and are a good way to control computational costs.

Papers

LLMs on Code


Planning with Large Language Models for Code Generation, MIT site, code
SantaCoder ServiceNow/HF
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages , Salesforce

Evaluations

Benchmarks

Multilingual Code Retrieval Without Paired Data: A New Benchmark and Experiments  ServiceNow, multiple NLs
ReCode: Robustness Evaluation of Code Generation Models AWS
amazon-science/recode
XLCoST: A Benchmark Dataset For Cross- Lingual Code Intelligence
BigIssue: A Realistic Bug Localization Benchmark

Metrics

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
neulab/code-bert-score: CodeBERTScore: an automatic metric for code generation, based on BERTScore

Other


Examining the perceived costs/benefits of ML-enhanced developer tooling  Google, qualitative study how familiarity and the complexity of the interaction with ML features influences whether a suggestion is perceived as high/low cost and high/low benefit.


R-U-SURE? Uncertainty-Aware Code Suggestions By Maximizing Utility Across Random User Intents


Conversational Automated Program Repair


A Generic Prompt for an LLM that enables NL-to-SQL across Domains and Compositions


DisPer: End-to-End Neural Permutation Program Synthesis

pbelcak/disper


Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs


Leveraging Static Analysis for Bug Repair Fb


Binding Language Models in Symbolic Languages


CodeBPE: Investigating Subtokenization Options for LLMs


ContraCLM: Contrastive Learning For Causal Language Model


ObSynth: An Interactive Synthesis System for Generating Object Models from Natural Language Specifications


Tasty: A Transformer Based Approach To Space And Time Complexity


Semantic Prefixes as an Auxiliary Target for Code Understanding and Generation


Learning Performance-Improving Code Edits


ScriptoriumWS: A Code Generation Assistant for Weak Supervision


Synthesis Of Mathematical Programs From Natural Language Specifications


Explicit Knowledge Transfer for Weakly-Supervised Code Generation


APerfCode: Auto Conversion to Performant Code


## x-iclr2023-paper-abstracts.json
[
    {
        "title": "Planning with Large Language Models for Code Generation",
        "abstract": "Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process. Although the programs they generate achieve high token-matching-based scores, they often fail to compile or generate incorrect outputs. The main reason is that conventional Transformer decoding algorithms may not be the best choice for code generation. In this work, we propose a novel Transformer decoding algorithm, Planning-Guided Transformer Decoding (PG-TD), that uses a planning algorithm to do lookahead search and guide the Transformer to generate better programs. Specifically, instead of simply optimizing the likelihood of the generated sequences, the Transformer makes use of a planner that generates candidate programs and tests them on public test cases. The Transformer can therefore make more informed decisions and generate tokens that will eventually lead to higher-quality programs. We also design a mechanism that shares information between the Transformer and the planner to make our algorithm computationally efficient. We empirically evaluate our framework with several large language models as backbones on public coding challenge benchmarks, showing that 1) it can generate programs that consistently achieve higher performance compared with competing baseline methods; 2) it enables controllable code generation, such as concise codes and highly-commented codes by optimizing modified objective.",
        "url": "https://openreview.net/forum?id=Lr8cOOtYbfL"
    },
    {
        "title": "StarCoder: may the source be with you!",
        "abstract": "The BigCode community, an open-scienti\ufb01c collaboration working on the responsi-\nble development of Large Language Models for Code (Code LLMs), introduces\nStarCoder and StarCoderBase: 15.5B parameter models with 8K context length,\nin\ufb01lling capabilities and fast large-batch inference enabled by multi-query attention.\nStarCoderBase is trained on 1 trillion tokens sourced from The Stack (Kocetkov\net al., 2022), a large collection of permissively licensed GitHub repositories with in-\nspection tools and an opt-out process. We \ufb01ne-tuned StarCoderBase on 35B Python\ntokens, resulting in the creation of StarCoder. We perform the most comprehensive\nevaluation of Code LLMs to date and show that StarCoderBase outperforms every\nopen Code LLM that supports multiple programming languages and matches or\noutperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder out-\nperforms every model that is \ufb01ne-tuned on Python, can be prompted to achieve 40%\npass@1 on HumanEval, and still retains its performance on other programming\nlanguages. We take several important steps towards a safe open-access model\nrelease, including an improved PII redaction pipeline and a novel attribution tracing\ntool, and make the StarCoder models publicly available under a more commercially\nviable version of the Open Responsible AI Model license.",
        "keywords": [
            [
                "date starcoderbase",
                0.4143
            ],
            [
                "openai code",
                0.4175
            ],
            [
                "bigcode community",
                0.4795
            ],
            [
                "tuned starcoderbase",
                0.4909
            ],
            [
                "starcoder models",
                0.5177
            ]
        ],
        "url": "https://arxiv.org/abs/2305.06161"
    },
    {
        "title": "\u201cIF IT\u2019S WHAT IWANTED THAT \u2019S GREAT ,BUT IF IT \u2019S NOT, I JUST WASTED TIME \u201d: E XAMINING THE PER -CEIVED COSTS /BENEFITS OF ML- ENHANCED DEVEL-OPER TOOLING",
        "abstract": "ML-enhanced software development tooling is changing the way engineers write code. Even though development of such tools has accelerated, studies have primarily focused on the accuracy and performance of underlying models, rather than user experience. Recently however, researchers have begun to explore how developers interact with ML-enhanced tooling to better understand impact on workflows. We investigate what factors influence developers\u2019 decision to interact with ML-based suggestions by analyzing the perceived benefits and costs of ML-based assistance and then, compare developers\u2019 experiences with different prototypes. Our findings suggest that developers engage in a form of ad-hoc cost-benefit analysis mediated by their familiarity with the task and the complexity of the suggestion. We close by providing design guidance aimed at increasing the perceived benefits and lowering costs. This research is intended to spark a growing conversation about what makes successful interactions with ML-based software development tooling.",
        "keywords": [
            [
                "tooling better",
                0.3805
            ],
            [
                "enhanced software",
                0.4206
            ],
            [
                "workflows investigate",
                0.4242
            ],
            [
                "suggest developers",
                0.4749
            ],
            [
                "developers engage",
                0.4959
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/4.pdf"
    },
    {
        "title": "CODEGEN2: LESSONS FOR TRAINING LLMS ON PROGRAMMING AND NATURAL LANGUAGES",
        "abstract": "Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) in-fill sampling, and (4) data distributions. We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into four lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source.",
        "keywords": [
            [
                "models llms",
                0.3235
            ],
            [
                "synthesis understanding",
                0.3609
            ],
            [
                "release codegen2",
                0.3631
            ],
            [
                "parameters training",
                0.3802
            ],
            [
                "large language",
                0.4844
            ]
        ],
        "url": "https://arxiv.org/abs/2305.02309"
    },
    {
        "title": "R-U-SURE? Uncertainty-Aware Code Suggestions  By Maximizing Utility Across Random User Intents",
        "abstract": "Large language models show impressive results at predicting structured text such as code, but also commonly introduce errors and hallucinations in their output. When used to assist software developers, these models may make mistakes that users must go back and \ufb01x, or worse, introduce subtle bugs that users may miss entirely. We propose Randomized Utility-driven Synthesis of Uncertain Regions (R-U-SURE), an approach for building uncertainty-aware suggestions based on a decision-theoretic model of goal-conditioned utility, using random samples from a generative model as a proxy for the unobserved possible intents of the end user. Our technique combines minimum-Bayes-risk decoding, dual decomposition, and decision diagrams in order to ef\ufb01ciently produce structured uncertainty summaries, given only sample access to an arbitrary generative model of code and an optional AST parser. We demonstrate R-U-SURE on three developer-assistance tasks, and show that it can be applied different user interaction patterns without retraining the model and leads to more accurate uncertainty estimates than token-probability baselines. We also release our implementation as an open-source library at https://github.com/google-research/r_u_sure.",
        "keywords": [
            [
                "suggestions based",
                0.2959
            ],
            [
                "arbitrary generative",
                0.2961
            ],
            [
                "software developers",
                0.3063
            ],
            [
                "introduce errors",
                0.3254
            ],
            [
                "randomized utility",
                0.3328
            ]
        ],
        "url": "https://arxiv.org/abs/2303.00732"
    },
    {
        "title": "CONVERSATIONAL AUTOMATED PROGRAM REPAIR",
        "abstract": "Automated Program Repair (APR) can help developers automatically generate patches for bugs. ...  In conversational APR, we iteratively build the input to the model by combining previously generated patches with validation feedback. ... We evaluate 10 different LLM including the newly developed ChatGPT model to demonstrate the improvement of conversational APR over the prior LLM for APR approach.",
        "keywords": [
            [
                "automated program",
                0.386
            ],
            [
                "improvement conversational",
                0.4008
            ],
            [
                "apr help",
                0.4131
            ],
            [
                "patches bugs",
                0.4357
            ],
            [
                "chatgpt model",
                0.4603
            ]
        ],
        "url": "https://arxiv.org/abs/2301.13246"
    },
    {
        "title": "A GENERIC PROMPT FOR AN LLM THAT ENABLES NL-TO-SQL ACROSS DOMAINS AND COMPOSITIONS",
        "abstract": "Large Language Models (LLMs) pretrained for code generation have demonstrated remarkable performance for multiple high level reasoning tasks. One such challenging task is Cross-Domain and Cross-Compositional generalization of Text-to-SQL semantic parsing, where the model is expected to exhibit generalization to novel compositions not seen during training. In this paper, we evaluate the capabilities of Codex, the GPT3 model pretrained with code, in both zero-shot and few-shot settings. Existing LLM (Codex) based approaches for this task rely on inference-time retrieval of similar few-shot samples from the training set, to build a run-time prompt for test time SQL query generation. In contrast, we devise an algorithm to come up with a minimal set of few-shots from the available data with complete coverage of SQL clauses, operators and functions and maximum coverage of available domains. We combine these few-shots with the out-of-distribution test query to define what we term as a Generic Prompt (GP) , which is further used to generate the corresponding SQL. The GP, being common across distinct test queries, not only provides us with a more efficient solution avoiding test time retrieval, but also yields SOTA few-shot cross-domain generalization results with an execution accuracy of 70.64% on the Spider-Dev set. We also evaluate the Generic Prompt on the App-Dev split of Spider-CG dataset for compositional generalization and Kaggle DBQA zero-shot cross-domain generation dataset obtaining few shot execution accuracy of 55.41% and 32.61%, respectively, with 3.25% and 11.41% absolute improvement over the zero-shot performance with Codex. We also showcase that for all the three datasets, the GPleads to a better performance than a prompt constructed by equal number of randomly selected exemplars from the available data.",
        "keywords": [
            [
                "domain generation",
                0.3002
            ],
            [
                "pretrained code",
                0.3044
            ],
            [
                "test query",
                0.3075
            ],
            [
                "text sql",
                0.3215
            ],
            [
                "compositional generalization",
                0.3371
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/3.pdf"
    },
    {
        "title": "END-TO-END NEURAL PERMUTATION PROGRAM SYNTHESIS",
        "abstract": "Permutations are the most general abstraction describing finitary actions affecting finitely many elements, yet also the simplest class of programs imaginable. We propose a neural model that, given input-output pair examples, finds both a minimal set of atomic operations and the programs mapping inputs to outputs in terms of these atoms. Our model, DISPER, achieves 100% program reconstruction accuracy when the atoms are known and performs well even when tasked with identifying distinct groups of atomic operations in a single configuration. ...",
        "keywords": [
            [
                "elements simplest",
                0.36
            ],
            [
                "finitary actions",
                0.3782
            ],
            [
                "neural model",
                0.3819
            ],
            [
                "accuracy atoms",
                0.3838
            ],
            [
                "programs mapping",
                0.4571
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/5.pdf"
    },
    {
        "title": "Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs",
        "abstract": "Aiming to produce reinforcement learning (RL) policies that are human-interpretable and can generalize better to novel scenarios, Trivedi et al. (2021) present a method (LEAPS) that first learns a program embedding space to continuously parameterize diverse programs from a pre-generated program dataset, and then searches for a task-solving program in the learned program embedding space when given a task. Despite encouraging results, the program policies that LEAPS can produce are limited by the distribution of the program dataset. Furthermore, during searching, LEAPS evaluates each candidate program solely based on its return, failing to precisely reward correct parts of programs and penalize incorrect parts. To address these issues, we propose to learn a meta-policy that composes a series of programs sampled from the learned program embedding space. By composing programs, our proposed method can produce program policies that describe out-of-distributionally complex behaviors and directly assign credits to programs that induce desired behaviors. We design and conduct extensive experiments in the Karel domain. The experimental results show that our proposed framework outperforms baselines. The ablation studies confirm the limitations of LEAPS and justify our design choices.",
        "keywords": [
            [
                "searching leaps",
                0.3635
            ],
            [
                "produce reinforcement",
                0.3713
            ],
            [
                "program policies",
                0.4435
            ],
            [
                "learns",
                0.4503
            ],
            [
                "program embedding",
                0.4738
            ]
        ],
        "url": "https://arxiv.org/abs/2301.12950"
    },
    {
        "title": "LEVERAGING STATIC ANALYSIS FOR BUGREPAIR",
        "abstract": "We propose a method combining machine learning with a static analysis tool (i.e. Infer) to automatically repair source code. ... Our method is able to \ufb01nd a function with the same semantics that does not raise a warning with around 97% precision and 66% recall.",
        "keywords": [
            [
                "recall",
                0.2576
            ],
            [
                "function semantics",
                0.2958
            ],
            [
                "infer automatically",
                0.3303
            ],
            [
                "analysis tool",
                0.3736
            ],
            [
                "repair source",
                0.4452
            ]
        ],
        "url": "https://arxiv.org/abs/2304.10379"
    },
    {
        "title": "CODEBPE: INVESTIGATING SUBTOKENIZATION OPTIONS FOR LARGE LANGUAGE MODEL PRETRAINING ON SOURCE CODE",
        "abstract": "Recent works has widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account source code specifics. We propose subtokenziation that reduces average length by 17\u201340% without downstream performance drop, and show that a carefully chosen subtokenization may significantly improve quality by 0.5-2%, possibly with some length increase.",
        "keywords": [
            [
                "code specifics",
                0.327
            ],
            [
                "architectures source",
                0.3279
            ],
            [
                "pretraining",
                0.3518
            ],
            [
                "large language",
                0.4202
            ],
            [
                "subtokenziation reduces",
                0.4496
            ]
        ],
        "url": "https://openreview.net/forum?id=rd-G1nO-Jbq"
    },
    {
        "title": "CONTRA CLM: C ONTRASTIVE LEARNING FOR CAUSAL LANGUAGE MODEL",
        "abstract": "Despite exciting progress in causal language models, the expressiveness of the representations is largely limited due to poor discrimination ability...",
        "keywords": [
            [
                "poor",
                0.1477
            ],
            [
                "largely limited",
                0.1688
            ],
            [
                "progress causal",
                0.4067
            ],
            [
                "discrimination ability",
                0.4254
            ],
            [
                "language models",
                0.5243
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/17.pdf"
    },
    {
        "title": "BigIssue: A Realistic Bug Localization Benchmark",
        "abstract": "As machine learning tools progress, the inevitable question arises: How can machine learning help us write better code? With significant progress being achieved in natural language processing with models like GPT-3 and Bert, the applications of natural language processing techniques to code are starting to be explored. We propose BigIssue: a benchmark for realistic bug localization. The goal of the benchmark is two-fold...",
        "keywords": [
            [
                "localization",
                0.2797
            ],
            [
                "bert",
                0.2926
            ],
            [
                "bigissue benchmark",
                0.3308
            ],
            [
                "learning tools",
                0.3584
            ],
            [
                "realistic bug",
                0.3591
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/19.pdf"
    },
    {
        "title": "ObSynth: An Interactive Synthesis System for Generating Object Models from Natural Language Specifications",
        "abstract": "We introduce ObSynth, an interactive system leveraging the domain knowledge embedded in large language models (LLMs) to help users design object models from high level natural language prompts. This is an example of specification reification, the process of taking a high-level, potentially vague specification and reifying it into a more concrete form. We evaluate ObSynth via a user study, leading to three key findings: first, object models designed using ObSynth are more detailed, showing that it often synthesizes fields users might have otherwise omitted. Second, a majority of objects, methods, and fields generated by ObSynth are kept by the user in the final object model, highlighting the quality of generated components. Third, ObSynth altered the workflow of participants: they focus on checking that synthesized components were correct rather than generating them from scratch, though ObSynth did not reduce the time participants took to generate object models.",
        "keywords": [
            [
                "fields generated",
                0.3447
            ],
            [
                "domain knowledge",
                0.3582
            ],
            [
                "obsynth interactive",
                0.359
            ],
            [
                "language prompts",
                0.414
            ],
            [
                "specification reification",
                0.4573
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/21.pdf"
    },
    {
        "title": "TAST Y: A TRANSFORMER BASED APPROACH TO SPACE AND TIME COMPLEXITY",
        "abstract": "Code based Language Models (LMs) have shown very promising results in the field of software engineering with applications such as code refinement, code completion and generation. However, the task of time and space complexity classification from code has not been extensively explored due to a lack of datasets, with prior endeavors being limited to Java. In this project, we aim to address these gaps by creating a labelled dataset of code snippets spanning multiple languages (Python and C++ datasets currently, with C, C#, and JavaScript datasets being released shortly). We find that existing time complexity calculation libraries and tools only apply to a limited number of use-cases. The lack of a well-defined rule based system motivates the application of several recently proposed code-based LMs. We demonstrate the effectiveness of dead code elimination and increasing the maximum sequence length of LMs. In addition to time complexity, we propose to use LMs to find space complexities from code, and to the best of our knowledge, this is the first attempt to do so. Furthermore, we introduce a novel code comprehension task, called cross-language transfer, where we fine-tune the LM on one language and run inference on another. Finally, we visualize the activation of the attention fed classification head of our LMs using Non-negative Matrix Factorization (NMF) to interpret our results.",
        "keywords": [
            [
                "libraries tools",
                0.358
            ],
            [
                "language models",
                0.3738
            ],
            [
                "software engineering",
                0.379
            ],
            [
                "space complexity",
                0.4005
            ],
            [
                "refinement code",
                0.4095
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/22.pdf"
    },
    {
        "title": "SEMANTIC PREFIXES AS AN AUXILIARY TARGET FOR\nCODE UNDERSTANDING AND GENERATION",
        "abstract": "Code understanding and generation require learning the mapping between human\nand programming languages. As human and programming languages are different\nin vocabulary, semantic, and, syntax, it is challenging for an autoregressive model\nto generate a sequence of tokens that is both semantically (i.e., carry the right\nmeaning) and syntactically correct (i.e., in the right sequence order). Inspired\nby this, we propose a prefix-based learning framework to lessen the burden of\nan autoregressive generation model by decoupling the learning of semantic and\nsyntactic dependencies. In particular, during the training we prepend the target\noutput with a semantic embedding that encodes the output sequence. In this\nway, a model learns to first predict the semantics of the output before generating\nthe sequence. Evaluating on 11 code understanding and generation datasets, we\nshow that our prefix-prepending approach improves the baseline by an average\nof 8.1% in exact match and 5.5% in CodeBLEU. It also either outperforms or is\non-par with the state-of-the-art methods across a variety of code understanding\ntasks. Our approach is general and can be used as a meta algorithm on top of any\nautoregressive language model.\n",
        "keywords": [
            [
                "generation datasets",
                0.3772
            ],
            [
                "syntactic dependencies",
                0.3928
            ],
            [
                "tokens semantically",
                0.4073
            ],
            [
                "code understanding",
                0.412
            ],
            [
                "autoregressive language",
                0.4178
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/23.pdf"
    },
    {
        "title": "Learning Performance-Improving Code Edits",
        "abstract": "The waning of Moore\u2019s Law has shifted the focus of the tech industry towards alternative methods for continued performance gains. While optimizing compilers are a standard tool to help increase program efficiency, programmers continue to shoulder much responsibility in crafting and refactoring code with better performance characteristics. In this paper, we investigate the ability of large language models (LLMs) to suggest functionally correct, performance improving code edits. We hypothesize that language models can suggest such edits in ways that would be impractical for static analysis alone. We investigate these questions by curating a large-scale dataset of Performance-Improving Edits, PIE. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program\u2019s performance. We use PIE to evaluate and improve the capacity of large language models. Specifically, we use examples from PIE to fine-tune multiple variants of CODEGEN, a billion-scale Transformer-decoder model. Additionally, we use examples from PIE to prompt OpenAI\u2019s CODEX using a few-shot prompting. By leveraging PIE, we find that both CODEX and CODEGEN can generate performance-improving edits, with speedups of more than 2.5 for over 25% of the programs, for C++ and Python, even after the C++ programs were compiled using the O3 optimization level. Crucially, we show that PIE allows CODEGEN, an open-sourced and 10 smaller model than CODEX, to match the performance of CODEX on this challenging task. Overall, this work opens new doors for creating systems and methods that can help programmers write efficient code.",
        "keywords": [
            [
                "generate performance",
                0.3997
            ],
            [
                "crafting refactoring",
                0.4015
            ],
            [
                "large language",
                0.4135
            ],
            [
                "compilers standard",
                0.4388
            ],
            [
                "improving edits",
                0.4514
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/28.pdf"
    },
    {
        "title": "SCRIPTORIUM WS: A CODE GENERATION ASSISTANT FOR WEAK SUPERVISION",
        "abstract": "Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. Instead of having domain experts write small programs to act as labeling functions (LFs) for weak supervision, this paper proposes using code-generation models to act as coding assistants for crafting weak supervision sources. The paper studies prompting strategies to maximize the quality of the generated sources and settles on a multi-tier strategy that incorporates multiple types of information. The insights gained from the study are then used to introduce ScriptoriumWS, a weak supervision system that greatly improves coverage while maintaining accuracy compared to hand-crafted sources.",
        "keywords": [
            [
                "using code",
                0.3438
            ],
            [
                "crafted sources",
                0.355
            ],
            [
                "labeled data",
                0.3578
            ],
            [
                "scriptoriumws weak",
                0.3837
            ],
            [
                "supervision popular",
                0.4224
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/30.pdf"
    },
    {
        "title": "SYNTHESIS OF MATHEMATICAL PROGRAMS FROM NATURAL LANGUAGE SPECIFICATIONS",
        "abstract": "Several decision problems that are encountered in various business domains can be modeled as mathematical programs, i.e. optimization problems. The process of conducting such modeling often requires the involvement of experts trained in operations research and advanced algorithms. Surprisingly, despite the significant advances in the methods for program and code synthesis, AutoML, learning to optimize etc., there has been little or no attention paid to automating the task of synthesizing mathematical programs. We imagine a scenario where the specifications for modeling, i.e. the objective and constraints are expressed in an unstructured form in natural language (NL) and the mathematical program has to be synthesized from such an NL specification. In this work we evaluate the efficacy of employing CodeT5 with data augmentation and post-processing of beams. We utilize GPT-3 with back translation for generation of synthetic examples. Further we apply rules of linear programming to score beams and correct beams based on common error patterns. We observe that with these enhancements CodeT5 base gives an execution accuracy of 0.73 which is significantly better than zero-shot execution accuracy of 0.41 by ChatGPT and 0.36 by Codex.",
        "keywords": [
            [
                "programming score",
                0.3725
            ],
            [
                "translation generation",
                0.3884
            ],
            [
                "expressed unstructured",
                0.3982
            ],
            [
                "specifications modeling",
                0.4125
            ],
            [
                "learning optimize",
                0.4126
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/32.pdf"
    },
    {
        "title": "EXPLICIT KNOWLEDGE TRANSFER FOR WEAKLY - SUPERVISED CODE GENERATION",
        "abstract": "Large language models (LLMs) can acquire strong code-generation capabilities through few-shot learning. In contrast, supervised \ufb01ne-tuning is still needed for smaller models to achieve good performance and such \ufb01ne-tuning demands a large number of task-speci\ufb01c NL-code pairs, which are expensive to obtain. In this paper, we attempt to transfer the code generation ability of an LLM to a smaller model with the aid of weakly-supervised data. More speci\ufb01cally, we propose explicit knowledge transfer (EKT), which uses the few-shot capabilities of a teacher LLM to create NL-code pairs that we then \ufb01lter for correctness and \ufb01ne-tune the student on. We evaluate EKT on the task of generating code solutions to math word problems from the GSM8k dataset. We \ufb01nd that EKT not only yields better performance than training with expert iteration, but also outperforms knowledge distillation, another form of knowledge transfer. A GPT-Neo 1.3B model trained using EKT with a GPT-J teacher achieves a 12.4% PASS @100on GSM8k, while the same student and teacher trained with knowledge distillation yield only a 3.7% PASS @100. We also show that it is possible for a student model to outperform the teacher using EKT.",
        "keywords": [
            [
                "teacher trained",
                0.3308
            ],
            [
                "generating code",
                0.371
            ],
            [
                "large language",
                0.3722
            ],
            [
                "knowledge distillation",
                0.3772
            ],
            [
                "weakly supervised",
                0.4111
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/33.pdf"
    },
    {
        "title": "XLCoST: A Benchmark Dataset for Cross-Lingual Code Intelligence",
        "abstract": "Recent advances in machine learning have significantly improved the understanding of source code data and achieved good performance on a number of downstream tasks. This paper introduces XLCoST, Cross-Lingual Code Snippet dataset, a new benchmark dataset for cross-lingual code intelligence. Our dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-lingual code tasks. To the best of our knowledge, it is the largest parallel dataset for source code both in terms of size and the number of languages. We believe this new dataset can be a valuable asset for the research community and facilitate the development and validation of new methods for cross-lingual code intelligence.",
        "keywords": [
            [
                "languages commonly",
                0.3632
            ],
            [
                "benchmark dataset",
                0.3655
            ],
            [
                "code snippet",
                0.4055
            ],
            [
                "parallel dataset",
                0.4123
            ],
            [
                "cross lingual",
                0.5065
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/34.pdf"
    },
    {
        "title": "APERFCODE: A UTO CONVERSION TO PERFORMANT CODE",
        "abstract": "Advanced developers often review code to improve the quality of readability, security, stability, cost, memory and compute performance, etc.... We seek to leverage the general-purpose knowledge and understanding being demonstrated by LLMs like ChatGPT to help automate the process of improving the performant quality of code. We have evaluated our approach against hand accelerated codebase of a proprietary dataset of a large... for 56.52% cases, the speedup is2\u00d7more than with hand-accelerated ones.",
        "keywords": [
            [
                "speedup",
                0.3785
            ],
            [
                "developers review",
                0.3802
            ],
            [
                "like chatgpt",
                0.3853
            ],
            [
                "codebase proprietary",
                0.4326
            ],
            [
                "memory compute",
                0.4353
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/35.pdf"
    },
    {
        "title": "MULTILINGUAL CODE RETRIEVAL WITHOUT PAIRED DATA: A NEW BENCHMARK AND EXPERIMENTS",
        "abstract": "We seek to overcome limitations to code retrieval quality posed by the scarcity of data containing pairs of code snippets and natural language queries in languages other than English. We correspondingly test the following hypothesis: if a model can map from English to code, and from other natural languages to English, then how well can the model directly map from those non-English languages into representations of code? To do so, we introduce two new datasets. For training models, we build a corpus corresponding to paired English/Code data and combine it with existing translation datasets given by pairs of English and other natural languages. For evaluation, we make a new benchmark available, dubbed M2CRB, containing pairs of text and code, for multiple natural and programming language pairs \u2013 namely: Spanish, Portuguese, German, and French, each paired with code snippets for: Python, Java, and JavaScript. Evaluation on both our new benchmark tasks as well as on an existing code-to-code search task confirms our hypothesis: models are able to generalize to unseen source/target language pairs they indirectly observed during training. We examine models which both generate and retrieve natural and programming languages and through ablations, we further verify the influence of different design choices and training tasks in terms of whether or not they contribute to generalization with unseen language pairs.",
        "keywords": [
            [
                "pairs english",
                0.3833
            ],
            [
                "unseen language",
                0.405
            ],
            [
                "translation datasets",
                0.4133
            ],
            [
                "code snippets",
                0.4635
            ],
            [
                "code search",
                0.5205
            ]
        ],
        "url": "https://dl4c.github.io/assets/pdf/papers/12.pdf"
    },
    {
        "title": "ReCode : Robustness Evaluation of Code Generation Models",
        "abstract": "Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation. In this paper, we propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.",
        "keywords": [
            [
                "comprehensive benchmark",
                0.3576
            ],
            [
                "text code",
                0.3922
            ],
            [
                "generation models",
                0.4026
            ],
            [
                "code tasks",
                0.406
            ],
            [
                "classification robustness",
                0.4402
            ]
        ],
        "url": "https://arxiv.org/abs/2212.10264"
    },
    {
        "title": "CODEBERTS CORE : EVALUATING CODE GENERATION WITH PRETRAINED MODELS OF CODE",
        "abstract": "Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose Code-BERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We \ufb01nd that Code-BERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release \ufb01ve language-speci\ufb01c pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score. Our language-speci\ufb01c models have been downloaded more than 25,000 times from the Huggingface Hub.",
        "keywords": [
            [
                "code reference",
                0.4064
            ],
            [
                "builds bertscore",
                0.412
            ],
            [
                "generated tokens",
                0.4169
            ],
            [
                "bert score",
                0.4203
            ],
            [
                "multilingual codebert",
                0.4603
            ]
        ],
        "url": "https://arxiv.org/abs/2302.05527"
    }
]