Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save karanikiotis/263251decb86f839a3265cc2306355b2 to your computer and use it in GitHub Desktop.
Save karanikiotis/263251decb86f839a3265cc2306355b2 to your computer and use it in GitHub Desktop.

AuthEceSoftEng <> Towards Automatically Generating a Personalized Code Formatting Mechanism

The Problem

Source code readability and comprehensibility have gained increased interest in the late years, due to the wide adoption of component-based software development and the (re)use of software residing in code hosting platforms. Among the various approaches proposed, consistent code styling and code formatting across a project have been proven to significantly improve both readability and the capability of the developers to understand the context, the functionality and the purpose of a block of code. Most code formatting approaches rely on a set of rules defined by experts that aspire to model a commonly accepted formatting. This approach is usually based on the expert's expertise and best practice knowledge, is time consuming and does not take into account the way a team develops software. Thus, it becomes too intrusive and, in many cases, is not adopted. In this work, we present an automated mechanism, that, given a set of source code files (for example a set of repositories that a team has developed), can be trained to recognize the formatting style used across a project and identify deviations from it, in a completely unsupervised manner. At first, source code is transformed into small meaningful pieces, called tokens. Tokens residing in the source code are used to train the Long Short-Term Memory and the Support Vector Machine models of our mechanism, in order to predict the probability of a token being wrongly positioned. Preliminary evaluation on two different axes indicates that our approach can effectively detect formatting deviations from the code styling used in a project and can provide actionable recommendations to the developer.

Team members

Tech Stack

Python was used as the main implementation language along with the libraries keras and sklearn.

Approach

We have divided our approach into six stages:

1. Manual Work

In this stage we did some manual work in order to discover regular expressions that will label Java files as containing a formatting error or not. In particular we identified 22 regexps:

r1 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9\s,]+[\r\n],")
r2 = re.compile("[a-zA-Z0-9]+\( ")
r3 = re.compile("package[\r\n]")
r4 = re.compile(" [\r\n]")
r5 = re.compile("public[\r\n]")
r6 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9 ,:\.\"_\(\)]*\) +\{")
r7 = re.compile("[\r\n];")
r8 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9 ,:\.\"_\(\)]* \)")
r9 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9 ,]* ,")
r10 = re.compile(" ;")
r11 = re.compile(",[^\s]")
r12 = re.compile("[\r\n]=")
r13 = re.compile("[a-zA-Z0-9]+ \(")
r14 = re.compile("[\t]")
r15 = re.compile(" \.")
r16 = re.compile("\. ")
r17 = re.compile("@ ")
r18 = re.compile("=[^\s=]")
r19 = re.compile("[\r\n]\(")
r20 = re.compile("if \([a-zA-Z0-9 ,:\.\"_\(\)]* \)")
r21 = re.compile("switch \( ")
r22 = re.compile("super ")

2. Pristine Java Files

With these 22 regexps expression in hand, we downloaded the Java source files dataset from the paper "Syntax and Sensibility: Using language models to detect and correct syntax errors" by Santos et al., . We applied the regular expressions in the dataset until we identified 10000 "pristine" Java source code files that did not contain a formatting error.

3. Tokenizer

The next stage was the creation of a tokenizer that would transform groups of characters of raw source code, into tokens for further processing. For example whenever the words true, false or null were identified, they were treated as the token LITERAL.

LITERAL = ["true","false","null"]

4. Training a generative model

Next, we trained a recurrent neural network based on LSTM nodes as a generative model for finding the next token based on the previous ones. For training we used the 10K "pristine" Java source files after being tokenized and the network learns a model of how "pristine" Java files look like. We used two layers of 400 LSTM nodes each. Testing it on the CodRep dataset we got a Mean Reciprocal Rank (MRR) in the areas of 0.6-0.7.

5. Training an outlier detection model

An additional approach was to use n-grams and in particular 7-grams and 10-grams and pose the problem as an outlier detection problem where we trained a 1-class SVM to predict whether a new 7-gram or 10-gram belongs or not to the class (no formatting error class). For this approach to be applied, we made use of the n-grams belonging in the "no formatting error" class. A group of 1-class SVMs with various ν and γ parameters was trained and returned its own prediction probability on each new sample and the final decision was made upon majority voting. The 1-class SVM models with the prediction probabilities reached an MRR in the areas of 0.7.

6. The final pipeline

Our final pipeline consisted of the following steps.

Tokenize > LSTM | SVM > Aggregation

The Aggregation stage averages the probabilities of "wronful presence" provided by the two models for each token and sorts them in increasing order of that probability. The first character of a token is assigned the probability, while the rest of the characters get 0. The complete mechanism, after the use of the aggregation stage improved the performance of the pipeline to an MRR of 0.85.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment