Source code readability and comprehensibility have gained increased interest in the late years, due to the wide adoption of component-based software development and the (re)use of software residing in code hosting platforms. Among the various approaches proposed, consistent code styling and code formatting across a project have been proven to significantly improve both readability and the capability of the developers to understand the context, the functionality and the purpose of a block of code. Most code formatting approaches rely on a set of rules defined by experts that aspire to model a commonly accepted formatting. This approach is usually based on the expert's expertise and best practice knowledge, is time consuming and does not take into account the way a team develops software. Thus, it becomes too intrusive and, in many cases, is not adopted. In this work, we present an automated mechanism, that, given a set of source code files (for example a set of repositories that a team has developed), can be trained to recognize the formatting style used across a project and identify deviations from it, in a completely unsupervised manner. At first, source code is transformed into small meaningful pieces, called tokens. Tokens residing in the source code are used to train the Long Short-Term Memory and the Support Vector Machine models of our mechanism, in order to predict the probability of a token being wrongly positioned. Preliminary evaluation on two different axes indicates that our approach can effectively detect formatting deviations from the code styling used in a project and can provide actionable recommendations to the developer.
Python was used as the main implementation language along with the libraries keras
and sklearn
.
We have divided our approach into six stages:
In this stage we did some manual work in order to discover regular expressions that will label Java files as containing a formatting error or not. In particular we identified 22 regexps:
r1 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9\s,]+[\r\n],")
r2 = re.compile("[a-zA-Z0-9]+\( ")
r3 = re.compile("package[\r\n]")
r4 = re.compile(" [\r\n]")
r5 = re.compile("public[\r\n]")
r6 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9 ,:\.\"_\(\)]*\) +\{")
r7 = re.compile("[\r\n];")
r8 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9 ,:\.\"_\(\)]* \)")
r9 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9 ,]* ,")
r10 = re.compile(" ;")
r11 = re.compile(",[^\s]")
r12 = re.compile("[\r\n]=")
r13 = re.compile("[a-zA-Z0-9]+ \(")
r14 = re.compile("[\t]")
r15 = re.compile(" \.")
r16 = re.compile("\. ")
r17 = re.compile("@ ")
r18 = re.compile("=[^\s=]")
r19 = re.compile("[\r\n]\(")
r20 = re.compile("if \([a-zA-Z0-9 ,:\.\"_\(\)]* \)")
r21 = re.compile("switch \( ")
r22 = re.compile("super ")
With these 22 regexps expression in hand, we downloaded the Java source files dataset from the paper "Syntax and Sensibility: Using language models to detect and correct syntax errors" by Santos et al., . We applied the regular expressions in the dataset until we identified 10000 "pristine" Java source code files that did not contain a formatting error.
The next stage was the creation of a tokenizer that would transform groups of characters of raw source code, into tokens for further processing. For example whenever the words true
, false
or null
were identified, they were treated as the token LITERAL
.
LITERAL = ["true","false","null"]
Next, we trained a recurrent neural network based on LSTM nodes as a generative model for finding the next token based on the previous ones. For training we used the 10K "pristine" Java source files after being tokenized and the network learns a model of how "pristine" Java files look like. We used two layers of 400 LSTM nodes each. Testing it on the CodRep dataset we got a Mean Reciprocal Rank (MRR) in the areas of 0.6-0.7
.
An additional approach was to use n-grams and in particular 7-grams and 10-grams and pose the problem as an outlier detection problem where we trained a 1-class SVM to predict whether a new 7-gram or 10-gram belongs or not to the class (no formatting error class). For this approach to be applied, we made use of the n-grams belonging in the "no formatting error" class. A group of 1-class SVMs with various ν
and γ
parameters was trained and returned its own prediction probability on each new sample and the final decision was made upon majority voting. The 1-class SVM models with the prediction probabilities reached an MRR in the areas of 0.7
.
Our final pipeline consisted of the following steps.
Tokenize > LSTM | SVM > Aggregation
The Aggregation
stage averages the probabilities of "wronful presence" provided by the two models for each token and sorts them in increasing order of that probability. The first character of a token is assigned the probability, while the rest of the characters get 0. The complete mechanism, after the use of the aggregation stage improved the performance of the pipeline to an MRR of 0.85
.