Skip to content

Instantly share code, notes, and snippets.

@kyrcha

kyrcha/CodRep-2019.md

Last active Nov 27, 2019
Embed
What would you like to do?
AuthEceSoftEng approach to CodRep-2019 competition

AuthEceSoftEng <> CodRep-2019

The Problem

The goal of the CodRep 2019 competition was to predict formatting errors in source code.

We were given 8000 Java files with one formatting error in each one and another file (out.txt) that contained the character position the formatting error existed. For example, in the snippet below, the error (unnecessary space) is found in position 30:

public class test{
  int a = 1 ;
}

Team members

Tech Stack

Python was used as the main implementation language along with the libraries keras and sklearn.

Approach

We have divided our approach into seven stages:

1. Manual Work

In this stage we did some manual work in order to discover regular expressions that will label Java files as containing an error or not. In particular we identified 22 regexps that identified the problems in 7735 out of the 8000 files. In Python these were:

r1 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9\s,]+[\r\n],")
r2 = re.compile("[a-zA-Z0-9]+\( ")
r3 = re.compile("package[\r\n]")
r4 = re.compile(" [\r\n]")
r5 = re.compile("public[\r\n]")
r6 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9 ,:\.\"_\(\)]*\) +\{")
r7 = re.compile("[\r\n];")
r8 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9 ,:\.\"_\(\)]* \)")
r9 = re.compile("[a-zA-Z0-9]+\([a-zA-Z0-9 ,]* ,")
r10 = re.compile(" ;")
r11 = re.compile(",[^\s]")
r12 = re.compile("[\r\n]=")
r13 = re.compile("[a-zA-Z0-9]+ \(")
r14 = re.compile("[\t]")
r15 = re.compile(" \.")
r16 = re.compile("\. ")
r17 = re.compile("@ ")
r18 = re.compile("=[^\s=]")
r19 = re.compile("[\r\n]\(")
r20 = re.compile("if \([a-zA-Z0-9 ,:\.\"_\(\)]* \)")
r21 = re.compile("switch \( ")
r22 = re.compile("super ")

2. Pristine Java Files

With these 22 regexps expression in hand, we downloaded the Java source files dataset from the paper "Syntax and Sensibility: Using language models to detect and correct syntax errors" by Santos et al., . We applied the regular expressions in the dataset until we identified 10000 "pristine" Java source code files that did not contain a formatting error.

3. Tokenizer

The next stage was the creation of a tokenizer that would transform groups of characters of raw source code, into tokens for further processing. For example whenever the words true, false or null were identified, they were treated as the token LITERAL.

LITERAL = ["true","false","null"]

4. Training a generative model

Next, we trained a recurrent neural network based on LSTM nodes as a generative model for finding the next token based on the previous ones. For training we used the 10K "pristine" Java source files after being tokenized and the network learns a model of how "pristine" Java files look like. We used two layers of 400 LSTM nodes each. Testing it on the CodRep dataset we got a Mean Reciprocal Rank (MRR) in the areas of 0.6-0.7.

This was the model used in the intermediate ranking that got a MRR of 0.6903873787200523.

5. Training a discriminative model

An additional approach was to use n-grams and in particular 7-grams and pose the problems as a binary classification problem. From the CodRep dataset, we extracted 7-grams of tokens in which 50% of them contained a formatting error and 50% of them did not. Because in the initial 7-grams extraction the dataset was imbalanced with 98.24% of samples non containing a formatting error (0 class), we downsampled the 0 class and upsampled the 1 class for a totla of 548580 samples. We appled xgboost and improved MRR in the areas of 0.7-0.8. Experiments were also performed with 5-grams and 9-grams.

6. Training an outlier detection model

We also posed the problem as an outlier detection problem where we trained a 1-class SVM to predict whether a new 7-gram belongs or not to the class (no formatting error class). For this approach to be applied, we made use of the previously generated 7-grams, keeping only the ones belonging in the "no formatting error" class. A group of 1-class SVMs with various ν and γ parameters was trained and returned its own prediction probability on each new sample and the final decision was made upon majority voting. By adding the 1-class SVM models with the prediction probabilities we reached an MRR in the areas of 0.85-0.89.

7. The final pipeline

Our final pipeline consisted of the following steps.

Tokenize > LSTM | XGBOOST | SVM > Aggregation > RegExp post-processing

The Aggregation stage averages the probabilities of "wronful presence" provided by the three models for each token and sorts them in increasing order of that probability. The first character of a token is assigned the probability, while the erst of the characters get 0.

In the RegExp post-processing step, we run the 22 regexps in each file and if there was a character matched we promoted the character to the first position.

This was the model used in the final ranking that got a MRR of 0.9839710325062592.

Conclusions

Some conclusions are:

  • Initially we believed that with the generative model and a lot of pristine Java files we would get very good results (we even tried 100K "pristine" Java files). Perhaps a more exotic deep neural network architecture is needed to achieve this.
  • The ensemble technique with different types of models greatly helped to improve the performance of the pipeline, while on its own no algorithm was able to achieve a performance above 0.7 MRR.
@tdurieux

This comment has been minimized.

Copy link

@tdurieux tdurieux commented Nov 25, 2019

I would be really interested to see your 22 regex. Would you agree to share them?

@kyrcha

This comment has been minimized.

Copy link
Owner Author

@kyrcha kyrcha commented Nov 25, 2019

Sure @tdurieux! I updated the gist to include them in stage 1.

@tdurieux

This comment has been minimized.

Copy link

@tdurieux tdurieux commented Nov 25, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment