Skip to content

Instantly share code, notes, and snippets.

@LovelyBuggies
Created May 1, 2022 01:03
Show Gist options
  • Save LovelyBuggies/078139f8f3afe6d3b09c1120dd1c74e8 to your computer and use it in GitHub Desktop.
Save LovelyBuggies/078139f8f3afe6d3b09c1120dd1c74e8 to your computer and use it in GitHub Desktop.
Tensor Representation of Code for Vulnerability Detection
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "73678dea-edb3-4ad2-a615-ea84f344542e",
"metadata": {},
"source": [
"# Code Vulnerability Detection on IBM/D2A "
]
},
{
"cell_type": "markdown",
"id": "c33dd29e-7bb1-4c39-8437-0fb3894cd736",
"metadata": {},
"source": [
"## 1. Introduction\n",
"\n",
"The current dataset contains the samples generated from 6 open-source projects, OpenSSL, FFmpeg, HTTPD, NGINX, Libtiff, and Libav. For each project, there are 3 pickle.gz files like `nginx_after_fix_extractor_0.pickle.gz`, `nginx_labeler_1.pickle.gz`, and `nginx_labeler_0.pickle.gz`, which are generated by two slightly different extractors. Each `pickle.gz file` contains compressed samples in JSON.\n",
"\n",
"The [field and discription](https://dax-cdn.cdn.appdomain.cloud/dax-d2a/1.0.0/data-preview/index.html?_ga=2.207765842.1169579792.1649770384-1526562817.1648132481) of the data is:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "cd232a8f-a906-4729-a862-0e8af1864a00",
"metadata": {},
"outputs": [],
"source": [
"# import split_data as sd\n",
"# data = sd.read_pickle_data_file(\"./dataset/httpd_labeler_0.pickle.gz\")"
]
},
{
"cell_type": "markdown",
"id": "a8628673-e813-46a4-91ef-67f9a65e6261",
"metadata": {},
"source": [
"## 2. Data Preparation and Preprocessing\n",
"\n",
"In Leaderboard Dataset, there are 4 directories corresponding to 4 tasks of the leaderboard. Each directory contains 3 csv files corresponding to the train (80%), dev (10%) and test (10%) split. The columns in the split files are identical except the test split which does not contain the label column. In this project, we are going to predict **whether a code snippet contains bugs or not** using the `Function` dataset.\n",
"\n",
"Firstly, we are going to obtain the codes of train, dev, and test."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "703bab51-b6d5-4d22-82f5-1213a048124b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded: 4643 samples in train; 596 samples in dev; 618 samples in test.\n"
]
}
],
"source": [
"import csv\n",
"\n",
"snippets_train = list()\n",
"snippets_dev = list()\n",
"snippets_test = list()\n",
"with open('./dataset/d2a_lbv1_function_train.csv') as csvfile:\n",
" reader = csv.reader(csvfile)\n",
" for i, row in enumerate(reader):\n",
" if i == 0: \n",
" continue\n",
" snippets_train.append((row[2], row[1]))\n",
"\n",
"with open('./dataset/d2a_lbv1_function_dev.csv') as csvfile:\n",
" reader = csv.reader(csvfile)\n",
" for i, row in enumerate(reader):\n",
" if i == 0: \n",
" continue\n",
" snippets_dev.append((row[2], row[1]))\n",
" \n",
"with open('./dataset/d2a_lbv1_function_test.csv') as csvfile:\n",
" reader = csv.reader(csvfile)\n",
" for i, row in enumerate(reader):\n",
" if i == 0: \n",
" continue\n",
" snippets_test.append(row[1])\n",
" \n",
"print(f\"Loaded: {len(snippets_train)} samples in train; {len(snippets_dev)} samples in dev; {len(snippets_test)} samples in test.\")"
]
},
{
"cell_type": "markdown",
"id": "bacf61a4-2f32-4fe9-a259-112f3c5166d6",
"metadata": {},
"source": [
"## 3. Syntax Tree and Code Tensorization\n",
"\n",
"Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited. To use C++ parser in Python code:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4e40daa1-1978-4b64-85d6-4d2cad265da4",
"metadata": {},
"outputs": [],
"source": [
"from tree_sitter import Language, Parser\n",
"\n",
"Language.build_library(\n",
" 'build/my-languages.so',\n",
" [\n",
" 'vendor/tree-sitter-cpp'\n",
" ]\n",
")\n",
"CPP_LANGUAGE = Language('build/my-languages.so', 'cpp')\n",
"parser = Parser()\n",
"parser.set_language(CPP_LANGUAGE)"
]
},
{
"cell_type": "markdown",
"id": "e5465da1-58f0-41b1-9377-6974792516f6",
"metadata": {},
"source": [
"Try to put the second snippet in training set to the [playground](https://tree-sitter.github.io/tree-sitter/playground)!"
]
},
{
"cell_type": "markdown",
"id": "26b3cd71-62a5-4722-bb4e-104a7eb8c6b9",
"metadata": {},
"source": [
"<details>\n",
"<summary>The second snippet:</summary>\n",
"\n",
"```cpp\n",
"static ngx_int_t\n",
"ngx_http_file_cache_lock(ngx_http_request_t *r, ngx_http_cache_t *c)\n",
"{\n",
" ngx_msec_t now, timer;\n",
" ngx_http_file_cache_t *cache;\n",
"\n",
" if (!c->lock) {\n",
" return NGX_DECLINED;\n",
" }\n",
"\n",
" now = ngx_current_msec;\n",
"\n",
" cache = c->file_cache;\n",
"\n",
" ngx_shmtx_lock(&cache->shpool->mutex);\n",
"\n",
" timer = c->node->lock_time - now;\n",
"\n",
" if (!c->node->updating || (ngx_msec_int_t) timer <= 0) {\n",
" c->node->updating = 1;\n",
" c->node->lock_time = now + c->lock_age;\n",
" c->updating = 1;\n",
" c->lock_time = c->node->lock_time;\n",
" }\n",
"\n",
" ngx_shmtx_unlock(&cache->shpool->mutex);\n",
"\n",
" ngx_log_debug2(NGX_LOG_DEBUG_HTTP, r->connection->log, 0,\n",
" \"http file cache lock u:%d wt:%M\",\n",
" c->updating, c->wait_time);\n",
"\n",
" if (c->updating) {\n",
" return NGX_DECLINED;\n",
" }\n",
"\n",
" if (c->lock_timeout == 0) {\n",
" return NGX_HTTP_CACHE_SCARCE;\n",
" }\n",
"\n",
" c->waiting = 1;\n",
"\n",
" if (c->wait_time == 0) {\n",
" c->wait_time = now + c->lock_timeout;\n",
"\n",
" c->wait_event.handler = ngx_http_file_cache_lock_wait_handler;\n",
" c->wait_event.data = r;\n",
" c->wait_event.log = r->connection->log;\n",
" }\n",
"\n",
" timer = c->wait_time - now;\n",
"\n",
" ngx_add_timer(&c->wait_event, (timer > 500) ? 500 : timer);\n",
"\n",
" r->main->blocked++;\n",
"\n",
" return NGX_AGAIN;\n",
"} \n",
"```\n",
"\n",
"</details>"
]
},
{
"cell_type": "markdown",
"id": "f9c5f568-1fb2-4c22-9ceb-a14e10ed0015",
"metadata": {},
"source": [
"To tensorize the data, we need to obtain tokens from the leaf nodes of the syntax tree. It's better to use DFS here, since it keeps the original order of the words in the code."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e3cd7b63-a2da-413e-b73f-9efb2dda316e",
"metadata": {},
"outputs": [],
"source": [
"def code_tokenize(code: str):\n",
" tree = parser.parse(code.encode())\n",
" root = tree.root_node\n",
" tokens = list()\n",
"\n",
" def DFS(tree, root, tokens):\n",
" if not root.children:\n",
" tokens.append(code.encode()[root.start_byte:root.end_byte].decode())\n",
" return\n",
"\n",
" for child in root.children:\n",
" DFS(tree, child, tokens)\n",
" \n",
" \n",
" DFS(tree, root, tokens)\n",
" return tokens"
]
},
{
"cell_type": "markdown",
"id": "ccc813b0-e29a-460a-ac23-f128e05e4737",
"metadata": {},
"source": [
"After getting the tokens, we store them in a corpus, and use word2vec model to get the vector of each word. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "dbb47a96-2f56-496c-91b1-b3b167698e5f",
"metadata": {},
"outputs": [],
"source": [
"from nltk.tokenize import sent_tokenize, word_tokenize\n",
"\n",
"import gensim\n",
"from gensim.models import Word2Vec\n",
"import os\n",
"\n",
"corpus_train = [code_tokenize(code) for code, _ in snippets_train]\n",
"corpus_dev = [code_tokenize(code) for code, _ in snippets_dev]\n",
"\n",
"if os.path.exists(\"./model/word2vec.model\"):\n",
" model = Word2Vec.load(\"./model/word2vec.model\")\n",
"\n",
"else:\n",
" model = gensim.models.Word2Vec(corpus_train + corpus_dev, min_count = 1, vector_size = 64, window = 4)\n",
" model.save(\"./model/word2vec.model\")"
]
},
{
"cell_type": "markdown",
"id": "a44c78fd-eac0-4d3b-a4e6-58c85fca7f74",
"metadata": {},
"source": [
"Then, the training set with tensors could be generated. Since the length of the code snippets varies a lot, we use the `tensorflow.image.resize` to resize the variable-sized tensors to the same size tensors. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "cd40a26d-3d22-41c0-b842-1160a9d7f565",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-04-30 20:41:10.693662: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\n",
"To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n"
]
}
],
"source": [
"import numpy as np\n",
"import tensorflow as tf\n",
"\n",
"y_train = tf.image.convert_image_dtype(np.asarray([s[1] for s in snippets_train]).astype('float32'), tf.float32)\n",
"y_dev = tf.image.convert_image_dtype(np.asarray([s[1] for s in snippets_dev]).astype('float32'), tf.float32)\n",
"\n",
"x_train_l, x_dev_l = list(), list()\n",
"for tokens in corpus_train:\n",
" code_matrix = np.asarray([model.wv[token] for token in tokens])\n",
" code_matrix_tf = tf.image.convert_image_dtype(\n",
" np.resize(code_matrix, (code_matrix.shape[0], code_matrix.shape[1], 1)),\n",
" tf.float32\n",
" )\n",
" code_matrix_resized = tf.image.resize(code_matrix_tf, (64, 64))\n",
" x_train_l.append(code_matrix_resized)\n",
"\n",
"for tokens in corpus_dev:\n",
" code_matrix = np.asarray([model.wv[token] for token in tokens])\n",
" code_matrix_tf = tf.image.convert_image_dtype(\n",
" np.resize(code_matrix, (code_matrix.shape[0], code_matrix.shape[1], 1)),\n",
" tf.float32\n",
" )\n",
" code_matrix_resized = tf.image.resize(code_matrix_tf, (64, 64))\n",
" x_dev_l.append(code_matrix_resized)\n",
" \n",
"x_train = tf.image.convert_image_dtype(np.asarray(x_train_l, dtype=float), tf.float32)\n",
"x_dev = tf.image.convert_image_dtype(np.asarray(x_dev_l, dtype=float), tf.float32)"
]
},
{
"cell_type": "markdown",
"id": "e69c19d1-4bf7-4db5-b983-3ebe02e9197d",
"metadata": {},
"source": [
"## 4. Binary Classification\n",
"\n",
"Afterward, the tensors of codes can be trained in convolutional neural network, just like the image binary classification task. Here we use accuracy, loss, recall, precision, and f1 as metrics."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "3ace61b9-98cd-44ba-ba9f-dadb5be9f70a",
"metadata": {},
"outputs": [],
"source": [
"from keras import backend as K\n",
"\n",
"def recall_m(y_true, y_pred):\n",
" true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))\n",
" possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))\n",
" recall = true_positives / (possible_positives + K.epsilon())\n",
" return recall\n",
"\n",
"def precision_m(y_true, y_pred):\n",
" true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))\n",
" predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))\n",
" precision = true_positives / (predicted_positives + K.epsilon())\n",
" return precision\n",
"\n",
"def f1_m(y_true, y_pred):\n",
" precision = precision_m(y_true, y_pred)\n",
" recall = recall_m(y_true, y_pred)\n",
" return 2*((precision*recall)/(precision+recall+K.epsilon()))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d148b990-96ba-4d1e-b014-fd682138d7da",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-04-30 20:47:03.108646: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO:tensorflow:Assets written to: ./model/cls-model/assets\n",
"146/146 [==============================] - 2s 13ms/step - loss: 0.2433 - accuracy: 0.9164 - f1_m: 0.9142 - precision_m: 0.9836 - recall_m: 0.8584\n",
"19/19 [==============================] - 0s 13ms/step - loss: 3.3293 - accuracy: 0.5470 - f1_m: 0.4966 - precision_m: 0.5963 - recall_m: 0.4497\n"
]
},
{
"data": {
"text/plain": [
"[3.3292691707611084,\n",
" 0.5469798445701599,\n",
" 0.49656686186790466,\n",
" 0.5962578654289246,\n",
" 0.44966161251068115]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"if os.path.exists(\"./model/cls-model\"):\n",
" cls_model = tf.keras.models.load_model(\"./model/cls-model\")\n",
"\n",
"else:\n",
" cls_model = tf.keras.models.Sequential([\n",
" tf.keras.layers.Conv2D(16, (1,1), activation='relu', input_shape=(64, 64, 1)),\n",
" tf.keras.layers.MaxPooling2D(2, 2),\n",
" tf.keras.layers.Conv2D(32, (1,1), activation='relu'),\n",
" tf.keras.layers.MaxPooling2D(2,2),\n",
" tf.keras.layers.Conv2D(64, (1,1), activation='relu'),\n",
" tf.keras.layers.MaxPooling2D(2,2),\n",
" tf.keras.layers.Conv2D(64, (1,1), activation='relu'),\n",
" tf.keras.layers.MaxPooling2D(2,2),\n",
" tf.keras.layers.Conv2D(64, (1,1), activation='relu'),\n",
" tf.keras.layers.MaxPooling2D(2,2),\n",
" tf.keras.layers.Flatten(),\n",
" tf.keras.layers.Dense(128, activation='relu'),\n",
" tf.keras.layers.Dense(1, activation='sigmoid')\n",
" ])\n",
" cls_model.compile(\n",
" loss='binary_crossentropy',\n",
" optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.001),\n",
" metrics=['accuracy', f1_m, precision_m, recall_m]\n",
" )\n",
" history = cls_model.fit(\n",
" x_train,\n",
" y_train,\n",
" epochs=100,\n",
" verbose=0\n",
" )\n",
" cls_model.save(\n",
" './model/cls-model',\n",
" overwrite=True,\n",
" include_optimizer=True,\n",
" save_format=None,\n",
" signatures=None,\n",
" options=None,\n",
" save_traces=True\n",
" )\n",
" \n",
"cls_model.evaluate(x_train, y_train)\n",
"cls_model.evaluate(x_dev, y_dev)"
]
},
{
"cell_type": "markdown",
"id": "e54d5633-36ad-440a-910c-ddd1372996d0",
"metadata": {},
"source": [
"Here are the curves with respect to different metrics during training."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b7ed7bd8-d95d-4d72-88d8-dbdbdae4c3db",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"plt.plot(history.history['accuracy'], label='Accuracy')\n",
"plt.plot(history.history['loss'], label='Loss')\n",
"plt.plot(history.history['f1_m'], label='F1')\n",
"plt.plot(history.history['precision_m'], label='Precision')\n",
"plt.plot(history.history['recall_m'], label='Recall')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d6767fc2-6426-47ee-b0ba-421d44fe1f60",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "hist",
"language": "python",
"name": "hist"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment