zurk/sourced.ml via engine.ipynb

## sourced.ml via engine.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How old ast2vec tasks map to source{d} engine\n",
    "\n",
    "We have 3 main algorithms in our ~art2vec~ [**sourced.ml**](https://github.com/src-d/ml).\n",
    "One can find a description in the README.\n",
    "Now algorithms don't use our new cool [engine](https://github.com/src-d/engine).\n",
    "We need to understand how we can rearchitect our tool.\n",
    "At first, let's see how we can reproduce them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports and engine setup part"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Be sure you have some siva files in **../siva-files/** folder. You can find it, for example, here: https://github.com/src-d/engine/tree/master/examples/siva-files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from operator import add\n",
    "\n",
    "from sourced.engine import Engine\n",
    "from pyspark.sql import SparkSession\n",
    "from pyspark.sql.functions import *\n",
    "from pyspark.sql.types import IntegerType, ArrayType, StringType, BooleanType\n",
    "\n",
    "from token_parser import TokenParser\n",
    "token_parser = TokenParser()\n",
    "\n",
    "spark = SparkSession.builder.master(\"local[*]\").appName(\"Examples\").getOrCreate()\n",
    "engine = Engine(spark, \"../siva-files/\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Helper functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get all files\n",
    "\n",
    "The most important columns for us are **path** and **repository_id**. \n",
    "\n",
    "**commit_hash** we can use to group all information from one repository together"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+--------------------+--------------------+---------+--------------------+--------------------+\n",
      "|         commit_hash|           file_hash|             content|is_binary|                path|       repository_id|\n",
      "+--------------------+--------------------+--------------------+---------+--------------------+--------------------+\n",
      "|55c65f93887341953...|8ee78c57bac69fa1e...|[2F 2A 0A 20 2A 0...|    false|app/src/main/java...|github.com/dotfen...|\n",
      "|55c65f93887341953...|93a0cf75865a4a7e0...|[3C 3F 78 6D 6C 2...|    false|app/src/main/res/...|github.com/dotfen...|\n",
      "|55c65f93887341953...|03eeb8d6a6c13aa38...|                  []|     true|app/src/main/res/...|github.com/dotfen...|\n",
      "|55c65f93887341953...|2ac3f84463e32b625...|[2F 2A 0A 20 2A 0...|    false|app/src/main/java...|github.com/dotfen...|\n",
      "|55c65f93887341953...|796b96d1c40232652...|[2F 62 75 69 6C 6...|    false|      app/.gitignore|github.com/dotfen...|\n",
      "+--------------------+--------------------+--------------------+---------+--------------------+--------------------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "def get_all_files():\n",
    "    frc = engine.repositories.filter(\"is_fork = false\").references.head_ref.commits.first_reference_commit\n",
    "    return frc.files.join(frc.select(\"repository_id\", col(\"hash\").alias(\"commit_hash\")), \"commit_hash\")\n",
    "\n",
    "get_all_files().show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get TF-IDF\n",
    "\n",
    "Just helpers. It is easy to understand what they do.\n",
    "\n",
    "One that uses these helpers should place as **document** column what should be counted as document (repositories, files, or something custom).\n",
    "\n",
    "We use \n",
    "$$ \n",
    "\\mathrm{IDF} = \\log \\frac{1 + N}{1 + n}\n",
    "$$ \n",
    "normalization where $N$ is total number of documents and \n",
    "$$\n",
    "\\mathrm{TF} = \\log (n + 1)\n",
    "$$\n",
    "for TF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_tokens(uasts):\n",
    "    return uasts.query_uast('//*[@roleIdentifier and not(@roleIncomplete)]')\\\n",
    "        .extract_tokens()\\\n",
    "        .where(size(col(\"tokens\")) != 0)\n",
    "\n",
    "def get_DF(uasts):\n",
    "    # There is a way to not fall into rdd. \n",
    "    # One can write special UDF function and use select instead.\n",
    "    # I failed to code it first time. Unparsable errors.\n",
    "    return get_tokens(uasts).rdd\\\n",
    "        .flatMap(lambda r: [(t, r.document) for token in r.tokens for t in token_parser(token)])\\\n",
    "        .distinct()\\\n",
    "        .map(lambda r: (r[0], 1))\\\n",
    "        .reduceByKey(add).toDF([\"token\", \"count\"])\n",
    "        \n",
    "def get_IDF(uasts):\n",
    "    return get_DF(uasts)\\\n",
    "        .select(\"token\", log(( 1 + uasts.count()) / (col(\"count\") + 1)).alias('idf'))\n",
    "\n",
    "def get_TF(uasts):\n",
    "    return get_tokens(uasts).rdd\\\n",
    "        .flatMap(lambda r: [(t, r.document) for token in r.tokens for t in token_parser(token)])\\\n",
    "        .toDF([\"token\", \"document\"])\\\n",
    "        .groupby([\"document\", \"token\"])\\\n",
    "        .count()\\\n",
    "        .select(\"document\", \"token\", (1+log(col(\"count\"))).alias(\"TF\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Extract uasts and set repository as document.\n",
    "2. Calculate TF and IDF.\n",
    "3. Combine them to TF-IDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+----------------------------------------+------------------+-----------------+------------------+\n",
      "|  token|                                document|                TF|              IDF|             TFIDF|\n",
      "+-------+----------------------------------------+------------------+-----------------+------------------+\n",
      "|  input|ab0da50d140873e61318305a93c9faf1b5a564a9|3.4849066497880004|4.688285092698598|16.338235895647294|\n",
      "|  input|3bbb74a39a2d7e55bb6070e5a6ae0bb5f27de71e| 5.442651256490317|4.688285092698598|25.516700750560844|\n",
      "| column|3bbb74a39a2d7e55bb6070e5a6ae0bb5f27de71e| 2.386294361119891|5.093750200806762|12.155187381138488|\n",
      "|highest|3bbb74a39a2d7e55bb6070e5a6ae0bb5f27de71e|               1.0|5.093750200806762| 5.093750200806762|\n",
      "| import|3bbb74a39a2d7e55bb6070e5a6ae0bb5f27de71e|               1.0|5.093750200806762| 5.093750200806762|\n",
      "+-------+----------------------------------------+------------------+-----------------+------------------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "uasts = get_all_files().extract_uasts().select(\"path\", \"uast\", col(\"commit_hash\").alias(\"document\"))\n",
    "TF = get_TF(uasts)\n",
    "IDF = get_IDF(uasts)\n",
    "TFIDF = TF.join(IDF, \"token\").select(\"token\", \"document\", \"TF\", \"IDF\", (col(\"TF\") * col(\"IDF\")).alias(\"TFIDF\"))\n",
    "\n",
    "TFIDF.show(5, 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get TF-IDF with pyspark.ml\n",
    "\n",
    "Also, you can use standard TF-IDF converters from pyspark.ml.\n",
    "\n",
    "Here is an example how you can do it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+--------------------+--------------------+--------------------+\n",
      "|            document|              tokens|                  TF|               TFIDF|\n",
      "+--------------------+--------------------+--------------------+--------------------+\n",
      "|55c65f93887341953...|[com, antonioleiv...|(784,[3,5,9,15,16...|(784,[3,5,9,15,16...|\n",
      "|95bec4223682ba38a...|[com, antonioleiv...|(784,[3,5,9,15,16...|(784,[3,5,9,15,16...|\n",
      "|3ba1266862a13c471...|[com, antonioleiv...|(784,[3,5,9,15,16...|(784,[3,5,9,15,16...|\n",
      "|ab0da50d140873e61...|[subprocess, time...|(784,[0,1,2,4,5,6...|(784,[0,1,2,4,5,6...|\n",
      "|3bbb74a39a2d7e55b...|[mydict, dec, num...|(784,[0,1,2,4,5,6...|(784,[0,1,2,4,5,6...|\n",
      "|290440b64a73f5c7e...|[fibonacci, int, ...|(784,[16,17,28,62...|(784,[16,17,28,62...|\n",
      "+--------------------+--------------------+--------------------+--------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.feature import CountVectorizer, IDF\n",
    "# Also, there is tokenizer exists, so we can write our own.\n",
    "\n",
    "uasts = get_all_files().extract_uasts().select(\"path\", \"uast\", col(\"commit_hash\").alias(\"document\"))\n",
    "tokens = get_tokens(uasts).rdd\\\n",
    "        .flatMap(lambda r: [(r.document, t) for token in r.tokens for t in token_parser(token)])\\\n",
    "        .toDF((\"document\", \"tokens\"))\\\n",
    "        .groupBy(\"document\").agg(collect_list(col(\"tokens\")).alias(\"tokens\"))\n",
    "        \n",
    "CV = CountVectorizer(inputCol=\"tokens\", outputCol=\"TF\")\n",
    "TF_Model = CV.fit(tokens)\n",
    "TF = TF_Model.transform(tokens)\n",
    "\n",
    "idf = IDF(inputCol=\"TF\", outputCol=\"TFIDF\")\n",
    "idfModel = idf.fit(TF)\n",
    "TFIDF = idfModel.transform(TF)\n",
    "\n",
    "TFIDF = TFIDF.select(\"document\", \"tokens\", \"TF\", \"TFIDF\")\n",
    "TFIDF.show(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In TF and TFIDF columns there are tuple of 3 elements: token count, token index, TF (TFIDF) value "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TF-IDF summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code covers steps 1-6 in [Weighted Bag of Vectors](https://github.com/src-d/ml#weighted-bag-of-vectors) algorithm\n",
    "\n",
    "and 1-5 in [Topic Modeling of Repositories](https://github.com/src-d/ml/blob/master/topic_modeling.md). \n",
    "\n",
    "Other steps remain the same (model save, or model transformations, or run spark unrelated jobs)\n",
    "\n",
    "Outcomes:\n",
    "1. Don't see any sense to use TF-IDF from pyspark.ml\n",
    "2. It is an easy thing to compute with pyspark, thing that there is no need to store it as an intermediate result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Identifier embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preparation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import importlib\n",
    "from bblfsh.sdkversion import VERSION\n",
    "\n",
    "Node = importlib.import_module(\n",
    "        \"bblfsh.gopkg.in.bblfsh.sdk.%s.uast.generated_pb2\" % VERSION).Node\n",
    "\n",
    "class Roles:\n",
    "    pass\n",
    "\n",
    "_ROLE = importlib.import_module(\"bblfsh.gopkg.in.bblfsh.sdk.%s.uast.generated_pb2\" % VERSION)._ROLE\n",
    "for desc in _ROLE.values:\n",
    "    setattr(Roles, desc.name, desc.index)\n",
    "del _ROLE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokens to index DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "tokens = get_tokens(uasts).rdd\\\n",
    "    .flatMap(lambda r: [t for token in r.tokens for t in token_parser(token)])\\\n",
    "    .distinct()\n",
    "tokens_number = tokens.count()\n",
    "tokens_list = tokens.take(tokens_number)\n",
    "tokens = tokens.zipWithIndex().toDF((\"token\", \"id\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Traverse uast algorithm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import itertools\n",
    "\n",
    "from bblfsh.sdkversion import VERSION\n",
    "\n",
    "def _flatten_children(root):\n",
    "    ids = []\n",
    "    stack = list(root.children)\n",
    "    for node in stack:\n",
    "        if Roles.IDENTIFIER in node.roles and Roles.QUALIFIED not in node.roles:\n",
    "            ids.append(node)\n",
    "        else:\n",
    "            stack.extend(node.children)\n",
    "    return ids\n",
    "\n",
    "def _traverse_uast(uast):\n",
    "    \"\"\"\n",
    "    Traverses UAST.\n",
    "    \"\"\"\n",
    "    if isinstance(uast, bytearray):\n",
    "        uast = importlib.import_module(\n",
    "            \"bblfsh.gopkg.in.bblfsh.sdk.%s.uast.generated_pb2\" % VERSION).Node.FromString(uast)\n",
    "    \n",
    "    stack = [uast]\n",
    "    new_stack = []\n",
    "    \n",
    "    while stack:\n",
    "        for node in stack:\n",
    "            children = _flatten_children(node)\n",
    "            tokens = []\n",
    "            for ch in children:\n",
    "                tokens.extend(token_parser(ch.token))\n",
    "            if (node.token.strip() is not None and node.token.strip() != \"\" and\n",
    "                    Roles.IDENTIFIER in node.roles and Roles.QUALIFIED not in node.roles):\n",
    "                tokens.extend(token_parser(node.token))\n",
    "            for pair in itertools.permutations(tokens, 2):\n",
    "                yield pair\n",
    "            \n",
    "            new_stack.extend(children)\n",
    "            \n",
    "        stack = new_stack\n",
    "        new_stack = []\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "if you want to test traverse_uast method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "defaultdict(<class 'int'>, {('runtim', 'create'): 1, ('apps', 'improp'): 2, ('runtim', 'improp'): 1, ('load', 'popul'): 2, ('create', 'isinst'): 1, ('create', 'popul'): 1, ('label', 'create'): 2, ('app', 'runtim'): 4, ('improp', 'load'): 2, ('isinst', 'self'): 1, ('runtim', 'app'): 4, ('apps', 'configur'): 2, ('apps', 'app'): 8, ('app', 'config'): 20, ('load', 'label'): 4, ('error', 'instal'): 2, ('config', 'configur'): 4, ('runtim', 'self'): 1, ('entry', 'config'): 16, ('ready', 'configur'): 2, ('instal', 'load'): 4, ('instal', 'ready'): 4, ('load', 'improp'): 2, ('error', 'popul'): 1, ('runtim', 'instal'): 2, ('config', 'ready'): 8, ('improp', 'configur'): 2, ('error', 'app'): 4, ('popul', 'error'): 1, ('isinst', 'entry'): 4, ('apps', 'config'): 8, ('instal', 'improp'): 2, ('ready', 'instal'): 4, ('load', 'instal'): 4, ('error', 'entry'): 4, ('popul', 'apps'): 2, ('app', 'label'): 8, ('self', 'create'): 1, ('app', 'ready'): 8, ('error', 'improp'): 1, ('create', 'label'): 2, ('self', 'isinst'): 1, ('improp', 'label'): 2, ('improp', 'entry'): 4, ('popul', 'label'): 2, ('entry', 'popul'): 4, ('popul', 'config'): 4, ('load', 'app'): 8, ('label', 'isinst'): 2, ('label', 'error'): 2, ('app', 'isinst'): 4, ('popul', 'create'): 1, ('entry', 'apps'): 8, ('popul', 'configur'): 1, ('label', 'app'): 8, ('configur', 'error'): 1, ('instal', 'popul'): 2, ('configur', 'config'): 4, ('improp', 'create'): 1, ('apps', 'label'): 4, ('entry', 'entry'): 12, ('entry', 'load'): 8, ('apps', 'apps'): 2, ('load', 'error'): 2, ('improp', 'app'): 4, ('error', 'isinst'): 1, ('self', 'label'): 2, ('self', 'improp'): 1, ('create', 'error'): 1, ('config', 'app'): 20, ('ready', 'apps'): 4, ('create', 'ready'): 2, ('self', 'instal'): 2, ('runtim', 'configur'): 1, ('ready', 'runtim'): 2, ('apps', 'entry'): 8, ('load', 'create'): 2, ('ready', 'create'): 2, ('runtim', 'error'): 2, ('load', 'runtim'): 2, ('isinst', 'error'): 1, ('label', 'ready'): 4, ('error', 'config'): 4, ('config', 'popul'): 4, ('popul', 'runtim'): 1, ('popul', 'improp'): 1, ('popul', 'ready'): 2, ('apps', 'isinst'): 2, ('isinst', 'load'): 2, ('isinst', 'ready'): 2, ('label', 'self'): 2, ('runtim', 'ready'): 2, ('ready', 'error'): 2, ('apps', 'popul'): 2, ('isinst', 'label'): 2, ('entry', 'configur'): 4, ('error', 'ready'): 2, ('error', 'label'): 2, ('isinst', 'apps'): 2, ('entry', 'create'): 4, ('isinst', 'runtim'): 1, ('entry', 'label'): 8, ('entry', 'instal'): 8, ('configur', 'app'): 4, ('error', 'apps'): 2, ('label', 'apps'): 4, ('create', 'improp'): 1, ('app', 'error'): 4, ('isinst', 'popul'): 1, ('app', 'instal'): 8, ('apps', 'error'): 2, ('load', 'entry'): 8, ('isinst', 'instal'): 2, ('ready', 'app'): 8, ('config', 'error'): 4, ('instal', 'create'): 2, ('error', 'load'): 2, ('label', 'config'): 8, ('configur', 'create'): 1, ('apps', 'ready'): 4, ('self', 'load'): 2, ('config', 'label'): 8, ('config', 'create'): 4, ('configur', 'instal'): 2, ('error', 'configur'): 1, ('app', 'app'): 12, ('instal', 'isinst'): 2, ('self', 'config'): 4, ('self', 'apps'): 2, ('error', 'runtim'): 2, ('isinst', 'configur'): 1, ('apps', 'runtim'): 2, ('popul', 'instal'): 2, ('self', 'runtim'): 1, ('label', 'entry'): 8, ('improp', 'isinst'): 1, ('self', 'ready'): 2, ('instal', 'instal'): 2, ('label', 'configur'): 2, ('improp', 'ready'): 2, ('entry', 'isinst'): 4, ('entry', 'self'): 4, ('ready', 'improp'): 2, ('popul', 'isinst'): 1, ('label', 'improp'): 2, ('load', 'configur'): 2, ('improp', 'error'): 1, ('entry', 'runtim'): 4, ('ready', 'entry'): 8, ('create', 'app'): 4, ('configur', 'isinst'): 1, ('apps', 'instal'): 6, ('entry', 'app'): 16, ('runtim', 'popul'): 1, ('create', 'self'): 1, ('isinst', 'app'): 4, ('instal', 'runtim'): 2, ('instal', 'app'): 8, ('configur', 'entry'): 4, ('isinst', 'improp'): 1, ('ready', 'popul'): 2, ('config', 'entry'): 16, ('improp', 'instal'): 2, ('popul', 'load'): 2, ('configur', 'load'): 2, ('load', 'apps'): 4, ('self', 'error'): 1, ('label', 'load'): 4, ('entry', 'improp'): 4, ('popul', 'entry'): 4, ('apps', 'load'): 4, ('create', 'runtim'): 1, ('config', 'apps'): 8, ('self', 'popul'): 1, ('configur', 'apps'): 2, ('create', 'configur'): 1, ('label', 'popul'): 2, ('self', 'entry'): 4, ('app', 'self'): 4, ('ready', 'ready'): 2, ('ready', 'load'): 4, ('configur', 'runtim'): 1, ('ready', 'self'): 2, ('improp', 'config'): 4, ('runtim', 'config'): 4, ('runtim', 'apps'): 2, ('config', 'improp'): 4, ('self', 'app'): 4, ('load', 'isinst'): 2, ('instal', 'apps'): 6, ('label', 'label'): 2, ('popul', 'self'): 1, ('improp', 'runtim'): 1, ('isinst', 'create'): 1, ('instal', 'configur'): 2, ('app', 'improp'): 4, ('runtim', 'load'): 2, ('app', 'load'): 8, ('app', 'create'): 4, ('create', 'instal'): 2, ('ready', 'config'): 8, ('instal', 'label'): 4, ('error', 'create'): 1, ('entry', 'ready'): 8, ('instal', 'self'): 2, ('app', 'apps'): 8, ('configur', 'self'): 1, ('create', 'config'): 4, ('config', 'self'): 4, ('instal', 'error'): 2, ('config', 'isinst'): 4, ('app', 'entry'): 16, ('config', 'config'): 12, ('runtim', 'label'): 2, ('error', 'self'): 1, ('create', 'load'): 2, ('improp', 'popul'): 1, ('configur', 'ready'): 2, ('apps', 'create'): 2, ('configur', 'label'): 2, ('load', 'config'): 8, ('improp', 'self'): 1, ('runtim', 'isinst'): 1, ('create', 'apps'): 2, ('app', 'configur'): 4, ('ready', 'isinst'): 2, ('entry', 'error'): 4, ('label', 'runtim'): 2, ('apps', 'self'): 2, ('popul', 'app'): 4, ('runtim', 'entry'): 4, ('instal', 'entry'): 8, ('load', 'self'): 2, ('isinst', 'config'): 4, ('load', 'load'): 2, ('load', 'ready'): 4, ('self', 'configur'): 1, ('config', 'runtim'): 4, ('create', 'entry'): 4, ('config', 'load'): 8, ('configur', 'popul'): 1, ('instal', 'config'): 8, ('configur', 'improp'): 2, ('label', 'instal'): 4, ('config', 'instal'): 8, ('improp', 'apps'): 2, ('app', 'popul'): 4, ('ready', 'label'): 4})\n"
     ]
    }
   ],
   "source": [
    "# test _traverse_uast\n",
    "from bblfsh import BblfshClient\n",
    "filepath = \"traverse_test.py\"\n",
    "\n",
    "client = BblfshClient(\"0.0.0.0:9432\")\n",
    "uast = client.parse(filepath).uast\n",
    "    \n",
    "from collections import defaultdict\n",
    "d = defaultdict(int)\n",
    "for x in _traverse_uast(uast):\n",
    "    d[x] += 1\n",
    "print(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------+---+---+\n",
      "|weight| id|id2|\n",
      "+------+---+---+\n",
      "|   220|168|168|\n",
      "|    24|262|168|\n",
      "|     4|216|168|\n",
      "|     1| 18|168|\n",
      "|    25|513|168|\n",
      "|     1|767|168|\n",
      "|     3|430|168|\n",
      "|     7|341|168|\n",
      "|     6|347|168|\n",
      "|     3|304|168|\n",
      "+------+---+---+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "uasts = get_all_files().extract_uasts().select(\"path\", \"uast\", col(\"commit_hash\").alias(\"document\"))\n",
    "tokens_matrix = uasts.where(size(col(\"uast\")) != 0).rdd\\\n",
    "        .flatMap(lambda raw: [((token1, token2), 1) for token1, token2 in _traverse_uast(raw.uast[0])])\\\n",
    "        .reduceByKey(add).map(lambda raw: (raw[0][0], raw[0][1], raw[1])).toDF((\"token\", \"token2\", \"weight\"))\n",
    "\n",
    "# Can be optimized a little if map in _traverse_uast\n",
    "matrix = tokens_matrix\\\n",
    "        .join(tokens, \"token\")\\\n",
    "        .join(tokens.select(col(\"token\").alias(\"token2\"), col(\"id\").alias(\"id2\")), \"token2\")\\\n",
    "        .drop(\"token\", \"token2\")\n",
    "matrix.show(10)        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save asdf model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "matrix_count = matrix.count()\n",
    "matrix_list = matrix.take(matrix_count)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "mat_weights = []\n",
    "mat_row = []\n",
    "mat_col = []\n",
    "for row in matrix_list:\n",
    "    mat_weights.append(row.weight)\n",
    "    mat_row.append(row.id)\n",
    "    mat_col.append(row.id2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save model \n",
    "\n",
    "import modelforge\n",
    "from modelforge.meta import generate_meta\n",
    "from modelforge.model import merge_strings, write_model\n",
    "\n",
    "output = \"cooc.asdf\"\n",
    "\n",
    "write_model(generate_meta(\"co-occurrences\", (1, 0, 0)),\n",
    "            {\"tokens\": merge_strings(tokens_list),\n",
    "             \"matrix\": {\n",
    "                \"shape\": (tokens_number, tokens_number),\n",
    "                \"format\": \"coo\",\n",
    "                \"data\": (mat_weights, (mat_row, mat_col))}\n",
    "            },\n",
    "            output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Identifier embeddings Summary\n",
    "\n",
    "This code covers 1-7 steps in [identifier embeddings algorithm](https://github.com/src-d/ml#identifier-embeddings).\n",
    "The rest is about swivel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Outcomes and quections\n",
    "\n",
    "1. I think that we dont need all intermideate models since we have spark. We can collect the final dataset for model run (nBOW, cooccurrence matrix, etc) and model results.\n",
    "2. I think that arcitecture should be fuctional-style. For example utils module to run some frequent engine code (as first several cells) and separate class (or functions set) for each algorithm.\n",
    "3. There are two main stages. First get data, where engine is involved and second is application of ml-method (can be code using spark.ml but usually not).\n",
    "4. Should we keep the same model format or switch to distributed hdfs format for large-weighted datasets? or use both?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

## token_parser.py
import re

import Stemmer


class TokenParser:
    """
    Common utilities for splitting and stemming tokens.
    """
    NAME_BREAKUP_RE = re.compile(r"[^a-zA-Z]+")  #: Regexp to split source code identifiers.
    STEM_THRESHOLD = 6  #: We do not stem splitted parts shorter than or equal to this size.
    MAX_TOKEN_LENGTH = 256  #: We cut identifiers longer than thi value.

    def __init__(self, stem_threshold=STEM_THRESHOLD, max_token_length=MAX_TOKEN_LENGTH):
        self._stemmer = Stemmer.Stemmer("english")
        self._stemmer.maxCacheSize = 0
        self._stem_threshold = stem_threshold
        self._max_token_length = max_token_length

    def __call__(self, token):
        return self.process_token(token)

    def process_token(self, token):
        for word in self.split(token):
            yield self.stem(word)

    def stem(self, word):
        if len(word) <= self._stem_threshold:
            return word
        return self._stemmer.stemWord(word)

    def split(self, token):
        token = token.strip()[:self._max_token_length]
        prev_p = [""]

        def ret(name):
            r = name.lower()
            if len(name) >= 3:
                yield r
                if prev_p[0]:
                    yield prev_p[0] + r
                    prev_p[0] = ""
            else:
                prev_p[0] = r

        for part in self.NAME_BREAKUP_RE.split(token):
            if not part:
                continue
            prev = part[0]
            pos = 0
            for i in range(1, len(part)):
                this = part[i]
                if prev.islower() and this.isupper():
                    yield from ret(part[pos:i])
                    pos = i
                elif prev.isupper() and this.islower():
                    if 0 < i - 1 - pos <= 3:
                        yield from ret(part[pos:i - 1])
                        pos = i - 1
                    elif i - 1 > pos:
                        yield from ret(part[pos:i])
                        pos = i
                prev = this
            last = part[pos:]
            if last:
                yield from ret(last)

    def __getstate__(self):
        state = self.__dict__.copy()
        del state["_stemmer"]
        return state

    def __setstate__(self, state):
        self.__dict__ = state
        self._stemmer = Stemmer.Stemmer("english")


class NoTokenParser:
    """
    One can use this class if he or she does not want to do any parsing.
    """

    def process_token(self, token):
        return [token]
	import re

	import Stemmer


	class TokenParser:
	"""
	Common utilities for splitting and stemming tokens.
	"""
	NAME_BREAKUP_RE = re.compile(r"[^a-zA-Z]+") #: Regexp to split source code identifiers.
	STEM_THRESHOLD = 6 #: We do not stem splitted parts shorter than or equal to this size.
	MAX_TOKEN_LENGTH = 256 #: We cut identifiers longer than thi value.

	def __init__(self, stem_threshold=STEM_THRESHOLD, max_token_length=MAX_TOKEN_LENGTH):
	self._stemmer = Stemmer.Stemmer("english")
	self._stemmer.maxCacheSize = 0
	self._stem_threshold = stem_threshold
	self._max_token_length = max_token_length

	def __call__(self, token):
	return self.process_token(token)

	def process_token(self, token):
	for word in self.split(token):
	yield self.stem(word)

	def stem(self, word):
	if len(word) <= self._stem_threshold:
	return word
	return self._stemmer.stemWord(word)

	def split(self, token):
	token = token.strip()[:self._max_token_length]
	prev_p = [""]

	def ret(name):
	r = name.lower()
	if len(name) >= 3:
	yield r
	if prev_p[0]:
	yield prev_p[0] + r
	prev_p[0] = ""
	else:
	prev_p[0] = r

	for part in self.NAME_BREAKUP_RE.split(token):
	if not part:
	continue
	prev = part[0]
	pos = 0
	for i in range(1, len(part)):
	this = part[i]
	if prev.islower() and this.isupper():
	yield from ret(part[pos:i])
	pos = i
	elif prev.isupper() and this.islower():
	if 0 < i - 1 - pos <= 3:
	yield from ret(part[pos:i - 1])
	pos = i - 1
	elif i - 1 > pos:
	yield from ret(part[pos:i])
	pos = i
	prev = this
	last = part[pos:]
	if last:
	yield from ret(last)

	def __getstate__(self):
	state = self.__dict__.copy()
	del state["_stemmer"]
	return state

	def __setstate__(self, state):
	self.__dict__ = state
	self._stemmer = Stemmer.Stemmer("english")


	class NoTokenParser:
	"""
	One can use this class if he or she does not want to do any parsing.
	"""

	def process_token(self, token):
	return [token]