userzimmermann/Pygmented coalangs.ipynb

## Pygmented coalangs.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**-- A Python Notebook by [Stefan Zimmermann](https://zimmermann.co/)**\n",
    "\n",
    "> **[mailto:user@zimmermann.co]()**\n",
    ">\n",
    "> **[Tweet @zimmermanncode](https://twitter.com/intent/tweet?text=@zimmermanncode)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'3.5.3 |Continuum Analytics, Inc.| (default, May 15 2017, 10:43:23) [MSC v.1900 64 bit (AMD64)]'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import sys\n",
    "sys.version"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## coalangs?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [coala](https://coala.io) code analysis tool contains a registry of\n",
    "manually maintainted programming language definitions for:\n",
    "\n",
    "* Defining which coala [Bears](https://github.com/coala/bears) support which languages\n",
    "* Defining generalized syntax features of those languages for language independent code analysis\n",
    "* Doing custom parsing of language code based on those generalized syntax features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pygments?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Pygments](https://pypi.python.org/pypi/pygments)\n",
    "is a Python library for syntax highlighting of source code\n",
    "\n",
    "Therefore it contains complete syntax definitions of any relevant programming language\n",
    "and a custom parser framework for tokenizing source code of all those languages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## So why Pygmenting coala?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By thinking about the above descriptions,\n",
    "it becomes obvious that coala is reinventing the wheel\n",
    "regarding custom programming language syntax definitions and source code parsing features\n",
    "\n",
    "coala's main tasks are complex code analysis by reusing external linting tools,\n",
    "analysis report generation, auto code fixing,\n",
    "and integration into source code management, software project management,\n",
    "and continuous integration platforms\n",
    "\n",
    "Custom language syntax definitions and source code parsing\n",
    "is only a small part of coala's feature set,\n",
    "but the main task of Pygments\n",
    "\n",
    "coala's developer community has limited resources for dealing with the latter,\n",
    "Pygments' developer community is fully devoted to it\n",
    "\n",
    "Therefore it's more than reasonable to reuse Pygments features as much as possible\n",
    "instead of reimplementing what Pygments is already capable of"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## coala language definitions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from coalib.bearlib.languages import (\n",
    "    Language, definitions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All specific language classes are defined in separate modules under\n",
    "`coalib.bearlib.languages.definitions`\n",
    "and get auto-registered by using `@Language` as class decorator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "from coalib.bearlib.languages.Language import Language\n",
      "\n",
      "\n",
      "@Language\n",
      "class Python:\n",
      "    aliases = 'py',\n",
      "    versions = 2.7, 3.3, 3.4, 3.5, 3.6\n",
      "\n",
      "    extensions = '.py',\n",
      "    comment_delimiter = '#'\n",
      "    multiline_comment_delimiters = {}\n",
      "    string_delimiters = {'\"': '\"', \"'\": \"'\"}\n",
      "    multiline_string_delimiters = {'\"\"\"': '\"\"\"', \"'''\": \"'''\"}\n",
      "    indent_types = ':',\n",
      "    encapsulators = {'(': ')', '[': ']', '{': '}'}\n",
      "\n"
     ]
    }
   ],
   "source": [
    "with open(definitions.Python.__file__) as file:\n",
    "    print(file.read())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "coalib.bearlib.languages.Language.Python"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Language.Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that a language has lots of attributes\n",
    "used for finding syntax elements in source code,\n",
    "like strings or comments,\n",
    "which is for example done by the [AnnotationBear](\n",
    "  https://github.com/coala/coala-bears/blob/master/bears/general/AnnotationBear.py).\n",
    "Many of them could become obsolete by just parsing with Pygments.\n",
    "Also `aliases` and file `extensions` are availabe in Pygments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pygmenting coala"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pygments.lexers import get_lexer_by_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Pygments, every language is defined as a lexer,\n",
    "which you can lookup by full or alias name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pygments.lexers.PythonLexer>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pylexer = get_lexer_by_name('py')\n",
    "pylexer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It has aliases and file extensions defined..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['python', 'py', 'sage']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pylexer.aliases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['*.py', '*.pyw', '*.sc', 'SConstruct', 'SConscript', '*.tac', '*.sage']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pylexer.filenames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... and even MIME types:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['text/x-python', 'application/x-python']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pylexer.mimetypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To not break any existing coala funcionality\n",
    "and to make language definitions from Pygments further customizable,\n",
    "it makes sense to just merge them into coala's existing definitions,\n",
    "for example by adding a cached `.pygment` meta property to the `Language` class,\n",
    "returning the according Pygments lexer,\n",
    "which can then be used to retrieve additional definitions\n",
    "and perform parsing actions.\n",
    "\n",
    "And every Pygments language not defined by coala yet\n",
    "could just be dynamically created as a new `@Language` class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from functools import lru_cache"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "class PygmentedLanguageMeta(type(Language)):\n",
    "    @property\n",
    "    @lru_cache()\n",
    "    def pygment(cls):\n",
    "        return get_lexer_by_name(cls.__name__)\n",
    "\n",
    "    @lru_cache()\n",
    "    def __getattr__(cls, name):\n",
    "        try:\n",
    "            return super().__getattr__(name)\n",
    "        except AttributeError:\n",
    "            pygment = get_lexer_by_name(name)\n",
    "            return Language(\n",
    "                type(pygment.name, (), {}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Language.__class__ = PygmentedLanguageMeta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now I just use a little ugly hack ;)\n",
    "because every language class gets an own `SubLanguageMeta`\n",
    "derived from the metaclass of `Language`..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Language.Python.__class__.pygment = (\n",
    "    PygmentedLanguageMeta.pygment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And here is the `.pygment`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pygments.lexers.PythonLexer>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Language.Python.pygment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And some dynamically created language class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "coalib.bearlib.languages.Language.COBOL"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Language.Cobol"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pygments.lexers.CobolLexer>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Language.Cobol.pygment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This should be enough as proof of concept...\n",
    "The real implementation will happen in a pull request based on\n",
    "https://github.com/coala/coala/issues/4775"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pygmenting the AnnotationBear"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As mentioned before, the [AnnotationBear](\n",
    "  https://github.com/coala/coala-bears/blob/master/bears/general/AnnotationBear.py)\n",
    "extracts comment and string ranges from source code\n",
    "based on coala language definitions.\n",
    "Let's take a look on the bear's results\n",
    "by using the convenient `Unleashed` bear wrapper from [bearsh](\n",
    "  https://github.com/coala/bearsh):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bearsh import Unleashed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "bear = Unleashed.AnnotationBear()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(and let's hide unnecessary logging messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import logging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "logging.getLogger().level = logging.CRITICAL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<HiddenResult object(id=0xe3382b2c62e545ce87caf396e846f4b8, origin='Unleashed[AnnotationBear]', message='', contents={'strings': (<SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=3, column=1) at 0x203bd0505c0>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=3, column=25) at 0x203bd050518>) at 0x203bd050320>, <SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=5, column=1) at 0x203bd050198>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=9, column=3) at 0x203bd050358>) at 0x203bd050f28>), 'comments': (<SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=1, column=1) at 0x203bd050e10>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=1, column=15) at 0x203bd0505f8>) at 0x203bd050588>,)}) at 0x203bd0502b0>]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(bear.run(r\"\"\"\n",
    "# some comment\n",
    "\n",
    "\"some escaped \\\"string\\\"\"\n",
    "\n",
    "'''\n",
    "some\n",
    "multiline\n",
    "string\n",
    "'''\n",
    "\"\"\".strip(), language='py'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We correctly get one `SourceRange` for `'comments'` and two for `'strings'`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But it also shows a limitation of the current implementation:\n",
    "It can handle escaped string delimiters inside of Python strings,\n",
    "but there are no language-specific escaping rules defined.\n",
    "It just always assumes backslashes,\n",
    "but not all languages use them for escaping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So all those different escaping rules\n",
    "would also have to be defined for all suported languages...\n",
    "But Pygments already knows them.\n",
    "So let's try to get similar result `contents` from Pygments..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A Pygments lexer can parse source code into a sequence of\n",
    "`pygments.token.Token` types and actually matched sub-strings..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(Token.Comment.Single, '# some comment'),\n",
       " (Token.Text, '\\n'),\n",
       " (Token.Text, '\\n'),\n",
       " (Token.Literal.String.Double, '\"'),\n",
       " (Token.Literal.String.Double, 'some escaped '),\n",
       " (Token.Literal.String.Escape, '\\\\\"'),\n",
       " (Token.Literal.String.Double, 'string'),\n",
       " (Token.Literal.String.Escape, '\\\\\"'),\n",
       " (Token.Literal.String.Double, '\"'),\n",
       " (Token.Text, '\\n'),\n",
       " (Token.Text, '\\n'),\n",
       " (Token.Literal.String.Doc, \"'''\\nsome\\nmultiline\\nstring\\n'''\"),\n",
       " (Token.Text, '\\n')]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(pylexer.get_tokens(r\"\"\"\n",
    "# some comment\n",
    "\n",
    "\"some escaped \\\"string\\\"\"\n",
    "\n",
    "'''\n",
    "some\n",
    "multiline\n",
    "string\n",
    "'''\n",
    "\"\"\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... whereby every `String` sub-token like the `Token.String.Double` delimiter\n",
    "(`Token.String` is a short-cut for  `Token.Literal.String`)\n",
    "can be checked for being a sub-token with the `in` operator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pygments.token import Token"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Token.String.Double in Token.String"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So by just combining the consecutive `Token.Comment` and `Token.String` elements,\n",
    "it should be possible to create the desired bear result `contents`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from coalib.results.AbsolutePosition import (\n",
    "    AbsolutePosition)\n",
    "from coalib.results.SourceRange import (\n",
    "    SourceRange)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def contents(source, token_types):\n",
    "    source_lines = re.findall('.*\\n?', source)\n",
    "    itokens = pylexer.get_tokens(source)\n",
    "    contents_dict = {}\n",
    "    start = end = 0\n",
    "    for token, text in itokens:\n",
    "        for name, token_type in (\n",
    "                token_types.items()):\n",
    "            while token in token_type:\n",
    "                end += len(text)\n",
    "                token, text = next(\n",
    "                    itokens, (None, ''))\n",
    "            if not end > start:\n",
    "                continue\n",
    "            contents_dict.setdefault(\n",
    "                name, []\n",
    "            ).append(\n",
    "                SourceRange.from_absolute_position(\n",
    "                    ':notebook:',\n",
    "                    AbsolutePosition(\n",
    "                        source_lines, start),\n",
    "                    AbsolutePosition(\n",
    "                        source_lines, end - 1)))\n",
    "            start = end\n",
    "        end += len(text)\n",
    "        start = end\n",
    "    return contents_dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'comments': [<SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=1, column=1) at 0x203bd079470>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=1, column=14) at 0x203bd0794a8>) at 0x203bd079518>],\n",
       " 'strings': [<SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=3, column=1) at 0x203bd0794e0>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=3, column=25) at 0x203bd079588>) at 0x203bd0795c0>,\n",
       "  <SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=5, column=1) at 0x203bd079550>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=9, column=3) at 0x203bd079630>) at 0x203bd079668>]}"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "contents(r\"\"\"\n",
    "# some comment\n",
    "\n",
    "\"some escaped \\\"string\\\"\"\n",
    "\n",
    "'''\n",
    "some\n",
    "multiline\n",
    "string\n",
    "'''\n",
    "\"\"\".strip(), {\n",
    "    'comments': Token.Comment,\n",
    "    'strings': Token.String,\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tada! Much less effort than the bear's current implementation :)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is only one little difference:\n",
    "The `end` position of the comment range.\n",
    "Because the bear treats the newline as part of the comment.\n",
    "But this should be rather unimportant ;)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again... enough as proof of concept :)\n",
    "And this time the real implementation will happen in a pull request based on\n",
    "https://github.com/coala/coala-bears/issues/2092"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"-- A Python Notebook by [Stefan Zimmermann](https://zimmermann.co/)\n",
	"\n",
	"> [mailto:user@zimmermann.co]()\n",
	">\n",
	"> [Tweet @zimmermanncode](https://twitter.com/intent/tweet?text=@zimmermanncode)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'3.5.3 \|Continuum Analytics, Inc.\| (default, May 15 2017, 10:43:23) [MSC v.1900 64 bit (AMD64)]'"
	]
	},
	"execution_count": 1,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import sys\n",
	"sys.version"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## coalangs?"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The [coala](https://coala.io) code analysis tool contains a registry of\n",
	"manually maintainted programming language definitions for:\n",
	"\n",
	"* Defining which coala [Bears](https://github.com/coala/bears) support which languages\n",
	"* Defining generalized syntax features of those languages for language independent code analysis\n",
	"* Doing custom parsing of language code based on those generalized syntax features"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Pygments?"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"[Pygments](https://pypi.python.org/pypi/pygments)\n",
	"is a Python library for syntax highlighting of source code\n",
	"\n",
	"Therefore it contains complete syntax definitions of any relevant programming language\n",
	"and a custom parser framework for tokenizing source code of all those languages"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## So why Pygmenting coala?"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"By thinking about the above descriptions,\n",
	"it becomes obvious that coala is reinventing the wheel\n",
	"regarding custom programming language syntax definitions and source code parsing features\n",
	"\n",
	"coala's main tasks are complex code analysis by reusing external linting tools,\n",
	"analysis report generation, auto code fixing,\n",
	"and integration into source code management, software project management,\n",
	"and continuous integration platforms\n",
	"\n",
	"Custom language syntax definitions and source code parsing\n",
	"is only a small part of coala's feature set,\n",
	"but the main task of Pygments\n",
	"\n",
	"coala's developer community has limited resources for dealing with the latter,\n",
	"Pygments' developer community is fully devoted to it\n",
	"\n",
	"Therefore it's more than reasonable to reuse Pygments features as much as possible\n",
	"instead of reimplementing what Pygments is already capable of"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## coala language definitions"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"from coalib.bearlib.languages import (\n",
	" Language, definitions)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"All specific language classes are defined in separate modules under\n",
	"`coalib.bearlib.languages.definitions`\n",
	"and get auto-registered by using `@Language` as class decorator:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"from coalib.bearlib.languages.Language import Language\n",
	"\n",
	"\n",
	"@Language\n",
	"class Python:\n",
	" aliases = 'py',\n",
	" versions = 2.7, 3.3, 3.4, 3.5, 3.6\n",
	"\n",
	" extensions = '.py',\n",
	" comment_delimiter = '#'\n",
	" multiline_comment_delimiters = {}\n",
	" string_delimiters = {'\"': '\"', \"'\": \"'\"}\n",
	" multiline_string_delimiters = {'\"\"\"': '\"\"\"', \"'''\": \"'''\"}\n",
	" indent_types = ':',\n",
	" encapsulators = {'(': ')', '[': ']', '{': '}'}\n",
	"\n"
	]
	}
	],
	"source": [
	"with open(definitions.Python.__file__) as file:\n",
	" print(file.read())"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"coalib.bearlib.languages.Language.Python"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Language.Python"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can see that a language has lots of attributes\n",
	"used for finding syntax elements in source code,\n",
	"like strings or comments,\n",
	"which is for example done by the [AnnotationBear](\n",
	" https://github.com/coala/coala-bears/blob/master/bears/general/AnnotationBear.py).\n",
	"Many of them could become obsolete by just parsing with Pygments.\n",
	"Also `aliases` and file `extensions` are availabe in Pygments"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Pygmenting coala"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"from pygments.lexers import get_lexer_by_name"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In Pygments, every language is defined as a lexer,\n",
	"which you can lookup by full or alias name:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<pygments.lexers.PythonLexer>"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pylexer = get_lexer_by_name('py')\n",
	"pylexer"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"It has aliases and file extensions defined..."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['python', 'py', 'sage']"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pylexer.aliases"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['.py', '.pyw', '.sc', 'SConstruct', 'SConscript', '.tac', '*.sage']"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pylexer.filenames"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"... and even MIME types:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['text/x-python', 'application/x-python']"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pylexer.mimetypes"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"To not break any existing coala funcionality\n",
	"and to make language definitions from Pygments further customizable,\n",
	"it makes sense to just merge them into coala's existing definitions,\n",
	"for example by adding a cached `.pygment` meta property to the `Language` class,\n",
	"returning the according Pygments lexer,\n",
	"which can then be used to retrieve additional definitions\n",
	"and perform parsing actions.\n",
	"\n",
	"And every Pygments language not defined by coala yet\n",
	"could just be dynamically created as a new `@Language` class:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [],
	"source": [
	"from functools import lru_cache"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [],
	"source": [
	"class PygmentedLanguageMeta(type(Language)):\n",
	" @property\n",
	" @lru_cache()\n",
	" def pygment(cls):\n",
	" return get_lexer_by_name(cls.__name__)\n",
	"\n",
	" @lru_cache()\n",
	" def __getattr__(cls, name):\n",
	" try:\n",
	" return super().__getattr__(name)\n",
	" except AttributeError:\n",
	" pygment = get_lexer_by_name(name)\n",
	" return Language(\n",
	" type(pygment.name, (), {}))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"Language.__class__ = PygmentedLanguageMeta"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now I just use a little ugly hack ;)\n",
	"because every language class gets an own `SubLanguageMeta`\n",
	"derived from the metaclass of `Language`..."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"Language.Python.__class__.pygment = (\n",
	" PygmentedLanguageMeta.pygment)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And here is the `.pygment`:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<pygments.lexers.PythonLexer>"
	]
	},
	"execution_count": 14,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Language.Python.pygment"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And some dynamically created language class:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"coalib.bearlib.languages.Language.COBOL"
	]
	},
	"execution_count": 15,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Language.Cobol"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<pygments.lexers.CobolLexer>"
	]
	},
	"execution_count": 16,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Language.Cobol.pygment"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This should be enough as proof of concept...\n",
	"The real implementation will happen in a pull request based on\n",
	"https://github.com/coala/coala/issues/4775"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Pygmenting the AnnotationBear"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As mentioned before, the [AnnotationBear](\n",
	" https://github.com/coala/coala-bears/blob/master/bears/general/AnnotationBear.py)\n",
	"extracts comment and string ranges from source code\n",
	"based on coala language definitions.\n",
	"Let's take a look on the bear's results\n",
	"by using the convenient `Unleashed` bear wrapper from [bearsh](\n",
	" https://github.com/coala/bearsh):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [],
	"source": [
	"from bearsh import Unleashed"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [],
	"source": [
	"bear = Unleashed.AnnotationBear()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"(and let's hide unnecessary logging messages)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import logging"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [],
	"source": [
	"logging.getLogger().level = logging.CRITICAL"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[<HiddenResult object(id=0xe3382b2c62e545ce87caf396e846f4b8, origin='Unleashed[AnnotationBear]', message='', contents={'strings': (<SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=3, column=1) at 0x203bd0505c0>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=3, column=25) at 0x203bd050518>) at 0x203bd050320>, <SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=5, column=1) at 0x203bd050198>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=9, column=3) at 0x203bd050358>) at 0x203bd050f28>), 'comments': (<SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=1, column=1) at 0x203bd050e10>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:bearsh-input:', line=1, column=15) at 0x203bd0505f8>) at 0x203bd050588>,)}) at 0x203bd0502b0>]"
	]
	},
	"execution_count": 22,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"list(bear.run(r\"\"\"\n",
	"# some comment\n",
	"\n",
	"\"some escaped \\\"string\\\"\"\n",
	"\n",
	"'''\n",
	"some\n",
	"multiline\n",
	"string\n",
	"'''\n",
	"\"\"\".strip(), language='py'))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We correctly get one `SourceRange` for `'comments'` and two for `'strings'`"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"But it also shows a limitation of the current implementation:\n",
	"It can handle escaped string delimiters inside of Python strings,\n",
	"but there are no language-specific escaping rules defined.\n",
	"It just always assumes backslashes,\n",
	"but not all languages use them for escaping"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So all those different escaping rules\n",
	"would also have to be defined for all suported languages...\n",
	"But Pygments already knows them.\n",
	"So let's try to get similar result `contents` from Pygments..."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A Pygments lexer can parse source code into a sequence of\n",
	"`pygments.token.Token` types and actually matched sub-strings..."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[(Token.Comment.Single, '# some comment'),\n",
	" (Token.Text, '\\n'),\n",
	" (Token.Text, '\\n'),\n",
	" (Token.Literal.String.Double, '\"'),\n",
	" (Token.Literal.String.Double, 'some escaped '),\n",
	" (Token.Literal.String.Escape, '\\\\\"'),\n",
	" (Token.Literal.String.Double, 'string'),\n",
	" (Token.Literal.String.Escape, '\\\\\"'),\n",
	" (Token.Literal.String.Double, '\"'),\n",
	" (Token.Text, '\\n'),\n",
	" (Token.Text, '\\n'),\n",
	" (Token.Literal.String.Doc, \"'''\\nsome\\nmultiline\\nstring\\n'''\"),\n",
	" (Token.Text, '\\n')]"
	]
	},
	"execution_count": 23,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"list(pylexer.get_tokens(r\"\"\"\n",
	"# some comment\n",
	"\n",
	"\"some escaped \\\"string\\\"\"\n",
	"\n",
	"'''\n",
	"some\n",
	"multiline\n",
	"string\n",
	"'''\n",
	"\"\"\"))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"... whereby every `String` sub-token like the `Token.String.Double` delimiter\n",
	"(`Token.String` is a short-cut for `Token.Literal.String`)\n",
	"can be checked for being a sub-token with the `in` operator:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"from pygments.token import Token"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 25,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"True"
	]
	},
	"execution_count": 25,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Token.String.Double in Token.String"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So by just combining the consecutive `Token.Comment` and `Token.String` elements,\n",
	"it should be possible to create the desired bear result `contents`:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import re"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"from coalib.results.AbsolutePosition import (\n",
	" AbsolutePosition)\n",
	"from coalib.results.SourceRange import (\n",
	" SourceRange)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def contents(source, token_types):\n",
	" source_lines = re.findall('.*\\n?', source)\n",
	" itokens = pylexer.get_tokens(source)\n",
	" contents_dict = {}\n",
	" start = end = 0\n",
	" for token, text in itokens:\n",
	" for name, token_type in (\n",
	" token_types.items()):\n",
	" while token in token_type:\n",
	" end += len(text)\n",
	" token, text = next(\n",
	" itokens, (None, ''))\n",
	" if not end > start:\n",
	" continue\n",
	" contents_dict.setdefault(\n",
	" name, []\n",
	" ).append(\n",
	" SourceRange.from_absolute_position(\n",
	" ':notebook:',\n",
	" AbsolutePosition(\n",
	" source_lines, start),\n",
	" AbsolutePosition(\n",
	" source_lines, end - 1)))\n",
	" start = end\n",
	" end += len(text)\n",
	" start = end\n",
	" return contents_dict"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"{'comments': [<SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=1, column=1) at 0x203bd079470>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=1, column=14) at 0x203bd0794a8>) at 0x203bd079518>],\n",
	" 'strings': [<SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=3, column=1) at 0x203bd0794e0>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=3, column=25) at 0x203bd079588>) at 0x203bd0795c0>,\n",
	" <SourceRange object(start=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=5, column=1) at 0x203bd079550>, end=<SourcePosition object(file='C:\\\\Users\\\\Zimmermann\\\\Projects\\\\coala\\\\gist\\\\Pygmented coalangs\\\\:notebook:', line=9, column=3) at 0x203bd079630>) at 0x203bd079668>]}"
	]
	},
	"execution_count": 29,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"contents(r\"\"\"\n",
	"# some comment\n",
	"\n",
	"\"some escaped \\\"string\\\"\"\n",
	"\n",
	"'''\n",
	"some\n",
	"multiline\n",
	"string\n",
	"'''\n",
	"\"\"\".strip(), {\n",
	" 'comments': Token.Comment,\n",
	" 'strings': Token.String,\n",
	"})"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Tada! Much less effort than the bear's current implementation :)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"There is only one little difference:\n",
	"The `end` position of the comment range.\n",
	"Because the bear treats the newline as part of the comment.\n",
	"But this should be rather unimportant ;)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Again... enough as proof of concept :)\n",
	"And this time the real implementation will happen in a pull request based on\n",
	"https://github.com/coala/coala-bears/issues/2092"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}