Skip to content

Instantly share code, notes, and snippets.

@ptmcg
Last active October 30, 2019 05:28
Show Gist options
  • Save ptmcg/6f8baa1105c6517b268f1d3f7417d872 to your computer and use it in GitHub Desktop.
Save ptmcg/6f8baa1105c6517b268f1d3f7417d872 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Parsing Problem, Using PyParsing\n",
"\n",
"To work through this tutorial, we will address a hypothetical problem. Given a string of alphabetic characters, expand any ranges to return a list of all the individual characters.\n",
"\n",
"For instance given:\n",
"\n",
" A-C,X,M-P,Z\n",
"\n",
"return the list of individual characters:\n",
"\n",
" A,B,C,X,M,N,O,P,Z\n",
"\n",
"For simplicity of this tutorial, we will limit ourselves to single, upper-case, alphabetic characters. At the end, I'll suggest some enhancements that you'll be able to tackle on your own.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parsing Options in Python\n",
"\n",
"Python offers several built-in mechanisms for slicing and dicing text strings:\n",
"\n",
" -- `str` class - the built-in `str` class itself includes a number of methods such as `split`, `join`, `startswith`, and `endswith` that are suitable for many basic string problems\n",
"\n",
" -- `re` and `regex` modules - the built-in `re` module and the externally available enhancement `regex` bring all the power, speed, complexity, and arcana of regular expressions to crack many difficult parsing problems\n",
"\n",
" -- `shlex` module\n",
"\n",
"These features are good for many quick text-splitting and scanning tasks, but can suffer when it comes to ease of maintenance or extensibility. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Approaching a problem with Pyparsing\n",
"\n",
"Whether using pyparsing or another parsing library, it is always good to start out with a plan for the parser. In parsing terms, this is called a BNF, for Backus-Naur Form. This plan helps structure your thinking and the resulting implementation of your parser. Here is a simplified BNF for the letter range problem:\n",
"\n",
" letter ::= 'A'..'Z'\n",
" letter_range ::= letter '-' letter\n",
" letter_range_item ::= letter_range | letter\n",
" letter_range_list ::= letter_range_item [',' letter_range_item]...\n",
"\n",
"This translates almost directly to pyparsing Python code:\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pyparsing as pp\n",
"\n",
"letter = pp.oneOf(list(pp.alphas))\n",
"letter_range = letter + '-' + letter\n",
"letter_range_item = letter_range | letter\n",
"letter_range_list = letter_range_item + (',' + letter_range_item)[...]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using `parseString` to run this parser:\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['A', '-', 'C', ',', 'X', ',', 'M', '-', 'P', ',', 'Z']\n"
]
}
],
"source": [
"test = \"A-C,X,M-P,Z\"\n",
"print(letter_range_list.parseString(test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While this shows that we did in fact parse everything in the input, we still have to do some cleaning up, and also add the expansion of the `A-C` type ranges.\n",
"\n",
"First thing is to strip out the commas. They are important during the parsing process, but not really necessary for the parsed results. Pyparsing provides a class `Suppress` for this type of expression, that needs to be parsed, but should be suppressed from the results:\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"letter_range_list = letter_range_item + pp.ZeroOrMore(pp.Suppress(',') + letter_range_item)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since Pyparsing 2.4.2, you can write `ZeroOrMore` using ellipsis notation:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"letter_range_list = letter_range_item + (pp.Suppress(',') + letter_range_item)[...]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As it happens, this `expr + (Suppress(delimiter) + expr)[...]` pattern occurs so frequently, that pyparsing includes a helper method `delimitedList(expr, delim)` (with a default delimiter of `','`) that supports this pattern:\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"letter_range_list = pp.delimitedList(letter_range_item)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this change, our output is now cleaner:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['A', '-', 'C', 'X', 'M', '-', 'P', 'Z']\n"
]
}
],
"source": [
"print(letter_range_list.parseString(test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last step is to expand the sub-ranges `'A-C'` and `'M-P'`. We could certainly walk this list of parsed items, looking for the `'-'` symbols:\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'A-C,X,M-P,Z'\n",
"['A', 'B', 'C', 'X', 'M', 'N', 'O', 'P', 'Z']\n"
]
}
],
"source": [
"def expand_range(from_, to_):\n",
" alphas = \"ABCDEFGHIJKLMNOPQRSTUVWXYZ\"\n",
" from_loc = alphas.index(from_)\n",
" to_loc = alphas.index(to_)\n",
" return list(alphas[from_loc:to_loc+1])\n",
"\n",
"out = []\n",
"parsed_output = letter_range_list.parseString(test)\n",
"\n",
"letter_iter = iter(parsed_output)\n",
"letter = next(letter_iter, '')\n",
"while letter:\n",
" if letter == '-':\n",
" to_ = next(letter_iter)\n",
" out.extend(expand_range(last, to_)[1:])\n",
" else:\n",
" out.append(letter)\n",
" last = letter\n",
" letter = next(letter_iter, '')\n",
"\n",
"print(repr(test))\n",
"print(out)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But doesn't it feel like we are retracing steps that the parser must have already taken? When pyparsing ran our parser to scan the original input string, it had to follow a very similar process to recognize the `letter_range` expression. Why not have pyparsing run this expansion process at the same time that it has parsed out a `letter_range`?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can attach a callback to an expression; in pyparsing applications, this callback is called a *parse action*. Parse actions can serve various purposes, but the feature we will use in this example is to replace the parsed tokens with a new set of tokens. We'll still utilize the `expand_range` function defined above, but now calling it will be more straightforward:\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'A-C,X,M-P,Z'\n",
"['A', 'B', 'C', 'X', 'M', 'N', 'O', 'P', 'Z']\n"
]
}
],
"source": [
"def expand_parsed_range(t):\n",
" return expand_range(t[0], t[2])\n",
"letter_range.addParseAction(expand_parsed_range)\n",
"\n",
"print(repr(test))\n",
"print(letter_range_list.parseString(test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## One more thing...\n",
"\n",
"While this accomplishes our goal, there is one pyparsing \"best practice\" that I'd like to incorporate into this example. Whenever we access parsed data using numeric indices like in\n",
"\n",
" return expand_range(t[0], t[2])\n",
"\n",
"we make it difficult to make changes in our grammar at some time in the future. For instance, if the grammar were to change and add an optional field in this expression, our index of 2 might have to be conditionalized depending on whether that field were present or not. It would be much better if we could refer to these fields by name.\n",
"\n",
"to:\n",
"\n",
" letter_range = letter('start') + '-' + letter('end')\n",
"\n",
"\n",
"With this change, we can now write our parse action using these results names:\n",
"\n",
" def expand_parsed_range(t):\n",
" return expand_range(t.start, t.end)\n",
"\n",
"And now if the definition of `letter_range` changes by adding other fields, our starting and ending fields will still be processed correctly, without having to change any existing code. Parsers that use results names are much more robust and maintainable over time.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Complete Parser\n",
"\n",
"Here is the full working parser example:\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['A', 'B', 'C', 'X', 'M', 'N', 'O', 'P', 'Z']\n"
]
}
],
"source": [
"import pyparsing as pp\n",
"\n",
"def expand_range(from_, to_):\n",
" alphas = \"ABCDEFGHIJKLMNOPQRSTUVWXYZ\"\n",
" from_loc = alphas.index(from_)\n",
" to_loc = alphas.index(to_)\n",
" return list(alphas[from_loc:to_loc+1])\n",
"\n",
"\"\"\"\n",
"BNF:\n",
" letter ::= 'A'..'Z'\n",
" letter_range ::= letter '-' letter\n",
" letter_range_item ::= letter_range | letter\n",
" letter_range_list ::= letter_range_item [',' letter_range_item]...\n",
"\"\"\"\n",
"letter = pp.oneOf(list(pp.alphas))\n",
"letter_range = letter('start') + '-' + letter('end')\n",
"letter_range_item = letter_range | letter\n",
"letter_range_list = pp.delimitedList(letter_range_item)\n",
"\n",
"def expand_parsed_range(t):\n",
" return expand_range(t.start, t.end)\n",
"letter_range.addParseAction(expand_parsed_range)\n",
"\n",
"test = \"A-C,X,M-P,Z\"\n",
"print(letter_range_list.parseString(test))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Potential extensions\n",
"\n",
"Here are some extensions to this parser that you can try on your own:\n",
"\n",
" -- Change the syntax of a range from `\"A-C\"` to `\"A:C\"`; how much of your program did you have to change?\n",
"\n",
" -- Add support for lower-case letters and numeric digits\n",
" \n",
" -- Select a Unicode set from `pp.pyparsing_unicode` and handle ranges including non-ASCII characters \n",
" \n",
" -- Add a parse action to `letter_item_list` to return a sorted list, with no duplicates\n",
" \n",
" -- Add validation that `letter_range` is proper (do not allow degenerate ranges like `\"J-J\"`, out of order ranges `\"X-M\"`, or mixed ranges like `\"T-4\"`)\n",
" \n",
" -- Write another parser to handle ranges of integers instead of letters\n",
" \n",
" -- Write a series of tests, and run them using `letter_range_list.runTests`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## For More Information\n",
"\n",
"You can find more information on pyparsing by visiting the pyparsing GitHub repo, at https://github.com/pyparsing/pyparsing."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment