public
Last active

An IPython Notebook walk-through of how to make an interesting key function for comparing (sort/min/max) in Python. View it at the notebook viewer: http://nbviewer.ipython.org/6696813 Corrections welcome

  • Download Gist
trickysorting.ipynb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
{
"metadata": {
"name": "Tricky sorting"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Using tricky key functions for sort/min/max in Python"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "At the moment I am rather enamoured with key functions in Python. The functions [`sorted`](http://docs.python.org/3/library/functions.html#sorted), `min` and `max` take an optional key argument that, if specified, should be a function that is applied to each item in the iterable. The result of this key function is what is used when comparing items in the list to order them.\n\nHere is an example simplified from my work. I have a list of 'sentences'. Each sentence is essentially a sentence of a forecast, describing a weather element or type, such as precipitation (precip), thunderstorms (TS), sky, fog (FG), large hail (H), gusty winds (w). Each sentence has a time range associated with it (TS may only be present in the late evening, for example). So what order should things be reported in?\n\n * In general, things should be reported in the order that they occur, so based on the start time. If the start times are equal, then compare the end times.\n * If the start and end times are equal, report them in the same order as a reference order (e.g. [sky, FG, precip, TS, w, H])\n* But also, make sure the TS are reported after the precip, if there is precip. (This is a bit of business logic: because the TS and precip are related phenomena and generally occur together, it could be jarring or absurd to those who know about such things to see TS mentioned before precip, even if the TS is being reported as starting earlier.)\n * Also also, some items like H only occur in the presence of other elements like TS. In this case their sentence will actually be something like _Thunderstorms possibly severe in the afternoon with large hail_ which emphasises their relationship to the TS. In this case we need to mention the H directly after the TS regardless of the time ranges.\n\nComplicated rules like this make it tempting to give up on using the built-in methods at all and just use a hand-crafted method which picks apart the elements and painstakingly pieces them back together according to requirements. However I would argue that is more error-prone, difficult to debug, hard to modify, and that with a couple of tricks we can keep using the built-in methods and still have a clue what is going on.\n\nI will also add, don't overlook the benefit of using these with min and max, where you may want to find the \"best\" result amongst a set according to cascading criteria. I use this frequently at work in combination with exhaustive search (aka [brute-force search](https://en.wikipedia.org/wiki/Brute-force_search) or \"generate and test\"). If the combinatorics are not going to explode on you too badly, I think it's a good way of knowing you have arrived at the \"best\" result.\n\n\nOK, so our data looks something like this:\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "from collections import namedtuple\n\nclass Sentence(namedtuple('Sentence', 'type start end words')):\n def __repr__(self):\n return 'Sentence({0.type}, {0.start}-{0.end})'.format(self)\n\npriorityOrder = ['sky', 'FG', 'precip', 'TS', 'w', 'H']\n\nprecip = Sentence('precip', 12, 24, 'Isolated showers from midday, becoming widespread in the evening')\nwinds = Sentence('w', 12, 24, 'Gusty winds in the early afternoon')\nts = Sentence('TS', 9, 24, 'Scattered thunderstorms from the late morning')\nhail = Sentence('H', 15, 21, 'Thunderstorms possibly severe in the afternoon with large hail')\nsky = Sentence('sky', 0, 0, 'Partly cloudy')\n\nsentences = set([winds, ts, hail, precip, sky])\n\nexpected = [sky, precip, ts, hail, winds]",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 99
},
{
"cell_type": "markdown",
"metadata": {},
"source": "OK so let's have a crack at a really simple key function. We want to sort on start time, then end time."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def byTime(sentence):\n return (sentence.start, sentence.end)\n\n# Or more briefly:\nbyTime = lambda s: (s.start, s.end)\n\nfrom pprint import pprint\npprint(sorted(sentences, key=byTime))",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[Sentence(sky, 0-0),\n Sentence(TS, 9-24),\n Sentence(precip, 12-24),\n Sentence(w, 12-24),\n Sentence(H, 15-21)]\n"
}
],
"prompt_number": 100
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Already we are using the first feature of Python sorting - tuple sorting. Python sorts tuples by comparing all first elements, then if they are equal, second elements, and so on. Hence our winds (w) sentence is coming before the H, because 12 < 15. It seems pretty obvious, but this is the basis for making quite more complicated key functions.\n\nHowever we haven't incorporated the requirement to sort by priority, if the time ranges are equal. It happens that 'w' is sorting after 'precip' in the above example but we haven't enforced it, which we can confirm directly:\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "byTime(precip) < byTime(winds)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 101,
"text": "False"
}
],
"prompt_number": 101
},
{
"cell_type": "markdown",
"metadata": {},
"source": "So let's add another element to our key function's return value to account for priority."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def byTimeAndPriority(sentence):\n priority = priorityOrder.index(sentence.type)\n return (sentence.start, sentence.end, priority)\n\npprint(sorted(sentences, key=byTimeAndPriority))\n\nbyTimeAndPriority(precip) < byTimeAndPriority(winds)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[Sentence(sky, 0-0),\n Sentence(TS, 9-24),\n Sentence(precip, 12-24),\n Sentence(w, 12-24),\n Sentence(H, 15-21)]\n"
},
{
"output_type": "pyout",
"prompt_number": 102,
"text": "True"
}
],
"prompt_number": 102
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Also to confirm that what is happening is what we expect, we can print out the value of calling the key function on each item in the iterable. This is super useful in debugging your sort and is a major benefit over [the old method](http://docs.python.org/3/howto/sorting.html#the-old-way-using-the-cmp-parameter) of influencing sort by writing a `cmp` method (that takes two items from an iterable, and returns -1/0/1 according to which should be ordered first). I can't praise this highly enough. My workmate recently rewrote a cmp function that was \"mostly\" right (read: I couldn't figure out how to fix it) into a key function that I have the highest confidence in - because of this exact reason."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for item in sorted(sentences, key=byTimeAndPriority):\n print byTimeAndPriority(item), \"<=\", item",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "(0, 0, 0) <= Sentence(sky, 0-0)\n(9, 24, 3) <= Sentence(TS, 9-24)\n(12, 24, 2) <= Sentence(precip, 12-24)\n(12, 24, 4) <= Sentence(w, 12-24)\n(15, 21, 5) <= Sentence(H, 15-21)\n"
}
],
"prompt_number": 103
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now I have confidence in our sorting!\n\nTo return to the trickier requirements. The TS is being listed ahead of the precip. What can we do about that?\n\nWell, maybe a useful thing to do is try and come up with the results of a key function that would give us what we want, and work backwards.\n\n<pre>\n(0, 0, 0, ...), Sentence(sky, 0-0)\n(12, 24, 2, ...) Sentence(precip, 12-24)\n(?, ?, ?, ...) Sentence(TS, 9-24)\n(?, ?, ?, ...) Sentence(H, 15-21)\n(12, 24, 4...) Sentence(w, 12-24)\n</pre>\n\nSo the thing that jumps out here is that the TS and H need to have the same values as the precip, so that they are sorted next to each other. But in our key function we only have one element of the iterable at a time. If the TS Sentence is passed to the key function we wouldn't generally have access to information about other elements in the iterable. So what we can we do? Well making a closure can help us out..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def generateKeyFn(items):\n precipStart = None\n precipEnd = None\n precipPriority = priorityOrder.index('precip')\n precipSentences = [item for item in items if item.type == 'precip']\n if precipSentences:\n precipStart = min(item.start for item in precipSentences)\n precipEnd = max(item.end for item in precipSentences)\n \n mustFollowPrecip = ('TS', 'H')\n \n def keyFn(item):\n if precipStart and item.type in mustFollowPrecip:\n return (precipStart, precipEnd, precipPriority)\n priority = priorityOrder.index(item.type)\n return (item.start, item.end, priority)\n\n return keyFn\n\n\nbySpecialTime = generateKeyFn(sentences)\nfor item in sorted(sentences, key=bySpecialTime):\n print bySpecialTime(item), \"<=\", item",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "(0, 0, 0) <= Sentence(sky, 0-0)\n(12, 24, 2) <= Sentence(precip, 12-24)\n(12, 24, 2) <= Sentence(H, 15-21)\n(12, 24, 2) <= Sentence(TS, 9-24)\n(12, 24, 4) <= Sentence(w, 12-24)\n"
}
],
"prompt_number": 104
},
{
"cell_type": "markdown",
"metadata": {},
"source": "This is a good start! Our precip/TS/H are all together. But we need to add extra elements to our key function to sort them correctly."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def generateKeyFn(items):\n precipStart, precipEnd = None, None\n precipPriority = priorityOrder.index('precip')\n precipSentences = [item for item in items if item.type == 'precip']\n if precipSentences:\n precipStart = min(item.start for item in precipSentences)\n precipEnd = max(item.end for item in precipSentences)\n \n mustFollowPrecip = ('TS', 'H')\n \n def keyFn(item):\n start, end = item.start, item.end\n priority = priorityOrder.index(item.type)\n mustFollow = item.type in mustFollowPrecip\n if precipStart and mustFollow:\n start, end, priority = precipStart, precipEnd, precipPriority\n return (start, end, priority, mustFollow)\n\n return keyFn\n\n\nbySpecialTime = generateKeyFn(sentences)\nfor item in sorted(sentences, key=bySpecialTime):\n print bySpecialTime(item), \"<=\", item",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "(0, 0, 0, False) <= Sentence(sky, 0-0)\n(12, 24, 2, False) <= Sentence(precip, 12-24)\n(12, 24, 2, True) <= Sentence(H, 15-21)\n(12, 24, 2, True) <= Sentence(TS, 9-24)\n(12, 24, 4, False) <= Sentence(w, 12-24)\n"
}
],
"prompt_number": 105
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Here we are relying on Boolean sorting - False < True. The value of this fourth field is irrelevant to the non-precip/TS/H sentences as these are already being sorted on the first three fields.\n\nNow TS is following precip, but we also need the H to follow the TS."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def generateKeyFn(items):\n precipStart, precipEnd = None, None\n precipPriority = priorityOrder.index('precip')\n precipSentences = [item for item in items if item.type == 'precip']\n if precipSentences:\n precipStart = min(item.start for item in precipSentences)\n precipEnd = max(item.end for item in precipSentences)\n \n typesMustFollowPrecip = ('TS', 'H')\n typesMustFollowTS = ('H',)\n \n def keyFn(item):\n start, end = item.start, item.end\n priority = priorityOrder.index(item.type)\n mustFollowTS = item.type in typesMustFollowTS\n mustFollowPrecip = item.type in typesMustFollowPrecip\n if precipStart and mustFollowPrecip:\n start, end, priority = precipStart, precipEnd, precipPriority\n return (start, end, priority, mustFollowPrecip, mustFollowTS)\n\n return keyFn\n\n\nbySpecialTime = generateKeyFn(sentences)\nfor item in sorted(sentences, key=bySpecialTime):\n print bySpecialTime(item), \"<=\", item\n\nsorted(sentences, key=bySpecialTime) == expected",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "(0, 0, 0, False, False) <= Sentence(sky, 0-0)\n(12, 24, 2, False, False) <= Sentence(precip, 12-24)\n(12, 24, 2, True, False) <= Sentence(TS, 9-24)\n(12, 24, 2, True, True) <= Sentence(H, 15-21)\n(12, 24, 4, False, False) <= Sentence(w, 12-24)\n"
},
{
"output_type": "pyout",
"prompt_number": 106,
"text": "True"
}
],
"prompt_number": 106
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Success! The fifth field is forcing the H after the TS.\n\nNow this key function is not totally correct. There is one immediate error. If we have TS and H but no precip, we haven't done anything to ensure that the H will still follow the precip. Witness:\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "sentencesNoPrecip = set([winds, ts, hail, sky])\nsortKey = generateKeyFn(sentencesNoPrecip)\nfor item in sorted(sentencesNoPrecip, key=sortKey):\n print sortKey(item), \"<=\", item",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "(0, 0, 0, False, False) <= Sentence(sky, 0-0)\n(9, 24, 3, True, False) <= Sentence(TS, 9-24)\n(12, 24, 4, False, False) <= Sentence(w, 12-24)\n(15, 21, 5, True, True) <= Sentence(H, 15-21)\n"
}
],
"prompt_number": 107
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Note that we had to generate a fresh key function because our input data was different.\n\nCorrecting `generateKeyFn` is left as an exercise to the reader. :) Here are some test cases, though. (Written for [py.test](http://pytest.org), likely to work with nose although untested)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import py\n\n\ndef test_sortingTime():\n input = [\n Sentence('precip', 0, 12, 'rain in the morning'),\n Sentence('sky', 12, 24, 'sunny afternoon'),\n ]\n expected = ['precip', 'sky']\n sortKey = generateKeyFn(input)\n assert [s.type for s in sorted(input, key=sortKey)] == expected\n\n\ndef test_sortingPriority():\n input = [\n Sentence('sky', 0, 24, 'cloudy'),\n Sentence('precip', 0, 24, 'rain'),\n ]\n expected = ['precip', 'sky']\n sortKey = generateKeyFn(input)\n assert [s.type for s in sorted(input, key=sortKey)] == expected\n\n\ndef test_sortingTSAfterPrecip():\n input = [\n Sentence('TS', 0, 12, 'thunderstorms'),\n Sentence('precip', 12, 24, 'rain'),\n ]\n expected = ['precip', 'TS']\n sortKey = generateKeyFn(input)\n assert [s.type for s in sorted(input, key=sortKey)] == expected\n\n \ndef test_sortingHailAfterTSWithPrecip():\n input = [\n Sentence('precip', 12, 24, 'rain'),\n Sentence('TS', 0, 12, 'thunderstorms'),\n Sentence('H', 0, 6, 'hail'),\n ]\n expected = ['precip', 'TS', 'H']\n sortKey = generateKeyFn(input)\n assert [s.type for s in sorted(input, key=sortKey)] == expected\n\n\ndef test_sortingHailAfterTSNoPrecip():\n input = [\n Sentence('TS', 0, 12, 'thunderstorms'),\n Sentence('H', 0, 6, 'hail'),\n ]\n expected = ['TS', 'H']\n sortKey = generateKeyFn(input)\n assert [s.type for s in sorted(input, key=sortKey)] == expected\n\n",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 108
},
{
"cell_type": "markdown",
"metadata": {},
"source": "And as expected, this last test fails when I run the tests:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "$py.test test_sort.py -v\n=============================================== test session starts ===============================================\nplatform linux2 -- Python 2.7.5 -- pytest-2.3.5 -- /usr/bin/python\nplugins: xdist, cov\ncollected 5 items \n\ntest_sort.py:34: test_sortingTime PASSED\ntest_sort.py:44: test_sortingPriority PASSED\ntest_sort.py:54: test_sortingTSAfterPrecip PASSED\ntest_sort.py:64: test_sortingHailAfterTSWithPrecip PASSED\ntest_sort.py:75: test_sortingHailAfterTSNoPrecip FAILED\n\n==================================================== FAILURES =====================================================\n_________________________________________ test_sortingHailAfterTSNoPrecip _________________________________________\ntest_sort.py:82: in test_sortingHailAfterTSNoPrecip\n> assert [s.type for s in sorted(input, key=sortKey)] == expected\nE assert ['H', 'TS'] == ['TS', 'H']\nE At index 0 diff: 'H' != 'TS'\n======================================= 1 failed, 4 passed in 0.02 seconds =======================================",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 109
},
{
"cell_type": "markdown",
"metadata": {},
"source": "----\nWritten with thanks to [this post](http://www.garann.com/dev/2013/how-to-blog-about-code-and-give-zero-fucks/) for a reminder to blog every now and then.\n\nReference reading:\n\n * [Official sorting HOWTO](http://docs.python.org/3/howto/sorting.html)\n * Google Developers [Python sorting](https://developers.google.com/edu/python/sorting) (good diagram for explaining use of the key function)"
}
],
"metadata": {}
}
]
}

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.