Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Last active June 20, 2022 08:05
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save psychemedia/e6b86540baa33391f6d9e91fdfe9fa4c to your computer and use it in GitHub Desktop.
Save psychemedia/e6b86540baa33391f6d9e91fdfe9fa4c to your computer and use it in GitHub Desktop.
Example of generating text from pandas dataframe using a python rules engine ( durable_rules )
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dakar Rules\n",
"\n",
"An experiment in using a rule based system to generate rally results fact statements."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's some set-up for working with my scraped Dakar results data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"STAGE = 3\n",
"\n",
"MAX = 10\n",
"\n",
"setups = {'sunderland':{'v':'moto','b':3},\n",
" 'alo':{'v':'car','b':310},\n",
" 'sainz':{'v':'car','b':305},\n",
" 'attiyah':{'v':'car', 'b':300},\n",
" 'price':{'v':'moto','b':1}\n",
" }\n",
"\n",
"def get_setup(n):\n",
" return setups[n]['v'],setups[n]['b']\n",
"\n",
"VTYPE, REBASER = get_setup('price')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the database handler itself..."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import sqlite3\n",
"from sqlite_utils import Database\n",
"\n",
"dbname = 'dakar_2020.db'\n",
"\n",
"conn = sqlite3.connect(dbname)\n",
"db = Database(conn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The rules engine I'm going to use is `durable_rules`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"#https://github.com/jruizgit/rules/blob/master/docs/py/reference.md\n",
"#%pip install durable_rules\n",
"from durable.lang import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll also be using `pandas`..."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's grab a simple set of example rankings from the database, as a `pandas` dataframe..."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Year</th>\n",
" <th>Stage</th>\n",
" <th>Type</th>\n",
" <th>Pos</th>\n",
" <th>Bib</th>\n",
" <th>VehicleType</th>\n",
" <th>Crew</th>\n",
" <th>Brand</th>\n",
" <th>Time_raw</th>\n",
" <th>TimeInS</th>\n",
" <th>Gap_raw</th>\n",
" <th>GapInS</th>\n",
" <th>Penalty_raw</th>\n",
" <th>PenaltyInS</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2020</td>\n",
" <td>3</td>\n",
" <td>general</td>\n",
" <td>1</td>\n",
" <td>9</td>\n",
" <td>moto</td>\n",
" <td>R. BRABEC MONSTER ENERGY HONDA TEAM 2020</td>\n",
" <td>HONDA</td>\n",
" <td>10:39:04</td>\n",
" <td>38344</td>\n",
" <td>0:00:00</td>\n",
" <td>0.0</td>\n",
" <td>00:00:00</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2020</td>\n",
" <td>3</td>\n",
" <td>general</td>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" <td>moto</td>\n",
" <td>K. BENAVIDES MONSTER ENERGY HONDA TEAM 2020</td>\n",
" <td>HONDA</td>\n",
" <td>10:43:47</td>\n",
" <td>38627</td>\n",
" <td>0:04:43</td>\n",
" <td>283.0</td>\n",
" <td>00:00:00</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2020</td>\n",
" <td>3</td>\n",
" <td>general</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>moto</td>\n",
" <td>M. WALKNER RED BULL KTM FACTORY TEAM</td>\n",
" <td>KTM</td>\n",
" <td>10:45:06</td>\n",
" <td>38706</td>\n",
" <td>0:06:02</td>\n",
" <td>362.0</td>\n",
" <td>00:00:00</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Year Stage Type Pos Bib VehicleType \\\n",
"0 2020 3 general 1 9 moto \n",
"1 2020 3 general 2 7 moto \n",
"2 2020 3 general 3 2 moto \n",
"\n",
" Crew Brand Time_raw TimeInS \\\n",
"0 R. BRABEC MONSTER ENERGY HONDA TEAM 2020 HONDA 10:39:04 38344 \n",
"1 K. BENAVIDES MONSTER ENERGY HONDA TEAM 2020 HONDA 10:43:47 38627 \n",
"2 M. WALKNER RED BULL KTM FACTORY TEAM KTM 10:45:06 38706 \n",
"\n",
" Gap_raw GapInS Penalty_raw PenaltyInS \n",
"0 0:00:00 0.0 00:00:00 0.0 \n",
"1 0:04:43 283.0 00:00:00 0.0 \n",
"2 0:06:02 362.0 00:00:00 0.0 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"q=f\"SELECT * FROM ranking WHERE VehicleType='{VTYPE}' AND Type='general' AND Stage={STAGE} AND Pos<={MAX}\"\n",
"tmpq = pd.read_sql(q, conn).fillna(0)\n",
"tmpq.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `inflect` package makes it easy to generate numner words from numerics... and a whole host of other things..."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"#https://github.com/jazzband/inflect\n",
"import inflect\n",
"\n",
"p = inflect.engine()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following function is a simple handler for generating nice time strings..."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"#BAsed on: https://stackoverflow.com/a/24542445/454773\n",
"intervals = (\n",
" ('weeks', 604800), # 60 * 60 * 24 * 7\n",
" ('days', 86400), # 60 * 60 * 24\n",
" ('hours', 3600), # 60 * 60\n",
" ('minutes', 60),\n",
" ('seconds', 1),\n",
" )\n",
" \n",
"def display_time(t, granularity=3,\n",
" sep=',', andword='and',\n",
" units = 'seconds', intify=True):\n",
" \"\"\"Take a time in seconds and return a sensible\n",
" natural language interpretation of it.\"\"\"\n",
" def nl_join(l):\n",
" if len(l)>2:\n",
" return ', '.join(f'{l[:-1]} {andword} {str(l[-1])}')\n",
" elif len(l)==2:\n",
" return f' {andword} '.join(l)\n",
" return l[0]\n",
" \n",
" result = []\n",
"\n",
" if intify:\n",
" t=int(t)\n",
"\n",
" #Need better handle for arbitrary time strings\n",
" #Perhaps parse into a timedelta object\n",
" # and then generate NL string from that?\n",
" if units=='seconds':\n",
" for name, count in intervals:\n",
" value = t // count\n",
" if value:\n",
" t -= value * count\n",
" if value == 1:\n",
" name = name.rstrip('s')\n",
" result.append(\"{} {}\".format(value, name))\n",
"\n",
" return nl_join(result[:granularity])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I suspect there's a way of doing things \"occasionally\" via the rules engine, but at times it may be easier to have rules that create statements \"occasionally\" as part of the rule code. This adds variety to generated text.\n",
"\n",
"The following functions help with that, returning strings probabalistically."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"\n",
"def sometimes(t, p=0.5):\n",
" \"\"\"Sometimes return a string passed to the function.\"\"\"\n",
" if random.random()>=p:\n",
" return t\n",
" return ''\n",
"\n",
"def occasionally(t):\n",
" \"\"\"Sometimes return a string passed to the function.\"\"\"\n",
" return sometimes(t, p=0.2)\n",
"\n",
"def rarely(t):\n",
" \"\"\"Rarely return a string passed to the function.\"\"\"\n",
" return sometimes(t, p=0.05)\n",
"\n",
"def pickone_equally(l, prefix='', suffix=''):\n",
" \"\"\"Return an item from a list,\n",
" selected at random with equal probability.\"\"\"\n",
" t = random.choice(l)\n",
" if t:\n",
" return f'{prefix}{t}{suffix}'\n",
" return suffix\n",
"\n",
"def pickfirst_prob(l, p=0.5):\n",
" \"\"\"Select the first item in a list with the specified probability,\n",
" else select an item, with equal probability, from the rest of the list.\"\"\"\n",
" if len(l)>1 and random.random() >= p:\n",
" return random.choice(l[1:])\n",
" return l[0]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a simple test ruleset for commenting on a simple results table.\n",
"\n",
"Rather than printing out statements in each rule, as the demos show, lets instead append generated text elements to an ordered list, and then render that at the end."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"run_control": {
"marked": false
}
},
"outputs": [],
"source": [
"from durable.lang import *\n",
"\n",
"txts = []\n",
"\n",
"with ruleset('test1'):\n",
" \n",
" #Display something about the crew in first place\n",
" @when_all(m.Pos == 1)\n",
" def whos_in_first(c):\n",
" \"\"\"Generate a sentence to report on the first placed vehicle.\"\"\"\n",
" #We can add additional state, accessiblr from other rules\n",
" #In this case, record the Crew and Brand for the first placed crew\n",
" c.s.first_crew = c.m.Crew\n",
" c.s.first_brand = c.m.Brand\n",
" \n",
" #Python f-strings make it easy to generate text sentences that include data elements\n",
" txts.append(f'{c.m.Crew} were in first in their {c.m.Brand} with a time of {c.m.Time_raw}.')\n",
" \n",
" #This just checks whether we get multiple rule fires...\n",
" @when_all(m.Pos == 1)\n",
" def whos_in_first2(c):\n",
" txts.append('we got another first...')\n",
" \n",
" #We can be a bit more creative in the other results\n",
" @when_all(m.Pos>1)\n",
" def whos_where(c):\n",
" \"\"\"Generate a sentence to describe the position of each other placed vehicle.\"\"\"\n",
" \n",
" #Use the inflect package to natural language textify position numbers...\n",
" nth = p.number_to_words(p.ordinal(c.m.Pos))\n",
" \n",
" #Use various probabalistic text generators to make a comment for each other result\n",
" first_opts = [c.s.first_crew, 'the stage winner']\n",
" if c.m.Brand==c.s.first_brand:\n",
" first_opts.append(f'the first placed {c.m.Brand}')\n",
" t = pickone_equally([f'with a time of {c.m.Time_raw}',\n",
" f'{sometimes(f\"{display_time(c.m.GapInS)} behind {pickone_equally(first_opts)}\")}'],\n",
" prefix=', ')\n",
" \n",
" #And add even more variation possibilities into the returned generated sentence\n",
" txts.append(f'{c.m.Crew} were in {nth}{sometimes(\" position\")}{sometimes(f\" representing {c.m.Brand}\")}{t}.')\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The rules handler doesn't seem to like the `numpy` typed numerical objects that the `pandas` dataframe provides, but if we cast the dataframe values to JSON and then back to a Python `dict`, everything seems to work fine."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"#This handles numpy types that ruleset json serialiser doesn't like\n",
"tmp = json.loads(tmpq.iloc[0].to_json())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we post as an event, then only a single rule can be fired from it\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.\n"
]
}
],
"source": [
"post('test1',tmp)\n",
"print(''.join(txts))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create a function that can be applied to each row of a `pandas` dataframe that will run the conents of the row through the ruleset:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def rulesbyrow(row, ruleset):\n",
" row = json.loads(json.dumps(row.to_dict()))\n",
" post(ruleset,row)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Capture the text results generated from the ruleset into a list, and then display the results. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.\n",
"\n",
"K. BENAVIDES MONSTER ENERGY HONDA TEAM 2020 were in second representing HONDA, with a time of 10:43:47.\n",
"\n",
"M. WALKNER RED BULL KTM FACTORY TEAM were in third position representing KTM, with a time of 10:45:06.\n",
"\n",
"J. BARREDA BORT MONSTER ENERGY HONDA TEAM 2020 were in fourth representing HONDA, with a time of 10:50:06.\n",
"\n",
"JI. CORNEJO FLORIMO MONSTER ENERGY HONDA TEAM 2020 were in fifth position representing HONDA, with a time of 10:50:23.\n",
"\n",
"T. PRICE RED BULL KTM FACTORY TEAM were in sixth representing KTM.\n",
"\n",
"L. BENAVIDES RED BULL KTM FACTORY TEAM were in seventh, 14 minutes and 20 seconds behind the stage winner.\n",
"\n",
"P. QUINTANILLA ROCKSTAR ENERGY HUSQVARNA FACTORY RACING were in eighth position representing HUSQVARNA.\n",
"\n",
"S. SUNDERLAND RED BULL KTM FACTORY TEAM were in ninth representing KTM, with a time of 10:56:14.\n",
"\n",
"X. DE SOULTRAIT MONSTER ENERGY YAMAHA RALLY TEAM were in tenth representing YAMAHA, 19 minutes and 55 seconds behind R. BRABEC MONSTER ENERGY HONDA TEAM 2020.\n"
]
}
],
"source": [
"txts=[]\n",
"tmpq.apply(rulesbyrow, ruleset='test1', axis=1)\n",
"\n",
"print('\\n\\n'.join(txts))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can evaluate a whole set of events passed as list of events using the `post_batch(RULESET,EVENTS)` function. It's easy enough to convert a `pandas` dataframe into a list of palatable `dict`s... "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def df_json(df):\n",
" \"\"\"Convert rows in a pandas dataframe to a JSON string.\n",
" Cast the JSON string back to a list of dicts \n",
" that are palatable to the rules engine. \n",
" \"\"\"\n",
" return json.loads(df.to_json(orient='records'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unfortunately, the `post_batch()` route doesn't look like it necessarily commits the rows to the ruleset in the provided row order? (Has the `dict` lost its ordering?)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.\n",
"\n",
"X. DE SOULTRAIT MONSTER ENERGY YAMAHA RALLY TEAM were in tenth position, with a time of 10:58:59.\n",
"\n",
"S. SUNDERLAND RED BULL KTM FACTORY TEAM were in ninth, with a time of 10:56:14.\n",
"\n",
"P. QUINTANILLA ROCKSTAR ENERGY HUSQVARNA FACTORY RACING were in eighth position representing HUSQVARNA, 15 minutes and 40 seconds behind R. BRABEC MONSTER ENERGY HONDA TEAM 2020.\n",
"\n",
"L. BENAVIDES RED BULL KTM FACTORY TEAM were in seventh, with a time of 10:53:24.\n",
"\n",
"T. PRICE RED BULL KTM FACTORY TEAM were in sixth position representing KTM, with a time of 10:51:02.\n",
"\n",
"JI. CORNEJO FLORIMO MONSTER ENERGY HONDA TEAM 2020 were in fifth position, with a time of 10:50:23.\n",
"\n",
"J. BARREDA BORT MONSTER ENERGY HONDA TEAM 2020 were in fourth representing HONDA.\n",
"\n",
"M. WALKNER RED BULL KTM FACTORY TEAM were in third, with a time of 10:45:06.\n",
"\n",
"K. BENAVIDES MONSTER ENERGY HONDA TEAM 2020 were in second, with a time of 10:43:47.\n"
]
}
],
"source": [
"txts=[]\n",
"\n",
"post_batch('test1', df_json(tmpq))\n",
"print('\\n\\n'.join(txts))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also assert the rows as `facts` rather than running them through the ruleset as `events`."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"def factsbyrow(row, ruleset):\n",
" row = json.loads(json.dumps(row.to_dict()))\n",
" assert_fact(ruleset,row)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fact is retained even it it matches a rule, so it gets a chance to match other rules too..."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R. BRABEC MONSTER ENERGY HONDA TEAM 2020 were in first in their HONDA with a time of 10:39:04.\n",
"\n",
"we got another first...\n",
"\n",
"K. BENAVIDES MONSTER ENERGY HONDA TEAM 2020 were in second, with a time of 10:43:47.\n",
"\n",
"M. WALKNER RED BULL KTM FACTORY TEAM were in third representing KTM.\n",
"\n",
"J. BARREDA BORT MONSTER ENERGY HONDA TEAM 2020 were in fourth representing HONDA, with a time of 10:50:06.\n",
"\n",
"JI. CORNEJO FLORIMO MONSTER ENERGY HONDA TEAM 2020 were in fifth position, 11 minutes and 19 seconds behind the first placed HONDA.\n",
"\n",
"T. PRICE RED BULL KTM FACTORY TEAM were in sixth representing KTM, with a time of 10:51:02.\n",
"\n",
"L. BENAVIDES RED BULL KTM FACTORY TEAM were in seventh position, 14 minutes and 20 seconds behind the stage winner.\n",
"\n",
"P. QUINTANILLA ROCKSTAR ENERGY HUSQVARNA FACTORY RACING were in eighth position, 15 minutes and 40 seconds behind R. BRABEC MONSTER ENERGY HONDA TEAM 2020.\n",
"\n",
"S. SUNDERLAND RED BULL KTM FACTORY TEAM were in ninth position, 17 minutes and 10 seconds behind R. BRABEC MONSTER ENERGY HONDA TEAM 2020.\n",
"\n",
"X. DE SOULTRAIT MONSTER ENERGY YAMAHA RALLY TEAM were in tenth position representing YAMAHA, 19 minutes and 55 seconds behind R. BRABEC MONSTER ENERGY HONDA TEAM 2020.\n"
]
}
],
"source": [
"txts=[]\n",
"tmpq.apply(factsbyrow, ruleset='test1', axis=1);\n",
"print('\\n\\n'.join(txts))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, if we apply the same facts multiple times, I think we get an error and bork the ruleset..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment