Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Created August 6, 2019 13:35
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save psychemedia/1b1b795c7cffbc9809da33a703842354 to your computer and use it in GitHub Desktop.
Save psychemedia/1b1b795c7cffbc9809da33a703842354 to your computer and use it in GitHub Desktop.
Example of parsing quantities from sentences
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Simple Tools from Extracting Quantities from Strings\n\nSuppose we have a report and we want to find the sentences that are talking about numerical things....\n\n*Originally inspired by [When you get data in sentences: how to use a spreadsheet to extract numbers from phrases](https://onlinejournalismblog.com/2019/07/29/when-you-get-data-in-sentences-how-to-use-a-spreadsheet-to-extract-numbers-from-phrases/), Paul Bradshaw, Online Journalism blog, form which some of the example sentences (sic!) are taken.*"
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "sentences = [\n '4 years and 6 months’ imprisonment with a licence extension of 2 years and 6 months',\n 'No quantities here',\n 'I measured it as 2 meters and 30 centimeters.',\n \"four years and six months' imprisonment with a licence extension of 2 years and 6 months\",\n 'it cost £250... bargain...',\n 'it weighs four hundred kilograms.',\n 'It weighs 400kg.',\n 'three million, two hundred & forty, you say?',\n 'it weighs four hundred and twenty kilograms.'\n \n]",
"execution_count": 152,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## `quantulum3`\n\n[`quantulum3`](https://github.com/nielstron/quantulum3) is a Python package *\"for information extraction of quantities from unstructured text\"*."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "#!pip3 install quantulum3\nfrom quantulum3 import parser",
"execution_count": 153,
"outputs": []
},
{
"metadata": {
"trusted": true,
"scrolled": false
},
"cell_type": "code",
"source": "for sent in sentences:\n print(sent)\n p = parser.parse(sent)\n if p:\n print('\\tSpoken:',parser.inline_parse_and_expand(sent))\n print('\\tNumeric elements:')\n for q in p:\n display(q)\n print('\\t\\t{} :: {}'.format(q.surface, q))\n print('\\n---------\\n')",
"execution_count": 154,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "4 years and 6 months’ imprisonment with a licence extension of 2 years and 6 months\n\tSpoken: four years and six months’ imprisonment with a licence extension of two years and six months\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(4, \"Unit(name=\"year\", entity=Entity(\"time\"), uri=Year)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t4 years :: four years\n"
},
{
"data": {
"text/plain": "Quantity(6, \"Unit(name=\"month\", entity=Entity(\"time\"), uri=Month)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t6 months :: six months\n"
},
{
"data": {
"text/plain": "Quantity(2, \"Unit(name=\"year\", entity=Entity(\"time\"), uri=Year)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t2 years :: two years\n"
},
{
"data": {
"text/plain": "Quantity(6, \"Unit(name=\"month\", entity=Entity(\"time\"), uri=Month)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t6 months :: six months\n\n---------\n\nNo quantities here\n\n---------\n\nI measured it as 2 meters and 30 centimeters.\n\tSpoken: I measured it as two metres and thirty centimetres.\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(2, \"Unit(name=\"metre\", entity=Entity(\"length\"), uri=Metre)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t2 meters :: two metres\n"
},
{
"data": {
"text/plain": "Quantity(30, \"Unit(name=\"centimetre\", entity=Entity(\"length\"), uri=Centimetre)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t30 centimeters :: thirty centimetres\n\n---------\n\nfour years and six months' imprisonment with a licence extension of 2 years and 6 months\n\tSpoken: four years and six months imprisonment with a licence extension of two years and six months\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(4, \"Unit(name=\"year\", entity=Entity(\"time\"), uri=Year)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\tfour years :: four years\n"
},
{
"data": {
"text/plain": "Quantity(6, \"Unit(name=\"month\", entity=Entity(\"time\"), uri=Month)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\tsix months' :: six months\n"
},
{
"data": {
"text/plain": "Quantity(2, \"Unit(name=\"year\", entity=Entity(\"time\"), uri=Year)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t2 years :: two years\n"
},
{
"data": {
"text/plain": "Quantity(6, \"Unit(name=\"month\", entity=Entity(\"time\"), uri=Month)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t6 months :: six months\n\n---------\n\nit cost £250... bargain...\n\tSpoken: it cost two hundred and fifty pounds sterling, zero pence... bargain...\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(250, \"Unit(name=\"pound sterling\", entity=Entity(\"currency\"), uri=Pound_sterling)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t£250 :: two hundred and fifty pounds sterling, zero pence\n\n---------\n\nit weighs four hundred kilograms.\n\tSpoken: it weighs four hundred kilograms.\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(400, \"Unit(name=\"kilogram\", entity=Entity(\"mass\"), uri=Kilogram)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\tfour hundred kilograms :: four hundred kilograms\n\n---------\n\nIt weighs 400kg.\n\tSpoken: It weighs four hundred kilograms.\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(400, \"Unit(name=\"kilogram\", entity=Entity(\"mass\"), uri=Kilogram)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t400kg :: four hundred kilograms\n\n---------\n\nthree million, two hundred & forty, you say?\n\tSpoken: three million, two hundred & forty, you say?\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(3e+06, \"Unit(name=\"dimensionless\", entity=Entity(\"dimensionless\"), uri=Dimensionless_quantity)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\tthree million :: three million\n"
},
{
"data": {
"text/plain": "Quantity(200, \"Unit(name=\"dimensionless\", entity=Entity(\"dimensionless\"), uri=Dimensionless_quantity)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\ttwo hundred :: two hundred\n"
},
{
"data": {
"text/plain": "Quantity(40, \"Unit(name=\"dimensionless\", entity=Entity(\"dimensionless\"), uri=Dimensionless_quantity)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\tforty :: forty\n\n---------\n\nit weighs four hundred and twenty kilograms.\n\tSpoken: it weighs four hundred and twenty kilograms.\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(420, \"Unit(name=\"kilogram\", entity=Entity(\"mass\"), uri=Kilogram)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\tfour hundred and twenty kilograms :: four hundred and twenty kilograms\n\n---------\n\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Finding quantity statements in large texts\n\nIf we have a large blog of text, we might want to quickly skim it for quantity containing sentences, we can do something like the following..."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import spacy\nnlp = spacy.load('en_core_web_lg', disable = ['ner'])",
"execution_count": 155,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "text = '''\nOnce upon a time, there was a thing. The thing weighed forty kilogrammes and cost £250. \nIt was blue. It took forty five minutes to get it home. \nWhat a day that was. I didn't get back until 2.15pm. Then I had some cake for tea.\n'''",
"execution_count": 171,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "doc = nlp(text)\nfor sent in doc.sents:\n print(sent)",
"execution_count": 172,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "\nOnce upon a time, there was a thing.\nThe thing weighed forty kilogrammes and cost £250. \n\nIt was blue.\nIt took forty five minutes to get it home. \n\nWhat a day that was.\nI didn't get back until 2.15pm.\nThen I had some cake for tea.\n\n"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "for sent in doc.sents:\n sent = sent.text\n p = parser.parse(sent)\n if p:\n print('\\tSpoken:',parser.inline_parse_and_expand(sent))\n print('\\tNumeric elements:')\n for q in p:\n display(q)\n print('\\t\\t{} :: {}'.format(q.surface, q))\n print('\\n---------\\n')",
"execution_count": 173,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "\tSpoken: \nOnce upon one instance, there was a thing.\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(1, \"Unit(name=\"count\", entity=Entity(\"dimensionless\"), uri=Count_data)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\ta time :: one instance\n\n---------\n\n\tSpoken: The thing weighed forty kilograms and cost two hundred and fifty pounds sterling, zero pence. \n\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(40, \"Unit(name=\"kilogram\", entity=Entity(\"mass\"), uri=Kilogram)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\tforty kilogrammes :: forty kilograms\n"
},
{
"data": {
"text/plain": "Quantity(250, \"Unit(name=\"pound sterling\", entity=Entity(\"currency\"), uri=Pound_sterling)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t£250 :: two hundred and fifty pounds sterling, zero pence\n\n---------\n\n\n---------\n\n\tSpoken: It took forty-five minutes to get it home. \n\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(45, \"Unit(name=\"minute of arc\", entity=Entity(\"angle\"), uri=Minute_and_second_of_arc)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\tforty five minutes :: forty-five minutes\n\n---------\n\n\tSpoken: What one day that was.\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(1, \"Unit(name=\"day\", entity=Entity(\"time\"), uri=Day)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\ta day :: one day\n\n---------\n\n\tSpoken: I didn't get back until two point one five picometres.\n\tNumeric elements:\n"
},
{
"data": {
"text/plain": "Quantity(2.15, \"Unit(name=\"picometre\", entity=Entity(\"length\"), uri=Picometre)\")"
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": "\t\t2.15pm :: two point one five picometres\n\n---------\n\n\n---------\n\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Annotating a dataset\n\nCan we extract numbers from sentences in a CSV file? Yes we can..."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "url = 'https://raw.githubusercontent.com/BBC-Data-Unit/unduly-lenient-sentences/master/ULS+for+Sankey.csv'",
"execution_count": 174,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd\n\ndf = pd.read_csv(url)\ndf.head()",
"execution_count": 175,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Year</th>\n <th>Offence category REFINED</th>\n <th>Original sentence (refined)</th>\n <th>Crown Court</th>\n <th>Outcome of Decision</th>\n <th>Revised?</th>\n <th>People</th>\n <th>Top 7</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2015</td>\n <td>Drug offence</td>\n <td>3 years imprisonment</td>\n <td>Bristol</td>\n <td>Not referred</td>\n <td>No</td>\n <td>1</td>\n <td>Y</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2015</td>\n <td>Death or serious injury - unlawful driving</td>\n <td>6 years imprisonment - Disqualified driving - ...</td>\n <td>Portsmouth</td>\n <td>Not referred</td>\n <td>No</td>\n <td>1</td>\n <td>Y</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2015</td>\n <td>Sexual offence</td>\n <td>9 months imprisonment suspended for 2 years</td>\n <td>Nottingham</td>\n <td>Out of time</td>\n <td>No</td>\n <td>1</td>\n <td>Y</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2015</td>\n <td>Theft offence</td>\n <td>4 years and 10 months imprisonment - consecuti...</td>\n <td>St Albans</td>\n <td>Not referred</td>\n <td>No</td>\n <td>1</td>\n <td>Y</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2015</td>\n <td>Theft offence</td>\n <td>unknown</td>\n <td>unknown</td>\n <td>Not in scheme</td>\n <td>No</td>\n <td>1</td>\n <td>Y</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Year Offence category REFINED \\\n0 2015 Drug offence \n1 2015 Death or serious injury - unlawful driving \n2 2015 Sexual offence \n3 2015 Theft offence \n4 2015 Theft offence \n\n Original sentence (refined) Crown Court \\\n0 3 years imprisonment Bristol \n1 6 years imprisonment - Disqualified driving - ... Portsmouth \n2 9 months imprisonment suspended for 2 years Nottingham \n3 4 years and 10 months imprisonment - consecuti... St Albans \n4 unknown unknown \n\n Outcome of Decision Revised? People Top 7 \n0 Not referred No 1 Y \n1 Not referred No 1 Y \n2 Out of time No 1 Y \n3 Not referred No 1 Y \n4 Not in scheme No 1 Y "
},
"execution_count": 175,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "#get a row\ndf.iloc[1]",
"execution_count": 178,
"outputs": [
{
"data": {
"text/plain": "Year 2015\nOffence category REFINED Death or serious injury - unlawful driving\nOriginal sentence (refined) 6 years imprisonment - Disqualified driving - ...\nCrown Court Portsmouth\nOutcome of Decision Not referred\nRevised? No\nPeople 1\nTop 7 Y\nName: 1, dtype: object"
},
"execution_count": 178,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "#and a, erm. sentence...\ndf.iloc[1]['Original sentence (refined)']",
"execution_count": 179,
"outputs": [
{
"data": {
"text/plain": "'6 years imprisonment - Disqualified driving - 8 years'"
},
"execution_count": 179,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "parser.parse(df.iloc[1]['Original sentence (refined)'])",
"execution_count": 180,
"outputs": [
{
"data": {
"text/plain": "[Quantity(6, \"Unit(name=\"year\", entity=Entity(\"time\"), uri=Year)\"),\n Quantity(8, \"Unit(name=\"year\", entity=Entity(\"time\"), uri=Year)\")]"
},
"execution_count": 180,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "def amountify(txt):\n try:\n if txt:\n p = parser.parse(txt)\n x=[]\n for q in p:\n x.append( '{} {}'.format(q.value, q.unit.name))\n return '::'.join(x)\n return ''\n except:\n return",
"execution_count": 206,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df['amounts'] = df['Original sentence (refined)'].apply(amountify)",
"execution_count": 207,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "df.head()",
"execution_count": 208,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Year</th>\n <th>Offence category REFINED</th>\n <th>Original sentence (refined)</th>\n <th>Crown Court</th>\n <th>Outcome of Decision</th>\n <th>Revised?</th>\n <th>People</th>\n <th>Top 7</th>\n <th>amounts</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>2015</td>\n <td>Drug offence</td>\n <td>3 years imprisonment</td>\n <td>Bristol</td>\n <td>Not referred</td>\n <td>No</td>\n <td>1</td>\n <td>Y</td>\n <td>3.0 year</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2015</td>\n <td>Death or serious injury - unlawful driving</td>\n <td>6 years imprisonment - Disqualified driving - ...</td>\n <td>Portsmouth</td>\n <td>Not referred</td>\n <td>No</td>\n <td>1</td>\n <td>Y</td>\n <td>6.0 year::8.0 year</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2015</td>\n <td>Sexual offence</td>\n <td>9 months imprisonment suspended for 2 years</td>\n <td>Nottingham</td>\n <td>Out of time</td>\n <td>No</td>\n <td>1</td>\n <td>Y</td>\n <td>9.0 month::2.0 year</td>\n </tr>\n <tr>\n <th>3</th>\n <td>2015</td>\n <td>Theft offence</td>\n <td>4 years and 10 months imprisonment - consecuti...</td>\n <td>St Albans</td>\n <td>Not referred</td>\n <td>No</td>\n <td>1</td>\n <td>Y</td>\n <td>4.0 year::10.0 month</td>\n </tr>\n <tr>\n <th>4</th>\n <td>2015</td>\n <td>Theft offence</td>\n <td>unknown</td>\n <td>unknown</td>\n <td>Not in scheme</td>\n <td>No</td>\n <td>1</td>\n <td>Y</td>\n <td></td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " Year Offence category REFINED \\\n0 2015 Drug offence \n1 2015 Death or serious injury - unlawful driving \n2 2015 Sexual offence \n3 2015 Theft offence \n4 2015 Theft offence \n\n Original sentence (refined) Crown Court \\\n0 3 years imprisonment Bristol \n1 6 years imprisonment - Disqualified driving - ... Portsmouth \n2 9 months imprisonment suspended for 2 years Nottingham \n3 4 years and 10 months imprisonment - consecuti... St Albans \n4 unknown unknown \n\n Outcome of Decision Revised? People Top 7 amounts \n0 Not referred No 1 Y 3.0 year \n1 Not referred No 1 Y 6.0 year::8.0 year \n2 Out of time No 1 Y 9.0 month::2.0 year \n3 Not referred No 1 Y 4.0 year::10.0 month \n4 Not in scheme No 1 Y "
},
"execution_count": 208,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We could then do something to split mutliple amounts into mutliple rows or columns..."
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.7.3",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "",
"data": {
"description": "Example of parsing quantities from sentences",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment