Examples of tagging and fuzzy matchings items in txt docs using python
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Entity Recognition and Extraction Recipes\n", | |
"\n", | |
"A collection of code fragments for performing:\n", | |
"\n", | |
"- simple entity extraction from a text;\n", | |
"- partial and fuzzing string matching of specified entities in a text." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Simple named entity recognition\n", | |
"\n", | |
"[`spaCy`](https://spacy.io/) is a natural language processing library for Python library that includes a basic model capable of recognising (ish!) names of people, places and organisations, as well as dates and financial amounts.\n", | |
"\n", | |
"According to the [`spaCy` entity recognition](https://spacy.io/docs/usage/entity-recognition) documentation, the built in model recognises the following types of entity:\n", | |
"\n", | |
"- `PERSON`\tPeople, including fictional.\n", | |
"- `NORP`\tNationalities or religious or political groups.\n", | |
"- `FACILITY`\tBuildings, airports, highways, bridges, etc.\n", | |
"- `ORG`\tCompanies, agencies, institutions, etc.\n", | |
"- `GPE`\tCountries, cities, states. (That is, *Geo-Political Entitites*)\n", | |
"- `LOC`\tNon-GPE locations, mountain ranges, bodies of water.\n", | |
"- `PRODUCT`\tObjects, vehicles, foods, etc. (Not services.)\n", | |
"- `EVENT`\tNamed hurricanes, battles, wars, sports events, etc.\n", | |
"- `WORK_OF_ART`\tTitles of books, songs, etc.\n", | |
"- `LANGUAGE`\tAny named language.\n", | |
"- `LAW` A legislation related entity(?)\n", | |
"\n", | |
"Quantities are also recognised:\n", | |
"\n", | |
"- `DATE`\tAbsolute or relative dates or periods.\n", | |
"- `TIME`\tTimes smaller than a day.\n", | |
"- `PERCENT`\tPercentage, including \"%\".\n", | |
"- `MONEY`\tMonetary values, including unit.\n", | |
"- `QUANTITY`\tMeasurements, as of weight or distance.\n", | |
"- `ORDINAL`\t\"first\", \"second\", etc.\n", | |
"- `CARDINAL`\tNumerals that do not fall under another type.\n", | |
"\n", | |
"Custom models can also be trained, but this requires annotated training documents." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#!pip3 install spacy" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from spacy.en import English\n", | |
"parser = English()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"example='''\n", | |
"That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in \n", | |
"York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; \n", | |
"acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase \n", | |
"over the same period in 2016; further notes 156 of these job losses will be in York, a city that in \n", | |
"the last six months has seen 2,000 job losses announced and has become the most inequitable city outside \n", | |
"of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of\n", | |
"triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with \n", | |
"the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and \n", | |
"production machinery, has risen due to the weakness of the pound and the uncertainty over the UK's future \n", | |
"relationship with the single market and customs union; and calls on the Government to intervene and work\n", | |
"with hon. Members, trades unions GMB and Unite and the company to avert these job losses now and prevent \n", | |
"further job losses across Nestlé.\n", | |
"'''" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Code \"borrowed\" from somewhere?!\n", | |
"def entities(example, show=False):\n", | |
" if show: print(example)\n", | |
" parsedEx = parser(example)\n", | |
"\n", | |
" print(\"-------------- entities only ---------------\")\n", | |
" # if you just want the entities and nothing else, you can do access the parsed examples \"ents\" property like this:\n", | |
" ents = list(parsedEx.ents)\n", | |
" tags={}\n", | |
" for entity in ents:\n", | |
" #print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))\n", | |
" term=' '.join(t.orth_ for t in entity)\n", | |
" if ' '.join(term) not in tags:\n", | |
" tags[term]=[(entity.label, entity.label_)]\n", | |
" else:\n", | |
" tags[term].append((entity.label, entity.label_))\n", | |
" print(tags)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"-------------- entities only ---------------\n", | |
"{'House': [(380, 'ORG')], '300': [(393, 'CARDINAL')], 'Nestlé': [(380, 'ORG')], '\\n York , Fawdon': [(381, 'GPE')], 'Halifax': [(381, 'GPE')], 'Girvan': [(381, 'GPE')], 'the Blue Riband': [(380, 'ORG')], 'Poland': [(381, 'GPE')], '\\n': [(381, 'GPE'), (381, 'GPE')], 'the first three months of 2017': [(387, 'DATE')], '£ 21 billion': [(390, 'MONEY')], '0.4 per': [(390, 'MONEY')], 'the same period in 2016': [(387, 'DATE')], '156': [(393, 'CARDINAL')], 'York': [(381, 'GPE')], '\\n the': [(381, 'GPE')], 'six': [(393, 'CARDINAL')], '2,000': [(393, 'CARDINAL')], 'the South East': [(382, 'LOC')], '110': [(393, 'CARDINAL')], 'Fawdon': [(381, 'GPE')], 'Newcastle': [(380, 'ORG')], 'a month of': [(387, 'DATE')], 'Article 50': [(21153, 'LAW')], 'EU': [(380, 'ORG')], 'UK': [(381, 'GPE')], 'GMB': [(380, 'ORG')], 'Unite': [(381, 'GPE')]}\n" | |
] | |
} | |
], | |
"source": [ | |
"entities(example)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"-------------- entities only ---------------\n", | |
"{'Bob Smith': [(377, 'PERSON')]}\n" | |
] | |
} | |
], | |
"source": [ | |
"q= \"Bob Smith was in the Houses of Parliament the other day\"\n", | |
"entities(q)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Note that the way that models are trained typically realises on cues from the correct capitalisation of named entities." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"-------------- entities only ---------------\n", | |
"{}\n" | |
] | |
} | |
], | |
"source": [ | |
"entities(q.lower())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## polyglot\n", | |
"\n", | |
"A simplistic, and quite slow, tagger, supporting limited recognition of *Locations* (`I-LOC`), *Organizations* (`I-ORG`) and *Persons* (`I-PER`).\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#!pip3 install polyglot\n", | |
"\n", | |
"##Mac ??\n", | |
"#!brew install icu4c\n", | |
"#I found I needed: pip3 install pyicu, pycld2, morfessor\n", | |
"##Linux\n", | |
"#apt-get install libicu-dev" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[polyglot_data] Downloading package embeddings2.en to\n", | |
"[polyglot_data] /Users/ajh59/polyglot_data...\n", | |
"[polyglot_data] Downloading package ner2.en to\n", | |
"[polyglot_data] /Users/ajh59/polyglot_data...\n" | |
] | |
} | |
], | |
"source": [ | |
"!polyglot download embeddings2.en ner2.en" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 109, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[I-LOC(['York']),\n", | |
" I-LOC(['Fawdon']),\n", | |
" I-LOC(['Halifax']),\n", | |
" I-LOC(['Girvan']),\n", | |
" I-LOC(['Poland']),\n", | |
" I-PER(['Nestlé']),\n", | |
" I-LOC(['York']),\n", | |
" I-LOC(['Fawdon']),\n", | |
" I-LOC(['Newcastle']),\n", | |
" I-ORG(['EU']),\n", | |
" I-ORG(['EU']),\n", | |
" I-ORG(['Government']),\n", | |
" I-ORG(['GMB']),\n", | |
" I-LOC(['Nestlé'])]" | |
] | |
}, | |
"execution_count": 109, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from polyglot.text import Text\n", | |
"\n", | |
"text = Text(example)\n", | |
"text.entities" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[I-PER(['Bob', 'Smith'])]" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"Text(q).entities" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Partial Matching Specific Entities\n", | |
"\n", | |
"Sometimes we may have a list of entities that we wish to match in a text. For example, suppose we have a list of MPs' names, or a list of ogranisations of subject terms identified in a thesaurus, and we want to tag a set of documents with those entities if the entity exists in the document.\n", | |
"\n", | |
"To do this, we can search a text for strings that exactly match any of the specified terms or where any of the specified terms match part of a longer string in the text.\n", | |
"\n", | |
"Naive implementations can take a signifcant time to find multiple strings within a tact, but the *Aho-Corasick* algorithm will efficiently match a large set of key values within a particular text." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"## The following recipe was hinted at via @pudo\n", | |
"\n", | |
"#!pip3 install pyahocorasick\n", | |
"#https://github.com/alephdata/aleph/blob/master/aleph/analyze/corasick_entity.py" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"First, construct an automaton that identifies the terms you want to detect in the target text." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from ahocorasick import Automaton\n", | |
"\n", | |
"A=Automaton()\n", | |
"A.add_word(\"Europe\",('VOCAB','Europe'))\n", | |
"A.add_word(\"European Union\",('VOCAB','European Union'))\n", | |
"A.add_word(\"Boris Johnson\",('PERSON','Boris Johnson'))\n", | |
"A.add_word(\"Boris\",('PERSON','Boris Johnson'))\n", | |
"A.add_word(\"boris johnson\",('PERSON','Boris Johnson (LC)'))\n", | |
"\n", | |
"A.make_automaton()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(4, ('PERSON', 'Boris Johnson')) Boris\n", | |
"(12, ('PERSON', 'Boris Johnson')) Boris Johnson\n", | |
"(31, ('VOCAB', 'Europe')) Boris Johnson went off to Europe\n", | |
"(60, ('VOCAB', 'Europe')) Boris Johnson went off to Europe to complain about the Europe\n", | |
"(68, ('VOCAB', 'European Union')) Boris Johnson went off to Europe to complain about the European Union\n" | |
] | |
} | |
], | |
"source": [ | |
"q2='Boris Johnson went off to Europe to complain about the European Union'\n", | |
"for item in A.iter(q2):\n", | |
" print(item, q2[:item[0]+1])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Once again, case is important." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(12, ('PERSON', 'Boris Johnson (LC)')) boris johnson\n" | |
] | |
} | |
], | |
"source": [ | |
"q2l = q2.lower()\n", | |
"for item in A.iter(q2l):\n", | |
" print(item, q2l[:item[0]+1])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can tweak the automata patterns to capture the length of the string match term, so we can annotate the text with matches more exactly:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"A=Automaton()\n", | |
"A.add_word(\"Europe\",(('VOCAB', len(\"Europe\")),'Europe'))\n", | |
"A.add_word(\"European Union\",(('VOCAB', len(\"European Union\")),'European Union'))\n", | |
"A.add_word(\"Boris Johnson\",(('PERSON', len(\"Boris Johnson\")),'Boris Johnson'))\n", | |
"A.add_word(\"Boris\",(('PERSON', len(\"Boris\")),'Boris Johnson'))\n", | |
"\n", | |
"A.make_automaton()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(4, (('PERSON', 5), 'Boris Johnson')) *Boris* Jo\n", | |
"(12, (('PERSON', 13), 'Boris Johnson')) *Boris Johnson* we\n", | |
"(31, (('VOCAB', 6), 'Europe')) to *Europe* to\n", | |
"(60, (('VOCAB', 6), 'Europe')) he *Europe*an \n", | |
"(68, (('VOCAB', 14), 'European Union')) he *European Union*\n" | |
] | |
} | |
], | |
"source": [ | |
"for item in A.iter(q2):\n", | |
" start=item[0]-item[1][0][1]+1\n", | |
" end=item[0]+1\n", | |
" print(item, '{}*{}*{}'.format(q2[start-3:start],q2[start:end],q2[end:end+3]))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Fuzzy Matching\n", | |
"\n", | |
"Whilst the *Aho-Corasick* approach will return hits for strings in the text that partially match the exact match key terms, sometimes we want to know whether there are terms in a text that *almost* match terms in specific set of terms." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Imagine a situation where we have managed to extract arbitrary named entities from a text, but they do not match strings in a specified list in an exact or partially exact way. Our next step might be to attempt to further match those entities in a *fuzzy* way with entities in a specified list." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### `fuzzyset`\n", | |
"The python [`fuzzyset`](https://github.com/axiak/fuzzyset) package will try to match a specified string to similar strings in a list of target strings, returning a single item from a specified target list that best matches the provided term.\n", | |
"\n", | |
"\n", | |
"For example, if we extract the name *Boris Johnstone* in a text, we might then try to further match that string, in a fuzzy way, with a list of correctly spelled MP names.\n", | |
"\n", | |
"A confidence value expresses the degree of match to terms in the fuzzy match set list." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 80, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"([(0.8666666666666667, 'Boris Johnson')],\n", | |
" [(0.8333333333333334, 'Diane Abbott')],\n", | |
" [(0.23076923076923073, 'Diane Abbott')])" | |
] | |
}, | |
"execution_count": 80, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import fuzzyset\n", | |
"\n", | |
"fz = fuzzyset.FuzzySet()\n", | |
"#Create a list of terms we would like to match against in a fuzzy way\n", | |
"for l in [\"Diane Abbott\", \"Boris Johnson\"]:\n", | |
" fz.add(l)\n", | |
"\n", | |
"#Now see if our sample term fuzzy matches any of those specified terms\n", | |
"sample_term='Boris Johnstone'\n", | |
"fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### `fuzzywuzzy`\n", | |
"If we want to try to find a fuzzy match for a term *within* a text, we can use the python [`fuzzywuzzy`](https://github.com/seatgeek/fuzzywuzzy) library. Once again, we spcify a list of target items we want to try to match against." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from fuzzywuzzy import process\n", | |
"from fuzzywuzzy import fuzz" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('Houses of Parliament', 90), ('Diane Abbott', 90), ('Boris Johnson', 86)]" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"terms=['Houses of Parliament', 'Diane Abbott', 'Boris Johnson']\n", | |
"\n", | |
"q= \"Diane Abbott, Theresa May and Boris Johnstone were in the Houses of Parliament the other day\"\n", | |
"process.extract(q,terms)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"By default, we get match confidence levels for each term in the target match set, although we can limit the response to a maximum number of matches:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('Houses of Parliament', 90), ('Boris Johnson', 85)]" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"process.extract(q,terms,scorer=fuzz.partial_ratio, limit=2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"A range of fuzzy match scroing algorithms are supported:\n", | |
"\n", | |
"- `WRatio` - measure of the sequences' similarity between 0 and 100, using different algorithms\n", | |
"- `QRatio` - Quick ratio comparison between two strings\n", | |
"- `UWRatio` - a measure of the sequences' similarity between 0 and 100, using different algorithms. Same as WRatio but preserving unicode\n", | |
"- `UQRatio` - Unicode quick ratio\n", | |
"- `ratio` - \n", | |
"- `partial_ratio - ratio of the most similar substring as a number between 0 and 100\n", | |
"- `token_sort_ratio` - a measure of the sequences' similarity between 0 and 100 but sorting the token before comparing\n", | |
"- `partial_token_set_ratio` - \n", | |
"- `partial_token_sort_ratio` - ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparing\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"More usefully, perhaps, is to return items that match above a particular confidence level:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('Houses of Parliament', 90), ('Diane Abbott', 90)]" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"process.extractBests(q,terms,score_cutoff=90)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"However, one problem with the `fuzzywuzzy` matcher is that it doesn't tell us where in the supplied text string the match occurred, or what string in the text was matched." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The `fuzzywuzzy` package can also be used to try to deduplicate a list of items, returning the longest item in the duplicate list. (It might be more useful if this is optionally the *first* item in the original list?)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 54, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"names=['Diane Abbott', 'Boris Johnson','Boris Johnstone','Diana Abbot', 'Boris Johnston','Joanna Lumley']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['Joanna Lumley', 'Boris Johnstone', 'Diane Abbott']" | |
] | |
}, | |
"execution_count": 55, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"process.dedupe(names, threshold=80)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"It might also be useful to see the candidate strings associated with each deduped item, treating the first item in the list as the canonical one:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 131, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[('Diane Abbott', 100), ('Diana Abbot', 87)]\n", | |
"[('Boris Johnson', 100), ('Boris Johnstone', 93), ('Boris Johnston', 96)]\n" | |
] | |
} | |
], | |
"source": [ | |
"import hashlib\n", | |
"\n", | |
"clusters={}\n", | |
"fuzzed=[]\n", | |
"for t in names:\n", | |
" fuzzyset=process.extractBests(t,names,score_cutoff=85)\n", | |
" #Generate a key based on the sorted members of the set\n", | |
" keyvals=sorted(set([x[0] for x in fuzzyset]),key=lambda x:names.index(x),reverse=False)\n", | |
" keytxt=''.join(keyvals)\n", | |
" key=hashlib.md5(keytxt).hexdigest()\n", | |
" if len(keyvals)>1 and key not in fuzzed:\n", | |
" clusters[key]=sorted(set([x for x in fuzzyset]),key=lambda x:names.index(x[0]),reverse=False)\n", | |
" fuzzed.append(key)\n", | |
"for cluster in clusters:\n", | |
" print(clusters[cluster])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## OpenRefine Clustering\n", | |
"\n", | |
"As well as running as a browser accessed application, [OpenRefine](http://openrefine.org/) also runs as a service that can be accessed from Python using the [refine-client.py](https://github.com/PaulMakepeace/refine-client-py) client libary.\n", | |
"\n", | |
"In particular, we can use the OpenRefine service to cluster fuzzily matched items within a list of items." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#!pip install git+https://github.com/PaulMakepeace/refine-client-py.git\n", | |
"#NOTE - this requires a python 2 kernel" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Initialise the connection to the server using default or environment variable defined server settings\n", | |
"#REFINE_HOST = os.environ.get('OPENREFINE_HOST', os.environ.get('GOOGLE_REFINE_HOST', '127.0.0.1'))\n", | |
"#REFINE_PORT = os.environ.get('OPENREFINE_PORT', os.environ.get('GOOGLE_REFINE_PORT', '3333'))\n", | |
"from google.refine import refine, facet\n", | |
"server = refine.RefineServer()\n", | |
"orefine = refine.Refine(server)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 133, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Name\r\n", | |
"Diane Abbott\r\n", | |
"Boris Johnson\r\n", | |
"Boris Johnstone\r\n", | |
"Diana Abbot\r\n", | |
"Boris Johnston\r\n", | |
"Joanna Lumley\r\n", | |
"Boris Johnstone\r\n" | |
] | |
} | |
], | |
"source": [ | |
"#Create an example CSV file to load into a test OpenRefine project\n", | |
"project_file = 'simpledemo.csv'\n", | |
"with open(project_file,'w') as f:\n", | |
" for t in ['Name']+names+['Boris Johnstone']:\n", | |
" f.write(t+ '\\n')\n", | |
"!cat {project_file}" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 134, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[u'Name']" | |
] | |
}, | |
"execution_count": 134, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"p=orefine.new_project(project_file=project_file)\n", | |
"p.columns" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"OpenRefine supports a range of clustering functions:\n", | |
"\n", | |
"```\n", | |
"- clusterer_type: binning; function: fingerprint|metaphone3|cologne-phonetic\n", | |
"- clusterer_type: binning; function: ngram-fingerprint; params: {'ngram-size': INT}\n", | |
"- clusterer_type: knn; function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 136, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[{'count': 1, 'value': u'Diana Abbot'}, {'count': 1, 'value': u'Diane Abbott'}]\n", | |
"[{'count': 2, 'value': u'Boris Johnstone'}, {'count': 1, 'value': u'Boris Johnston'}]\n" | |
] | |
} | |
], | |
"source": [ | |
"clusters=p.compute_clusters('Name',clusterer_type='binning',function='cologne-phonetic')\n", | |
"for cluster in clusters:\n", | |
" print(cluster)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Topic Models\n", | |
"\n", | |
"Topic models are statistical models that attempts to categorise different \"topics\" that occur across a set of docments.\n", | |
"\n", | |
"Several python libraries provide a simple interface for the generation of topic models from text contained in multiple documents." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### `gensim`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#!pip3 install gensim" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 53, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#https://github.com/sgsinclair/alta/blob/e5bc94f7898b3bcaf872069f164bc6534769925b/ipynb/TopicModelling.ipynb\n", | |
"from gensim import corpora, models\n", | |
"\n", | |
"def get_lda_from_lists_of_words(lists_of_words, **kwargs):\n", | |
" dictionary = corpora.Dictionary(lists_of_words) # this dictionary maps terms to integers\n", | |
" corpus = [dictionary.doc2bow(text) for text in lists_of_words] # create a bag of words from each document\n", | |
" tfidf = models.TfidfModel(corpus) # this models the significance of words using term frequency inverse document frequency\n", | |
" corpus_tfidf = tfidf[corpus]\n", | |
" kwargs[\"id2word\"] = dictionary # set the dictionary\n", | |
" return models.LdaModel(corpus_tfidf, **kwargs) # do the LDA topic modelling\n", | |
"\n", | |
"def print_top_terms(lda, num_terms=10):\n", | |
" txt=[]\n", | |
" num_terms=min([num_terms,lda.num_topics])\n", | |
" for i in range(0, num_terms):\n", | |
" terms = [term for term,val in lda.show_topic(i,num_terms)]\n", | |
" txt.append(\"\\t - top {} terms for topic #{}: {}\".format(num_terms,i,' '.join(terms)))\n", | |
" return '\\n'.join(txt)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"To start with, let's create a list of dummy documents and then generate word lists for each document." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 78, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"docs=['The banks still have a lot to answer for the financial crisis.',\n", | |
" 'This MP and that Member of Parliament were both active in the debate.',\n", | |
" 'The companies that work in finance need to be responsible.',\n", | |
" 'There is a reponsibility incumber on all participants for high quality debate in Parliament.',\n", | |
" 'Corporate finance is a big responsibility.']\n", | |
"\n", | |
"#Create lists of words from the text in each document\n", | |
"from nltk.tokenize import word_tokenize\n", | |
"docs = [ word_tokenize(doc.lower()) for doc in docs ]\n", | |
"\n", | |
"#Remove stop words from the wordlists\n", | |
"from nltk.corpus import stopwords\n", | |
"docs = [ [word for word in doc if word not in stopwords.words('english') ] for doc in docs ]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we can generate the topic models from the list of word lists." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 82, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"\t - top 3 terms for topic #0: parliament debate active\n", | |
"\t - top 3 terms for topic #1: responsible work need\n", | |
"\t - top 3 terms for topic #2: corporate big responsibility\n" | |
] | |
} | |
], | |
"source": [ | |
"topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)\n", | |
"print( print_top_terms(topicsLda))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The model is randomised - if we run it again we are likely to get a different result." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 83, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"\t - top 3 terms for topic #0: finance corporate responsibility\n", | |
"\t - top 3 terms for topic #1: participants quality high\n", | |
"\t - top 3 terms for topic #2: member mp active\n" | |
] | |
} | |
], | |
"source": [ | |
"topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)\n", | |
"print( print_top_terms(topicsLda))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment