Skip to content

Instantly share code, notes, and snippets.

@AndradeEduardo
Last active November 4, 2017 06:28
Show Gist options
  • Save AndradeEduardo/bcc4e825bd9621d34d12916f4b9f7797 to your computer and use it in GitHub Desktop.
Save AndradeEduardo/bcc4e825bd9621d34d12916f4b9f7797 to your computer and use it in GitHub Desktop.
Spacy NER Experiments
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"text = \"\"\"Gary Winston Lineker was an excellent football player.\n",
"GARY WINSTON LINEKER was a striker.\n",
"gary winston lineker was born in England.\n",
"gARY WiNsTon lInEker is married to Danielle Bux.\n",
"Gary W. Lineker, Kanny Sansom and Peter Shilton played together.\n",
"The defenders:\n",
" Gary Stevens\n",
" Kenny Sansom\n",
" Terry Butcher\n",
"The midfields were:\n",
" - Bryan Robson;\n",
" - Ray Wilkins;\n",
" - Chris Waddle.\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import spacy \n",
"nlp = spacy.load('en')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nlp_big = spacy.load('en_core_web_md')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"document = nlp(text)\n",
"document_big = nlp_big(text)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Gary Winston Lineker was an excellent football player.,\n",
" GARY WINSTON LINEKER was a striker.,\n",
" gary winston lineker was born in England.,\n",
" gARY WiNsTon lInEker is married to Danielle Bux.,\n",
" Gary W. Lineker, Kanny Sansom and Peter Shilton played together.,\n",
" The defenders:\n",
" Gary Stevens\n",
" Kenny Sansom\n",
" Terry Butcher\n",
" The midfields were:\n",
" - Bryan Robson;\n",
" - Ray Wilkins;\n",
" - Chris Waddle.]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(document.sents)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Gary Winston Lineker ', 'WINSTON LINEKER ', 'Danielle Bux', 'Gary W. Lineker', 'Kanny Sansom ', 'Peter Shilton ', 'Gary Stevens', 'Kenny Sansom', 'Terry Butcher']\n"
]
}
],
"source": [
"entities = [e.string for e in document.ents if 'PERSON'==e.label_] \n",
"entities = list(entities) \n",
"print (entities)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Gary Winston Lineker ', 'Danielle Bux', 'Gary W. Lineker', 'Kanny Sansom ', 'Peter Shilton ', 'Gary Stevens', 'Kenny Sansom', 'Terry Butcher', 'Ray Wilkins']\n"
]
}
],
"source": [
"entities_big = [e.string for e in document_big.ents if 'PERSON'==e.label_] \n",
"entities_big = list(entities_big) \n",
"print (entities_big)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|Entities on text|Recognised by <br>'en_core_web_sm' (50MB)|Recognised by <br>'en_core_web_md' (1GB)|\n",
"|---------|---------------|----------------|\n",
"|Gary Winston Lineker|Gary Winston Lineker |Gary Winston Lineker |\n",
"|GARY WINSTON LINEKER|WINSTON LINEKER | |\n",
"|gary winston lineker| | |\n",
"|gARY WiNsTon lInEker| | |\n",
"|Danielle Bux|Danielle Bux |Danielle Bux |\n",
"|Gary Lineker|Gary Lineker |Gary Lineker |\n",
"|Kanny Sansom|Kanny Sansom |Kanny Sansom |\n",
"|Peter Shilton|Peter Shilton |Peter Shilton |\n",
"|Gary Stevens|Gary Stevens |Gary Stevens |\n",
"|Kenny Sansom|Kenny Sansom |Kenny Sansom |\n",
"|Terry Butcher|Terry Butcher |Terry Butcher |\n",
"|Bryan Robson| | |\n",
"|Ray Wilkins| |Ray Wilkins |\n",
"|Chris Waddle| | |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusion\n",
"The Spacy's named entity recognition classifiers for the English language can be used to perform NER. However, this feature shall be implemente with care. Particular characteristiques of the corpus have to be assessed prior to releasing such a functionality on production environment. For instance, the pattern of the typing of names, the characters used as list bullets and some other characteristiques to be assessed during development and test time."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment