Skip to content

Instantly share code, notes, and snippets.

@AndradeEduardo
Created November 4, 2017 05:32
Show Gist options
  • Save AndradeEduardo/922fefe851d6606176a7ec2e702872ed to your computer and use it in GitHub Desktop.
Save AndradeEduardo/922fefe851d6606176a7ec2e702872ed to your computer and use it in GitHub Desktop.
NER_Spacy
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"text = \"\"\"Gary Winston Lineker was an excellent football player.\n",
"GARY WINSTON LINEKER was a striker.\n",
"gary winston lineker was born in England.\n",
"gARY WiNsTon lInEker is married to Danielle Bux.\n",
"Gary W. Lineker, Kanny Sansom and Peter Shilton played together.\n",
"The defensors:\n",
" Gary Stevens\n",
" Kenny Sansom\n",
" Terry Butcher\n",
"The midfields were:\n",
" - Bryan Robson;\n",
" - Ray Wilkins;\n",
" - Chris Waddle.\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import spacy \n",
"nlp = spacy.load('en')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nlp_big = spacy.load('en_core_web_md')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"document = nlp(text)\n",
"document_big = nlp_big(text)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Gary Winston Lineker was an excellent football player.,\n",
" GARY WINSTON LINEKER was a striker.,\n",
" gary winston lineker was born in England.,\n",
" gARY WiNsTon lInEker is married to Danielle Bux.,\n",
" Gary W. Lineker, Kanny Sansom and Peter Shilton played together.,\n",
" The defensors:\n",
" Gary Stevens\n",
" Kenny Sansom\n",
" Terry Butcher\n",
" The midfields were:\n",
" - Bryan Robson;\n",
" - Ray Wilkins;\n",
" - Chris Waddle.]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(document.sents)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Gary Winston Lineker ', 'WINSTON LINEKER ', 'Danielle Bux', 'Gary W. Lineker', 'Kanny Sansom ', 'Peter Shilton ', 'Gary Stevens', 'Kenny Sansom', 'Terry Butcher']\n"
]
}
],
"source": [
"entities = [e.string for e in document.ents if 'PERSON'==e.label_] \n",
"entities = list(entities) \n",
"print (entities)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Gary Winston Lineker ', 'Danielle Bux', 'Gary W. Lineker', 'Kanny Sansom ', 'Peter Shilton ', 'Gary Stevens', 'Kenny Sansom', 'Terry Butcher', 'Ray Wilkins']\n"
]
}
],
"source": [
"entities_big = [e.string for e in document_big.ents if 'PERSON'==e.label_] \n",
"entities_big = list(entities_big) \n",
"print (entities_big)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|Entities on text|Recognised by <br>'en_core_web_sm' (50MB)|Recognised by <br>'en_core_web_md' (1GB)|\n",
"|---------|---------------|----------------|\n",
"|Gary Winston Lineker|Gary Winston Lineker |Gary Winston Lineker |\n",
"|GARY WINSTON LINEKER|WINSTON LINEKER | |\n",
"|gary winston lineker| | |\n",
"|gARY WiNsTon lInEker| | |\n",
"|Danielle Bux|Danielle Bux |Danielle Bux |\n",
"|Gary Lineker|Gary Lineker |Gary Lineker |\n",
"|Kanny Sansom|Kanny Sansom |Kanny Sansom |\n",
"|Peter Shilton|Peter Shilton |Peter Shilton |\n",
"|Gary Stevens|Gary Stevens |Gary Stevens |\n",
"|Kenny Sansom|Kenny Sansom |Kenny Sansom |\n",
"|Terry Butcher|Terry Butcher |Terry Butcher |\n",
"|Bryan Robson| | |\n",
"|Ray Wilkins| |Ray Wilkins |\n",
"|Chris Waddle| | |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusion\n",
"The Spacy's named entity recognition classifiers for the English language can be used to perform NER. However, this feature shall be implemente with care. Particular characteristiques of the corpus have to be assessed prior to releasing such a functionality on production environment. For instance, the pattern of the typing of names, the characters used as list bullets and some other characteristiques to be assessed during development and test time."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment