AndradeEduardo/NER_Spacy.ipynb

## NER_Spacy.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "text = \"\"\"Gary Winston Lineker was an excellent football player.\n",
    "GARY WINSTON LINEKER was a striker.\n",
    "gary winston lineker was born in England.\n",
    "gARY WiNsTon lInEker is married to Danielle Bux.\n",
    "Gary W. Lineker, Kanny Sansom and Peter Shilton played together.\n",
    "The defenders:\n",
    "    Gary Stevens\n",
    "    Kenny Sansom\n",
    "    Terry Butcher\n",
    "The midfields were:\n",
    "    - Bryan Robson;\n",
    "    - Ray Wilkins;\n",
    "    - Chris Waddle.\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import spacy \n",
    "nlp = spacy.load('en')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nlp_big = spacy.load('en_core_web_md')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "document = nlp(text)\n",
    "document_big = nlp_big(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Gary Winston Lineker was an excellent football player.,\n",
       " GARY WINSTON LINEKER was a striker.,\n",
       " gary winston lineker was born in England.,\n",
       " gARY WiNsTon lInEker is married to Danielle Bux.,\n",
       " Gary W. Lineker, Kanny Sansom and Peter Shilton played together.,\n",
       " The defenders:\n",
       "     Gary Stevens\n",
       "     Kenny Sansom\n",
       "     Terry Butcher\n",
       " The midfields were:\n",
       "     - Bryan Robson;\n",
       "     - Ray Wilkins;\n",
       "     - Chris Waddle.]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(document.sents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Gary Winston Lineker ', 'WINSTON LINEKER ', 'Danielle Bux', 'Gary W. Lineker', 'Kanny Sansom ', 'Peter Shilton ', 'Gary Stevens', 'Kenny Sansom', 'Terry Butcher']\n"
     ]
    }
   ],
   "source": [
    "entities = [e.string for e in document.ents if 'PERSON'==e.label_] \n",
    "entities = list(entities) \n",
    "print (entities)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Gary Winston Lineker ', 'Danielle Bux', 'Gary W. Lineker', 'Kanny Sansom ', 'Peter Shilton ', 'Gary Stevens', 'Kenny Sansom', 'Terry Butcher', 'Ray Wilkins']\n"
     ]
    }
   ],
   "source": [
    "entities_big = [e.string for e in document_big.ents if 'PERSON'==e.label_] \n",
    "entities_big = list(entities_big) \n",
    "print (entities_big)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|Entities on text|Recognised by <br>'en_core_web_sm' (50MB)|Recognised by <br>'en_core_web_md' (1GB)|\n",
    "|---------|---------------|----------------|\n",
    "|Gary Winston Lineker|Gary Winston Lineker |Gary Winston Lineker |\n",
    "|GARY WINSTON LINEKER|WINSTON LINEKER | |\n",
    "|gary winston lineker| | |\n",
    "|gARY WiNsTon lInEker| | |\n",
    "|Danielle Bux|Danielle Bux |Danielle Bux |\n",
    "|Gary Lineker|Gary Lineker |Gary Lineker |\n",
    "|Kanny Sansom|Kanny Sansom |Kanny Sansom |\n",
    "|Peter Shilton|Peter Shilton |Peter Shilton |\n",
    "|Gary Stevens|Gary Stevens |Gary Stevens |\n",
    "|Kenny Sansom|Kenny Sansom |Kenny Sansom |\n",
    "|Terry Butcher|Terry Butcher |Terry Butcher |\n",
    "|Bryan Robson| | |\n",
    "|Ray Wilkins| |Ray Wilkins |\n",
    "|Chris Waddle| | |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conclusion\n",
    "The Spacy's named entity recognition classifiers for the English language can be used to perform NER. However, this feature shall be implemente with care. Particular characteristiques of the corpus have to be assessed prior to releasing such a functionality on production environment. For instance, the pattern of the typing of names, the characters used as list bullets and some other characteristiques to be assessed during development and test time."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"text = \"\"\"Gary Winston Lineker was an excellent football player.\n",
	"GARY WINSTON LINEKER was a striker.\n",
	"gary winston lineker was born in England.\n",
	"gARY WiNsTon lInEker is married to Danielle Bux.\n",
	"Gary W. Lineker, Kanny Sansom and Peter Shilton played together.\n",
	"The defenders:\n",
	" Gary Stevens\n",
	" Kenny Sansom\n",
	" Terry Butcher\n",
	"The midfields were:\n",
	" - Bryan Robson;\n",
	" - Ray Wilkins;\n",
	" - Chris Waddle.\"\"\""
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import spacy \n",
	"nlp = spacy.load('en')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"nlp_big = spacy.load('en_core_web_md')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"document = nlp(text)\n",
	"document_big = nlp_big(text)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[Gary Winston Lineker was an excellent football player.,\n",
	" GARY WINSTON LINEKER was a striker.,\n",
	" gary winston lineker was born in England.,\n",
	" gARY WiNsTon lInEker is married to Danielle Bux.,\n",
	" Gary W. Lineker, Kanny Sansom and Peter Shilton played together.,\n",
	" The defenders:\n",
	" Gary Stevens\n",
	" Kenny Sansom\n",
	" Terry Butcher\n",
	" The midfields were:\n",
	" - Bryan Robson;\n",
	" - Ray Wilkins;\n",
	" - Chris Waddle.]"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"list(document.sents)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['Gary Winston Lineker ', 'WINSTON LINEKER ', 'Danielle Bux', 'Gary W. Lineker', 'Kanny Sansom ', 'Peter Shilton ', 'Gary Stevens', 'Kenny Sansom', 'Terry Butcher']\n"
	]
	}
	],
	"source": [
	"entities = [e.string for e in document.ents if 'PERSON'==e.label_] \n",
	"entities = list(entities) \n",
	"print (entities)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['Gary Winston Lineker ', 'Danielle Bux', 'Gary W. Lineker', 'Kanny Sansom ', 'Peter Shilton ', 'Gary Stevens', 'Kenny Sansom', 'Terry Butcher', 'Ray Wilkins']\n"
	]
	}
	],
	"source": [
	"entities_big = [e.string for e in document_big.ents if 'PERSON'==e.label_] \n",
	"entities_big = list(entities_big) \n",
	"print (entities_big)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\|Entities on text\|Recognised by <br>'en_core_web_sm' (50MB)\|Recognised by <br>'en_core_web_md' (1GB)\|\n",
	"\|---------\|---------------\|----------------\|\n",
	"\|Gary Winston Lineker\|Gary Winston Lineker \|Gary Winston Lineker \|\n",
	"\|GARY WINSTON LINEKER\|WINSTON LINEKER \| \|\n",
	"\|gary winston lineker\| \| \|\n",
	"\|gARY WiNsTon lInEker\| \| \|\n",
	"\|Danielle Bux\|Danielle Bux \|Danielle Bux \|\n",
	"\|Gary Lineker\|Gary Lineker \|Gary Lineker \|\n",
	"\|Kanny Sansom\|Kanny Sansom \|Kanny Sansom \|\n",
	"\|Peter Shilton\|Peter Shilton \|Peter Shilton \|\n",
	"\|Gary Stevens\|Gary Stevens \|Gary Stevens \|\n",
	"\|Kenny Sansom\|Kenny Sansom \|Kenny Sansom \|\n",
	"\|Terry Butcher\|Terry Butcher \|Terry Butcher \|\n",
	"\|Bryan Robson\| \| \|\n",
	"\|Ray Wilkins\| \|Ray Wilkins \|\n",
	"\|Chris Waddle\| \| \|"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Conclusion\n",
	"The Spacy's named entity recognition classifiers for the English language can be used to perform NER. However, this feature shall be implemente with care. Particular characteristiques of the corpus have to be assessed prior to releasing such a functionality on production environment. For instance, the pattern of the typing of names, the characters used as list bullets and some other characteristiques to be assessed during development and test time."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}