wpm/Entity Highlighting in Context.ipynb

## Entity Highlighting in Context.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Entity Highlighting in Context"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The visualization tools in the [spaCy](https://spacy.io/) natural language toolkit can display entity annotations for an entire document.\n",
    "Here we produce highlight just those sentences in the document that contain the specified entities.\n",
    "\n",
    "(You will have to [install the large English language model](https://spacy.io/usage/models) separately.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import spacy\n",
    "from spacy import displacy\n",
    "from itertools import groupby\n",
    "\n",
    "nlp = spacy.load(\"en_core_web_lg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following function displays all the sentences in a parsed document containing the specified entity types. If no entity types are specified, all entities are highlighted. If a sentence does not contain any entities of interest, it is not displayed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def entities_in_context(doc, *entity_types):\n",
    "    def highlight_entity(entity_label):\n",
    "        if not entity_types:\n",
    "            return True\n",
    "        return entity_label in entity_types\n",
    "    \n",
    "    for context, group in groupby([(entity.sent, entity) for entity in doc.ents if highlight_entity(entity.label_)], \n",
    "                                   key=lambda t:t[0]):\n",
    "        entities = [{\"start\": (entity.start_char - context.start_char), \n",
    "                     \"end\":entity.end_char - context.end_char, \n",
    "                     \"label\":entity.label_} for _, entity in group]\n",
    "        context_document = {\"text\": str(context), \"ents\": entities, \"title\": None}\n",
    "        displacy.render(context_document, style=\"ent\", jupyter=True, manual=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following document consists of three sentences, two of which contain dates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "text = u\"\"\"Miles Davis was born on May 26, 1926 and died on September 28, 1991.\n",
    "    He was a world-renowned musician.\n",
    "    His album Kind of Blue was released on August 17, 1959.\"\"\"\n",
    "doc = nlp(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print only those sentences that contain DATE or PERSON entities. Note that the second sentence in the document is not printed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "entities_in_context(doc, \"DATE\", \"PERSON\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}