AlexMikhalev/inexact-search-aho-corasick.ipynb

## inexact-search-aho-corasick.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inexact Matching with Aho-Corasick\n",
    "\n",
    "The algorithm described in this notebook is based partially on ideas from the paper [Efficient Clinical Concept Extraction in Electronic Medical Records](https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14794/14029) (Guo, Kakrania, Baldwin, and Syeda-Mahmood, 2017). The paper describes their novel indexing method for concept extraction as follows:\n",
    "\n",
    "> To speed up the processing, we propose a novel indexing method that significantly reduces the search space while\n",
    "still maintaining the requisite flexibility in matching. First, each word in the vocabulary is represented by a unique prefix, the shortest sequence of letters that differentiates it from every other term. Next, an inverted index is created for the mapping from prefixes to report sentences. Starting from the representative prefix of each term in the vocabulary (or a set of prefixes in the case of a multi-word term), all relevant sentences can be easily retrieved as potential matches for the term, and post-filtering by longest common word sequence matching can be used to further refine the search results.\n",
    "\n",
    "The test case mentioned in the paper as an example of the functionality of this concept extraction technique is matching the concept for `Chest CT` from the phrase `a subsequent CT scan of the chest`.\n",
    "\n",
    "Method described in this notebook uses the words of the concept names from the concept dictionary as keys to an Aho-Corasick data structure, with the payload consisting of an inverted index of ((`concept_id`, `synonym_id`), `term_weight`) associated with the word as the payload. Results of annotation against text are then post-processed to retrieve high confidence annotations against these words, and merged to find concept annotations for multi-term phrases in the text.\n",
    "\n",
    "The technique is applicable to both short text such as queries, and longer text bodies, such as paragraphs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ahocorasick\n",
    "import numpy as np\n",
    "import operator\n",
    "import os\n",
    "import pandas as pd\n",
    "import re\n",
    "import spacy\n",
    "\n",
    "from spacy.lang.en import stop_words as sw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATA_DIR = \"../data\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Indexing\n",
    "\n",
    "We will first create an inverted index and then load it into an Aho-Corasick automaton."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop_words = sw.STOP_WORDS\n",
    "def is_stop_word(w):\n",
    "    return w.lower() in stop_words\n",
    "\n",
    "non_alpha_re = re.compile(\"^[^a-zA-Z0-9].*$\")\n",
    "def is_non_alpha(w):\n",
    "    return re.match(non_alpha_re, w) is not None\n",
    "\n",
    "def is_all_caps(w):\n",
    "    return w.upper() == w\n",
    "\n",
    "\n",
    "assert(is_stop_word(\"is\"))\n",
    "assert(is_non_alpha(\"-\"))\n",
    "assert(is_all_caps(\"AIDS\"))\n",
    "assert(not is_all_caps(\"hearing\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 0 concepts\n",
      "Loaded 100000 concepts\n",
      "Loaded 200000 concepts\n",
      "Loaded 300000 concepts\n",
      "Loaded 400000 concepts\n",
      "Loaded 500000 concepts\n",
      "Loaded 581733 terms, COMPLETE\n",
      "# of concepts: 581733\n"
     ]
    }
   ],
   "source": [
    "def build_inexact_match_dictionary(file_path):\n",
    "    cid2cfn, term2props = {}, {}\n",
    "\n",
    "    # prepare the data. Since we are building a Aho-Corasick data structure keyed\n",
    "    # by individual words in the concept names, there is greater chance of collision\n",
    "    # between terms belonging to multiple concepts. In order to get around that, we\n",
    "    # preprocess the terms to an inverted index of terms to list of pairs of \n",
    "    # (concept_id:synonym_id, weight) pairs. Here concept_id (CID) is the ID the \n",
    "    # token corresponding to the synonym is mapped to, and synonym_id (SID) is the\n",
    "    # sequence number of the synonym. The weight is the reciprocal of the number of\n",
    "    # non-stopword terms in the synonym, so a match against a term in a short synonym \n",
    "    # is \"better\" than one against a term in a long synonym.\n",
    "    num_loaded = 0\n",
    "    fvert = open(os.path.join(DATA_DIR, \"vertices.txt\"), \"r\")\n",
    "    for line in fvert:\n",
    "        if num_loaded % 100000 == 0:\n",
    "            print(\"Loaded {:d} concepts\".format(num_loaded))\n",
    "        cid, syns = line.strip().split('\\t')\n",
    "        for sid, syn in enumerate(syns.split('|')):\n",
    "            if sid == 0:\n",
    "                cid2cfn[cid] = syn\n",
    "            terms = syn.split(' ')\n",
    "            matchable_terms = []\n",
    "            for term in syn.split(' '):\n",
    "                if is_stop_word(term) or is_non_alpha(term):\n",
    "                    continue\n",
    "                if not is_all_caps(term):\n",
    "                    term = term.lower()\n",
    "                matchable_terms.append(term)\n",
    "            key = \":\".join([cid, str(sid)])\n",
    "            weight = 1 / len(terms)\n",
    "            for term in matchable_terms:\n",
    "                if term not in term2props.keys():\n",
    "                    term2props[term] = [(key, weight)]\n",
    "                else:\n",
    "                    term2props[term].append((key, weight))\n",
    "        num_loaded += 1\n",
    "\n",
    "    print(\"Loaded {:d} terms, COMPLETE\".format(num_loaded))\n",
    "    fvert.close()\n",
    "    \n",
    "    # load up the Aho-Corasick automaton\n",
    "    A = ahocorasick.Automaton()\n",
    "\n",
    "    for term in term2props.keys():\n",
    "        props = term2props[term]\n",
    "        A.add_word(term, (term, props))\n",
    "\n",
    "    A.make_automaton()\n",
    "    \n",
    "    # return the cid2cfn dictionary and the automaton\n",
    "    return cid2cfn, A\n",
    "\n",
    "\n",
    "cid2cfn, A = build_inexact_match_dictionary(os.path.join(DATA_DIR, \"vertices.txt\"))\n",
    "print(\"# of concepts: {:d}\".format(len(cid2cfn)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Annotation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_matched_terms(query, A, cid2cfn, confidence_threshold=0.85, span_slop=5):\n",
    "    \n",
    "    # normalize input query to match terms in dictionary\n",
    "    query_terms = [(term if is_all_caps(term) else term.lower()) for term in query.split(' ')]\n",
    "    # construct set of matchable terms (minus stopwords and non-alpha words)\n",
    "    matchable_terms = set([term for term in query_terms if not(is_stop_word(term) or is_non_alpha(term))])\n",
    "    query = \" \".join(query_terms)\n",
    "    query_results = []\n",
    "    for end_index, (term, props) in A.iter(query):\n",
    "        if term in matchable_terms:\n",
    "            start_index = end_index - len(term) + 1\n",
    "            query_results.append((start_index, end_index, term, props))\n",
    "    \n",
    "    # at this stage, Aho-Corasick provides ((start_offset, end_offset, term, props=[(csid, weight)]))\n",
    "    # for each matched term in query\n",
    "    # Next we construct a matrix of weights for each term (row) for each matched cid:sid (col)\n",
    "    num_rows = len(query_results)\n",
    "    cols = []\n",
    "    for row in query_results:\n",
    "        _, _, term, props = row\n",
    "        cols.extend([p[0] for p in props])\n",
    "    cols = set(cols)\n",
    "    num_cols = len(cols)\n",
    "    W = np.zeros((num_rows, num_cols))\n",
    "    \n",
    "    # construct lookup indices\n",
    "    csid2col, col2csid, term2row, row2term = {}, {}, {}, {}\n",
    "\n",
    "    e_rows = enumerate([r[2] for r in query_results])\n",
    "    term2row = {x:i for i, x in e_rows}\n",
    "    row2term = {v:k for (k, v) in term2row.items()}\n",
    "\n",
    "    e_cols = enumerate(sorted(list(cols)))\n",
    "    csid2col = {x:i for i, x in e_cols}\n",
    "    col2csid = {v:k for (k, v) in csid2col.items()}\n",
    "\n",
    "    # populate weights matrix W\n",
    "    for row in query_results:\n",
    "        start_index, end_index, term, props = row\n",
    "        i = term2row[term]\n",
    "        for k, w in props:\n",
    "            j = csid2col[k]\n",
    "            W[i, j] = w\n",
    "            \n",
    "    # compute the confidence score for each cid:sid (CSID)\n",
    "    row_sum = np.sum(W, axis=0)\n",
    "    # candidate CSIDs are those whose confidence score is above threshold\n",
    "    candidate_weights, candidate_col_ids = [], []\n",
    "    for j in range(W.shape[1]):\n",
    "        if row_sum[j] > confidence_threshold:\n",
    "            candidate_col_ids.append(j)\n",
    "            candidate_weights.append(W[:, j])\n",
    "    C = np.array(candidate_weights).T\n",
    "    \n",
    "    # merge offsets for each candidate. In case there are multiple entries down the column,\n",
    "    # these signal that multiple terms have mapped to a single concept, and we need to \n",
    "    # merge the offsets accordingly. In addition, for non-contiguous matched spans, we \n",
    "    # need to decide if we should treat the span as continuous or pick the longest non-\n",
    "    # contiguous span.\n",
    "    offsets = [(s, e) for s, e, t, p in query_results]\n",
    "    terms = [t for s, e, t, p in query_results]\n",
    "\n",
    "    matched_concepts = []\n",
    "\n",
    "    for j in range(C.shape[1]):\n",
    "        is_full_match = True\n",
    "        csid = col2csid[candidate_col_ids[j]]\n",
    "        score = np.sum(C[:, j])\n",
    "        cs_row_ids = np.where(C[:, j] > 0)[0]\n",
    "\n",
    "        term = \" \".join([terms[i] for i in cs_row_ids])\n",
    "        cs_offsets = [offsets[i] for i in cs_row_ids]\n",
    "        if len(cs_offsets) > 1:\n",
    "            spans = []\n",
    "            start_offset, end_offset = None, None\n",
    "            for i in range(len(cs_offsets) - 1):\n",
    "                if start_offset is None:\n",
    "                    start_offset = cs_offsets[i][0]\n",
    "                if cs_offsets[i+1][0] - cs_offsets[i][1] <= span_slop:\n",
    "                    end_offset = cs_offsets[i+1][1]\n",
    "                else:\n",
    "                    end_offset = cs_offsets[i][1]\n",
    "                    spans.append((start_offset, end_offset))\n",
    "                    start_offset = None\n",
    "                    end_offset = None\n",
    "            if len(spans) > 1:\n",
    "                # select longest span\n",
    "                spans = sorted(spans, key=lambda x: x[1]-x[0], reverse=True)\n",
    "                start_offset = spans[0][0]\n",
    "                end_offset = spans[0][1]\n",
    "                is_full_match = False\n",
    "                # discount score based on subspan length\n",
    "                score *= len(spans[0]) / sum([len(span) for span in spans])\n",
    "        else:\n",
    "            start_offset = cs_offsets[0][0]\n",
    "            end_offset = cs_offsets[0][1]\n",
    "        \n",
    "        if start_offset is not None and end_offset is not None:\n",
    "            matched_concepts.append([term, start_offset, end_offset, csid, score, is_full_match])\n",
    "        \n",
    "    # we now need to remove spans which are completely subsumed in longer spans. This is \n",
    "    # to prevent mapping to a term such as \"lung cancer\" from being mapped separately to\n",
    "    # concepts for \"lung\", \"cancer\", and \"lung cancer\".\n",
    "    longest_matched_concepts, covered_spans = [], []\n",
    "\n",
    "    matched_concepts = sorted(matched_concepts, key=lambda x: x[2]-x[1], reverse=True)\n",
    "    for matched_concept in matched_concepts:\n",
    "        start_offset, end_offset = matched_concept[1], matched_concept[2]\n",
    "        is_subsumed = False\n",
    "        for cs_s, cs_e in covered_spans:\n",
    "            if start_offset >= cs_s and end_offset <= cs_e:\n",
    "                # remove completely subsumed spans from report\n",
    "                is_subsumed = True\n",
    "                break\n",
    "        if is_subsumed:\n",
    "            continue\n",
    "        longest_matched_concepts.append(matched_concept)\n",
    "        covered_spans.append((start_offset, end_offset))\n",
    "\n",
    "    # finally, we pull in the concept primary name for display purposes\n",
    "    matched_concepts_for_display = []\n",
    "    for matched_concept in longest_matched_concepts:\n",
    "        cid = matched_concept[3].split(':')[0]\n",
    "        cfn = cid2cfn[cid]\n",
    "        matched_concepts_for_display.append([\n",
    "            matched_concept[0], # term\n",
    "            matched_concept[1], # start_offset\n",
    "            matched_concept[2], # end offset\n",
    "            cid,                # concept id\n",
    "            cfn,                # concept primary name\n",
    "            matched_concept[4], # score\n",
    "            matched_concept[5]  # full match\n",
    "        ])\n",
    "    return matched_concepts_for_display"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples of Query Annotations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_queries = [\n",
    "    \"helicobacter pylori patient perception\",\n",
    "    \"lyme disease syphilis\",\n",
    "    \"nausea related to latuda\",\n",
    "    \"pheochromocytoma surgery mitral\",\n",
    "    \"talus osteochondral injury\",\n",
    "    \"tuberculosis pulmonary\",\n",
    "    \"a subsequent CT scan of the chest\"\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Baseline\n",
    "\n",
    "The following concepts were identified by an in-house system used to map search queries to concepts in our medical taxonomy.\n",
    "\n",
    "1. __helicobacter pylori patient perception__ : Helicobacter pylori (8121505), patient (8111861), perception (8117608).\n",
    "2. __lyme disease syphilis__: Lyme disease (2791575), syphilis (2792091).\n",
    "3. __nausea related to latuda__: nausea (4993818), Latuda (8815455).\n",
    "4. __pheochromocytoma surgery mitral__: pheochromocytoma (2791864), surgical procedure (5344477), mitral (9786428).\n",
    "5. __talus osteochondral injury__: talus (8002610), osteochondral plate (9787545), injury (8109859)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_columns = [\"term\", \"start_offset\", \"end_offset\", \"concept_id\", \"concept_primary_name\", \n",
    "                 \"match_confidence\", \"is_full_match\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>start_offset</th>\n",
       "      <th>end_offset</th>\n",
       "      <th>concept_id</th>\n",
       "      <th>concept_primary_name</th>\n",
       "      <th>match_confidence</th>\n",
       "      <th>is_full_match</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>helicobacter pylori</td>\n",
       "      <td>0</td>\n",
       "      <td>18</td>\n",
       "      <td>8121505</td>\n",
       "      <td>Helicobacter pylori</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>perception</td>\n",
       "      <td>28</td>\n",
       "      <td>37</td>\n",
       "      <td>8117608</td>\n",
       "      <td>Perception</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>patient</td>\n",
       "      <td>20</td>\n",
       "      <td>26</td>\n",
       "      <td>8111861</td>\n",
       "      <td>patient</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  term  start_offset  end_offset concept_id  \\\n",
       "0  helicobacter pylori             0          18    8121505   \n",
       "1           perception            28          37    8117608   \n",
       "2              patient            20          26    8111861   \n",
       "\n",
       "  concept_primary_name  match_confidence  is_full_match  \n",
       "0  Helicobacter pylori               1.0           True  \n",
       "1           Perception               1.0           True  \n",
       "2              patient               1.0           True  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = find_matched_terms(sample_queries[0], A, cid2cfn)\n",
    "\n",
    "results_df = pd.DataFrame(results, columns=result_columns)\n",
    "results_df.head(len(results))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>start_offset</th>\n",
       "      <th>end_offset</th>\n",
       "      <th>concept_id</th>\n",
       "      <th>concept_primary_name</th>\n",
       "      <th>match_confidence</th>\n",
       "      <th>is_full_match</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>lyme disease</td>\n",
       "      <td>0</td>\n",
       "      <td>11</td>\n",
       "      <td>2791575</td>\n",
       "      <td>Lyme disease</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>syphilis</td>\n",
       "      <td>13</td>\n",
       "      <td>20</td>\n",
       "      <td>2792091</td>\n",
       "      <td>syphilis</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           term  start_offset  end_offset concept_id concept_primary_name  \\\n",
       "0  lyme disease             0          11    2791575         Lyme disease   \n",
       "1      syphilis            13          20    2792091             syphilis   \n",
       "\n",
       "   match_confidence  is_full_match  \n",
       "0               1.0           True  \n",
       "1               1.0           True  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = find_matched_terms(sample_queries[1], A, cid2cfn)\n",
    "\n",
    "results_df = pd.DataFrame(results, columns=result_columns)\n",
    "results_df.head(len(results))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>start_offset</th>\n",
       "      <th>end_offset</th>\n",
       "      <th>concept_id</th>\n",
       "      <th>concept_primary_name</th>\n",
       "      <th>match_confidence</th>\n",
       "      <th>is_full_match</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>related</td>\n",
       "      <td>7</td>\n",
       "      <td>13</td>\n",
       "      <td>9125985</td>\n",
       "      <td>Related</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>nausea</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>4993818</td>\n",
       "      <td>nausea</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>latuda</td>\n",
       "      <td>18</td>\n",
       "      <td>23</td>\n",
       "      <td>8815455</td>\n",
       "      <td>Latuda</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      term  start_offset  end_offset concept_id concept_primary_name  \\\n",
       "0  related             7          13    9125985              Related   \n",
       "1   nausea             0           5    4993818               nausea   \n",
       "2   latuda            18          23    8815455               Latuda   \n",
       "\n",
       "   match_confidence  is_full_match  \n",
       "0               1.0           True  \n",
       "1               1.0           True  \n",
       "2               1.0           True  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = find_matched_terms(sample_queries[2], A, cid2cfn)\n",
    "\n",
    "results_df = pd.DataFrame(results, columns=result_columns)\n",
    "results_df.head(len(results))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>start_offset</th>\n",
       "      <th>end_offset</th>\n",
       "      <th>concept_id</th>\n",
       "      <th>concept_primary_name</th>\n",
       "      <th>match_confidence</th>\n",
       "      <th>is_full_match</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>pheochromocytoma</td>\n",
       "      <td>0</td>\n",
       "      <td>15</td>\n",
       "      <td>2791864</td>\n",
       "      <td>pheochromocytoma</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>surgery</td>\n",
       "      <td>17</td>\n",
       "      <td>23</td>\n",
       "      <td>5344477</td>\n",
       "      <td>surgical procedure</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>mitral</td>\n",
       "      <td>25</td>\n",
       "      <td>30</td>\n",
       "      <td>9786428</td>\n",
       "      <td>mitral</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               term  start_offset  end_offset concept_id concept_primary_name  \\\n",
       "0  pheochromocytoma             0          15    2791864     pheochromocytoma   \n",
       "1           surgery            17          23    5344477   surgical procedure   \n",
       "2            mitral            25          30    9786428               mitral   \n",
       "\n",
       "   match_confidence  is_full_match  \n",
       "0               1.0           True  \n",
       "1               1.0           True  \n",
       "2               1.0           True  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = find_matched_terms(sample_queries[3], A, cid2cfn)\n",
    "\n",
    "results_df = pd.DataFrame(results, columns=result_columns)\n",
    "results_df.head(len(results))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>start_offset</th>\n",
       "      <th>end_offset</th>\n",
       "      <th>concept_id</th>\n",
       "      <th>concept_primary_name</th>\n",
       "      <th>match_confidence</th>\n",
       "      <th>is_full_match</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>osteochondral</td>\n",
       "      <td>6</td>\n",
       "      <td>18</td>\n",
       "      <td>9787545</td>\n",
       "      <td>osteochondral plate</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>injury</td>\n",
       "      <td>20</td>\n",
       "      <td>25</td>\n",
       "      <td>8109859</td>\n",
       "      <td>injury</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>talus</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>8002610</td>\n",
       "      <td>talus</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            term  start_offset  end_offset concept_id concept_primary_name  \\\n",
       "0  osteochondral             6          18    9787545  osteochondral plate   \n",
       "1         injury            20          25    8109859               injury   \n",
       "2          talus             0           4    8002610                talus   \n",
       "\n",
       "   match_confidence  is_full_match  \n",
       "0               1.0           True  \n",
       "1               1.0           True  \n",
       "2               1.0           True  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = find_matched_terms(sample_queries[4], A, cid2cfn)\n",
    "\n",
    "results_df = pd.DataFrame(results, columns=result_columns)\n",
    "results_df.head(len(results))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>start_offset</th>\n",
       "      <th>end_offset</th>\n",
       "      <th>concept_id</th>\n",
       "      <th>concept_primary_name</th>\n",
       "      <th>match_confidence</th>\n",
       "      <th>is_full_match</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>tuberculosis pulmonary</td>\n",
       "      <td>0</td>\n",
       "      <td>21</td>\n",
       "      <td>8107493</td>\n",
       "      <td>pulmonary tuberculosis</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     term  start_offset  end_offset concept_id  \\\n",
       "0  tuberculosis pulmonary             0          21    8107493   \n",
       "\n",
       "     concept_primary_name  match_confidence  is_full_match  \n",
       "0  pulmonary tuberculosis               1.0           True  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = find_matched_terms(sample_queries[5], A, cid2cfn)\n",
    "\n",
    "results_df = pd.DataFrame(results, columns=result_columns)\n",
    "results_df.head(len(results))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>start_offset</th>\n",
       "      <th>end_offset</th>\n",
       "      <th>concept_id</th>\n",
       "      <th>concept_primary_name</th>\n",
       "      <th>match_confidence</th>\n",
       "      <th>is_full_match</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CT scan chest</td>\n",
       "      <td>13</td>\n",
       "      <td>32</td>\n",
       "      <td>8109536</td>\n",
       "      <td>CT of chest</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            term  start_offset  end_offset concept_id concept_primary_name  \\\n",
       "0  CT scan chest            13          32    8109536          CT of chest   \n",
       "\n",
       "   match_confidence  is_full_match  \n",
       "0               1.0           True  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = find_matched_terms(sample_queries[6], A, cid2cfn, span_slop=10)\n",
    "\n",
    "results_df = pd.DataFrame(results, columns=result_columns)\n",
    "results_df.head(len(results))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples of text annotation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_texts = [\n",
    "    \"\"\"To halt the global tuberculosis epidemic, transmission must be stopped to prevent new infections \n",
    "       and new cases. Identification of individuals with tuberculosis and prompt initiation of effective \n",
    "       treatment to rapidly render them non-infectious is crucial to this task. However, in settings of \n",
    "       high tuberculosis burden, active case-finding is often not implemented, resulting in long delays \n",
    "       in diagnosis and treatment. A range of strategies to find cases and ensure prompt and correct \n",
    "       treatment have been shown to be effective in high tuberculosis-burden settings. The population\n",
    "       level effect of targeted active case-finding on reducing tuberculosis incidence has been shown by \n",
    "       studies and projected by mathematical modelling. The inclusion of targeted active case-finding in \n",
    "       a comprehensive epidemic-control strategy for tuberculosis should contribute substantially to a \n",
    "       decrease in tuberculosis incidence.\"\"\",\n",
    "    \"\"\"Infection with Mycobacterium tuberculosis remains a major cause of morbidity and mortality all \n",
    "       over the world. Since the effectiveness of the only available tuberculosis vaccine, Mycobacterium \n",
    "       bovis bacillus Calmette-Guérin (BCG), is suboptimal, there is a strong demand to develop new \n",
    "       tuberculosis vaccines. As tuberculosis is an airborne disease, the intranasal route of vaccination \n",
    "       might be preferable. Live influenza virus vaccines might be considered as potential vectors for \n",
    "       mucosal immunization against various viral or bacterial pathogens, including M. tuberculosis. \n",
    "       We generated several subtypes of attenuated recombinant influenza A viruses expressing the 6-kDa \n",
    "       early secretory antigenic target protein (ESAT-6) of M. tuberculosis from the NS1 reading frame. \n",
    "       We were able to demonstrate the potency of influenza virus NS vectors to induce an \n",
    "       M. tuberculosisspecific Th1 immune response in mice. Moreover, intranasal immunization of mice and \n",
    "       guinea pigs with such vectors induced protection against mycobacterial challenge, similar to that \n",
    "       induced by BCG vaccination.\"\"\"\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp = spacy.load(\"en\")\n",
    "\n",
    "def prepare_text(text):\n",
    "    doc = nlp(text)\n",
    "    return \" \".join([token.text for token in doc])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>start_offset</th>\n",
       "      <th>end_offset</th>\n",
       "      <th>concept_id</th>\n",
       "      <th>concept_primary_name</th>\n",
       "      <th>match_confidence</th>\n",
       "      <th>is_full_match</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>identification</td>\n",
       "      <td>123</td>\n",
       "      <td>136</td>\n",
       "      <td>8878533</td>\n",
       "      <td>Identification - mental defense mechanism</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>comprehensive</td>\n",
       "      <td>867</td>\n",
       "      <td>879</td>\n",
       "      <td>8119030</td>\n",
       "      <td>Comprehension</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>tuberculosis</td>\n",
       "      <td>984</td>\n",
       "      <td>995</td>\n",
       "      <td>2792173</td>\n",
       "      <td>tuberculosis</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>transmission</td>\n",
       "      <td>43</td>\n",
       "      <td>54</td>\n",
       "      <td>8110814</td>\n",
       "      <td>disease transmission</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>initiation</td>\n",
       "      <td>182</td>\n",
       "      <td>191</td>\n",
       "      <td>8903356</td>\n",
       "      <td>initiation</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>population</td>\n",
       "      <td>626</td>\n",
       "      <td>635</td>\n",
       "      <td>9724683</td>\n",
       "      <td>geographic population</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>infectious</td>\n",
       "      <td>254</td>\n",
       "      <td>263</td>\n",
       "      <td>9786858</td>\n",
       "      <td>infectious</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>contribute</td>\n",
       "      <td>933</td>\n",
       "      <td>942</td>\n",
       "      <td>9792999</td>\n",
       "      <td>contribution</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>treatment</td>\n",
       "      <td>539</td>\n",
       "      <td>547</td>\n",
       "      <td>5216597</td>\n",
       "      <td>therapeutic procedure</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>diagnosis</td>\n",
       "      <td>438</td>\n",
       "      <td>446</td>\n",
       "      <td>5304448</td>\n",
       "      <td>diagnosis</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>incidence</td>\n",
       "      <td>997</td>\n",
       "      <td>1005</td>\n",
       "      <td>9203317</td>\n",
       "      <td>incidence</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>strategy</td>\n",
       "      <td>900</td>\n",
       "      <td>907</td>\n",
       "      <td>8110336</td>\n",
       "      <td>strategy</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>epidemic</td>\n",
       "      <td>881</td>\n",
       "      <td>888</td>\n",
       "      <td>8113048</td>\n",
       "      <td>Epidemic</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>decrease</td>\n",
       "      <td>972</td>\n",
       "      <td>979</td>\n",
       "      <td>8864047</td>\n",
       "      <td>decrease</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>reducing</td>\n",
       "      <td>696</td>\n",
       "      <td>703</td>\n",
       "      <td>9787037</td>\n",
       "      <td>reduction</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>control</td>\n",
       "      <td>892</td>\n",
       "      <td>898</td>\n",
       "      <td>7987974</td>\n",
       "      <td>Control</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>studies case control</td>\n",
       "      <td>755</td>\n",
       "      <td>761</td>\n",
       "      <td>8110199</td>\n",
       "      <td>case control study</td>\n",
       "      <td>0.5</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>finding</td>\n",
       "      <td>845</td>\n",
       "      <td>851</td>\n",
       "      <td>8114924</td>\n",
       "      <td>physical finding</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>prevent</td>\n",
       "      <td>75</td>\n",
       "      <td>81</td>\n",
       "      <td>8120170</td>\n",
       "      <td>prevent</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>rapidly</td>\n",
       "      <td>228</td>\n",
       "      <td>234</td>\n",
       "      <td>9073025</td>\n",
       "      <td>Rapidly (qualifier value)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>crucial</td>\n",
       "      <td>268</td>\n",
       "      <td>274</td>\n",
       "      <td>9790737</td>\n",
       "      <td>critical</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>global</td>\n",
       "      <td>12</td>\n",
       "      <td>17</td>\n",
       "      <td>5354974</td>\n",
       "      <td>generalized</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>active</td>\n",
       "      <td>831</td>\n",
       "      <td>836</td>\n",
       "      <td>8208927</td>\n",
       "      <td>Active</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>ensure</td>\n",
       "      <td>504</td>\n",
       "      <td>509</td>\n",
       "      <td>9150977</td>\n",
       "      <td>Ensure</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>prompt</td>\n",
       "      <td>511</td>\n",
       "      <td>516</td>\n",
       "      <td>9757941</td>\n",
       "      <td>Prompt</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>burden</td>\n",
       "      <td>604</td>\n",
       "      <td>609</td>\n",
       "      <td>9792062</td>\n",
       "      <td>burden</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>level</td>\n",
       "      <td>646</td>\n",
       "      <td>650</td>\n",
       "      <td>8863573</td>\n",
       "      <td>level</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>range</td>\n",
       "      <td>466</td>\n",
       "      <td>470</td>\n",
       "      <td>9790498</td>\n",
       "      <td>range</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>high</td>\n",
       "      <td>584</td>\n",
       "      <td>587</td>\n",
       "      <td>9723548</td>\n",
       "      <td>High</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>case</td>\n",
       "      <td>838</td>\n",
       "      <td>841</td>\n",
       "      <td>9792560</td>\n",
       "      <td>case</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>new</td>\n",
       "      <td>111</td>\n",
       "      <td>113</td>\n",
       "      <td>9786819</td>\n",
       "      <td>novel</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    term  start_offset  end_offset concept_id  \\\n",
       "0         identification           123         136    8878533   \n",
       "1          comprehensive           867         879    8119030   \n",
       "2           tuberculosis           984         995    2792173   \n",
       "3           transmission            43          54    8110814   \n",
       "4             initiation           182         191    8903356   \n",
       "5             population           626         635    9724683   \n",
       "6             infectious           254         263    9786858   \n",
       "7             contribute           933         942    9792999   \n",
       "8              treatment           539         547    5216597   \n",
       "9              diagnosis           438         446    5304448   \n",
       "10             incidence           997        1005    9203317   \n",
       "11              strategy           900         907    8110336   \n",
       "12              epidemic           881         888    8113048   \n",
       "13              decrease           972         979    8864047   \n",
       "14              reducing           696         703    9787037   \n",
       "15               control           892         898    7987974   \n",
       "16  studies case control           755         761    8110199   \n",
       "17               finding           845         851    8114924   \n",
       "18               prevent            75          81    8120170   \n",
       "19               rapidly           228         234    9073025   \n",
       "20               crucial           268         274    9790737   \n",
       "21                global            12          17    5354974   \n",
       "22                active           831         836    8208927   \n",
       "23                ensure           504         509    9150977   \n",
       "24                prompt           511         516    9757941   \n",
       "25                burden           604         609    9792062   \n",
       "26                 level           646         650    8863573   \n",
       "27                 range           466         470    9790498   \n",
       "28                  high           584         587    9723548   \n",
       "29                  case           838         841    9792560   \n",
       "30                   new           111         113    9786819   \n",
       "\n",
       "                         concept_primary_name  match_confidence  is_full_match  \n",
       "0   Identification - mental defense mechanism               1.0           True  \n",
       "1                               Comprehension               1.0           True  \n",
       "2                                tuberculosis               1.0           True  \n",
       "3                        disease transmission               1.0           True  \n",
       "4                                  initiation               1.0           True  \n",
       "5                       geographic population               1.0           True  \n",
       "6                                  infectious               1.0           True  \n",
       "7                                contribution               1.0           True  \n",
       "8                       therapeutic procedure               1.0           True  \n",
       "9                                   diagnosis               1.0           True  \n",
       "10                                  incidence               1.0           True  \n",
       "11                                   strategy               1.0           True  \n",
       "12                                   Epidemic               1.0           True  \n",
       "13                                   decrease               1.0           True  \n",
       "14                                  reduction               1.0           True  \n",
       "15                                    Control               1.0           True  \n",
       "16                         case control study               0.5          False  \n",
       "17                           physical finding               1.0           True  \n",
       "18                                    prevent               1.0           True  \n",
       "19                  Rapidly (qualifier value)               1.0           True  \n",
       "20                                   critical               1.0           True  \n",
       "21                                generalized               1.0           True  \n",
       "22                                     Active               1.0           True  \n",
       "23                                     Ensure               1.0           True  \n",
       "24                                     Prompt               1.0           True  \n",
       "25                                     burden               1.0           True  \n",
       "26                                      level               1.0           True  \n",
       "27                                      range               1.0           True  \n",
       "28                                       High               1.0           True  \n",
       "29                                       case               1.0           True  \n",
       "30                                      novel               1.0           True  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = find_matched_terms(prepare_text(sample_texts[0]), A, cid2cfn, span_slop=30)\n",
    "\n",
    "results_df = pd.DataFrame(results, columns=result_columns)\n",
    "results_df.head(len(results))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>term</th>\n",
       "      <th>start_offset</th>\n",
       "      <th>end_offset</th>\n",
       "      <th>concept_id</th>\n",
       "      <th>concept_primary_name</th>\n",
       "      <th>match_confidence</th>\n",
       "      <th>is_full_match</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>mycobacterium bovis</td>\n",
       "      <td>190</td>\n",
       "      <td>217</td>\n",
       "      <td>8121290</td>\n",
       "      <td>Mycobacterium bovis</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>live vaccine influenza virus intranasal</td>\n",
       "      <td>453</td>\n",
       "      <td>480</td>\n",
       "      <td>8904228</td>\n",
       "      <td>intranasal influenza live virus vaccine</td>\n",
       "      <td>0.5</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>infection influenza virus</td>\n",
       "      <td>900</td>\n",
       "      <td>914</td>\n",
       "      <td>2793084</td>\n",
       "      <td>influenza</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>immune response</td>\n",
       "      <td>977</td>\n",
       "      <td>991</td>\n",
       "      <td>8107682</td>\n",
       "      <td>immune response</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>M. tuberculosis</td>\n",
       "      <td>949</td>\n",
       "      <td>963</td>\n",
       "      <td>8121775</td>\n",
       "      <td>Mycobacterium tuberculosis</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>BCG vaccination</td>\n",
       "      <td>1178</td>\n",
       "      <td>1192</td>\n",
       "      <td>9725496</td>\n",
       "      <td>vaccination against tuberculosis</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>effectiveness</td>\n",
       "      <td>131</td>\n",
       "      <td>143</td>\n",
       "      <td>8866302</td>\n",
       "      <td>effectiveness</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>reading frame</td>\n",
       "      <td>832</td>\n",
       "      <td>844</td>\n",
       "      <td>9275032</td>\n",
       "      <td>Reading Frames</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>immunization</td>\n",
       "      <td>1025</td>\n",
       "      <td>1036</td>\n",
       "      <td>8107909</td>\n",
       "      <td>immunization</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>attenuated</td>\n",
       "      <td>675</td>\n",
       "      <td>684</td>\n",
       "      <td>8928394</td>\n",
       "      <td>Attenuated by (contextual qualifier) (qualifie...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>intranasal</td>\n",
       "      <td>1014</td>\n",
       "      <td>1023</td>\n",
       "      <td>9783132</td>\n",
       "      <td>intranasal</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>protection</td>\n",
       "      <td>1097</td>\n",
       "      <td>1106</td>\n",
       "      <td>9791409</td>\n",
       "      <td>protection</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>morbidity</td>\n",
       "      <td>67</td>\n",
       "      <td>75</td>\n",
       "      <td>8113345</td>\n",
       "      <td>Morbidity - disease rate</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>bacterial</td>\n",
       "      <td>1120</td>\n",
       "      <td>1128</td>\n",
       "      <td>8116544</td>\n",
       "      <td>bacterium</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>infection</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "      <td>8816100</td>\n",
       "      <td>infection</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>available</td>\n",
       "      <td>157</td>\n",
       "      <td>165</td>\n",
       "      <td>8825474</td>\n",
       "      <td>availability of</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>including</td>\n",
       "      <td>605</td>\n",
       "      <td>613</td>\n",
       "      <td>8954471</td>\n",
       "      <td>Including</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>mortality</td>\n",
       "      <td>81</td>\n",
       "      <td>89</td>\n",
       "      <td>9322149</td>\n",
       "      <td>mortality</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>secretory</td>\n",
       "      <td>754</td>\n",
       "      <td>762</td>\n",
       "      <td>9344807</td>\n",
       "      <td>secretory process</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>challenge</td>\n",
       "      <td>1130</td>\n",
       "      <td>1138</td>\n",
       "      <td>9792057</td>\n",
       "      <td>challenge</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>potential</td>\n",
       "      <td>506</td>\n",
       "      <td>514</td>\n",
       "      <td>9792568</td>\n",
       "      <td>potential</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>bacillus</td>\n",
       "      <td>219</td>\n",
       "      <td>226</td>\n",
       "      <td>8107869</td>\n",
       "      <td>Bacillus</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>airborne</td>\n",
       "      <td>367</td>\n",
       "      <td>374</td>\n",
       "      <td>9793999</td>\n",
       "      <td>airborne</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>disease</td>\n",
       "      <td>376</td>\n",
       "      <td>382</td>\n",
       "      <td>2795416</td>\n",
       "      <td>disease</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>mucosal</td>\n",
       "      <td>537</td>\n",
       "      <td>543</td>\n",
       "      <td>8001644</td>\n",
       "      <td>Mucous Membrane</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>protein</td>\n",
       "      <td>781</td>\n",
       "      <td>787</td>\n",
       "      <td>8106247</td>\n",
       "      <td>protein</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>viruses</td>\n",
       "      <td>710</td>\n",
       "      <td>716</td>\n",
       "      <td>8116064</td>\n",
       "      <td>virus</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>induced</td>\n",
       "      <td>1167</td>\n",
       "      <td>1173</td>\n",
       "      <td>8923304</td>\n",
       "      <td>induced</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>potency</td>\n",
       "      <td>889</td>\n",
       "      <td>895</td>\n",
       "      <td>9792597</td>\n",
       "      <td>potency</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>target</td>\n",
       "      <td>774</td>\n",
       "      <td>779</td>\n",
       "      <td>5352400</td>\n",
       "      <td>goal</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>strong</td>\n",
       "      <td>283</td>\n",
       "      <td>288</td>\n",
       "      <td>8863756</td>\n",
       "      <td>strong</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>demand</td>\n",
       "      <td>290</td>\n",
       "      <td>295</td>\n",
       "      <td>9793675</td>\n",
       "      <td>demand</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>viral</td>\n",
       "      <td>574</td>\n",
       "      <td>578</td>\n",
       "      <td>8116064</td>\n",
       "      <td>virus</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>route</td>\n",
       "      <td>401</td>\n",
       "      <td>405</td>\n",
       "      <td>8861830</td>\n",
       "      <td>route</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>major</td>\n",
       "      <td>52</td>\n",
       "      <td>56</td>\n",
       "      <td>8864136</td>\n",
       "      <td>major</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>early</td>\n",
       "      <td>748</td>\n",
       "      <td>752</td>\n",
       "      <td>9790417</td>\n",
       "      <td>early</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>cause</td>\n",
       "      <td>58</td>\n",
       "      <td>62</td>\n",
       "      <td>9790538</td>\n",
       "      <td>cause</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>able</td>\n",
       "      <td>865</td>\n",
       "      <td>868</td>\n",
       "      <td>9061928</td>\n",
       "      <td>Able</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>mice</td>\n",
       "      <td>1041</td>\n",
       "      <td>1044</td>\n",
       "      <td>9790284</td>\n",
       "      <td>mouse</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>new</td>\n",
       "      <td>308</td>\n",
       "      <td>310</td>\n",
       "      <td>9786819</td>\n",
       "      <td>novel</td>\n",
       "      <td>1.0</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                       term  start_offset  end_offset  \\\n",
       "0                       mycobacterium bovis           190         217   \n",
       "1   live vaccine influenza virus intranasal           453         480   \n",
       "2                 infection influenza virus           900         914   \n",
       "3                           immune response           977         991   \n",
       "4                           M. tuberculosis           949         963   \n",
       "5                           BCG vaccination          1178        1192   \n",
       "6                             effectiveness           131         143   \n",
       "7                             reading frame           832         844   \n",
       "8                              immunization          1025        1036   \n",
       "9                                attenuated           675         684   \n",
       "10                               intranasal          1014        1023   \n",
       "11                               protection          1097        1106   \n",
       "12                                morbidity            67          75   \n",
       "13                                bacterial          1120        1128   \n",
       "14                                infection             0           8   \n",
       "15                                available           157         165   \n",
       "16                                including           605         613   \n",
       "17                                mortality            81          89   \n",
       "18                                secretory           754         762   \n",
       "19                                challenge          1130        1138   \n",
       "20                                potential           506         514   \n",
       "21                                 bacillus           219         226   \n",
       "22                                 airborne           367         374   \n",
       "23                                  disease           376         382   \n",
       "24                                  mucosal           537         543   \n",
       "25                                  protein           781         787   \n",
       "26                                  viruses           710         716   \n",
       "27                                  induced          1167        1173   \n",
       "28                                  potency           889         895   \n",
       "29                                   target           774         779   \n",
       "30                                   strong           283         288   \n",
       "31                                   demand           290         295   \n",
       "32                                    viral           574         578   \n",
       "33                                    route           401         405   \n",
       "34                                    major            52          56   \n",
       "35                                    early           748         752   \n",
       "36                                    cause            58          62   \n",
       "37                                     able           865         868   \n",
       "38                                     mice          1041        1044   \n",
       "39                                      new           308         310   \n",
       "\n",
       "   concept_id                               concept_primary_name  \\\n",
       "0     8121290                                Mycobacterium bovis   \n",
       "1     8904228            intranasal influenza live virus vaccine   \n",
       "2     2793084                                          influenza   \n",
       "3     8107682                                    immune response   \n",
       "4     8121775                         Mycobacterium tuberculosis   \n",
       "5     9725496                   vaccination against tuberculosis   \n",
       "6     8866302                                      effectiveness   \n",
       "7     9275032                                     Reading Frames   \n",
       "8     8107909                                       immunization   \n",
       "9     8928394  Attenuated by (contextual qualifier) (qualifie...   \n",
       "10    9783132                                         intranasal   \n",
       "11    9791409                                         protection   \n",
       "12    8113345                           Morbidity - disease rate   \n",
       "13    8116544                                          bacterium   \n",
       "14    8816100                                          infection   \n",
       "15    8825474                                    availability of   \n",
       "16    8954471                                          Including   \n",
       "17    9322149                                          mortality   \n",
       "18    9344807                                  secretory process   \n",
       "19    9792057                                          challenge   \n",
       "20    9792568                                          potential   \n",
       "21    8107869                                           Bacillus   \n",
       "22    9793999                                           airborne   \n",
       "23    2795416                                            disease   \n",
       "24    8001644                                    Mucous Membrane   \n",
       "25    8106247                                            protein   \n",
       "26    8116064                                              virus   \n",
       "27    8923304                                            induced   \n",
       "28    9792597                                            potency   \n",
       "29    5352400                                               goal   \n",
       "30    8863756                                             strong   \n",
       "31    9793675                                             demand   \n",
       "32    8116064                                              virus   \n",
       "33    8861830                                              route   \n",
       "34    8864136                                              major   \n",
       "35    9790417                                              early   \n",
       "36    9790538                                              cause   \n",
       "37    9061928                                               Able   \n",
       "38    9790284                                              mouse   \n",
       "39    9786819                                              novel   \n",
       "\n",
       "    match_confidence  is_full_match  \n",
       "0                1.0           True  \n",
       "1                0.5          False  \n",
       "2                1.0           True  \n",
       "3                1.0           True  \n",
       "4                1.0           True  \n",
       "5                1.0           True  \n",
       "6                1.0           True  \n",
       "7                1.0           True  \n",
       "8                1.0           True  \n",
       "9                1.0           True  \n",
       "10               1.0           True  \n",
       "11               1.0           True  \n",
       "12               1.0           True  \n",
       "13               1.0           True  \n",
       "14               1.0           True  \n",
       "15               1.0           True  \n",
       "16               1.0           True  \n",
       "17               1.0           True  \n",
       "18               1.0           True  \n",
       "19               1.0           True  \n",
       "20               1.0           True  \n",
       "21               1.0           True  \n",
       "22               1.0           True  \n",
       "23               1.0           True  \n",
       "24               1.0           True  \n",
       "25               1.0           True  \n",
       "26               1.0           True  \n",
       "27               1.0           True  \n",
       "28               1.0           True  \n",
       "29               1.0           True  \n",
       "30               1.0           True  \n",
       "31               1.0           True  \n",
       "32               1.0           True  \n",
       "33               1.0           True  \n",
       "34               1.0           True  \n",
       "35               1.0           True  \n",
       "36               1.0           True  \n",
       "37               1.0           True  \n",
       "38               1.0           True  \n",
       "39               1.0           True  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = find_matched_terms(prepare_text(sample_texts[1]), A, cid2cfn, span_slop=30)\n",
    "\n",
    "results_df = pd.DataFrame(results, columns=result_columns)\n",
    "results_df.head(len(results))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}