Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save AlexMikhalev/efea4d45684e8ff5eee014abc9b1d385 to your computer and use it in GitHub Desktop.
Save AlexMikhalev/efea4d45684e8ff5eee014abc9b1d385 to your computer and use it in GitHub Desktop.
Code demonstrating building and querying an Aho-Corasick FSM for inexact search
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Inexact Matching with Aho-Corasick\n",
"\n",
"The algorithm described in this notebook is based partially on ideas from the paper [Efficient Clinical Concept Extraction in Electronic Medical Records](https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14794/14029) (Guo, Kakrania, Baldwin, and Syeda-Mahmood, 2017). The paper describes their novel indexing method for concept extraction as follows:\n",
"\n",
"> To speed up the processing, we propose a novel indexing method that significantly reduces the search space while\n",
"still maintaining the requisite flexibility in matching. First, each word in the vocabulary is represented by a unique prefix, the shortest sequence of letters that differentiates it from every other term. Next, an inverted index is created for the mapping from prefixes to report sentences. Starting from the representative prefix of each term in the vocabulary (or a set of prefixes in the case of a multi-word term), all relevant sentences can be easily retrieved as potential matches for the term, and post-filtering by longest common word sequence matching can be used to further refine the search results.\n",
"\n",
"The test case mentioned in the paper as an example of the functionality of this concept extraction technique is matching the concept for `Chest CT` from the phrase `a subsequent CT scan of the chest`.\n",
"\n",
"Method described in this notebook uses the words of the concept names from the concept dictionary as keys to an Aho-Corasick data structure, with the payload consisting of an inverted index of ((`concept_id`, `synonym_id`), `term_weight`) associated with the word as the payload. Results of annotation against text are then post-processed to retrieve high confidence annotations against these words, and merged to find concept annotations for multi-term phrases in the text.\n",
"\n",
"The technique is applicable to both short text such as queries, and longer text bodies, such as paragraphs."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import ahocorasick\n",
"import numpy as np\n",
"import operator\n",
"import os\n",
"import pandas as pd\n",
"import re\n",
"import spacy\n",
"\n",
"from spacy.lang.en import stop_words as sw"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"DATA_DIR = \"../data\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Indexing\n",
"\n",
"We will first create an inverted index and then load it into an Aho-Corasick automaton."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"stop_words = sw.STOP_WORDS\n",
"def is_stop_word(w):\n",
" return w.lower() in stop_words\n",
"\n",
"non_alpha_re = re.compile(\"^[^a-zA-Z0-9].*$\")\n",
"def is_non_alpha(w):\n",
" return re.match(non_alpha_re, w) is not None\n",
"\n",
"def is_all_caps(w):\n",
" return w.upper() == w\n",
"\n",
"\n",
"assert(is_stop_word(\"is\"))\n",
"assert(is_non_alpha(\"-\"))\n",
"assert(is_all_caps(\"AIDS\"))\n",
"assert(not is_all_caps(\"hearing\"))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded 0 concepts\n",
"Loaded 100000 concepts\n",
"Loaded 200000 concepts\n",
"Loaded 300000 concepts\n",
"Loaded 400000 concepts\n",
"Loaded 500000 concepts\n",
"Loaded 581733 terms, COMPLETE\n",
"# of concepts: 581733\n"
]
}
],
"source": [
"def build_inexact_match_dictionary(file_path):\n",
" cid2cfn, term2props = {}, {}\n",
"\n",
" # prepare the data. Since we are building a Aho-Corasick data structure keyed\n",
" # by individual words in the concept names, there is greater chance of collision\n",
" # between terms belonging to multiple concepts. In order to get around that, we\n",
" # preprocess the terms to an inverted index of terms to list of pairs of \n",
" # (concept_id:synonym_id, weight) pairs. Here concept_id (CID) is the ID the \n",
" # token corresponding to the synonym is mapped to, and synonym_id (SID) is the\n",
" # sequence number of the synonym. The weight is the reciprocal of the number of\n",
" # non-stopword terms in the synonym, so a match against a term in a short synonym \n",
" # is \"better\" than one against a term in a long synonym.\n",
" num_loaded = 0\n",
" fvert = open(os.path.join(DATA_DIR, \"vertices.txt\"), \"r\")\n",
" for line in fvert:\n",
" if num_loaded % 100000 == 0:\n",
" print(\"Loaded {:d} concepts\".format(num_loaded))\n",
" cid, syns = line.strip().split('\\t')\n",
" for sid, syn in enumerate(syns.split('|')):\n",
" if sid == 0:\n",
" cid2cfn[cid] = syn\n",
" terms = syn.split(' ')\n",
" matchable_terms = []\n",
" for term in syn.split(' '):\n",
" if is_stop_word(term) or is_non_alpha(term):\n",
" continue\n",
" if not is_all_caps(term):\n",
" term = term.lower()\n",
" matchable_terms.append(term)\n",
" key = \":\".join([cid, str(sid)])\n",
" weight = 1 / len(terms)\n",
" for term in matchable_terms:\n",
" if term not in term2props.keys():\n",
" term2props[term] = [(key, weight)]\n",
" else:\n",
" term2props[term].append((key, weight))\n",
" num_loaded += 1\n",
"\n",
" print(\"Loaded {:d} terms, COMPLETE\".format(num_loaded))\n",
" fvert.close()\n",
" \n",
" # load up the Aho-Corasick automaton\n",
" A = ahocorasick.Automaton()\n",
"\n",
" for term in term2props.keys():\n",
" props = term2props[term]\n",
" A.add_word(term, (term, props))\n",
"\n",
" A.make_automaton()\n",
" \n",
" # return the cid2cfn dictionary and the automaton\n",
" return cid2cfn, A\n",
"\n",
"\n",
"cid2cfn, A = build_inexact_match_dictionary(os.path.join(DATA_DIR, \"vertices.txt\"))\n",
"print(\"# of concepts: {:d}\".format(len(cid2cfn)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Annotation"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def find_matched_terms(query, A, cid2cfn, confidence_threshold=0.85, span_slop=5):\n",
" \n",
" # normalize input query to match terms in dictionary\n",
" query_terms = [(term if is_all_caps(term) else term.lower()) for term in query.split(' ')]\n",
" # construct set of matchable terms (minus stopwords and non-alpha words)\n",
" matchable_terms = set([term for term in query_terms if not(is_stop_word(term) or is_non_alpha(term))])\n",
" query = \" \".join(query_terms)\n",
" query_results = []\n",
" for end_index, (term, props) in A.iter(query):\n",
" if term in matchable_terms:\n",
" start_index = end_index - len(term) + 1\n",
" query_results.append((start_index, end_index, term, props))\n",
" \n",
" # at this stage, Aho-Corasick provides ((start_offset, end_offset, term, props=[(csid, weight)]))\n",
" # for each matched term in query\n",
" # Next we construct a matrix of weights for each term (row) for each matched cid:sid (col)\n",
" num_rows = len(query_results)\n",
" cols = []\n",
" for row in query_results:\n",
" _, _, term, props = row\n",
" cols.extend([p[0] for p in props])\n",
" cols = set(cols)\n",
" num_cols = len(cols)\n",
" W = np.zeros((num_rows, num_cols))\n",
" \n",
" # construct lookup indices\n",
" csid2col, col2csid, term2row, row2term = {}, {}, {}, {}\n",
"\n",
" e_rows = enumerate([r[2] for r in query_results])\n",
" term2row = {x:i for i, x in e_rows}\n",
" row2term = {v:k for (k, v) in term2row.items()}\n",
"\n",
" e_cols = enumerate(sorted(list(cols)))\n",
" csid2col = {x:i for i, x in e_cols}\n",
" col2csid = {v:k for (k, v) in csid2col.items()}\n",
"\n",
" # populate weights matrix W\n",
" for row in query_results:\n",
" start_index, end_index, term, props = row\n",
" i = term2row[term]\n",
" for k, w in props:\n",
" j = csid2col[k]\n",
" W[i, j] = w\n",
" \n",
" # compute the confidence score for each cid:sid (CSID)\n",
" row_sum = np.sum(W, axis=0)\n",
" # candidate CSIDs are those whose confidence score is above threshold\n",
" candidate_weights, candidate_col_ids = [], []\n",
" for j in range(W.shape[1]):\n",
" if row_sum[j] > confidence_threshold:\n",
" candidate_col_ids.append(j)\n",
" candidate_weights.append(W[:, j])\n",
" C = np.array(candidate_weights).T\n",
" \n",
" # merge offsets for each candidate. In case there are multiple entries down the column,\n",
" # these signal that multiple terms have mapped to a single concept, and we need to \n",
" # merge the offsets accordingly. In addition, for non-contiguous matched spans, we \n",
" # need to decide if we should treat the span as continuous or pick the longest non-\n",
" # contiguous span.\n",
" offsets = [(s, e) for s, e, t, p in query_results]\n",
" terms = [t for s, e, t, p in query_results]\n",
"\n",
" matched_concepts = []\n",
"\n",
" for j in range(C.shape[1]):\n",
" is_full_match = True\n",
" csid = col2csid[candidate_col_ids[j]]\n",
" score = np.sum(C[:, j])\n",
" cs_row_ids = np.where(C[:, j] > 0)[0]\n",
"\n",
" term = \" \".join([terms[i] for i in cs_row_ids])\n",
" cs_offsets = [offsets[i] for i in cs_row_ids]\n",
" if len(cs_offsets) > 1:\n",
" spans = []\n",
" start_offset, end_offset = None, None\n",
" for i in range(len(cs_offsets) - 1):\n",
" if start_offset is None:\n",
" start_offset = cs_offsets[i][0]\n",
" if cs_offsets[i+1][0] - cs_offsets[i][1] <= span_slop:\n",
" end_offset = cs_offsets[i+1][1]\n",
" else:\n",
" end_offset = cs_offsets[i][1]\n",
" spans.append((start_offset, end_offset))\n",
" start_offset = None\n",
" end_offset = None\n",
" if len(spans) > 1:\n",
" # select longest span\n",
" spans = sorted(spans, key=lambda x: x[1]-x[0], reverse=True)\n",
" start_offset = spans[0][0]\n",
" end_offset = spans[0][1]\n",
" is_full_match = False\n",
" # discount score based on subspan length\n",
" score *= len(spans[0]) / sum([len(span) for span in spans])\n",
" else:\n",
" start_offset = cs_offsets[0][0]\n",
" end_offset = cs_offsets[0][1]\n",
" \n",
" if start_offset is not None and end_offset is not None:\n",
" matched_concepts.append([term, start_offset, end_offset, csid, score, is_full_match])\n",
" \n",
" # we now need to remove spans which are completely subsumed in longer spans. This is \n",
" # to prevent mapping to a term such as \"lung cancer\" from being mapped separately to\n",
" # concepts for \"lung\", \"cancer\", and \"lung cancer\".\n",
" longest_matched_concepts, covered_spans = [], []\n",
"\n",
" matched_concepts = sorted(matched_concepts, key=lambda x: x[2]-x[1], reverse=True)\n",
" for matched_concept in matched_concepts:\n",
" start_offset, end_offset = matched_concept[1], matched_concept[2]\n",
" is_subsumed = False\n",
" for cs_s, cs_e in covered_spans:\n",
" if start_offset >= cs_s and end_offset <= cs_e:\n",
" # remove completely subsumed spans from report\n",
" is_subsumed = True\n",
" break\n",
" if is_subsumed:\n",
" continue\n",
" longest_matched_concepts.append(matched_concept)\n",
" covered_spans.append((start_offset, end_offset))\n",
"\n",
" # finally, we pull in the concept primary name for display purposes\n",
" matched_concepts_for_display = []\n",
" for matched_concept in longest_matched_concepts:\n",
" cid = matched_concept[3].split(':')[0]\n",
" cfn = cid2cfn[cid]\n",
" matched_concepts_for_display.append([\n",
" matched_concept[0], # term\n",
" matched_concept[1], # start_offset\n",
" matched_concept[2], # end offset\n",
" cid, # concept id\n",
" cfn, # concept primary name\n",
" matched_concept[4], # score\n",
" matched_concept[5] # full match\n",
" ])\n",
" return matched_concepts_for_display"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examples of Query Annotations"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"sample_queries = [\n",
" \"helicobacter pylori patient perception\",\n",
" \"lyme disease syphilis\",\n",
" \"nausea related to latuda\",\n",
" \"pheochromocytoma surgery mitral\",\n",
" \"talus osteochondral injury\",\n",
" \"tuberculosis pulmonary\",\n",
" \"a subsequent CT scan of the chest\"\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Baseline\n",
"\n",
"The following concepts were identified by an in-house system used to map search queries to concepts in our medical taxonomy.\n",
"\n",
"1. __helicobacter pylori patient perception__ : Helicobacter pylori (8121505), patient (8111861), perception (8117608).\n",
"2. __lyme disease syphilis__: Lyme disease (2791575), syphilis (2792091).\n",
"3. __nausea related to latuda__: nausea (4993818), Latuda (8815455).\n",
"4. __pheochromocytoma surgery mitral__: pheochromocytoma (2791864), surgical procedure (5344477), mitral (9786428).\n",
"5. __talus osteochondral injury__: talus (8002610), osteochondral plate (9787545), injury (8109859)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"result_columns = [\"term\", \"start_offset\", \"end_offset\", \"concept_id\", \"concept_primary_name\", \n",
" \"match_confidence\", \"is_full_match\"]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>term</th>\n",
" <th>start_offset</th>\n",
" <th>end_offset</th>\n",
" <th>concept_id</th>\n",
" <th>concept_primary_name</th>\n",
" <th>match_confidence</th>\n",
" <th>is_full_match</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>helicobacter pylori</td>\n",
" <td>0</td>\n",
" <td>18</td>\n",
" <td>8121505</td>\n",
" <td>Helicobacter pylori</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>perception</td>\n",
" <td>28</td>\n",
" <td>37</td>\n",
" <td>8117608</td>\n",
" <td>Perception</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>patient</td>\n",
" <td>20</td>\n",
" <td>26</td>\n",
" <td>8111861</td>\n",
" <td>patient</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" term start_offset end_offset concept_id \\\n",
"0 helicobacter pylori 0 18 8121505 \n",
"1 perception 28 37 8117608 \n",
"2 patient 20 26 8111861 \n",
"\n",
" concept_primary_name match_confidence is_full_match \n",
"0 Helicobacter pylori 1.0 True \n",
"1 Perception 1.0 True \n",
"2 patient 1.0 True "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = find_matched_terms(sample_queries[0], A, cid2cfn)\n",
"\n",
"results_df = pd.DataFrame(results, columns=result_columns)\n",
"results_df.head(len(results))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>term</th>\n",
" <th>start_offset</th>\n",
" <th>end_offset</th>\n",
" <th>concept_id</th>\n",
" <th>concept_primary_name</th>\n",
" <th>match_confidence</th>\n",
" <th>is_full_match</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>lyme disease</td>\n",
" <td>0</td>\n",
" <td>11</td>\n",
" <td>2791575</td>\n",
" <td>Lyme disease</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>syphilis</td>\n",
" <td>13</td>\n",
" <td>20</td>\n",
" <td>2792091</td>\n",
" <td>syphilis</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" term start_offset end_offset concept_id concept_primary_name \\\n",
"0 lyme disease 0 11 2791575 Lyme disease \n",
"1 syphilis 13 20 2792091 syphilis \n",
"\n",
" match_confidence is_full_match \n",
"0 1.0 True \n",
"1 1.0 True "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = find_matched_terms(sample_queries[1], A, cid2cfn)\n",
"\n",
"results_df = pd.DataFrame(results, columns=result_columns)\n",
"results_df.head(len(results))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>term</th>\n",
" <th>start_offset</th>\n",
" <th>end_offset</th>\n",
" <th>concept_id</th>\n",
" <th>concept_primary_name</th>\n",
" <th>match_confidence</th>\n",
" <th>is_full_match</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>related</td>\n",
" <td>7</td>\n",
" <td>13</td>\n",
" <td>9125985</td>\n",
" <td>Related</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>nausea</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>4993818</td>\n",
" <td>nausea</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>latuda</td>\n",
" <td>18</td>\n",
" <td>23</td>\n",
" <td>8815455</td>\n",
" <td>Latuda</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" term start_offset end_offset concept_id concept_primary_name \\\n",
"0 related 7 13 9125985 Related \n",
"1 nausea 0 5 4993818 nausea \n",
"2 latuda 18 23 8815455 Latuda \n",
"\n",
" match_confidence is_full_match \n",
"0 1.0 True \n",
"1 1.0 True \n",
"2 1.0 True "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = find_matched_terms(sample_queries[2], A, cid2cfn)\n",
"\n",
"results_df = pd.DataFrame(results, columns=result_columns)\n",
"results_df.head(len(results))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>term</th>\n",
" <th>start_offset</th>\n",
" <th>end_offset</th>\n",
" <th>concept_id</th>\n",
" <th>concept_primary_name</th>\n",
" <th>match_confidence</th>\n",
" <th>is_full_match</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>pheochromocytoma</td>\n",
" <td>0</td>\n",
" <td>15</td>\n",
" <td>2791864</td>\n",
" <td>pheochromocytoma</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>surgery</td>\n",
" <td>17</td>\n",
" <td>23</td>\n",
" <td>5344477</td>\n",
" <td>surgical procedure</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>mitral</td>\n",
" <td>25</td>\n",
" <td>30</td>\n",
" <td>9786428</td>\n",
" <td>mitral</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" term start_offset end_offset concept_id concept_primary_name \\\n",
"0 pheochromocytoma 0 15 2791864 pheochromocytoma \n",
"1 surgery 17 23 5344477 surgical procedure \n",
"2 mitral 25 30 9786428 mitral \n",
"\n",
" match_confidence is_full_match \n",
"0 1.0 True \n",
"1 1.0 True \n",
"2 1.0 True "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = find_matched_terms(sample_queries[3], A, cid2cfn)\n",
"\n",
"results_df = pd.DataFrame(results, columns=result_columns)\n",
"results_df.head(len(results))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>term</th>\n",
" <th>start_offset</th>\n",
" <th>end_offset</th>\n",
" <th>concept_id</th>\n",
" <th>concept_primary_name</th>\n",
" <th>match_confidence</th>\n",
" <th>is_full_match</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>osteochondral</td>\n",
" <td>6</td>\n",
" <td>18</td>\n",
" <td>9787545</td>\n",
" <td>osteochondral plate</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>injury</td>\n",
" <td>20</td>\n",
" <td>25</td>\n",
" <td>8109859</td>\n",
" <td>injury</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>talus</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" <td>8002610</td>\n",
" <td>talus</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" term start_offset end_offset concept_id concept_primary_name \\\n",
"0 osteochondral 6 18 9787545 osteochondral plate \n",
"1 injury 20 25 8109859 injury \n",
"2 talus 0 4 8002610 talus \n",
"\n",
" match_confidence is_full_match \n",
"0 1.0 True \n",
"1 1.0 True \n",
"2 1.0 True "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = find_matched_terms(sample_queries[4], A, cid2cfn)\n",
"\n",
"results_df = pd.DataFrame(results, columns=result_columns)\n",
"results_df.head(len(results))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>term</th>\n",
" <th>start_offset</th>\n",
" <th>end_offset</th>\n",
" <th>concept_id</th>\n",
" <th>concept_primary_name</th>\n",
" <th>match_confidence</th>\n",
" <th>is_full_match</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>tuberculosis pulmonary</td>\n",
" <td>0</td>\n",
" <td>21</td>\n",
" <td>8107493</td>\n",
" <td>pulmonary tuberculosis</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" term start_offset end_offset concept_id \\\n",
"0 tuberculosis pulmonary 0 21 8107493 \n",
"\n",
" concept_primary_name match_confidence is_full_match \n",
"0 pulmonary tuberculosis 1.0 True "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = find_matched_terms(sample_queries[5], A, cid2cfn)\n",
"\n",
"results_df = pd.DataFrame(results, columns=result_columns)\n",
"results_df.head(len(results))"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>term</th>\n",
" <th>start_offset</th>\n",
" <th>end_offset</th>\n",
" <th>concept_id</th>\n",
" <th>concept_primary_name</th>\n",
" <th>match_confidence</th>\n",
" <th>is_full_match</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CT scan chest</td>\n",
" <td>13</td>\n",
" <td>32</td>\n",
" <td>8109536</td>\n",
" <td>CT of chest</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" term start_offset end_offset concept_id concept_primary_name \\\n",
"0 CT scan chest 13 32 8109536 CT of chest \n",
"\n",
" match_confidence is_full_match \n",
"0 1.0 True "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = find_matched_terms(sample_queries[6], A, cid2cfn, span_slop=10)\n",
"\n",
"results_df = pd.DataFrame(results, columns=result_columns)\n",
"results_df.head(len(results))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examples of text annotation"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"sample_texts = [\n",
" \"\"\"To halt the global tuberculosis epidemic, transmission must be stopped to prevent new infections \n",
" and new cases. Identification of individuals with tuberculosis and prompt initiation of effective \n",
" treatment to rapidly render them non-infectious is crucial to this task. However, in settings of \n",
" high tuberculosis burden, active case-finding is often not implemented, resulting in long delays \n",
" in diagnosis and treatment. A range of strategies to find cases and ensure prompt and correct \n",
" treatment have been shown to be effective in high tuberculosis-burden settings. The population\n",
" level effect of targeted active case-finding on reducing tuberculosis incidence has been shown by \n",
" studies and projected by mathematical modelling. The inclusion of targeted active case-finding in \n",
" a comprehensive epidemic-control strategy for tuberculosis should contribute substantially to a \n",
" decrease in tuberculosis incidence.\"\"\",\n",
" \"\"\"Infection with Mycobacterium tuberculosis remains a major cause of morbidity and mortality all \n",
" over the world. Since the effectiveness of the only available tuberculosis vaccine, Mycobacterium \n",
" bovis bacillus Calmette-Guérin (BCG), is suboptimal, there is a strong demand to develop new \n",
" tuberculosis vaccines. As tuberculosis is an airborne disease, the intranasal route of vaccination \n",
" might be preferable. Live influenza virus vaccines might be considered as potential vectors for \n",
" mucosal immunization against various viral or bacterial pathogens, including M. tuberculosis. \n",
" We generated several subtypes of attenuated recombinant influenza A viruses expressing the 6-kDa \n",
" early secretory antigenic target protein (ESAT-6) of M. tuberculosis from the NS1 reading frame. \n",
" We were able to demonstrate the potency of influenza virus NS vectors to induce an \n",
" M. tuberculosisspecific Th1 immune response in mice. Moreover, intranasal immunization of mice and \n",
" guinea pigs with such vectors induced protection against mycobacterial challenge, similar to that \n",
" induced by BCG vaccination.\"\"\"\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"nlp = spacy.load(\"en\")\n",
"\n",
"def prepare_text(text):\n",
" doc = nlp(text)\n",
" return \" \".join([token.text for token in doc])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>term</th>\n",
" <th>start_offset</th>\n",
" <th>end_offset</th>\n",
" <th>concept_id</th>\n",
" <th>concept_primary_name</th>\n",
" <th>match_confidence</th>\n",
" <th>is_full_match</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>identification</td>\n",
" <td>123</td>\n",
" <td>136</td>\n",
" <td>8878533</td>\n",
" <td>Identification - mental defense mechanism</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>comprehensive</td>\n",
" <td>867</td>\n",
" <td>879</td>\n",
" <td>8119030</td>\n",
" <td>Comprehension</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>tuberculosis</td>\n",
" <td>984</td>\n",
" <td>995</td>\n",
" <td>2792173</td>\n",
" <td>tuberculosis</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>transmission</td>\n",
" <td>43</td>\n",
" <td>54</td>\n",
" <td>8110814</td>\n",
" <td>disease transmission</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>initiation</td>\n",
" <td>182</td>\n",
" <td>191</td>\n",
" <td>8903356</td>\n",
" <td>initiation</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>population</td>\n",
" <td>626</td>\n",
" <td>635</td>\n",
" <td>9724683</td>\n",
" <td>geographic population</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>infectious</td>\n",
" <td>254</td>\n",
" <td>263</td>\n",
" <td>9786858</td>\n",
" <td>infectious</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>contribute</td>\n",
" <td>933</td>\n",
" <td>942</td>\n",
" <td>9792999</td>\n",
" <td>contribution</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>treatment</td>\n",
" <td>539</td>\n",
" <td>547</td>\n",
" <td>5216597</td>\n",
" <td>therapeutic procedure</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>diagnosis</td>\n",
" <td>438</td>\n",
" <td>446</td>\n",
" <td>5304448</td>\n",
" <td>diagnosis</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>incidence</td>\n",
" <td>997</td>\n",
" <td>1005</td>\n",
" <td>9203317</td>\n",
" <td>incidence</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>strategy</td>\n",
" <td>900</td>\n",
" <td>907</td>\n",
" <td>8110336</td>\n",
" <td>strategy</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>epidemic</td>\n",
" <td>881</td>\n",
" <td>888</td>\n",
" <td>8113048</td>\n",
" <td>Epidemic</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>decrease</td>\n",
" <td>972</td>\n",
" <td>979</td>\n",
" <td>8864047</td>\n",
" <td>decrease</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>reducing</td>\n",
" <td>696</td>\n",
" <td>703</td>\n",
" <td>9787037</td>\n",
" <td>reduction</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>control</td>\n",
" <td>892</td>\n",
" <td>898</td>\n",
" <td>7987974</td>\n",
" <td>Control</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>studies case control</td>\n",
" <td>755</td>\n",
" <td>761</td>\n",
" <td>8110199</td>\n",
" <td>case control study</td>\n",
" <td>0.5</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>finding</td>\n",
" <td>845</td>\n",
" <td>851</td>\n",
" <td>8114924</td>\n",
" <td>physical finding</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>prevent</td>\n",
" <td>75</td>\n",
" <td>81</td>\n",
" <td>8120170</td>\n",
" <td>prevent</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>rapidly</td>\n",
" <td>228</td>\n",
" <td>234</td>\n",
" <td>9073025</td>\n",
" <td>Rapidly (qualifier value)</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>crucial</td>\n",
" <td>268</td>\n",
" <td>274</td>\n",
" <td>9790737</td>\n",
" <td>critical</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>global</td>\n",
" <td>12</td>\n",
" <td>17</td>\n",
" <td>5354974</td>\n",
" <td>generalized</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>active</td>\n",
" <td>831</td>\n",
" <td>836</td>\n",
" <td>8208927</td>\n",
" <td>Active</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>ensure</td>\n",
" <td>504</td>\n",
" <td>509</td>\n",
" <td>9150977</td>\n",
" <td>Ensure</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>prompt</td>\n",
" <td>511</td>\n",
" <td>516</td>\n",
" <td>9757941</td>\n",
" <td>Prompt</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>burden</td>\n",
" <td>604</td>\n",
" <td>609</td>\n",
" <td>9792062</td>\n",
" <td>burden</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>level</td>\n",
" <td>646</td>\n",
" <td>650</td>\n",
" <td>8863573</td>\n",
" <td>level</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>range</td>\n",
" <td>466</td>\n",
" <td>470</td>\n",
" <td>9790498</td>\n",
" <td>range</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>high</td>\n",
" <td>584</td>\n",
" <td>587</td>\n",
" <td>9723548</td>\n",
" <td>High</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>case</td>\n",
" <td>838</td>\n",
" <td>841</td>\n",
" <td>9792560</td>\n",
" <td>case</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>new</td>\n",
" <td>111</td>\n",
" <td>113</td>\n",
" <td>9786819</td>\n",
" <td>novel</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" term start_offset end_offset concept_id \\\n",
"0 identification 123 136 8878533 \n",
"1 comprehensive 867 879 8119030 \n",
"2 tuberculosis 984 995 2792173 \n",
"3 transmission 43 54 8110814 \n",
"4 initiation 182 191 8903356 \n",
"5 population 626 635 9724683 \n",
"6 infectious 254 263 9786858 \n",
"7 contribute 933 942 9792999 \n",
"8 treatment 539 547 5216597 \n",
"9 diagnosis 438 446 5304448 \n",
"10 incidence 997 1005 9203317 \n",
"11 strategy 900 907 8110336 \n",
"12 epidemic 881 888 8113048 \n",
"13 decrease 972 979 8864047 \n",
"14 reducing 696 703 9787037 \n",
"15 control 892 898 7987974 \n",
"16 studies case control 755 761 8110199 \n",
"17 finding 845 851 8114924 \n",
"18 prevent 75 81 8120170 \n",
"19 rapidly 228 234 9073025 \n",
"20 crucial 268 274 9790737 \n",
"21 global 12 17 5354974 \n",
"22 active 831 836 8208927 \n",
"23 ensure 504 509 9150977 \n",
"24 prompt 511 516 9757941 \n",
"25 burden 604 609 9792062 \n",
"26 level 646 650 8863573 \n",
"27 range 466 470 9790498 \n",
"28 high 584 587 9723548 \n",
"29 case 838 841 9792560 \n",
"30 new 111 113 9786819 \n",
"\n",
" concept_primary_name match_confidence is_full_match \n",
"0 Identification - mental defense mechanism 1.0 True \n",
"1 Comprehension 1.0 True \n",
"2 tuberculosis 1.0 True \n",
"3 disease transmission 1.0 True \n",
"4 initiation 1.0 True \n",
"5 geographic population 1.0 True \n",
"6 infectious 1.0 True \n",
"7 contribution 1.0 True \n",
"8 therapeutic procedure 1.0 True \n",
"9 diagnosis 1.0 True \n",
"10 incidence 1.0 True \n",
"11 strategy 1.0 True \n",
"12 Epidemic 1.0 True \n",
"13 decrease 1.0 True \n",
"14 reduction 1.0 True \n",
"15 Control 1.0 True \n",
"16 case control study 0.5 False \n",
"17 physical finding 1.0 True \n",
"18 prevent 1.0 True \n",
"19 Rapidly (qualifier value) 1.0 True \n",
"20 critical 1.0 True \n",
"21 generalized 1.0 True \n",
"22 Active 1.0 True \n",
"23 Ensure 1.0 True \n",
"24 Prompt 1.0 True \n",
"25 burden 1.0 True \n",
"26 level 1.0 True \n",
"27 range 1.0 True \n",
"28 High 1.0 True \n",
"29 case 1.0 True \n",
"30 novel 1.0 True "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = find_matched_terms(prepare_text(sample_texts[0]), A, cid2cfn, span_slop=30)\n",
"\n",
"results_df = pd.DataFrame(results, columns=result_columns)\n",
"results_df.head(len(results))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>term</th>\n",
" <th>start_offset</th>\n",
" <th>end_offset</th>\n",
" <th>concept_id</th>\n",
" <th>concept_primary_name</th>\n",
" <th>match_confidence</th>\n",
" <th>is_full_match</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>mycobacterium bovis</td>\n",
" <td>190</td>\n",
" <td>217</td>\n",
" <td>8121290</td>\n",
" <td>Mycobacterium bovis</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>live vaccine influenza virus intranasal</td>\n",
" <td>453</td>\n",
" <td>480</td>\n",
" <td>8904228</td>\n",
" <td>intranasal influenza live virus vaccine</td>\n",
" <td>0.5</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>infection influenza virus</td>\n",
" <td>900</td>\n",
" <td>914</td>\n",
" <td>2793084</td>\n",
" <td>influenza</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>immune response</td>\n",
" <td>977</td>\n",
" <td>991</td>\n",
" <td>8107682</td>\n",
" <td>immune response</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>M. tuberculosis</td>\n",
" <td>949</td>\n",
" <td>963</td>\n",
" <td>8121775</td>\n",
" <td>Mycobacterium tuberculosis</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>BCG vaccination</td>\n",
" <td>1178</td>\n",
" <td>1192</td>\n",
" <td>9725496</td>\n",
" <td>vaccination against tuberculosis</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>effectiveness</td>\n",
" <td>131</td>\n",
" <td>143</td>\n",
" <td>8866302</td>\n",
" <td>effectiveness</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>reading frame</td>\n",
" <td>832</td>\n",
" <td>844</td>\n",
" <td>9275032</td>\n",
" <td>Reading Frames</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>immunization</td>\n",
" <td>1025</td>\n",
" <td>1036</td>\n",
" <td>8107909</td>\n",
" <td>immunization</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>attenuated</td>\n",
" <td>675</td>\n",
" <td>684</td>\n",
" <td>8928394</td>\n",
" <td>Attenuated by (contextual qualifier) (qualifie...</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>intranasal</td>\n",
" <td>1014</td>\n",
" <td>1023</td>\n",
" <td>9783132</td>\n",
" <td>intranasal</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>protection</td>\n",
" <td>1097</td>\n",
" <td>1106</td>\n",
" <td>9791409</td>\n",
" <td>protection</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>morbidity</td>\n",
" <td>67</td>\n",
" <td>75</td>\n",
" <td>8113345</td>\n",
" <td>Morbidity - disease rate</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>bacterial</td>\n",
" <td>1120</td>\n",
" <td>1128</td>\n",
" <td>8116544</td>\n",
" <td>bacterium</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>infection</td>\n",
" <td>0</td>\n",
" <td>8</td>\n",
" <td>8816100</td>\n",
" <td>infection</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>available</td>\n",
" <td>157</td>\n",
" <td>165</td>\n",
" <td>8825474</td>\n",
" <td>availability of</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>including</td>\n",
" <td>605</td>\n",
" <td>613</td>\n",
" <td>8954471</td>\n",
" <td>Including</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>mortality</td>\n",
" <td>81</td>\n",
" <td>89</td>\n",
" <td>9322149</td>\n",
" <td>mortality</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>secretory</td>\n",
" <td>754</td>\n",
" <td>762</td>\n",
" <td>9344807</td>\n",
" <td>secretory process</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>challenge</td>\n",
" <td>1130</td>\n",
" <td>1138</td>\n",
" <td>9792057</td>\n",
" <td>challenge</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>potential</td>\n",
" <td>506</td>\n",
" <td>514</td>\n",
" <td>9792568</td>\n",
" <td>potential</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>bacillus</td>\n",
" <td>219</td>\n",
" <td>226</td>\n",
" <td>8107869</td>\n",
" <td>Bacillus</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>airborne</td>\n",
" <td>367</td>\n",
" <td>374</td>\n",
" <td>9793999</td>\n",
" <td>airborne</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>disease</td>\n",
" <td>376</td>\n",
" <td>382</td>\n",
" <td>2795416</td>\n",
" <td>disease</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>mucosal</td>\n",
" <td>537</td>\n",
" <td>543</td>\n",
" <td>8001644</td>\n",
" <td>Mucous Membrane</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>protein</td>\n",
" <td>781</td>\n",
" <td>787</td>\n",
" <td>8106247</td>\n",
" <td>protein</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>viruses</td>\n",
" <td>710</td>\n",
" <td>716</td>\n",
" <td>8116064</td>\n",
" <td>virus</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>induced</td>\n",
" <td>1167</td>\n",
" <td>1173</td>\n",
" <td>8923304</td>\n",
" <td>induced</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>potency</td>\n",
" <td>889</td>\n",
" <td>895</td>\n",
" <td>9792597</td>\n",
" <td>potency</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>target</td>\n",
" <td>774</td>\n",
" <td>779</td>\n",
" <td>5352400</td>\n",
" <td>goal</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>strong</td>\n",
" <td>283</td>\n",
" <td>288</td>\n",
" <td>8863756</td>\n",
" <td>strong</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>demand</td>\n",
" <td>290</td>\n",
" <td>295</td>\n",
" <td>9793675</td>\n",
" <td>demand</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>viral</td>\n",
" <td>574</td>\n",
" <td>578</td>\n",
" <td>8116064</td>\n",
" <td>virus</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>route</td>\n",
" <td>401</td>\n",
" <td>405</td>\n",
" <td>8861830</td>\n",
" <td>route</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>major</td>\n",
" <td>52</td>\n",
" <td>56</td>\n",
" <td>8864136</td>\n",
" <td>major</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>early</td>\n",
" <td>748</td>\n",
" <td>752</td>\n",
" <td>9790417</td>\n",
" <td>early</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>cause</td>\n",
" <td>58</td>\n",
" <td>62</td>\n",
" <td>9790538</td>\n",
" <td>cause</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>able</td>\n",
" <td>865</td>\n",
" <td>868</td>\n",
" <td>9061928</td>\n",
" <td>Able</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>mice</td>\n",
" <td>1041</td>\n",
" <td>1044</td>\n",
" <td>9790284</td>\n",
" <td>mouse</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>new</td>\n",
" <td>308</td>\n",
" <td>310</td>\n",
" <td>9786819</td>\n",
" <td>novel</td>\n",
" <td>1.0</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" term start_offset end_offset \\\n",
"0 mycobacterium bovis 190 217 \n",
"1 live vaccine influenza virus intranasal 453 480 \n",
"2 infection influenza virus 900 914 \n",
"3 immune response 977 991 \n",
"4 M. tuberculosis 949 963 \n",
"5 BCG vaccination 1178 1192 \n",
"6 effectiveness 131 143 \n",
"7 reading frame 832 844 \n",
"8 immunization 1025 1036 \n",
"9 attenuated 675 684 \n",
"10 intranasal 1014 1023 \n",
"11 protection 1097 1106 \n",
"12 morbidity 67 75 \n",
"13 bacterial 1120 1128 \n",
"14 infection 0 8 \n",
"15 available 157 165 \n",
"16 including 605 613 \n",
"17 mortality 81 89 \n",
"18 secretory 754 762 \n",
"19 challenge 1130 1138 \n",
"20 potential 506 514 \n",
"21 bacillus 219 226 \n",
"22 airborne 367 374 \n",
"23 disease 376 382 \n",
"24 mucosal 537 543 \n",
"25 protein 781 787 \n",
"26 viruses 710 716 \n",
"27 induced 1167 1173 \n",
"28 potency 889 895 \n",
"29 target 774 779 \n",
"30 strong 283 288 \n",
"31 demand 290 295 \n",
"32 viral 574 578 \n",
"33 route 401 405 \n",
"34 major 52 56 \n",
"35 early 748 752 \n",
"36 cause 58 62 \n",
"37 able 865 868 \n",
"38 mice 1041 1044 \n",
"39 new 308 310 \n",
"\n",
" concept_id concept_primary_name \\\n",
"0 8121290 Mycobacterium bovis \n",
"1 8904228 intranasal influenza live virus vaccine \n",
"2 2793084 influenza \n",
"3 8107682 immune response \n",
"4 8121775 Mycobacterium tuberculosis \n",
"5 9725496 vaccination against tuberculosis \n",
"6 8866302 effectiveness \n",
"7 9275032 Reading Frames \n",
"8 8107909 immunization \n",
"9 8928394 Attenuated by (contextual qualifier) (qualifie... \n",
"10 9783132 intranasal \n",
"11 9791409 protection \n",
"12 8113345 Morbidity - disease rate \n",
"13 8116544 bacterium \n",
"14 8816100 infection \n",
"15 8825474 availability of \n",
"16 8954471 Including \n",
"17 9322149 mortality \n",
"18 9344807 secretory process \n",
"19 9792057 challenge \n",
"20 9792568 potential \n",
"21 8107869 Bacillus \n",
"22 9793999 airborne \n",
"23 2795416 disease \n",
"24 8001644 Mucous Membrane \n",
"25 8106247 protein \n",
"26 8116064 virus \n",
"27 8923304 induced \n",
"28 9792597 potency \n",
"29 5352400 goal \n",
"30 8863756 strong \n",
"31 9793675 demand \n",
"32 8116064 virus \n",
"33 8861830 route \n",
"34 8864136 major \n",
"35 9790417 early \n",
"36 9790538 cause \n",
"37 9061928 Able \n",
"38 9790284 mouse \n",
"39 9786819 novel \n",
"\n",
" match_confidence is_full_match \n",
"0 1.0 True \n",
"1 0.5 False \n",
"2 1.0 True \n",
"3 1.0 True \n",
"4 1.0 True \n",
"5 1.0 True \n",
"6 1.0 True \n",
"7 1.0 True \n",
"8 1.0 True \n",
"9 1.0 True \n",
"10 1.0 True \n",
"11 1.0 True \n",
"12 1.0 True \n",
"13 1.0 True \n",
"14 1.0 True \n",
"15 1.0 True \n",
"16 1.0 True \n",
"17 1.0 True \n",
"18 1.0 True \n",
"19 1.0 True \n",
"20 1.0 True \n",
"21 1.0 True \n",
"22 1.0 True \n",
"23 1.0 True \n",
"24 1.0 True \n",
"25 1.0 True \n",
"26 1.0 True \n",
"27 1.0 True \n",
"28 1.0 True \n",
"29 1.0 True \n",
"30 1.0 True \n",
"31 1.0 True \n",
"32 1.0 True \n",
"33 1.0 True \n",
"34 1.0 True \n",
"35 1.0 True \n",
"36 1.0 True \n",
"37 1.0 True \n",
"38 1.0 True \n",
"39 1.0 True "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = find_matched_terms(prepare_text(sample_texts[1]), A, cid2cfn, span_slop=30)\n",
"\n",
"results_df = pd.DataFrame(results, columns=result_columns)\n",
"results_df.head(len(results))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment