PritiShaw/reactome-failed-query-processing.ipynb

## reactome-failed-query-processing.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Reactome Failed Query Analysis",
      "provenance": [],
      "collapsed_sections": [
        "yq3CVqLlhxJD"
      ],
      "authorship_tag": "ABX9TyP7bJNfKJ96OaG5ADwsQDEY",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/PritiShaw/fb8abb610ec8267d21150544aa245257/reactome-failed-query-processing.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d3k8baTpmlLP",
        "colab_type": "text"
      },
      "source": [
        "Retrieve Information for Reactome Failed Searches\n",
        "---\n",
        "---\n",
        "### Purpose\n",
        "The notebook processes the failed query terms to get the PMIDs where the term was seen.\n",
        "Using these PMIDs MESH terms and article metadata will be extracted and presented in a tab seperatted file.\n",
        "### How to Run\n",
        "All code cells needs to be run sequentially\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GM1o-yjOdnW8",
        "colab_type": "text"
      },
      "source": [
        "### Setup"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TbdkMu1ATOVr",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!git clone https://github.com/PritiShaw/reactome-failed-queries code"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QOynF0hfSu1P",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "%cd /content/code/src"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RInn17_HJZnC",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!pip install -r requirements.txt\n",
        "!git clone https://github.com/PritiShaw/Reactome-Failed-Queries-Processing.git processor"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tKc7g8fRbcii",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import logging\n",
        "logging.basicConfig()\n",
        "logging.getLogger().setLevel(logging.WARNING)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rGjJnEF2d8xP",
        "colab_type": "text"
      },
      "source": [
        "### Get PMID from failed query terms\n",
        "`getPMID(terms)`  \n",
        "The method takes the list of terms and queries in Pubmed database to get the List of PMID containing the term.  \n",
        "`_extractListID(filecontent, term)`  \n",
        "Extracts list of PMIDs from the XML and save it in file for further processing."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "DIRxMACsIgTX",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# getPMID.py\n",
        "\n",
        "import requests\n",
        "import time\n",
        "import json\n",
        "import xml.etree.ElementTree as ET\n",
        "\n",
        "def _extractListID(filecontent, term):\n",
        "    tree = ET.fromstring(filecontent, ET.XMLParser(encoding='utf-8'))\n",
        "    ID = tree.findall('./IdList/Id')\n",
        "    count = tree.find('./Count').text\n",
        "\n",
        "    with open(\"pmid_list.txt\", \"a\") as op_file:\n",
        "        for i in ID:\n",
        "            print(i.text + \"~\" + term + \"~\" + count, file=op_file)\n",
        "\n",
        "\"\"\"\n",
        "Get PMID for the Query terms\n",
        "Parameters:\n",
        "  terms: List of failed query terms\n",
        "  pmid_threshold: Limit of Pubmed articles to process, default is 20\n",
        "\"\"\"\n",
        "def getPMID(terms,pmid_threshold=20):\n",
        "    if pmid_threshold<1:\n",
        "        pmid_threshold=20\n",
        "    for term in terms:\n",
        "        term = term.strip().rpartition(\",\")[0]\n",
        "        flag = True\n",
        "        while flag:\n",
        "            try:\n",
        "                xml_content = requests.get(\n",
        "                \"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=\"+pmid_threshold+\"&term=hasabstract%20AND%20\"+term)\n",
        "                _extractListID(xml_content.text, term)\n",
        "                flag = False\n",
        "            except:\n",
        "                time.sleep(.5)\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YtbRKt3Hcp4H",
        "colab_type": "text"
      },
      "source": [
        "### Extraction of MESH terms\n",
        "\n",
        "`getAbstracts()`   \n",
        "Reads the PMID list and gets the Abstract from Pubmed. The output file is sent to [Interactive Medical Text Indexer (MTI)](https://ii.nlm.nih.gov/Batch/index.shtml) for batch processing of these abstracts"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tpsF7u8tIaQN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# getMESH.py\n",
        "\n",
        "import requests\n",
        "import os\n",
        "import time\n",
        "\n",
        "from urllib.request import urlopen\n",
        "from xml.etree.ElementTree import parse\n",
        "\n",
        "\"\"\"\n",
        "Get abstracts from PMID and generate input file for MESH Batch processing\n",
        "\"\"\"\n",
        "def getAbstracts():\n",
        "    with open(\"pmid_list.txt\") as file:\n",
        "        with open('abstract.txt', 'w') as o:\n",
        "            for inp in file:\n",
        "                pmid = inp.strip().split(\"~\")[0]\n",
        "                flag = True\n",
        "                while flag:\n",
        "                    try:\n",
        "                        var_url = urlopen(f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id={pmid}')\n",
        "                        flag = False\n",
        "                    except:\n",
        "                        time.sleep(.5)\n",
        "                xmldoc = parse(var_url)\n",
        "                for item in xmldoc.iterfind('PubmedArticle'):\n",
        "                    try:\n",
        "                        abstract_text = item.findtext(\n",
        "                            'MedlineCitation/Article/Abstract/AbstractText')\n",
        "                        article_title = item.findtext(\n",
        "                            'MedlineCitation/Article/ArticleTitle')\n",
        "                        if abstract_text:\n",
        "                            print('UI  - ', pmid, file=o)\n",
        "                            print(\n",
        "                                'TI  - ', article_title.encode(\"ascii\", \"ignore\"), file=o)\n",
        "                            print(\n",
        "                                'AB  - ', abstract_text.encode(\"ascii\", \"ignore\"), file=o)\n",
        "                            print(\"\\n\", file=o)\n",
        "                        else:\n",
        "                            print(\"Err: MESH: \", \"Undefined Abstract\")\n",
        "                    except Exception as e:\n",
        "                        print(\"Err: MESH: \", e)\n",
        "\n",
        "def getMESH():\n",
        "    getAbstracts()\n",
        "    os.system(\"bash handleMTI.sh >> mesh.txt\")\n",
        "    "
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pnArpnW-fmNy",
        "colab_type": "text"
      },
      "source": [
        "### Extraction of Metadata from the PMIDs\n",
        "\n",
        "Following details are retrieved using EUtils, INDRA and OpenCitations using the PMID\n",
        "`JOURNAL_TITLE`, `YEAR`, `PMCID`, `DOI`, `PMC_CITATION_COUNT`, `INDRA_STATEMENT_COUNT`, `OC_CITATION_COUNT`, `INDRA_QUERY_TERM_STATEMENT_COUNT`, `PMID_COUNT`\n",
        "\n",
        "`getIndraQueryTermStmtCount`  \n",
        "This method uses [Gilda](https://github.com/indralab/gilda) for grounding the failed query term"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jjXzZat5Im81",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# getEUtilsInfo.py\n",
        "\n",
        "import os\n",
        "import gzip\n",
        "import time\n",
        "import sys\n",
        "import csv\n",
        "import requests\n",
        "import json\n",
        "import indra.literature.pubmed_client as parser\n",
        "import xml.etree.ElementTree as ET\n",
        "from indra.sources import indra_db_rest\n",
        "from indra.assemblers.html.assembler import HtmlAssembler\n",
        "from urllib.parse import urljoin\n",
        "from indra.statements.statements import stmts_to_json\n",
        "\n",
        "\n",
        "\"\"\"\n",
        "Gets Citation count for PMID\n",
        "\n",
        "Parameters\n",
        "----------\n",
        "pmid : string\n",
        "    PMID of the medical paper\n",
        "\n",
        "Returns\n",
        "-------\n",
        "string\n",
        "    Citation Count\n",
        "\"\"\"\n",
        "def citationCount(pmid):\n",
        "    flag = True\n",
        "    while flag:\n",
        "        try:\n",
        "            citationCount_url = requests.get(\n",
        "                \"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&linkname=pubmed_pmc_refs&id=\"+pmid)\n",
        "            flag = False\n",
        "        except Exception as e:\n",
        "            time.sleep(.5)\n",
        "    try:\n",
        "        fileContent = citationCount_url.text\n",
        "        tree = ET.fromstring(fileContent, ET.XMLParser(encoding='utf-8'))\n",
        "        ID = tree.findall('./LinkSet/LinkSetDb/Link')\n",
        "        return len(ID)\n",
        "    except:\n",
        "        return 0\n",
        "\n",
        "\n",
        "\"\"\"\n",
        "Get number of statments generated by INDRA from the query term\n",
        "\n",
        "Parameters\n",
        "----------\n",
        "    txt : string\n",
        "        Query term to be processed\n",
        "    source_apis : [], optional\n",
        "        APIs to be searched from, default is all\n",
        "\n",
        "Returns\n",
        "------\n",
        "    integer\n",
        "        Number of Indra Statements\n",
        "\"\"\"\n",
        "def getIndraQueryTermStmtCount(txt, source_apis=None):\n",
        "    grounding_service_url = 'http://grounding.indra.bio/'\n",
        "    resp = requests.post(urljoin(grounding_service_url,\n",
        "                                 'ground'), json={'text': txt})\n",
        "    grounding_results = resp.json()\n",
        "    if len(grounding_results) > 0:\n",
        "        term_id = grounding_results[0]['term']['id']\n",
        "        term_db = grounding_results[0]['term']['db']\n",
        "        term = term_id + '@' + term_db\n",
        "    else:\n",
        "        return 0\n",
        "    stmts = indra_db_rest.get_statements(agents=[term]).statements\n",
        "    stmts_json = stmts_to_json(stmts)\n",
        "    valid_stmts = set()\n",
        "    if source_apis:\n",
        "        idx = 0\n",
        "        for stmt in stmts_json:\n",
        "            evidences = stmt.get(\"evidence\", [])\n",
        "            for ev in evidences:\n",
        "                if ev[\"source_api\"] in source_apis:\n",
        "                    valid_stmts.add(stmts[idx])\n",
        "            idx += 1\n",
        "        return len(valid_stmts)\n",
        "    return len(stmts)\n",
        "\n",
        "\n",
        "\"\"\"\n",
        "Extracts information from XML\n",
        "\n",
        "Parameters\n",
        "----------\n",
        "fileContent : \n",
        "    XML Content for the journal\n",
        "citationCount : \n",
        "    Citation count for the PMID\n",
        "term:\n",
        "    Reactome query term\n",
        "total_pmid :\n",
        "    Number of articles where the term is seen\n",
        "\"\"\"\n",
        "def extractFromXML(pmid,  term, total_pmid):\n",
        "\n",
        "    flag = True\n",
        "    while flag:\n",
        "        try:\n",
        "            xmlContent = requests.get(\n",
        "                \"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=\"+pmid)\n",
        "            flag = False\n",
        "        except Exception as e:\n",
        "            time.sleep(.5)\n",
        "\n",
        "    fileContent = xmlContent.text\n",
        "    destFileName = \"eutils_output.tsv\"\n",
        "    if(os.path.isfile(destFileName)):\n",
        "        destCSV = open(destFileName, 'a')\n",
        "    else:\n",
        "        destCSV = open(destFileName, 'w')\n",
        "        print('\\t'.join([\"PMID\", \"TERM\", \"JOURNAL_TITLE\", \"YEAR\", \"PMCID\", \"DOI\", \"PMC_CITATION_COUNT\",\n",
        "                         \"INDRA_STATEMENT_COUNT\", \"OC_CITATION_COUNT\", \"INDRA_QUERY_TERM_STATEMENT_COUNT\", \"PMID_COUNT\"]), file=destCSV)\n",
        "    writer = csv.writer(destCSV, delimiter='\\t', quoting=csv.QUOTE_MINIMAL)\n",
        "    tree = ET.fromstring(fileContent, ET.XMLParser(encoding='utf-8'))\n",
        "    pm_articles = tree.findall('./PubmedArticle')\n",
        "    citation_count = citationCount(pmid)\n",
        "    for art_ix, pm_article in enumerate(pm_articles):\n",
        "        medline_citation = pm_article.find('./MedlineCitation')\n",
        "        pubmed = pm_article.find('./PubmedData')\n",
        "        try:\n",
        "            history_pub_date = pubmed.find(\n",
        "                './History/PubMedPubDate[@PubStatus=\"pubmed\"]')\n",
        "            year = parser._find_elem_text(history_pub_date, 'Year')\n",
        "            PublicationTypeList = medline_citation.find(\n",
        "                './Article/PublicationTypeList')\n",
        "            pubType = parser._find_elem_text(\n",
        "                PublicationTypeList, 'PublicationType')\n",
        "            topics = []\n",
        "            for topic in medline_citation.findall('./MeshHeadingList/MeshHeading'):\n",
        "                topics.append(topic.find('DescriptorName').text)\n",
        "            topics_string = ' , '.join(topics)\n",
        "        except Exception as err:\n",
        "            print(\"Err: EUtils:\", err)\n",
        "            continue\n",
        "\n",
        "        pub_year = None if (year is None) else int(year)\n",
        "        article_info = parser._get_article_info(\n",
        "            medline_citation, pm_article.find('PubmedData'))\n",
        "        journal_info = parser._get_journal_info(medline_citation, False)\n",
        "\n",
        "        # Preparing results\n",
        "        title = journal_info[\"journal_abbrev\"] or \"\"\n",
        "        year = pub_year\n",
        "        DOI = article_info[\"doi\"] or \"\"\n",
        "        PMCID = article_info[\"pmcid\"] or \"\"\n",
        "        PMID = article_info[\"pmid\"] or \"\"\n",
        "        pmc_citation_count = citation_count\n",
        "        OC_CITATION_COUNT = 0\n",
        "        try:\n",
        "            if DOI != \"\":\n",
        "                output = requests.get(\n",
        "                    \"https://opencitations.net/api/v1/metadata/\" + DOI).json()\n",
        "                if len(output) > 0:\n",
        "                    OC_CITATION_COUNT = output[0][\"citation_count\"]\n",
        "        except:\n",
        "            pass\n",
        "        stmt = indra_db_rest.get_statements_for_paper(\n",
        "            [('pmid', PMID)]).statements\n",
        "        indra_stmt_count = len(stmt)\n",
        "        # storing in tsv file\n",
        "        writer.writerow([PMID, term, title, year, PMCID, DOI, pmc_citation_count,\n",
        "                         indra_stmt_count, OC_CITATION_COUNT,  getIndraQueryTermStmtCount(term),total_pmid])\n",
        "    # Closing file\n",
        "    destCSV.close()\n",
        "\n",
        "\n",
        "\"\"\"\n",
        "Generate a TSV containing meta details of PMID from EUtils\n",
        "\"\"\"\n",
        "def getEUtilsInfo():\n",
        "    with open(\"pmid_list.txt\") as file:\n",
        "        for line in file:\n",
        "            try:\n",
        "                line = line.strip().split(\"~\")\n",
        "                pmid = line[0]\n",
        "                term = line[1]\n",
        "                total_pmid = line[2]\n",
        "                extractFromXML(pmid, term, total_pmid)\n",
        "            except Exception as e:\n",
        "                print(\"Err: EUtils: \", e, line)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lfk0Dm2Min0S",
        "colab_type": "text"
      },
      "source": [
        "### Generate Output TSV\n",
        "Merge output MESH terms from MTI Batch Processing and the metadata to generate a TSV output file."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bswcvY0Ui1jY",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# mergeOutputs.py\n",
        "\n",
        "import csv\n",
        "import json\n",
        "import sys\n",
        "import os\n",
        "\n",
        "\"\"\"\n",
        "Merge outputs from EUtils and MESH terms from \n",
        "\n",
        "Parameters\n",
        "----------\n",
        "path_eutils:\n",
        "    Path to TSV file containing metadata from EUtils\n",
        "path_mesh:\n",
        "    Path to MESH terms extracted by Web API\n",
        "path_output_dir:\n",
        "    Path to Output directory\n",
        "\"\"\"\n",
        "def mergeOutputs(path_eutils, path_mesh, path_output_dir):\n",
        "    details = {}\n",
        "\n",
        "    with open(path_eutils) as f:\n",
        "        reader = csv.DictReader(f, delimiter='\\t')\n",
        "        for row in reader:\n",
        "            if row[\"PMID\"] not in details:\n",
        "                details[row[\"PMID\"]] = {}\n",
        "            details[row[\"PMID\"]][row[\"TERM\"]] = {\n",
        "                \"journal\": row[\"JOURNAL_TITLE\"],\n",
        "                \"year\": row[\"YEAR\"],\n",
        "                \"pmc\": row[\"PMCID\"],\n",
        "                \"doi\": row[\"DOI\"],\n",
        "                \"citation_count\": row[\"PMC_CITATION_COUNT\"],\n",
        "                \"indra_stmt_count\": row[\"INDRA_STATEMENT_COUNT\"],\n",
        "                \"oc_citation_count\": row[\"OC_CITATION_COUNT\"],\n",
        "                \"indra_query_term_stmt_count\": row[\"INDRA_QUERY_TERM_STATEMENT_COUNT\"],\n",
        "                \"pmid_count\": row[\"PMID_COUNT\"],\n",
        "                \"mesh\": []\n",
        "            }\n",
        "\n",
        "    with open(path_mesh) as mesh:\n",
        "        for line in mesh:\n",
        "            inp = line.split(\"|\")\n",
        "            mesh_term = inp[1]\n",
        "            pmid = inp[0]\n",
        "            for term in details[pmid]:\n",
        "                details[pmid][term][\"mesh\"].append(mesh_term)\n",
        "\n",
        "    with open(os.path.join(path_output_dir, \"output.tsv\"), 'w') as csvfile:\n",
        "        writer = csv.writer(csvfile, quoting=csv.QUOTE_MINIMAL, delimiter='\\t')\n",
        "        writer.writerow([\"QUERY_TERM\", \"PMID\", \"JOURNAL_TITLE\", \"YEAR\", \"PMCID\",\n",
        "                         \"DOI\", \"PMC_CITATION_COUNT\", \"INDRA_STATEMENT_COUNT\", \"OC_CITATION_COUNT\", \"INDRA_QUERY_TERM_STATEMENT_COUNT\", \"MESH_TERMS\",\"PMID_COUNT\"])\n",
        "        for key in details:\n",
        "            for term in details[key]:\n",
        "                writer.writerow([term, key, details[key][term][\"journal\"], details[key][term][\"year\"], details[key][term][\"pmc\"], details[key][term]\n",
        "                                 [\"doi\"], details[key][term][\"citation_count\"], details[key][term][\n",
        "                    \"indra_stmt_count\"], details[key][term][\"oc_citation_count\"],\n",
        "                    details[key][term][\"indra_query_term_stmt_count\"], \"|\".join(details[key][term][\"mesh\"]), details[key][term][pmid_count]])"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yq3CVqLlhxJD",
        "colab_type": "text"
      },
      "source": [
        "### Driver Function"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "EBkvP1eJYQG1",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# script.py\n",
        "\n",
        "import requests\n",
        "import multiprocessing\n",
        "import os\n",
        "import datetime\n",
        " \n",
        "from tqdm import tqdm\n",
        " \n",
        "def saveInHistory(terms):\n",
        "    with open(\"./processor/history\", \"a\") as out_file:\n",
        "        out_file.write('\\n'.join(terms)+'\\n' )\n",
        " \n",
        " \n",
        "if __name__ == \"__main__\":\n",
        "    terms_request = requests.get(\n",
        "        \"https://gist.githubusercontent.com/PritiShaw/03ce10747835390ec8a755fed9ea813d/raw/cc72cb5479f09b574e03ed22c8d4e3147e09aa0c/Reactome.csv\")\n",
        "    inp_terms = terms_request.text.splitlines()\n",
        "    history = set()\n",
        "    with open(\"./processor/history\",\"r\") as history_file:\n",
        "        for line in history_file:\n",
        "            history.add(line.strip())\n",
        "    with open(\"./processor/history_rev\",\"r\") as history_file:\n",
        "        for line in history_file:\n",
        "            history.add(line.strip())\n",
        "    terms = [[]]\n",
        "    for term in inp_terms[1:]:\n",
        "        term_parts =  term.split(\",\")\n",
        "        if len(term_parts)==2 and int(term_parts[1])>9 and term not in history:\n",
        "            terms[-1].append(term)\n",
        "            if len(terms[-1])==10:\n",
        "                terms.append([])\n",
        "\n",
        "    for chunk in tqdm(terms):\n",
        "        print('\\t',datetime.datetime.utcnow())\n",
        "        getPMID(chunk)\n",
        "        process_mesh = multiprocessing.Process(target=getMESH)\n",
        "        process_meta = multiprocessing.Process(target=getEUtilsInfo)\n",
        " \n",
        "        process_meta.start()\n",
        "        process_mesh.start()\n",
        "        process_meta.join()\n",
        "        process_mesh.join()\n",
        " \n",
        "        mergeOutputs(\"eutils_output.csv\",\"mesh.txt\",\"./processor\")\n",
        "        history.update(chunk)        \n",
        "        saveInHistory(chunk)\n",
        "        os.system(\"bash handleGit.sh\")\n",
        "        os.system(\"bash cleanup.sh\")"
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "Reactome Failed Query Analysis",
	"provenance": [],
	"collapsed_sections": [
	"yq3CVqLlhxJD"
	],
	"authorship_tag": "ABX9TyP7bJNfKJ96OaG5ADwsQDEY",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/PritiShaw/fb8abb610ec8267d21150544aa245257/reactome-failed-query-processing.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "d3k8baTpmlLP",
	"colab_type": "text"
	},
	"source": [
	"Retrieve Information for Reactome Failed Searches\n",
	"---\n",
	"---\n",
	"### Purpose\n",
	"The notebook processes the failed query terms to get the PMIDs where the term was seen.\n",
	"Using these PMIDs MESH terms and article metadata will be extracted and presented in a tab seperatted file.\n",
	"### How to Run\n",
	"All code cells needs to be run sequentially\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "GM1o-yjOdnW8",
	"colab_type": "text"
	},
	"source": [
	"### Setup"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "TbdkMu1ATOVr",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!git clone https://github.com/PritiShaw/reactome-failed-queries code"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "QOynF0hfSu1P",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"%cd /content/code/src"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "RInn17_HJZnC",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!pip install -r requirements.txt\n",
	"!git clone https://github.com/PritiShaw/Reactome-Failed-Queries-Processing.git processor"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "tKc7g8fRbcii",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"import logging\n",
	"logging.basicConfig()\n",
	"logging.getLogger().setLevel(logging.WARNING)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "rGjJnEF2d8xP",
	"colab_type": "text"
	},
	"source": [
	"### Get PMID from failed query terms\n",
	"`getPMID(terms)` \n",
	"The method takes the list of terms and queries in Pubmed database to get the List of PMID containing the term. \n",
	"`_extractListID(filecontent, term)` \n",
	"Extracts list of PMIDs from the XML and save it in file for further processing."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "DIRxMACsIgTX",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# getPMID.py\n",
	"\n",
	"import requests\n",
	"import time\n",
	"import json\n",
	"import xml.etree.ElementTree as ET\n",
	"\n",
	"def _extractListID(filecontent, term):\n",
	" tree = ET.fromstring(filecontent, ET.XMLParser(encoding='utf-8'))\n",
	" ID = tree.findall('./IdList/Id')\n",
	" count = tree.find('./Count').text\n",
	"\n",
	" with open(\"pmid_list.txt\", \"a\") as op_file:\n",
	" for i in ID:\n",
	" print(i.text + \"~\" + term + \"~\" + count, file=op_file)\n",
	"\n",
	"\"\"\"\n",
	"Get PMID for the Query terms\n",
	"Parameters:\n",
	" terms: List of failed query terms\n",
	" pmid_threshold: Limit of Pubmed articles to process, default is 20\n",
	"\"\"\"\n",
	"def getPMID(terms,pmid_threshold=20):\n",
	" if pmid_threshold<1:\n",
	" pmid_threshold=20\n",
	" for term in terms:\n",
	" term = term.strip().rpartition(\",\")[0]\n",
	" flag = True\n",
	" while flag:\n",
	" try:\n",
	" xml_content = requests.get(\n",
	" \"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=\"+pmid_threshold+\"&term=hasabstract%20AND%20\"+term)\n",
	" _extractListID(xml_content.text, term)\n",
	" flag = False\n",
	" except:\n",
	" time.sleep(.5)\n"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "YtbRKt3Hcp4H",
	"colab_type": "text"
	},
	"source": [
	"### Extraction of MESH terms\n",
	"\n",
	"`getAbstracts()` \n",
	"Reads the PMID list and gets the Abstract from Pubmed. The output file is sent to [Interactive Medical Text Indexer (MTI)](https://ii.nlm.nih.gov/Batch/index.shtml) for batch processing of these abstracts"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "tpsF7u8tIaQN",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# getMESH.py\n",
	"\n",
	"import requests\n",
	"import os\n",
	"import time\n",
	"\n",
	"from urllib.request import urlopen\n",
	"from xml.etree.ElementTree import parse\n",
	"\n",
	"\"\"\"\n",
	"Get abstracts from PMID and generate input file for MESH Batch processing\n",
	"\"\"\"\n",
	"def getAbstracts():\n",
	" with open(\"pmid_list.txt\") as file:\n",
	" with open('abstract.txt', 'w') as o:\n",
	" for inp in file:\n",
	" pmid = inp.strip().split(\"~\")[0]\n",
	" flag = True\n",
	" while flag:\n",
	" try:\n",
	" var_url = urlopen(f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id={pmid}')\n",
	" flag = False\n",
	" except:\n",
	" time.sleep(.5)\n",
	" xmldoc = parse(var_url)\n",
	" for item in xmldoc.iterfind('PubmedArticle'):\n",
	" try:\n",
	" abstract_text = item.findtext(\n",
	" 'MedlineCitation/Article/Abstract/AbstractText')\n",
	" article_title = item.findtext(\n",
	" 'MedlineCitation/Article/ArticleTitle')\n",
	" if abstract_text:\n",
	" print('UI - ', pmid, file=o)\n",
	" print(\n",
	" 'TI - ', article_title.encode(\"ascii\", \"ignore\"), file=o)\n",
	" print(\n",
	" 'AB - ', abstract_text.encode(\"ascii\", \"ignore\"), file=o)\n",
	" print(\"\\n\", file=o)\n",
	" else:\n",
	" print(\"Err: MESH: \", \"Undefined Abstract\")\n",
	" except Exception as e:\n",
	" print(\"Err: MESH: \", e)\n",
	"\n",
	"def getMESH():\n",
	" getAbstracts()\n",
	" os.system(\"bash handleMTI.sh >> mesh.txt\")\n",
	" "
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "pnArpnW-fmNy",
	"colab_type": "text"
	},
	"source": [
	"### Extraction of Metadata from the PMIDs\n",
	"\n",
	"Following details are retrieved using EUtils, INDRA and OpenCitations using the PMID\n",
	"`JOURNAL_TITLE`, `YEAR`, `PMCID`, `DOI`, `PMC_CITATION_COUNT`, `INDRA_STATEMENT_COUNT`, `OC_CITATION_COUNT`, `INDRA_QUERY_TERM_STATEMENT_COUNT`, `PMID_COUNT`\n",
	"\n",
	"`getIndraQueryTermStmtCount` \n",
	"This method uses [Gilda](https://github.com/indralab/gilda) for grounding the failed query term"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "jjXzZat5Im81",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# getEUtilsInfo.py\n",
	"\n",
	"import os\n",
	"import gzip\n",
	"import time\n",
	"import sys\n",
	"import csv\n",
	"import requests\n",
	"import json\n",
	"import indra.literature.pubmed_client as parser\n",
	"import xml.etree.ElementTree as ET\n",
	"from indra.sources import indra_db_rest\n",
	"from indra.assemblers.html.assembler import HtmlAssembler\n",
	"from urllib.parse import urljoin\n",
	"from indra.statements.statements import stmts_to_json\n",
	"\n",
	"\n",
	"\"\"\"\n",
	"Gets Citation count for PMID\n",
	"\n",
	"Parameters\n",
	"----------\n",
	"pmid : string\n",
	" PMID of the medical paper\n",
	"\n",
	"Returns\n",
	"-------\n",
	"string\n",
	" Citation Count\n",
	"\"\"\"\n",
	"def citationCount(pmid):\n",
	" flag = True\n",
	" while flag:\n",
	" try:\n",
	" citationCount_url = requests.get(\n",
	" \"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&linkname=pubmed_pmc_refs&id=\"+pmid)\n",
	" flag = False\n",
	" except Exception as e:\n",
	" time.sleep(.5)\n",
	" try:\n",
	" fileContent = citationCount_url.text\n",
	" tree = ET.fromstring(fileContent, ET.XMLParser(encoding='utf-8'))\n",
	" ID = tree.findall('./LinkSet/LinkSetDb/Link')\n",
	" return len(ID)\n",
	" except:\n",
	" return 0\n",
	"\n",
	"\n",
	"\"\"\"\n",
	"Get number of statments generated by INDRA from the query term\n",
	"\n",
	"Parameters\n",
	"----------\n",
	" txt : string\n",
	" Query term to be processed\n",
	" source_apis : [], optional\n",
	" APIs to be searched from, default is all\n",
	"\n",
	"Returns\n",
	"------\n",
	" integer\n",
	" Number of Indra Statements\n",
	"\"\"\"\n",
	"def getIndraQueryTermStmtCount(txt, source_apis=None):\n",
	" grounding_service_url = 'http://grounding.indra.bio/'\n",
	" resp = requests.post(urljoin(grounding_service_url,\n",
	" 'ground'), json={'text': txt})\n",
	" grounding_results = resp.json()\n",
	" if len(grounding_results) > 0:\n",
	" term_id = grounding_results[0]['term']['id']\n",
	" term_db = grounding_results[0]['term']['db']\n",
	" term = term_id + '@' + term_db\n",
	" else:\n",
	" return 0\n",
	" stmts = indra_db_rest.get_statements(agents=[term]).statements\n",
	" stmts_json = stmts_to_json(stmts)\n",
	" valid_stmts = set()\n",
	" if source_apis:\n",
	" idx = 0\n",
	" for stmt in stmts_json:\n",
	" evidences = stmt.get(\"evidence\", [])\n",
	" for ev in evidences:\n",
	" if ev[\"source_api\"] in source_apis:\n",
	" valid_stmts.add(stmts[idx])\n",
	" idx += 1\n",
	" return len(valid_stmts)\n",
	" return len(stmts)\n",
	"\n",
	"\n",
	"\"\"\"\n",
	"Extracts information from XML\n",
	"\n",
	"Parameters\n",
	"----------\n",
	"fileContent : \n",
	" XML Content for the journal\n",
	"citationCount : \n",
	" Citation count for the PMID\n",
	"term:\n",
	" Reactome query term\n",
	"total_pmid :\n",
	" Number of articles where the term is seen\n",
	"\"\"\"\n",
	"def extractFromXML(pmid, term, total_pmid):\n",
	"\n",
	" flag = True\n",
	" while flag:\n",
	" try:\n",
	" xmlContent = requests.get(\n",
	" \"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=\"+pmid)\n",
	" flag = False\n",
	" except Exception as e:\n",
	" time.sleep(.5)\n",
	"\n",
	" fileContent = xmlContent.text\n",
	" destFileName = \"eutils_output.tsv\"\n",
	" if(os.path.isfile(destFileName)):\n",
	" destCSV = open(destFileName, 'a')\n",
	" else:\n",
	" destCSV = open(destFileName, 'w')\n",
	" print('\\t'.join([\"PMID\", \"TERM\", \"JOURNAL_TITLE\", \"YEAR\", \"PMCID\", \"DOI\", \"PMC_CITATION_COUNT\",\n",
	" \"INDRA_STATEMENT_COUNT\", \"OC_CITATION_COUNT\", \"INDRA_QUERY_TERM_STATEMENT_COUNT\", \"PMID_COUNT\"]), file=destCSV)\n",
	" writer = csv.writer(destCSV, delimiter='\\t', quoting=csv.QUOTE_MINIMAL)\n",
	" tree = ET.fromstring(fileContent, ET.XMLParser(encoding='utf-8'))\n",
	" pm_articles = tree.findall('./PubmedArticle')\n",
	" citation_count = citationCount(pmid)\n",
	" for art_ix, pm_article in enumerate(pm_articles):\n",
	" medline_citation = pm_article.find('./MedlineCitation')\n",
	" pubmed = pm_article.find('./PubmedData')\n",
	" try:\n",
	" history_pub_date = pubmed.find(\n",
	" './History/PubMedPubDate[@PubStatus=\"pubmed\"]')\n",
	" year = parser._find_elem_text(history_pub_date, 'Year')\n",
	" PublicationTypeList = medline_citation.find(\n",
	" './Article/PublicationTypeList')\n",
	" pubType = parser._find_elem_text(\n",
	" PublicationTypeList, 'PublicationType')\n",
	" topics = []\n",
	" for topic in medline_citation.findall('./MeshHeadingList/MeshHeading'):\n",
	" topics.append(topic.find('DescriptorName').text)\n",
	" topics_string = ' , '.join(topics)\n",
	" except Exception as err:\n",
	" print(\"Err: EUtils:\", err)\n",
	" continue\n",
	"\n",
	" pub_year = None if (year is None) else int(year)\n",
	" article_info = parser._get_article_info(\n",
	" medline_citation, pm_article.find('PubmedData'))\n",
	" journal_info = parser._get_journal_info(medline_citation, False)\n",
	"\n",
	" # Preparing results\n",
	" title = journal_info[\"journal_abbrev\"] or \"\"\n",
	" year = pub_year\n",
	" DOI = article_info[\"doi\"] or \"\"\n",
	" PMCID = article_info[\"pmcid\"] or \"\"\n",
	" PMID = article_info[\"pmid\"] or \"\"\n",
	" pmc_citation_count = citation_count\n",
	" OC_CITATION_COUNT = 0\n",
	" try:\n",
	" if DOI != \"\":\n",
	" output = requests.get(\n",
	" \"https://opencitations.net/api/v1/metadata/\" + DOI).json()\n",
	" if len(output) > 0:\n",
	" OC_CITATION_COUNT = output[0][\"citation_count\"]\n",
	" except:\n",
	" pass\n",
	" stmt = indra_db_rest.get_statements_for_paper(\n",
	" [('pmid', PMID)]).statements\n",
	" indra_stmt_count = len(stmt)\n",
	" # storing in tsv file\n",
	" writer.writerow([PMID, term, title, year, PMCID, DOI, pmc_citation_count,\n",
	" indra_stmt_count, OC_CITATION_COUNT, getIndraQueryTermStmtCount(term),total_pmid])\n",
	" # Closing file\n",
	" destCSV.close()\n",
	"\n",
	"\n",
	"\"\"\"\n",
	"Generate a TSV containing meta details of PMID from EUtils\n",
	"\"\"\"\n",
	"def getEUtilsInfo():\n",
	" with open(\"pmid_list.txt\") as file:\n",
	" for line in file:\n",
	" try:\n",
	" line = line.strip().split(\"~\")\n",
	" pmid = line[0]\n",
	" term = line[1]\n",
	" total_pmid = line[2]\n",
	" extractFromXML(pmid, term, total_pmid)\n",
	" except Exception as e:\n",
	" print(\"Err: EUtils: \", e, line)"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "lfk0Dm2Min0S",
	"colab_type": "text"
	},
	"source": [
	"### Generate Output TSV\n",
	"Merge output MESH terms from MTI Batch Processing and the metadata to generate a TSV output file."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "bswcvY0Ui1jY",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# mergeOutputs.py\n",
	"\n",
	"import csv\n",
	"import json\n",
	"import sys\n",
	"import os\n",
	"\n",
	"\"\"\"\n",
	"Merge outputs from EUtils and MESH terms from \n",
	"\n",
	"Parameters\n",
	"----------\n",
	"path_eutils:\n",
	" Path to TSV file containing metadata from EUtils\n",
	"path_mesh:\n",
	" Path to MESH terms extracted by Web API\n",
	"path_output_dir:\n",
	" Path to Output directory\n",
	"\"\"\"\n",
	"def mergeOutputs(path_eutils, path_mesh, path_output_dir):\n",
	" details = {}\n",
	"\n",
	" with open(path_eutils) as f:\n",
	" reader = csv.DictReader(f, delimiter='\\t')\n",
	" for row in reader:\n",
	" if row[\"PMID\"] not in details:\n",
	" details[row[\"PMID\"]] = {}\n",
	" details[row[\"PMID\"]][row[\"TERM\"]] = {\n",
	" \"journal\": row[\"JOURNAL_TITLE\"],\n",
	" \"year\": row[\"YEAR\"],\n",
	" \"pmc\": row[\"PMCID\"],\n",
	" \"doi\": row[\"DOI\"],\n",
	" \"citation_count\": row[\"PMC_CITATION_COUNT\"],\n",
	" \"indra_stmt_count\": row[\"INDRA_STATEMENT_COUNT\"],\n",
	" \"oc_citation_count\": row[\"OC_CITATION_COUNT\"],\n",
	" \"indra_query_term_stmt_count\": row[\"INDRA_QUERY_TERM_STATEMENT_COUNT\"],\n",
	" \"pmid_count\": row[\"PMID_COUNT\"],\n",
	" \"mesh\": []\n",
	" }\n",
	"\n",
	" with open(path_mesh) as mesh:\n",
	" for line in mesh:\n",
	" inp = line.split(\"\|\")\n",
	" mesh_term = inp[1]\n",
	" pmid = inp[0]\n",
	" for term in details[pmid]:\n",
	" details[pmid][term][\"mesh\"].append(mesh_term)\n",
	"\n",
	" with open(os.path.join(path_output_dir, \"output.tsv\"), 'w') as csvfile:\n",
	" writer = csv.writer(csvfile, quoting=csv.QUOTE_MINIMAL, delimiter='\\t')\n",
	" writer.writerow([\"QUERY_TERM\", \"PMID\", \"JOURNAL_TITLE\", \"YEAR\", \"PMCID\",\n",
	" \"DOI\", \"PMC_CITATION_COUNT\", \"INDRA_STATEMENT_COUNT\", \"OC_CITATION_COUNT\", \"INDRA_QUERY_TERM_STATEMENT_COUNT\", \"MESH_TERMS\",\"PMID_COUNT\"])\n",
	" for key in details:\n",
	" for term in details[key]:\n",
	" writer.writerow([term, key, details[key][term][\"journal\"], details[key][term][\"year\"], details[key][term][\"pmc\"], details[key][term]\n",
	" [\"doi\"], details[key][term][\"citation_count\"], details[key][term][\n",
	" \"indra_stmt_count\"], details[key][term][\"oc_citation_count\"],\n",
	" details[key][term][\"indra_query_term_stmt_count\"], \"\|\".join(details[key][term][\"mesh\"]), details[key][term][pmid_count]])"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "yq3CVqLlhxJD",
	"colab_type": "text"
	},
	"source": [
	"### Driver Function"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "EBkvP1eJYQG1",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"# script.py\n",
	"\n",
	"import requests\n",
	"import multiprocessing\n",
	"import os\n",
	"import datetime\n",
	" \n",
	"from tqdm import tqdm\n",
	" \n",
	"def saveInHistory(terms):\n",
	" with open(\"./processor/history\", \"a\") as out_file:\n",
	" out_file.write('\\n'.join(terms)+'\\n' )\n",
	" \n",
	" \n",
	"if __name__ == \"__main__\":\n",
	" terms_request = requests.get(\n",
	" \"https://gist.githubusercontent.com/PritiShaw/03ce10747835390ec8a755fed9ea813d/raw/cc72cb5479f09b574e03ed22c8d4e3147e09aa0c/Reactome.csv\")\n",
	" inp_terms = terms_request.text.splitlines()\n",
	" history = set()\n",
	" with open(\"./processor/history\",\"r\") as history_file:\n",
	" for line in history_file:\n",
	" history.add(line.strip())\n",
	" with open(\"./processor/history_rev\",\"r\") as history_file:\n",
	" for line in history_file:\n",
	" history.add(line.strip())\n",
	" terms = [[]]\n",
	" for term in inp_terms[1:]:\n",
	" term_parts = term.split(\",\")\n",
	" if len(term_parts)==2 and int(term_parts[1])>9 and term not in history:\n",
	" terms[-1].append(term)\n",
	" if len(terms[-1])==10:\n",
	" terms.append([])\n",
	"\n",
	" for chunk in tqdm(terms):\n",
	" print('\\t',datetime.datetime.utcnow())\n",
	" getPMID(chunk)\n",
	" process_mesh = multiprocessing.Process(target=getMESH)\n",
	" process_meta = multiprocessing.Process(target=getEUtilsInfo)\n",
	" \n",
	" process_meta.start()\n",
	" process_mesh.start()\n",
	" process_meta.join()\n",
	" process_mesh.join()\n",
	" \n",
	" mergeOutputs(\"eutils_output.csv\",\"mesh.txt\",\"./processor\")\n",
	" history.update(chunk) \n",
	" saveInHistory(chunk)\n",
	" os.system(\"bash handleGit.sh\")\n",
	" os.system(\"bash cleanup.sh\")"
	],
	"execution_count": null,
	"outputs": []
	}
	]
	}