Last active
June 8, 2016 15:55
-
-
Save psychemedia/616d8586e055eb1e4b0193ac5a55b9ad to your computer and use it in GitHub Desktop.
A first attempt at exploring some contentmine command IPython magics for use in Jupyter notebooks.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Contentmine IPython Magic\n", | |
"\n", | |
"*A doodle by Tony Hirst / @psychemedia*\n", | |
"\n", | |
"This is a first, very weak, attempt at putting together some `contentmine` IPython magics.\n", | |
"\n", | |
"The magics are based on the following conditions:\n", | |
"\n", | |
"- an IPython notebook running in Docker container in privileged mode using Python 3.3+ and with a specified volume mlunted in the container (in the example, I use `/notebooks`;\n", | |
"- the existence of a public Docker image `psychemedia/contentmine` containing the *contentmine* applications: `getpapers`, `norma`, `cmine`;\n", | |
"\n", | |
"There are two ideas at the heart of the demo:\n", | |
"\n", | |
"1. that we can run commands in Docker containers as commandline commands and get any results files back via a shared folder;\n", | |
"2. that we can run Docker containers from inside a container (for example, as a commandline command from a code cell in a Jupyter notebook running in a container).\n", | |
"\n", | |
"\n", | |
"As an example, this notebook was run in a container fired up from the following `docker-compose.yaml` file launched with the command `docker-compose up -d`:" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"````\n", | |
"notebook:\n", | |
" image: jupyter/notebook\n", | |
" ports:\n", | |
" - \"8899:8888\"\n", | |
" volumes:\n", | |
" - ./notebooks:/notebooks\n", | |
" - /var/run/docker.sock:/var/run/docker.sock\n", | |
" privileged: true\n", | |
"````" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"State is passed between the command line Docker container and the notebook container by mounting a specified directory in the command line container on top of a specified directory in the notebook container. Files persist in the notebook container directory; the temporary command line container can writes files to, and read files from this directory and its subdirectories.\n", | |
"\n", | |
"----\n", | |
"\n", | |
"Install the magics:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Requirement already satisfied (use --upgrade to upgrade): docker-py in /usr/local/lib/python3.4/dist-packages\n", | |
"Requirement already satisfied (use --upgrade to upgrade): backports.ssl-match-hostname>=3.5 in /usr/local/lib/python3.4/dist-packages (from docker-py)\n", | |
"Requirement already satisfied (use --upgrade to upgrade): six>=1.4.0 in /usr/local/lib/python3.4/dist-packages (from docker-py)\n", | |
"Requirement already satisfied (use --upgrade to upgrade): websocket-client>=0.32.0 in /usr/local/lib/python3.4/dist-packages (from docker-py)\n", | |
"Requirement already satisfied (use --upgrade to upgrade): requests>=2.5.2 in /usr/local/lib/python3.4/dist-packages (from docker-py)\n", | |
"\u001b[33mYou are using pip version 8.0.2, however version 8.1.1 is available.\n", | |
"You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n" | |
] | |
} | |
], | |
"source": [ | |
"from IPython.core.magic import Magics, magics_class, line_magic\n", | |
"from IPython.core.magic_arguments import (argument, magic_arguments,\n", | |
" parse_argstring)\n", | |
"import shutil\n", | |
"import shlex\n", | |
"import os\n", | |
"\n", | |
"!pip3 install docker-py\n", | |
"import docker\n", | |
"\n", | |
"#Should do this as part of init\n", | |
"if not shutil.which(\"docker\"):\n", | |
" !apt-get update && apt-get install -y docker.io\n", | |
"\n", | |
" \n", | |
"@magics_class\n", | |
"class DockerMagics(Magics):\n", | |
" \n", | |
" #def dockerMagicGetPath(container,mountdir):\n", | |
" def dockerMagicGetPath(self,mountdir):\n", | |
" cli =docker.Client(base_url='unix://var/run/docker.sock')\n", | |
" #if cli.containers(filters={'name':container}):\n", | |
" # containerData=cli.inspect_container(container)\n", | |
" containers=cli.containers(filters={'id':os.environ['HOSTNAME']})\n", | |
" if containers==[]:\n", | |
" return ''\n", | |
" else:\n", | |
" c=[x['Source'] for x in containers[0]['Mounts'] if 'Destination' in x and x['Destination']==mountdir ]\n", | |
" return c[0]\n", | |
"\n", | |
" #! docker run -v /Users/ajh59/tmp/notebookdockercli/notebooks/downloads:/contentmineself --tty --interactive psychemedia/contentmine getpapers -q rhinocerous -o /contentmineself/rhinocerous -x\n", | |
" return ''\n", | |
"\n", | |
" #getpapers -q rhinocerous -o /contentmine/rhinocerous -x\n", | |
" @line_magic\n", | |
" def getpapers(self,line):\n", | |
" \"\"\" Runs a contentmine command: /MOUNTDIR SEARCHTERM\n", | |
" %getpapers /notebooks rhinocerous\n", | |
" \"\"\"\n", | |
" mount=self.dockerMagicGetPath(line.strip().split()[0])\n", | |
" if mount=='':\n", | |
" print('No container mounted there?')\n", | |
" return\n", | |
" Q=' '.join(line.strip().split()[1:])\n", | |
" QD=shlex.quote(Q)\n", | |
" DD='{}{}'.format(mount,'/contentmineMagic')\n", | |
" ! docker run --rm -v {DD}:/tmp_contentmineMagic --tty --interactive psychemedia/contentmine getpapers -q {Q} -o /tmp_contentmineMagic/{QD} -x\n", | |
"\n", | |
" #norma --project /contentmine/aardvark -i fulltext.xml -o scholarly.html --transform nlm2html\n", | |
" @line_magic\n", | |
" def norma(self,line):\n", | |
" \"\"\"\n", | |
" %norma /notebooks rhinocerous\n", | |
" \"\"\"\n", | |
" mount=self.dockerMagicGetPath(line.strip().split()[0])\n", | |
" if mount=='':\n", | |
" print('No container mounted there?')\n", | |
" return\n", | |
"\n", | |
" Q=' '.join(line.strip().split()[1:])\n", | |
" QD=shlex.quote(Q)\n", | |
" DD='{}{}'.format(mount,'/contentmineMagic')\n", | |
" ! docker run --rm -v {DD}:/tmp_contentmineMagic --tty --interactive psychemedia/contentmine norma --project /tmp_contentmineMagic/{QD} -i fulltext.xml -o scholarly.html --transform nlm2html\n", | |
" \n", | |
" #./contentmine cmine /contentmine/aardvark\n", | |
" @line_magic\n", | |
" def cmine(self,line):\n", | |
" \"\"\"\n", | |
" %cmine /notebooks rhinocerous\n", | |
" \"\"\"\n", | |
" mount=self.dockerMagicGetPath(line.strip().split()[0])\n", | |
" if mount=='':\n", | |
" print('No container mounted there?')\n", | |
" return\n", | |
"\n", | |
" Q=' '.join(line.strip().split()[1:])\n", | |
" QD=shlex.quote(Q)\n", | |
" DD='{}{}'.format(mount,'/contentmineMagic')\n", | |
" ! docker run --rm -v {DD}:/tmp_contentmineMagic --tty --interactive psychemedia/contentmine cmine /tmp_contentmineMagic/{QD}" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"ip = get_ipython()\n", | |
"ip.register_magics(DockerMagics)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"----\n", | |
"\n", | |
"Now for a demo..." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Contentmine Magic.ipynb Untitled.ipynb\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!rm -r contentmineMagic/\n", | |
"!ls" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"\u001b[32minfo\u001b[39m: Searching using eupmc API\n", | |
"\u001b[32minfo\u001b[39m: Found 4 open access results\n", | |
"\u001b[2K\u001b[1GRetrieving results [\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m] 100% (eta 0.0s)\n", | |
"\u001b[32minfo\u001b[39m: Done collecting results\n", | |
"\u001b[32minfo\u001b[39m: Saving result metadata\n", | |
"\u001b[32minfo\u001b[39m: Full EUPMC result metadata written to \u001b[34meupmc_results.json\u001b[39m\n", | |
"\u001b[32minfo\u001b[39m: Individual EUPMC result metadata records written\n", | |
"\u001b[32minfo\u001b[39m: Extracting fulltext HTML URL list (may not be available for all articles)\n", | |
"\u001b[32minfo\u001b[39m: Fulltext HTML URL list written to \u001b[34meupmc_fulltext_html_urls.txt\u001b[39m\n", | |
"\u001b[32minfo\u001b[39m: Got XML URLs for 4 out of 4 results\n", | |
"\u001b[32minfo\u001b[39m: Downloading fulltext XML files\n", | |
"\u001b[2K\u001b[1GDownloading files [\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m-------------------] 25% (1/4) [0.0s elapsed, eta 0.0]\u001b[2K\u001b[1GDownloading files [\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m-------------] 50% (2/4) [0.0s elapsed, eta 0.0]\u001b[2K\u001b[1GDownloading files [\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m------] 75% (3/4) [0.0s elapsed, eta 0.0]\u001b[2K\u001b[1GDownloading files [\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m] 100% (4/4) [0.1s elapsed, eta 0.0]\n", | |
"\u001b[32minfo\u001b[39m: All downloads succeeded!\n" | |
] | |
} | |
], | |
"source": [ | |
"%getpapers /notebooks rhinocerous" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"." | |
] | |
} | |
], | |
"source": [ | |
"%norma /notebooks rhinocerous" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"running: word([frequencies])[{xpath:@count>20}, {w.stopwords:pmcstop.txt stopwords.txt}]\n", | |
"WS: /tmp_contentmineMagic/rhinocerous 0 [main] DEBUG org.xmlcml.ami2.wordutil.WordSetWrapper - symbol expands to: /org/xmlcml/ami2/wordutil/pmcstop.txt\n", | |
"4 [main] DEBUG org.xmlcml.ami2.wordutil.WordSetWrapper - symbol expands to: /org/xmlcml/ami2/wordutil/stopwords.txt\n", | |
".filter: word([frequencies])[{xpath:@count>20}, {w.stopwords:pmcstop.txt stopwords.txt}]\n", | |
"frequenciesfrequencies5461 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
".5464 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"5467 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"5470 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"summary: word([frequencies])[{xpath:@count>20}, {w.stopwords:pmcstop.txt stopwords.txt}]\n", | |
"C: frequencies.running: sequence([dnaprimer])[]\n", | |
".filter: sequence([dnaprimer])[]\n", | |
"dnaprimerdnaprimer6546 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
".6549 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"6550 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"6552 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"summary: sequence([dnaprimer])[]\n", | |
"C: dnaprimer.running: gene([human])[]\n", | |
".filter: gene([human])[]\n", | |
"humanhuman8330 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
".8332 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"8334 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"8336 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"summary: gene([human])[]\n", | |
"C: human.running: species([genus])[]\n", | |
"SP: /tmp_contentmineMagic/rhinocerous.filter: species([genus])[]\n", | |
"genusgenus10699 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
".10701 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"10703 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"10706 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"summary: species([genus])[]\n", | |
"C: genus.running: species([binomial])[]\n", | |
"SP: /tmp_contentmineMagic/rhinocerous.filter: species([binomial])[]\n", | |
"binomialbinomial12458 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
".12460 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"12462 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"12464 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n", | |
"summary: species([binomial])[]\n", | |
"C: binomial.12535 [main] WARN org.xmlcml.ami2.plugins.ResultsAnalysis - Null pluginOption\n", | |
"12540 [main] WARN org.xmlcml.ami2.plugins.ResultsAnalysis - Null pluginOption\n", | |
"12545 [main] WARN org.xmlcml.ami2.plugins.ResultsAnalysis - Null pluginOption\n", | |
"12549 [main] WARN org.xmlcml.ami2.plugins.ResultsAnalysis - Null pluginOption\n", | |
"12553 [main] WARN org.xmlcml.ami2.plugins.ResultsAnalysis - Null pluginOption\n" | |
] | |
} | |
], | |
"source": [ | |
"%cmine /notebooks rhinocerous" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"contentmineMagic Contentmine Magic.ipynb Untitled.ipynb\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!ls" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"rhinocerous\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!ls contentmineMagic/" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"commonest.dataTables.html sequence.dnaprimer.count.xml\r\n", | |
"count.dataTables.html\t sequence.dnaprimer.documents.xml\r\n", | |
"entries.dataTables.html sequence.dnaprimer.snippets.xml\r\n", | |
"eupmc_fulltext_html_urls.txt species.binomial.count.xml\r\n", | |
"eupmc_results.json\t species.binomial.documents.xml\r\n", | |
"full.dataTables.html\t species.binomial.snippets.xml\r\n", | |
"gene.human.count.xml\t species.genus.count.xml\r\n", | |
"gene.human.documents.xml species.genus.documents.xml\r\n", | |
"gene.human.snippets.xml species.genus.snippets.xml\r\n", | |
"PMC2213592\t\t word.frequencies.count.xml\r\n", | |
"PMC4698820\t\t word.frequencies.documents.xml\r\n", | |
"PMC4730296\t\t word.frequencies.snippets.xml\r\n", | |
"PMC4788244\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!ls contentmineMagic/rhinocerous/\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Where Next?\n", | |
"\n", | |
"Setting up the shared directories is a bit of a fudge - is there a better way?\n", | |
"\n", | |
"The magics need to be better defined, allowing for the passing of appropriate command line switches, e.g. in `getpapers`, via [`core.magic_arguments`](http://ipython.readthedocs.io/en/stable/api/generated/IPython.core.magic_arguments.html?), for example.\n", | |
"\n", | |
"Need to consider cell magics so we can write a pipeline along the lines of something like:\n", | |
"\n", | |
" %%contentmine /notebooks rhinocerous\n", | |
" getpapers\n", | |
" norma\n", | |
" cmine\n", | |
"\n", | |
"A proper install package needs putting together.\n", | |
"\n", | |
"The magics need generalising up to a generic `docker magic`, and then perhaps back down to magics for a particular application?\n", | |
"\n", | |
"More info: [Defining custom magics](http://ipython.readthedocs.io/en/stable/config/custommagics.html)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.4.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Start the notebook server linked to the contentmins container as follows: | |
#docker-compose up -d | |
notebook: | |
image: jupyter/notebook | |
ports: | |
- "8899:8888" | |
volumes_from: | |
- contentmineshare | |
volumes: | |
- ./notebooks:/notebooks | |
# - ./contentmine:/cmstore | |
- /var/run/docker.sock:/var/run/docker.sock | |
privileged: true | |
# links: | |
# - contentmine:contentmine | |
contentmineshare: | |
image: psychemedia/contentmine | |
volumes: | |
- /contentmine |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment