Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Last active June 8, 2016 15:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save psychemedia/616d8586e055eb1e4b0193ac5a55b9ad to your computer and use it in GitHub Desktop.
Save psychemedia/616d8586e055eb1e4b0193ac5a55b9ad to your computer and use it in GitHub Desktop.
A first attempt at exploring some contentmine command IPython magics for use in Jupyter notebooks.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Contentmine IPython Magic\n",
"\n",
"*A doodle by Tony Hirst / @psychemedia*\n",
"\n",
"This is a first, very weak, attempt at putting together some `contentmine` IPython magics.\n",
"\n",
"The magics are based on the following conditions:\n",
"\n",
"- an IPython notebook running in Docker container in privileged mode using Python 3.3+ and with a specified volume mlunted in the container (in the example, I use `/notebooks`;\n",
"- the existence of a public Docker image `psychemedia/contentmine` containing the *contentmine* applications: `getpapers`, `norma`, `cmine`;\n",
"\n",
"There are two ideas at the heart of the demo:\n",
"\n",
"1. that we can run commands in Docker containers as commandline commands and get any results files back via a shared folder;\n",
"2. that we can run Docker containers from inside a container (for example, as a commandline command from a code cell in a Jupyter notebook running in a container).\n",
"\n",
"\n",
"As an example, this notebook was run in a container fired up from the following `docker-compose.yaml` file launched with the command `docker-compose up -d`:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"````\n",
"notebook:\n",
" image: jupyter/notebook\n",
" ports:\n",
" - \"8899:8888\"\n",
" volumes:\n",
" - ./notebooks:/notebooks\n",
" - /var/run/docker.sock:/var/run/docker.sock\n",
" privileged: true\n",
"````"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"State is passed between the command line Docker container and the notebook container by mounting a specified directory in the command line container on top of a specified directory in the notebook container. Files persist in the notebook container directory; the temporary command line container can writes files to, and read files from this directory and its subdirectories.\n",
"\n",
"----\n",
"\n",
"Install the magics:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied (use --upgrade to upgrade): docker-py in /usr/local/lib/python3.4/dist-packages\n",
"Requirement already satisfied (use --upgrade to upgrade): backports.ssl-match-hostname>=3.5 in /usr/local/lib/python3.4/dist-packages (from docker-py)\n",
"Requirement already satisfied (use --upgrade to upgrade): six>=1.4.0 in /usr/local/lib/python3.4/dist-packages (from docker-py)\n",
"Requirement already satisfied (use --upgrade to upgrade): websocket-client>=0.32.0 in /usr/local/lib/python3.4/dist-packages (from docker-py)\n",
"Requirement already satisfied (use --upgrade to upgrade): requests>=2.5.2 in /usr/local/lib/python3.4/dist-packages (from docker-py)\n",
"\u001b[33mYou are using pip version 8.0.2, however version 8.1.1 is available.\n",
"You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
]
}
],
"source": [
"from IPython.core.magic import Magics, magics_class, line_magic\n",
"from IPython.core.magic_arguments import (argument, magic_arguments,\n",
" parse_argstring)\n",
"import shutil\n",
"import shlex\n",
"import os\n",
"\n",
"!pip3 install docker-py\n",
"import docker\n",
"\n",
"#Should do this as part of init\n",
"if not shutil.which(\"docker\"):\n",
" !apt-get update && apt-get install -y docker.io\n",
"\n",
" \n",
"@magics_class\n",
"class DockerMagics(Magics):\n",
" \n",
" #def dockerMagicGetPath(container,mountdir):\n",
" def dockerMagicGetPath(self,mountdir):\n",
" cli =docker.Client(base_url='unix://var/run/docker.sock')\n",
" #if cli.containers(filters={'name':container}):\n",
" # containerData=cli.inspect_container(container)\n",
" containers=cli.containers(filters={'id':os.environ['HOSTNAME']})\n",
" if containers==[]:\n",
" return ''\n",
" else:\n",
" c=[x['Source'] for x in containers[0]['Mounts'] if 'Destination' in x and x['Destination']==mountdir ]\n",
" return c[0]\n",
"\n",
" #! docker run -v /Users/ajh59/tmp/notebookdockercli/notebooks/downloads:/contentmineself --tty --interactive psychemedia/contentmine getpapers -q rhinocerous -o /contentmineself/rhinocerous -x\n",
" return ''\n",
"\n",
" #getpapers -q rhinocerous -o /contentmine/rhinocerous -x\n",
" @line_magic\n",
" def getpapers(self,line):\n",
" \"\"\" Runs a contentmine command: /MOUNTDIR SEARCHTERM\n",
" %getpapers /notebooks rhinocerous\n",
" \"\"\"\n",
" mount=self.dockerMagicGetPath(line.strip().split()[0])\n",
" if mount=='':\n",
" print('No container mounted there?')\n",
" return\n",
" Q=' '.join(line.strip().split()[1:])\n",
" QD=shlex.quote(Q)\n",
" DD='{}{}'.format(mount,'/contentmineMagic')\n",
" ! docker run --rm -v {DD}:/tmp_contentmineMagic --tty --interactive psychemedia/contentmine getpapers -q {Q} -o /tmp_contentmineMagic/{QD} -x\n",
"\n",
" #norma --project /contentmine/aardvark -i fulltext.xml -o scholarly.html --transform nlm2html\n",
" @line_magic\n",
" def norma(self,line):\n",
" \"\"\"\n",
" %norma /notebooks rhinocerous\n",
" \"\"\"\n",
" mount=self.dockerMagicGetPath(line.strip().split()[0])\n",
" if mount=='':\n",
" print('No container mounted there?')\n",
" return\n",
"\n",
" Q=' '.join(line.strip().split()[1:])\n",
" QD=shlex.quote(Q)\n",
" DD='{}{}'.format(mount,'/contentmineMagic')\n",
" ! docker run --rm -v {DD}:/tmp_contentmineMagic --tty --interactive psychemedia/contentmine norma --project /tmp_contentmineMagic/{QD} -i fulltext.xml -o scholarly.html --transform nlm2html\n",
" \n",
" #./contentmine cmine /contentmine/aardvark\n",
" @line_magic\n",
" def cmine(self,line):\n",
" \"\"\"\n",
" %cmine /notebooks rhinocerous\n",
" \"\"\"\n",
" mount=self.dockerMagicGetPath(line.strip().split()[0])\n",
" if mount=='':\n",
" print('No container mounted there?')\n",
" return\n",
"\n",
" Q=' '.join(line.strip().split()[1:])\n",
" QD=shlex.quote(Q)\n",
" DD='{}{}'.format(mount,'/contentmineMagic')\n",
" ! docker run --rm -v {DD}:/tmp_contentmineMagic --tty --interactive psychemedia/contentmine cmine /tmp_contentmineMagic/{QD}"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ip = get_ipython()\n",
"ip.register_magics(DockerMagics)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"\n",
"Now for a demo..."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Contentmine Magic.ipynb Untitled.ipynb\r\n"
]
}
],
"source": [
"!rm -r contentmineMagic/\n",
"!ls"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[32minfo\u001b[39m: Searching using eupmc API\n",
"\u001b[32minfo\u001b[39m: Found 4 open access results\n",
"\u001b[2K\u001b[1GRetrieving results [\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m] 100% (eta 0.0s)\n",
"\u001b[32minfo\u001b[39m: Done collecting results\n",
"\u001b[32minfo\u001b[39m: Saving result metadata\n",
"\u001b[32minfo\u001b[39m: Full EUPMC result metadata written to \u001b[34meupmc_results.json\u001b[39m\n",
"\u001b[32minfo\u001b[39m: Individual EUPMC result metadata records written\n",
"\u001b[32minfo\u001b[39m: Extracting fulltext HTML URL list (may not be available for all articles)\n",
"\u001b[32minfo\u001b[39m: Fulltext HTML URL list written to \u001b[34meupmc_fulltext_html_urls.txt\u001b[39m\n",
"\u001b[32minfo\u001b[39m: Got XML URLs for 4 out of 4 results\n",
"\u001b[32minfo\u001b[39m: Downloading fulltext XML files\n",
"\u001b[2K\u001b[1GDownloading files [\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m-------------------] 25% (1/4) [0.0s elapsed, eta 0.0]\u001b[2K\u001b[1GDownloading files [\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m-------------] 50% (2/4) [0.0s elapsed, eta 0.0]\u001b[2K\u001b[1GDownloading files [\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m------] 75% (3/4) [0.0s elapsed, eta 0.0]\u001b[2K\u001b[1GDownloading files [\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m\u001b[32m=\u001b[39m] 100% (4/4) [0.1s elapsed, eta 0.0]\n",
"\u001b[32minfo\u001b[39m: All downloads succeeded!\n"
]
}
],
"source": [
"%getpapers /notebooks rhinocerous"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"."
]
}
],
"source": [
"%norma /notebooks rhinocerous"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"running: word([frequencies])[{xpath:@count>20}, {w.stopwords:pmcstop.txt stopwords.txt}]\n",
"WS: /tmp_contentmineMagic/rhinocerous 0 [main] DEBUG org.xmlcml.ami2.wordutil.WordSetWrapper - symbol expands to: /org/xmlcml/ami2/wordutil/pmcstop.txt\n",
"4 [main] DEBUG org.xmlcml.ami2.wordutil.WordSetWrapper - symbol expands to: /org/xmlcml/ami2/wordutil/stopwords.txt\n",
".filter: word([frequencies])[{xpath:@count>20}, {w.stopwords:pmcstop.txt stopwords.txt}]\n",
"frequenciesfrequencies5461 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
".5464 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"5467 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"5470 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"summary: word([frequencies])[{xpath:@count>20}, {w.stopwords:pmcstop.txt stopwords.txt}]\n",
"C: frequencies.running: sequence([dnaprimer])[]\n",
".filter: sequence([dnaprimer])[]\n",
"dnaprimerdnaprimer6546 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
".6549 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"6550 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"6552 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"summary: sequence([dnaprimer])[]\n",
"C: dnaprimer.running: gene([human])[]\n",
".filter: gene([human])[]\n",
"humanhuman8330 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
".8332 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"8334 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"8336 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"summary: gene([human])[]\n",
"C: human.running: species([genus])[]\n",
"SP: /tmp_contentmineMagic/rhinocerous.filter: species([genus])[]\n",
"genusgenus10699 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
".10701 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"10703 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"10706 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"summary: species([genus])[]\n",
"C: genus.running: species([binomial])[]\n",
"SP: /tmp_contentmineMagic/rhinocerous.filter: species([binomial])[]\n",
"binomialbinomial12458 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
".12460 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"12462 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"12464 [main] WARN org.xmlcml.cmine.util.CMineGlobber - might delete system files: IGNORED\n",
"summary: species([binomial])[]\n",
"C: binomial.12535 [main] WARN org.xmlcml.ami2.plugins.ResultsAnalysis - Null pluginOption\n",
"12540 [main] WARN org.xmlcml.ami2.plugins.ResultsAnalysis - Null pluginOption\n",
"12545 [main] WARN org.xmlcml.ami2.plugins.ResultsAnalysis - Null pluginOption\n",
"12549 [main] WARN org.xmlcml.ami2.plugins.ResultsAnalysis - Null pluginOption\n",
"12553 [main] WARN org.xmlcml.ami2.plugins.ResultsAnalysis - Null pluginOption\n"
]
}
],
"source": [
"%cmine /notebooks rhinocerous"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"contentmineMagic Contentmine Magic.ipynb Untitled.ipynb\r\n"
]
}
],
"source": [
"!ls"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"rhinocerous\r\n"
]
}
],
"source": [
"!ls contentmineMagic/"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"commonest.dataTables.html sequence.dnaprimer.count.xml\r\n",
"count.dataTables.html\t sequence.dnaprimer.documents.xml\r\n",
"entries.dataTables.html sequence.dnaprimer.snippets.xml\r\n",
"eupmc_fulltext_html_urls.txt species.binomial.count.xml\r\n",
"eupmc_results.json\t species.binomial.documents.xml\r\n",
"full.dataTables.html\t species.binomial.snippets.xml\r\n",
"gene.human.count.xml\t species.genus.count.xml\r\n",
"gene.human.documents.xml species.genus.documents.xml\r\n",
"gene.human.snippets.xml species.genus.snippets.xml\r\n",
"PMC2213592\t\t word.frequencies.count.xml\r\n",
"PMC4698820\t\t word.frequencies.documents.xml\r\n",
"PMC4730296\t\t word.frequencies.snippets.xml\r\n",
"PMC4788244\r\n"
]
}
],
"source": [
"!ls contentmineMagic/rhinocerous/\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Where Next?\n",
"\n",
"Setting up the shared directories is a bit of a fudge - is there a better way?\n",
"\n",
"The magics need to be better defined, allowing for the passing of appropriate command line switches, e.g. in `getpapers`, via [`core.magic_arguments`](http://ipython.readthedocs.io/en/stable/api/generated/IPython.core.magic_arguments.html?), for example.\n",
"\n",
"Need to consider cell magics so we can write a pipeline along the lines of something like:\n",
"\n",
" %%contentmine /notebooks rhinocerous\n",
" getpapers\n",
" norma\n",
" cmine\n",
"\n",
"A proper install package needs putting together.\n",
"\n",
"The magics need generalising up to a generic `docker magic`, and then perhaps back down to magics for a particular application?\n",
"\n",
"More info: [Defining custom magics](http://ipython.readthedocs.io/en/stable/config/custommagics.html)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
#Start the notebook server linked to the contentmins container as follows:
#docker-compose up -d
notebook:
image: jupyter/notebook
ports:
- "8899:8888"
volumes_from:
- contentmineshare
volumes:
- ./notebooks:/notebooks
# - ./contentmine:/cmstore
- /var/run/docker.sock:/var/run/docker.sock
privileged: true
# links:
# - contentmine:contentmine
contentmineshare:
image: psychemedia/contentmine
volumes:
- /contentmine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment