Skip to content

Instantly share code, notes, and snippets.

@dotsdl
Last active March 31, 2017 21:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dotsdl/0e0fbd409e3e102d0458 to your computer and use it in GitHub Desktop.
Save dotsdl/0e0fbd409e3e102d0458 to your computer and use it in GitHub Desktop.
MDAnalysis performance improvements under new topology model
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"import time\n",
"import pandas as pd\n",
"import numpy as np\n",
"import gc\n",
"\n",
"import MDAnalysis as mda\n",
"from MDAnalysis.topology.GROParser import GROParser"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We will benchmark `AtomGroup`s, `ResidueGroup`s, and `SegmentGroup`s in attribute access and assignment of the current development branch of `MDAnalysis`, and our *issue-363* branch of `MDAnalysis`, which uses an entirely new topology system."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"These benchmarks were carried out on a Thinkpad X260 with Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz. We also used:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/plain": [
"'1.12.0'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"np.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Our systems were vesicle systems using repeats of vesicles from the [vesicle library](https://github.com/Becksteinlab/vesicle_library) publicly hosted on github. We used three systems, with approximately 10 million, 3.5 million, and 1.5 million atoms."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"systems = {'10M' : 'systems/vesicles/10M/system.gro',\n",
" '3.5M' : 'systems/vesicles/3_5M/system.gro',\n",
" '1.5M' : 'systems/vesicles/1_5M/system.gro'}"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Building the topology"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"How long does the GRO parser take? In the new implementation, we don't do any mass guessing, so it's already lighter; we defer guess methods to the `Universe` after the `Topology` has been built and attached. The final result is also a `Topology` object instead of a list of `Atom` objects."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"def time_GROParser(gro, parser):\n",
" \"\"\"Time how long it takes to parse a GRO topology with a given parser.\n",
" \n",
" :Arguments:\n",
" *gro*\n",
" path to the GRO file to parse\n",
" *parser*\n",
" the parser class to use\n",
" \n",
" :Returns:\n",
" *dt*\n",
" total time in seconds required to parse file\n",
" \n",
" \"\"\"\n",
" start = time.time()\n",
" parser(gro).parse()\n",
" dt = time.time() - start\n",
" gc.collect()\n",
" return dt"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"def time_multiple_GROParser(grofiles, iterations, parser):\n",
" \"\"\"Get parse timings for multiple grofiles over multiple iterations.\n",
" \n",
" :Arguments:\n",
" *grofiles*\n",
" dictionary giving as values grofile paths\n",
" *iterations*\n",
" number of timings to do for each gro file\n",
" *parser*\n",
" GRO parser to use\n",
" \n",
" :Returns:\n",
" *data*\n",
" dataframe giving the timings for each run\n",
" \"\"\"\n",
" data = {\n",
" 'system': [],\n",
" 'time': []\n",
" }\n",
"\n",
" for gro in grofiles:\n",
" print \"gro file: {}\".format(gro)\n",
" for i in range(iterations):\n",
" print \"\\riteration: {}\".format(i),\n",
"\n",
" out = time_GROParser(grofiles[gro], parser)\n",
"\n",
" data['system'].append(gro)\n",
" data['time'].append(out)\n",
" print '\\n'\n",
"\n",
" return pd.DataFrame(data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## the old GRO parser"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gro file: 3.5M\n",
"iteration: 1 \n",
"\n",
"gro file: 1.5M\n",
"iteration: 1 \n",
"\n",
"gro file: 10M\n",
"iteration: 1 \n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"df = time_multiple_GROParser(systems, iterations=2, parser=GROParser)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>system</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3.5M</td>\n",
" <td>25.463573</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3.5M</td>\n",
" <td>25.045070</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.5M</td>\n",
" <td>12.297968</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.5M</td>\n",
" <td>12.277537</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>10M</td>\n",
" <td>70.155502</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>10M</td>\n",
" <td>69.792134</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" system time\n",
"0 3.5M 25.463573\n",
"1 3.5M 25.045070\n",
"2 1.5M 12.297968\n",
"3 1.5M 12.277537\n",
"4 10M 70.155502\n",
"5 10M 69.792134"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## the new GRO parser "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gro file: 3.5M\n",
"\r",
"iteration: 0"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/alter/Library/mdanalysis/MDAnalysis/package/MDAnalysis/topology/guessers.py:56: UserWarning: Failed to guess the mass for the following atom types: G\n",
" \"\".format(', '.join(misses)))\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"iteration: 1 \n",
"\n",
"gro file: 1.5M\n",
"iteration: 1 \n",
"\n",
"gro file: 10M\n",
"iteration: 1 \n",
"\n"
]
}
],
"source": [
"df = time_multiple_GROParser(systems, iterations=2, parser=GROParser)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>system</th>\n",
" <th>time</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3.5M</td>\n",
" <td>16.211191</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3.5M</td>\n",
" <td>15.910637</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.5M</td>\n",
" <td>7.958214</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.5M</td>\n",
" <td>7.935534</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>10M</td>\n",
" <td>45.729256</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>10M</td>\n",
" <td>45.610785</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" system time\n",
"0 3.5M 16.211191\n",
"1 3.5M 15.910637\n",
"2 1.5M 7.958214\n",
"3 1.5M 7.935534\n",
"4 10M 45.729256\n",
"5 10M 45.610785"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Our new parser is about 1.5 times faster. This might not mean much, however, since we made different choices on what it should do."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Our old parser yields a data structure that is about 3.6 GB in memory, and our new parser gives one that is only 1.3 GB. The new `Topology` object is a lot smaller than a list of `Atom`s."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Creating AtomGroups"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Creating `AtomGroup`s in the old implementation requires indexing a list of `Atom` objects. This can get expensive for a large number of atoms."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"def time_AtomGroup_slice(universe, slice_):\n",
" \"\"\"Time how long it takes to slice an AtomGroup out of all atoms in the system.\n",
" \n",
" Parameters\n",
" ----------\n",
" universe\n",
" Universe whose atoms will be sliced\n",
" slice_\n",
" the slice to apply; can also be a fancy or boolean index\n",
" \n",
" :Returns:\n",
" df\n",
" total time in seconds required to create AtomGroup\n",
" \n",
" \"\"\"\n",
" start = time.time()\n",
" universe.atoms[slice_]\n",
" dt = time.time() - start\n",
" return dt"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"def time_multiple_AtomGroup_slice(grofiles, iterations, slice_):\n",
" \"\"\"Get parse timings for multiple grofiles over multiple iterations.\n",
" \n",
" :Arguments:\n",
" grofiles\n",
" dictionary giving as values grofile paths\n",
" iterations\n",
" number of timings to do for each gro file\n",
" slice_\n",
" AtomGroup slicing to use; can be a fancy or boolean index\n",
" \n",
" :Returns:\n",
" data\n",
" dataframe giving the timings for each run\n",
" \"\"\"\n",
" data = {\n",
" 'system': [],\n",
" 'time': []\n",
" }\n",
"\n",
" for gro in grofiles:\n",
" print \"gro file: {}\".format(gro)\n",
" u = mda.Universe(grofiles[gro])\n",
" for i in range(iterations):\n",
" print \"\\riteration: {}\".format(i),\n",
"\n",
" out = time_AtomGroup_slice(u, slice_)\n",
"\n",
" data['system'].append(gro)\n",
" data['time'].append(out)\n",
" print '\\n'\n",
"\n",
" return pd.DataFrame(data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## old implementation"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gro file: 3.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 1.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 10M\n",
"iteration: 9 \n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"df = time_multiple_AtomGroup_slice(systems, iterations=10, slice_=np.arange(3, 574393, 7))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" </tr>\n",
" <tr>\n",
" <th>system</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1.5M</th>\n",
" <td>0.017631</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10M</th>\n",
" <td>0.016662</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3.5M</th>\n",
" <td>0.017449</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time\n",
"system \n",
"1.5M 0.017631\n",
"10M 0.016662\n",
"3.5M 0.017449"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"df.groupby('system').mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## new implementation"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gro file: 3.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 1.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 10M\n",
"iteration: 9 \n",
"\n"
]
}
],
"source": [
"df = time_multiple_AtomGroup_slice(systems, iterations=10, slice_=np.arange(3, 574393, 7))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" </tr>\n",
" <tr>\n",
" <th>system</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1.5M</th>\n",
" <td>0.000441</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10M</th>\n",
" <td>0.000424</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3.5M</th>\n",
" <td>0.000420</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time\n",
"system \n",
"1.5M 0.000441\n",
"10M 0.000424\n",
"3.5M 0.000420"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('system').mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"About 40 times faster! Note that since the indexing we chose indexed the same number of atoms for all system sizes, the time it took didn't scale with system size here. Other indexes/slices could be applied, but because this is a fancy index it should be the worst case scenario for speed in our new scheme at any rate."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Getting attributes"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We often want to get attributes of an `AtomGroup`'s atoms. In the old scheme, this required iterating through the list of `Atom` objects, filling an array with their attribute's values. In our new scheme, the `AtomGroup`'s indices are used to slice the corresponding `TopologyAttr` array. Getting something like `resids` from an `AtomGroup` uses the `AtomGroup`'s indices to slice a translation table in `Topology` to get the corresponding residue indices, and then uses these to slice the `resids` array giving resids for each residue."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"def time_AtomGroup_attr(atomgroup, attribute):\n",
" \"\"\"Time how long it takes to get an attribute of an AtomGroup.\n",
" \n",
" Parameters\n",
" ----------\n",
" atomgroup\n",
" atomgroup to use\n",
" attribute\n",
" attribute to get\n",
" \n",
" :Returns:\n",
" df\n",
" total time in seconds required to get attribute\n",
" \n",
" \"\"\"\n",
" start = time.time()\n",
" getattr(atomgroup, attribute)\n",
" dt = time.time() - start\n",
" return dt"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"def time_multiple_AtomGroup_attr(grofiles, iterations, attribute):\n",
" \"\"\"Get parse timings for multiple grofiles over multiple iterations.\n",
" \n",
" Arguments\n",
" ---------\n",
" grofiles\n",
" dictionary giving as values grofile paths\n",
" iterations\n",
" number of timings to do for each gro file\n",
" attribute\n",
" attribute to get\n",
" \n",
" Returns\n",
" -------\n",
" data\n",
" dataframe giving the timings for each run\n",
" \"\"\"\n",
" data = {\n",
" 'system': [],\n",
" 'time': []\n",
" }\n",
"\n",
" for gro in grofiles:\n",
" print \"gro file: {}\".format(gro)\n",
" u = mda.Universe(grofiles[gro])\n",
" ag = u.atoms\n",
" for i in range(iterations):\n",
" print \"\\riteration: {}\".format(i),\n",
"\n",
" out = time_AtomGroup_attr(ag, attribute)\n",
"\n",
" data['system'].append(gro)\n",
" data['time'].append(out)\n",
" print '\\n'\n",
"\n",
" return pd.DataFrame(data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## old attribute access"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gro file: 3.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 1.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 10M\n",
"iteration: 9 \n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"df = time_multiple_AtomGroup_attr(systems, iterations=10, attribute='resnames')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" </tr>\n",
" <tr>\n",
" <th>system</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1.5M</th>\n",
" <td>0.241742</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10M</th>\n",
" <td>1.538812</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3.5M</th>\n",
" <td>0.597206</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time\n",
"system \n",
"1.5M 0.241742\n",
"10M 1.538812\n",
"3.5M 0.597206"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"df.groupby('system').mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## new attribute access"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gro file: 3.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 1.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 10M\n",
"iteration: 9 \n",
"\n"
]
}
],
"source": [
"df = time_multiple_AtomGroup_attr(systems, iterations=10, attribute='resnames')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" </tr>\n",
" <tr>\n",
" <th>system</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1.5M</th>\n",
" <td>0.037288</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10M</th>\n",
" <td>0.244270</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3.5M</th>\n",
" <td>0.076125</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time\n",
"system \n",
"1.5M 0.037288\n",
"10M 0.244270\n",
"3.5M 0.076125"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('system').mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"About a 6x-8x speedup for accessing attributes."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Setting atom attributes"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"def time_AtomGroup_setattr(atomgroup, attribute, values):\n",
" \"\"\"Time how long it takes to set an attribute of an AtomGroup.\n",
" \n",
" Parameters\n",
" ----------\n",
" atomgroup\n",
" atomgroup to use\n",
" attribute\n",
" attribute to set\n",
" values\n",
" values to set with\n",
" \n",
" :Returns:\n",
" df\n",
" total time in seconds required to set attribute\n",
" \n",
" \"\"\"\n",
" start = time.time()\n",
" setattr(atomgroup, attribute, values)\n",
" dt = time.time() - start\n",
" return dt"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"def time_multiple_AtomGroup_setatomids(grofiles, iterations):\n",
" \"\"\"Get timings for multiple grofiles over multiple iterations.\n",
" \n",
" Arguments\n",
" ---------\n",
" grofiles\n",
" dictionary giving as values grofile paths\n",
" iterations\n",
" number of timings to do for each gro file\n",
" \n",
" Returns\n",
" -------\n",
" data\n",
" dataframe giving the timings for each run\n",
" \"\"\"\n",
" data = {\n",
" 'system': [],\n",
" 'time': []\n",
" }\n",
"\n",
" for gro in grofiles:\n",
" print \"gro file: {}\".format(gro)\n",
" u = mda.Universe(grofiles[gro])\n",
" ag = u.atoms\n",
" for i in range(iterations):\n",
" print \"\\riteration: {}\".format(i),\n",
" \n",
" out = time_AtomGroup_setattr(ag, 'names', ag.names)\n",
"\n",
" data['system'].append(gro)\n",
" data['time'].append(out)\n",
" print '\\n'\n",
"\n",
" return pd.DataFrame(data)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## old implementation"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gro file: 3.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 1.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 10M\n",
"iteration: 9 \n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"df = time_multiple_AtomGroup_setatomids(systems, iterations=10)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" </tr>\n",
" <tr>\n",
" <th>system</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1.5M</th>\n",
" <td>0.751237</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10M</th>\n",
" <td>3.789588</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3.5M</th>\n",
" <td>1.318672</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time\n",
"system \n",
"1.5M 0.751237\n",
"10M 3.789588\n",
"3.5M 1.318672"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": []
}
],
"source": [
"df.groupby('system').mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## new implementation"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gro file: 3.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 1.5M\n",
"iteration: 9 \n",
"\n",
"gro file: 10M\n",
"iteration: 9 \n",
"\n"
]
}
],
"source": [
"df = time_multiple_AtomGroup_setatomids(systems, iterations=10)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" </tr>\n",
" <tr>\n",
" <th>system</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1.5M</th>\n",
" <td>0.017210</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10M</th>\n",
" <td>0.097012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3.5M</th>\n",
" <td>0.035642</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time\n",
"system \n",
"1.5M 0.017210\n",
"10M 0.097012\n",
"3.5M 0.035642"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby('system').mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The new scheme gives about a 40x speedup for setting!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
@jandom
Copy link

jandom commented Dec 22, 2015

Wow, this looks awesome! This benchmark is largely a mode where "read-in one, big system", how is this expected to perform in "read in many, small systems"?

@dotsdl
Copy link
Author

dotsdl commented Dec 28, 2015

@jandom sorry didn't see this until now. Since the Topology is a collection of numpy arrays instead of a list of Atom objects, it should perform better for many smaller systems, too, and each one also has a smaller memory footprint since we only have as many attributes as we need, no duplication of data, etc.. We already see that we get a decent speedup on parsing a GRO file with this new scheme, but we also omitted guessing from it, too, so perhaps it's not a fair comparison.

Does that kinda answer your question?

@orbeckst
Copy link

orbeckst commented Jul 5, 2016

The notebook says that the benchmark system are not available but we recently put them on figshare (as also mentioned in the updated README for the vesicle_library):

A set of large vesicle systems, ranging in size from 1.75 M to 10 M particles are made available under doi:10.6084/m9.figshare.3406708.

@orbeckst
Copy link

@dotsdl please fix the notebook as it holds up MDAnalysis/MDAnalysis.github.io#41 (see also MDAnalysis/MDAnalysis.github.io#41 (comment) )

  • fix availability of vesicle library
  • remove stupid json warnings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment