dotsdl/topology_scaling.ipynb

## topology_scaling.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "import time\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import gc\n",
    "\n",
    "import MDAnalysis as mda\n",
    "from MDAnalysis.topology.GROParser import GROParser"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We will benchmark `AtomGroup`s, `ResidueGroup`s, and `SegmentGroup`s in attribute access and assignment of the current development branch of `MDAnalysis`, and our *issue-363* branch of `MDAnalysis`, which uses an entirely new topology system."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "These benchmarks were carried out on a Thinkpad X260 with Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz. We also used:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1.12.0'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "np.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Our systems were vesicle systems using repeats of vesicles from the [vesicle library](https://github.com/Becksteinlab/vesicle_library) publicly hosted on github.  We used three systems, with approximately 10 million, 3.5 million, and 1.5 million atoms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "systems = {'10M'  : 'systems/vesicles/10M/system.gro',\n",
    "           '3.5M' : 'systems/vesicles/3_5M/system.gro',\n",
    "           '1.5M' : 'systems/vesicles/1_5M/system.gro'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Building the topology"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "How long does the GRO parser take? In the new implementation, we don't do any mass guessing, so it's already lighter; we defer guess methods to the `Universe` after the `Topology` has been built and attached. The final result is also a `Topology` object instead of a list of `Atom` objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "def time_GROParser(gro, parser):\n",
    "    \"\"\"Time how long it takes to parse a GRO topology with a given parser.\n",
    "    \n",
    "    :Arguments:\n",
    "        *gro*\n",
    "            path to the GRO file to parse\n",
    "        *parser*\n",
    "            the parser class to use\n",
    "            \n",
    "    :Returns:\n",
    "        *dt*\n",
    "            total time in seconds required to parse file\n",
    "    \n",
    "    \"\"\"\n",
    "    start = time.time()\n",
    "    parser(gro).parse()\n",
    "    dt = time.time() - start\n",
    "    gc.collect()\n",
    "    return dt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "def time_multiple_GROParser(grofiles, iterations, parser):\n",
    "    \"\"\"Get parse timings for multiple grofiles over multiple iterations.\n",
    "    \n",
    "    :Arguments:\n",
    "        *grofiles*\n",
    "            dictionary giving as values grofile paths\n",
    "        *iterations*\n",
    "            number of timings to do for each gro file\n",
    "        *parser*\n",
    "            GRO parser to use\n",
    "            \n",
    "    :Returns:\n",
    "        *data*\n",
    "            dataframe giving the timings for each run\n",
    "    \"\"\"\n",
    "    data = {\n",
    "            'system': [],\n",
    "            'time': []\n",
    "           }\n",
    "\n",
    "    for gro in grofiles:\n",
    "        print \"gro file: {}\".format(gro)\n",
    "        for i in range(iterations):\n",
    "            print \"\\riteration: {}\".format(i),\n",
    "\n",
    "            out = time_GROParser(grofiles[gro], parser)\n",
    "\n",
    "            data['system'].append(gro)\n",
    "            data['time'].append(out)\n",
    "        print '\\n'\n",
    "\n",
    "    return pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## the old GRO parser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gro file: 3.5M\n",
      "iteration: 1  \n",
      "\n",
      "gro file: 1.5M\n",
      "iteration: 1  \n",
      "\n",
      "gro file: 10M\n",
      "iteration: 1  \n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "df = time_multiple_GROParser(systems, iterations=2, parser=GROParser)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>system</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3.5M</td>\n",
       "      <td>25.463573</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3.5M</td>\n",
       "      <td>25.045070</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.5M</td>\n",
       "      <td>12.297968</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.5M</td>\n",
       "      <td>12.277537</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10M</td>\n",
       "      <td>70.155502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>10M</td>\n",
       "      <td>69.792134</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  system       time\n",
       "0   3.5M  25.463573\n",
       "1   3.5M  25.045070\n",
       "2   1.5M  12.297968\n",
       "3   1.5M  12.277537\n",
       "4    10M  70.155502\n",
       "5    10M  69.792134"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## the new GRO parser "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gro file: 3.5M\n",
      "\r",
      "iteration: 0"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/alter/Library/mdanalysis/MDAnalysis/package/MDAnalysis/topology/guessers.py:56: UserWarning: Failed to guess the mass for the following atom types: G\n",
      "  \"\".format(', '.join(misses)))\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "iteration: 1 \n",
      "\n",
      "gro file: 1.5M\n",
      "iteration: 1  \n",
      "\n",
      "gro file: 10M\n",
      "iteration: 1  \n",
      "\n"
     ]
    }
   ],
   "source": [
    "df = time_multiple_GROParser(systems, iterations=2, parser=GROParser)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>system</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3.5M</td>\n",
       "      <td>16.211191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3.5M</td>\n",
       "      <td>15.910637</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.5M</td>\n",
       "      <td>7.958214</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.5M</td>\n",
       "      <td>7.935534</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10M</td>\n",
       "      <td>45.729256</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>10M</td>\n",
       "      <td>45.610785</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  system       time\n",
       "0   3.5M  16.211191\n",
       "1   3.5M  15.910637\n",
       "2   1.5M   7.958214\n",
       "3   1.5M   7.935534\n",
       "4    10M  45.729256\n",
       "5    10M  45.610785"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Our new parser is about 1.5 times faster. This might not mean much, however, since we made different choices on what it should do."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Our old parser yields a data structure that is about 3.6 GB in memory, and our new parser gives one that is only 1.3 GB. The new `Topology` object is a lot smaller than a list of `Atom`s."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Creating AtomGroups"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Creating `AtomGroup`s in the old implementation requires indexing a list of `Atom` objects. This can get expensive for a large number of atoms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "def time_AtomGroup_slice(universe, slice_):\n",
    "    \"\"\"Time how long it takes to slice an AtomGroup out of all atoms in the system.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    universe\n",
    "        Universe whose atoms will be sliced\n",
    "    slice_\n",
    "        the slice to apply; can also be a fancy or boolean index\n",
    "        \n",
    "    :Returns:\n",
    "    df\n",
    "        total time in seconds required to create AtomGroup\n",
    "    \n",
    "    \"\"\"\n",
    "    start = time.time()\n",
    "    universe.atoms[slice_]\n",
    "    dt = time.time() - start\n",
    "    return dt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "def time_multiple_AtomGroup_slice(grofiles, iterations, slice_):\n",
    "    \"\"\"Get parse timings for multiple grofiles over multiple iterations.\n",
    "    \n",
    "    :Arguments:\n",
    "    grofiles\n",
    "        dictionary giving as values grofile paths\n",
    "    iterations\n",
    "        number of timings to do for each gro file\n",
    "    slice_\n",
    "        AtomGroup slicing to use; can be a fancy or boolean index\n",
    "            \n",
    "    :Returns:\n",
    "    data\n",
    "        dataframe giving the timings for each run\n",
    "    \"\"\"\n",
    "    data = {\n",
    "            'system': [],\n",
    "            'time': []\n",
    "           }\n",
    "\n",
    "    for gro in grofiles:\n",
    "        print \"gro file: {}\".format(gro)\n",
    "        u = mda.Universe(grofiles[gro])\n",
    "        for i in range(iterations):\n",
    "            print \"\\riteration: {}\".format(i),\n",
    "\n",
    "            out = time_AtomGroup_slice(u, slice_)\n",
    "\n",
    "            data['system'].append(gro)\n",
    "            data['time'].append(out)\n",
    "        print '\\n'\n",
    "\n",
    "    return pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## old implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gro file: 3.5M\n",
      "iteration: 9 \n",
      "\n",
      "gro file: 1.5M\n",
      "iteration: 9 \n",
      "\n",
      "gro file: 10M\n",
      "iteration: 9 \n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "df = time_multiple_AtomGroup_slice(systems, iterations=10, slice_=np.arange(3, 574393, 7))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>system</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1.5M</th>\n",
       "      <td>0.017631</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10M</th>\n",
       "      <td>0.016662</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3.5M</th>\n",
       "      <td>0.017449</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            time\n",
       "system          \n",
       "1.5M    0.017631\n",
       "10M     0.016662\n",
       "3.5M    0.017449"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "df.groupby('system').mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## new implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gro file: 3.5M\n",
      "iteration: 9 \n",
      "\n",
      "gro file: 1.5M\n",
      "iteration: 9 \n",
      "\n",
      "gro file: 10M\n",
      "iteration: 9 \n",
      "\n"
     ]
    }
   ],
   "source": [
    "df = time_multiple_AtomGroup_slice(systems, iterations=10, slice_=np.arange(3, 574393, 7))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>system</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1.5M</th>\n",
       "      <td>0.000441</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10M</th>\n",
       "      <td>0.000424</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3.5M</th>\n",
       "      <td>0.000420</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            time\n",
       "system          \n",
       "1.5M    0.000441\n",
       "10M     0.000424\n",
       "3.5M    0.000420"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('system').mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "About 40 times faster! Note that since the indexing we chose indexed the same number of atoms for all system sizes, the time it took didn't scale with system size here. Other indexes/slices could be applied, but because this is a fancy index it should be the worst case scenario for speed in our new scheme at any rate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Getting attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We often want to get attributes of an `AtomGroup`'s atoms. In the old scheme, this required iterating through the list of `Atom` objects, filling an array with their attribute's values. In our new scheme, the `AtomGroup`'s indices are used to slice the corresponding `TopologyAttr` array. Getting something like `resids` from an `AtomGroup` uses the `AtomGroup`'s indices to slice a translation table in `Topology` to get the corresponding residue indices, and then uses these to slice the `resids` array giving resids for each residue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "def time_AtomGroup_attr(atomgroup, attribute):\n",
    "    \"\"\"Time how long it takes to get an attribute of an AtomGroup.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    atomgroup\n",
    "        atomgroup to use\n",
    "    attribute\n",
    "        attribute to get\n",
    "        \n",
    "    :Returns:\n",
    "    df\n",
    "        total time in seconds required to get attribute\n",
    "    \n",
    "    \"\"\"\n",
    "    start = time.time()\n",
    "    getattr(atomgroup, attribute)\n",
    "    dt = time.time() - start\n",
    "    return dt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "def time_multiple_AtomGroup_attr(grofiles, iterations, attribute):\n",
    "    \"\"\"Get parse timings for multiple grofiles over multiple iterations.\n",
    "    \n",
    "    Arguments\n",
    "    ---------\n",
    "    grofiles\n",
    "        dictionary giving as values grofile paths\n",
    "    iterations\n",
    "        number of timings to do for each gro file\n",
    "    attribute\n",
    "        attribute to get\n",
    "            \n",
    "    Returns\n",
    "    -------\n",
    "    data\n",
    "        dataframe giving the timings for each run\n",
    "    \"\"\"\n",
    "    data = {\n",
    "            'system': [],\n",
    "            'time': []\n",
    "           }\n",
    "\n",
    "    for gro in grofiles:\n",
    "        print \"gro file: {}\".format(gro)\n",
    "        u = mda.Universe(grofiles[gro])\n",
    "        ag = u.atoms\n",
    "        for i in range(iterations):\n",
    "            print \"\\riteration: {}\".format(i),\n",
    "\n",
    "            out = time_AtomGroup_attr(ag, attribute)\n",
    "\n",
    "            data['system'].append(gro)\n",
    "            data['time'].append(out)\n",
    "        print '\\n'\n",
    "\n",
    "    return pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## old attribute access"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gro file: 3.5M\n",
      "iteration: 9          \n",
      "\n",
      "gro file: 1.5M\n",
      "iteration: 9         \n",
      "\n",
      "gro file: 10M\n",
      "iteration: 9          \n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "df = time_multiple_AtomGroup_attr(systems, iterations=10, attribute='resnames')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>system</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1.5M</th>\n",
       "      <td>0.241742</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10M</th>\n",
       "      <td>1.538812</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3.5M</th>\n",
       "      <td>0.597206</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            time\n",
       "system          \n",
       "1.5M    0.241742\n",
       "10M     1.538812\n",
       "3.5M    0.597206"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "df.groupby('system').mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## new attribute access"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gro file: 3.5M\n",
      "iteration: 9    \n",
      "\n",
      "gro file: 1.5M\n",
      "iteration: 9  \n",
      "\n",
      "gro file: 10M\n",
      "iteration: 9        \n",
      "\n"
     ]
    }
   ],
   "source": [
    "df = time_multiple_AtomGroup_attr(systems, iterations=10, attribute='resnames')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>system</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1.5M</th>\n",
       "      <td>0.037288</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10M</th>\n",
       "      <td>0.244270</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3.5M</th>\n",
       "      <td>0.076125</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            time\n",
       "system          \n",
       "1.5M    0.037288\n",
       "10M     0.244270\n",
       "3.5M    0.076125"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('system').mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "About a 6x-8x speedup for accessing attributes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Setting atom attributes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "def time_AtomGroup_setattr(atomgroup, attribute, values):\n",
    "    \"\"\"Time how long it takes to set an attribute of an AtomGroup.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    atomgroup\n",
    "        atomgroup to use\n",
    "    attribute\n",
    "        attribute to set\n",
    "    values\n",
    "        values to set with\n",
    "        \n",
    "    :Returns:\n",
    "    df\n",
    "        total time in seconds required to set attribute\n",
    "    \n",
    "    \"\"\"\n",
    "    start = time.time()\n",
    "    setattr(atomgroup, attribute, values)\n",
    "    dt = time.time() - start\n",
    "    return dt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "def time_multiple_AtomGroup_setatomids(grofiles, iterations):\n",
    "    \"\"\"Get timings for multiple grofiles over multiple iterations.\n",
    "    \n",
    "    Arguments\n",
    "    ---------\n",
    "    grofiles\n",
    "        dictionary giving as values grofile paths\n",
    "    iterations\n",
    "        number of timings to do for each gro file\n",
    "            \n",
    "    Returns\n",
    "    -------\n",
    "    data\n",
    "        dataframe giving the timings for each run\n",
    "    \"\"\"\n",
    "    data = {\n",
    "            'system': [],\n",
    "            'time': []\n",
    "           }\n",
    "\n",
    "    for gro in grofiles:\n",
    "        print \"gro file: {}\".format(gro)\n",
    "        u = mda.Universe(grofiles[gro])\n",
    "        ag = u.atoms\n",
    "        for i in range(iterations):\n",
    "            print \"\\riteration: {}\".format(i),\n",
    "            \n",
    "            out = time_AtomGroup_setattr(ag, 'names', ag.names)\n",
    "\n",
    "            data['system'].append(gro)\n",
    "            data['time'].append(out)\n",
    "        print '\\n'\n",
    "\n",
    "    return pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## old implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gro file: 3.5M\n",
      "iteration: 9          \n",
      "\n",
      "gro file: 1.5M\n",
      "iteration: 9          \n",
      "\n",
      "gro file: 10M\n",
      "iteration: 9          \n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "df = time_multiple_AtomGroup_setatomids(systems, iterations=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>system</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1.5M</th>\n",
       "      <td>0.751237</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10M</th>\n",
       "      <td>3.789588</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3.5M</th>\n",
       "      <td>1.318672</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            time\n",
       "system          \n",
       "1.5M    0.751237\n",
       "10M     3.789588\n",
       "3.5M    1.318672"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "df.groupby('system').mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## new implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gro file: 3.5M\n",
      "iteration: 9    \n",
      "\n",
      "gro file: 1.5M\n",
      "iteration: 9   \n",
      "\n",
      "gro file: 10M\n",
      "iteration: 9  \n",
      "\n"
     ]
    }
   ],
   "source": [
    "df = time_multiple_AtomGroup_setatomids(systems, iterations=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>system</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1.5M</th>\n",
       "      <td>0.017210</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10M</th>\n",
       "      <td>0.097012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3.5M</th>\n",
       "      <td>0.035642</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            time\n",
       "system          \n",
       "1.5M    0.017210\n",
       "10M     0.097012\n",
       "3.5M    0.035642"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('system').mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The new scheme gives about a 40x speedup for setting!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}