Skip to content

Instantly share code, notes, and snippets.

@HarryRybacki-zz
Created June 29, 2013 02:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save HarryRybacki-zz/5889457 to your computer and use it in GitHub Desktop.
Save HarryRybacki-zz/5889457 to your computer and use it in GitHub Desktop.
Benchmarking MongoDB queries in an IPython notebook.
{
"metadata": {
"name": "Sprint - Benchmarking"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Query benchmarking a MongoDB instance within an IPython notebook."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In this notebook, I will demonstrate rudamentry benchmarking of a MongoDB populated with mock citation data. Please, feel free to make comments or contact me if you have any recommendations!"
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "First we need to import string for access to ascii characters, random for helping automate things, and pprint for looking at complex dictionary objects."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import string\nimport random\nimport pprint",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "Next, we need to create generator functions for creating strings, full names, titles, and journals. This information will be used to populate our fake articles."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def generate_string(min_size=3, max_size=10):\n \"\"\"generates a string of random upper/lower case letters default length between 3 and 10 chars\"\"\"\n return ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase) \n for x in range(random.randint(min_size, max_size)))",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": "def generate_fullname():\n \"\"\"generates a dict representing a name consisting of a given and family name\"\"\"\n return {\"given\": generate_string() + ' ' + random.choice(string.ascii_uppercase),\n \"family\": generate_string()\n }",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": "def generate_title_or_journal(words=None):\n \"\"\"generates a title or journal based on random words and ending with punctuation\"\"\"\n if words is None:\n num_of_words = random.randint(5, 12)\n return ' '.join(generate_string() for x in range(num_of_words)) + random.choice(['.', '!', '?'])\n \n ",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "Now, lets get an idea of what kind of output we can expect our random generators to produce."
},
{
"cell_type": "code",
"collapsed": false,
"input": "generate_string()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 5,
"text": "'ExnxxO'"
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": "generate_title_or_journal()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 6,
"text": "'nXLTmxx ewknz mPTkwwVY QCCX wIFyVbo uUbfPnqtD kkVsVVXD iSPiyBHgn!'"
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": "generate_fullname()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 7,
"text": "{'family': 'ewsiTpt', 'given': 'TLefUicJ M'}"
}
],
"prompt_number": 7
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "To limit the universe, lets create an author pool with 500 authors and 100 journals for our articles."
},
{
"cell_type": "code",
"collapsed": false,
"input": "# populate a list with 500 fake authors\nauthor_pool = []\nfor x in range(500):\n author_pool.append(generate_fullname())",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": "# populate a list with 100 fake journals\njournal_pool = []\nfor x in range(100):\n journal_pool.append(generate_title_or_journal())",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "Now that we have our generators and pools set up, we can begin to create fake Scholarly standard CSLs. To do this we need a generator for producing a dict representation of a citation and a generator to produce the Scholarly standard CSL."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def generate_citation():\n \"\"\"generates a fake citation in Citation Style Language format\"\"\"\n # pick between 2 and 5 authors randomly from pool\n num_of_authors = random.randint(2, 5)\n authors = []\n for x in range(num_of_authors):\n authors.append(random.choice(author_pool))\n \n # package and return the citation\n return {\"title\": generate_title_or_journal(),\n \"authors\": authors,\n \"container-title\": random.choice(journal_pool),\n \"date\": random.randint(1955, 2013)\n }",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Scholarly standard [CSL](https://github.com/citation-style-language/schema/blob/master/csl-data.json) consist of a primary citation, a list of citations which it references, and meta-data for tracking and metrics."
},
{
"cell_type": "code",
"collapsed": false,
"input": "def generate_csl():\n \"\"\"generates a fake Scholarly standard CSL\"\"\"\n # pick between 2 and 5 references\n num_of_references = random.randint(2, 5)\n references = []\n for x in range(num_of_references):\n references.append(generate_citation())\n \n # package and return CSL\n return {\"citation\": generate_citation(),\n \"references\": references,\n \"meta-data\": {}\n }",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "Now, lets test our CSL generator."
},
{
"cell_type": "code",
"collapsed": false,
"input": "pprint.pprint(generate_csl())",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "{'citation': {'authors': [{'family': 'HJrDn', 'given': 'fsBHqfKEjn I'},\n {'family': 'sMXUNbysMs', 'given': 'mAbCf T'},\n {'family': 'pzt', 'given': 'aRcUPpVMc Y'},\n {'family': 'lAyAMah', 'given': 'NemCpF U'},\n {'family': 'sjaxpYD', 'given': 'FHnslrxmLS F'}],\n 'container-title': 'IyPZyDVaNg TGrFJnpIhn hbwf RmFOPbj jBxOgWpwcH ROZo IaBEgX.',\n 'date': 1996,\n 'title': 'IGDgHNhtgj yojhNrZ vLTR jfVmB IINh.'},\n 'meta-data': {},\n 'references': [{'authors': [{'family': 'zDNPRssZ', 'given': 'NEx Q'},\n {'family': 'FtTW', 'given': 'xCSCCPtNdP V'},\n {'family': 'jdqHFnfQK', 'given': 'BHhiw K'},\n {'family': 'sVWmKmEN', 'given': 'ERz Z'},\n {'family': 'TZSp', 'given': 'yyEI A'}],\n 'container-title': 'YllZ OoonIIDxey oqEqGDJKH dLTV EDnglty wdK Fdec raFPHRuKk rADrXsA vMcKM rrC DvEUXx?',\n 'date': 1958,\n 'title': 'YBa suMfU QmcjXMJ plNcC qscEJHas Msgr fDlq uPc tAYSP?'},\n {'authors': [{'family': 'tIDS', 'given': 'sBqAJO C'},\n {'family': 'XAAbAzk', 'given': 'nGR G'},\n {'family': 'TWHFv', 'given': 'PmMHSeue Z'},\n {'family': 'FAMll', 'given': 'YCzVtyLaua P'},\n {'family': 'OGs', 'given': 'TcYjVf A'}],\n 'container-title': 'ltqVKNOm UVudMWX kYtAVZpgGv WHsz okAw mmjxcvU adTIcW wjmn NfukbsKla!',\n 'date': 1967,\n 'title': 'XNAzF yynbt gOLMpEYds UIvk IkPSXO rUJjAYxz YPQsOmGHz QHf XvmOBpR sEx QPbVHTuCq ulXscfk?'},\n {'authors': [{'family': 'TzndvzQ', 'given': 'ZkKdj E'},\n {'family': 'iSHabLvm', 'given': 'LmnGjximf L'},\n {'family': 'ENpoY', 'given': 'XJRv J'},\n {'family': 'sqgy', 'given': 'CQr M'}],\n 'container-title': 'OASbcEcS FgaVRHgWzf oDujV PwHe ulVyHYGgEV zccikpstAN YVkm DVde GSLinN ojxcRoQ rGSp IfbpFs.',\n 'date': 1971,\n 'title': 'AipYA FNBw qIumehV VVU nKvp MnpSmoYZvm QYqlaGeD eNLpPCjA!'},\n {'authors': [{'family': 'kdjLNKODAb',\n 'given': 'qOhjJGZv S'},\n {'family': 'dmqVg', 'given': 'vuLYMMMR L'}],\n 'container-title': 'aXN RHPQUVuscP JrOAXxTRW BptUDPXR XkdZYYdGc LoAZmL SNw kMXYoBDKA IPWBQRcUO AKK nNRe?',\n 'date': 2002,\n 'title': 'nuy VYI eRBQYkzOBI qAMVCsu NjHW AVXbF eci VDP xHFQRLT VZb?'},\n {'authors': [{'family': 'mPd', 'given': 'GPgOto J'},\n {'family': 'oYcLg', 'given': 'mfcA F'},\n {'family': 'xzaXF', 'given': 'BdmDFN F'}],\n 'container-title': 'vNy etDnU tiN owgjx dCiKH NAvMscrIl NnUmxq daO.',\n 'date': 1988,\n 'title': 'XVOqdhCEE puwVr grfbfxmmk yRxpzrwIP rsOntjGVc pMXMDvwPj!'}]}\n"
}
],
"prompt_number": 12
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "Continuing, let's connect to a MongoDB instance and see how long it takes to insert datasets of various sizes."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pymongo import MongoClient\n# setup a connection to the local db\nclient = MongoClient('localhost', 27018)\n# create a new db for the benchmarking\ndb = client.benchmark_test\n# create a new collection to store the articles\narticles = db.articles",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": "def addcsls(niter):\n \"\"\"inserts n number of articles into the collection\"\"\"\n for x in range(niter):\n articles.insert(generate_csl())",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": "runtimes = ''",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": "%%capture runtimes\nfor niter in [10, 100, 1000, 10000, 100000]:\n %timeit addcsls(niter)\n articles.remove()",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "Although stdout isn't pretty, %timeit gives us a good idea of how long each batch took to insert. I'm running these tests on a Thinkpad X230 with an intel i5 and msata drive. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "runtimes.stdout",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 17,
"text": "u'100 loops, best of 3: 8.38 ms per loop\\n10 loops, best of 3: 118 ms per loop\\n1 loops, best of 3: 1.15 s per loop\\n1 loops, best of 3: 11.9 s per loop\\n1 loops, best of 3: 96.6 s per loop\\n'"
}
],
"prompt_number": 17
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "Next, we'll insert a sizeable chunk of fake citations and start running benchmarks on various queries. Note: When running the 32-bit version of MongoDB databases are limited to 2gb. I found that limit to be hit around ~500,000 fake articles. If you're running the 64-bit version, feel free to add more than what I have below."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for x in range(450000):\n articles.insert(generate_csl())\narticles.count()",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 18,
"text": "450000"
}
],
"prompt_number": 18
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "Down to the queries."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Notes: \n\n* MongoDB uses 'dot notation' for iterating through dicts. \n\nE.g. citati on.authors.family will look for:\n\n {\"citation\": {\"authors\": { \"family\" : \"...here...\" } } }\n\n* cell magics e.g.: **%%timeit** *must* be at the top of the cell or they will not function. Yes, even above comments.\n\n* **%%time** vs **%timeit** -- I chose to use %%timeit because it reports the average time of some number of iterations. Although you can manually specify how many iterations via the -n parameter, it's not always necessary. The IPython guys were smart enough to make this scale automatically. Inexpensive actions such as the query below may loop 100,000 times but expensive actions i.g. creating 100,000 articles and inserting them into the DB may only run 3 times."
},
{
"cell_type": "heading",
"level": 5,
"metadata": {},
"source": "First, lets run some author based queries."
},
{
"cell_type": "code",
"collapsed": false,
"input": "%%timeit\n# time a query for one author\none_author = articles.find({'citation.authors.family': author_pool[23]['family']})",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "100000 loops, best of 3: 7.68 us per loop\n"
}
],
"prompt_number": 19
},
{
"cell_type": "code",
"collapsed": false,
"input": "%%timeit\n# time a query for two authors\ntwo_authors = articles.find({\n 'citation.authors.family': author_pool[23]['family'], \n 'citation.authors.family': author_pool[51]['family']\n })",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "100000 loops, best of 3: 7.86 us per loop\n"
}
],
"prompt_number": 20
},
{
"cell_type": "code",
"collapsed": false,
"input": "%%timeit\n# now, lets query two authors from the primary citation and then a third within the references\nthree_authors = articles.find({\n 'citation.authors.family': author_pool[23]['family'], \n 'citation.authors.family': author_pool[51]['family'],\n 'references.authors.family': author_pool[3]['family']\n })",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "100000 loops, best of 3: 7.86 us per loop\n"
}
],
"prompt_number": 21
},
{
"cell_type": "heading",
"level": 5,
"metadata": {},
"source": "Next, lets the search times of title vs ObjectID."
},
{
"cell_type": "code",
"collapsed": false,
"input": "# grab the ObjectID and title from a random article\nsample_article = articles.find_one()\nuid = sample_article['_id']\ntitle = sample_article['citation']['title']",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": "# time a query based on ObjectID\n%timeit articles.find({'_id': uid})",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "100000 loops, best of 3: 7.88 us per loop\n"
}
],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": "# time a query vased on title\n%timeit articles.find({'citation.title': title})",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "100000 loops, best of 3: 7.79 us per loop\n"
}
],
"prompt_number": 24
},
{
"cell_type": "heading",
"level": 5,
"metadata": {},
"source": "Finally, let's execute some ranged queries.\n"
},
{
"cell_type": "code",
"collapsed": false,
"input": "%%timeit\n# query for all articles writtein in 2000\narticle_query_one = articles.find({'citation.date': 2000 } )",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "100000 loops, best of 3: 7.61 us per loop\n"
}
],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": "%%timeit\n# query for all articles written after 2000\narticle_query_two = articles.find({'citation.date': { \"$gt\": 2000} })",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "100000 loops, best of 3: 7.93 us per loop\n"
}
],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": "%%timeit\n# query for all articles written after 1960 and before 2011\narticle_query_three = articles.find({'citation.date': { '$gte': 1960, '$lte': 2010 }})",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "100000 loops, best of 3: 7.77 us per loop\n"
}
],
"prompt_number": 27
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": "After thoughts:"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We were initially concerned with response times for database queries. But, as the results show, even with roughly half of a million articles querying deep within individual citations was rather quick. Not a single query had an average time of over 10 micro seconds. Not too bad."
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment