Skip to content

Instantly share code, notes, and snippets.

@RMDK
Created May 8, 2014 04:26
Show Gist options
  • Save RMDK/3a13c1e21be788efa5dc to your computer and use it in GitHub Desktop.
Save RMDK/3a13c1e21be788efa5dc to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "heading",
"source": "Getting Warmed Up:",
"level": 1
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.\n\n<a target=\"_parent\"href=\"http://rmdk.ca\">Visit my webpage for more</a>. \n\nEmail me: <a target=\"_parent\" href=\"http://rmdk.ca/contact/\">ryan@rmdk.ca</a>\n\n\nI'd love if you shared this post"
},
{
"metadata": {},
"cell_type": "code",
"input": "social()",
"prompt_number": 4,
"outputs": [
{
"text": "<IPython.core.display.HTML at 0x105c7e2d0>",
"html": "\n <a style='float:left; margin-right:5px;' href=\"https://twitter.com/share\" class=\"twitter-share-button\" data-text=\"Check this out\" data-via=\"Ryanmdk\">Tweet</a>\n<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>\n <a style='float:left; margin-right:5px;' href=\"https://twitter.com/Ryanmdk\" class=\"twitter-follow-button\" data-show-count=\"false\">Follow @Ryanmdk</a>\n<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>\n <a style='float:left; margin-right:5px;'target='_parent' href=\"http://www.reddit.com/submit\" onclick=\"window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false\"> <img src=\"http://www.reddit.com/static/spreddit7.gif\" alt=\"submit to reddit\" border=\"0\" /> </a>\n<script src=\"//platform.linkedin.com/in.js\" type=\"text/javascript\">\n lang: en_US\n</script>\n<script type=\"IN/Share\"></script>\n",
"output_type": "pyout",
"metadata": {},
"prompt_number": 4
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "heading",
"source": "Learning NumPy Basics",
"level": 3
},
{
"metadata": {},
"cell_type": "code",
"input": "import numpy as np\nnumpy.version.full_version",
"prompt_number": 66,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 66,
"metadata": {},
"text": "'1.8.0'"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "a = np.array([0,1,2,3,4, 5])\na",
"prompt_number": 48,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 48,
"metadata": {},
"text": "array([0, 1, 2, 3, 4, 5])"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Get a look at the dimensions and shape of the data"
},
{
"metadata": {},
"cell_type": "code",
"input": "print a.ndim\nprint a.shape",
"prompt_number": 49,
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "1\n(6,)\n"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Transform array into 2D matrix."
},
{
"metadata": {},
"cell_type": "code",
"input": "b = a.reshape((3,2)) # rows, columns\nb",
"prompt_number": 50,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 50,
"metadata": {},
"text": "array([[0, 1],\n [2, 3],\n [4, 5]])"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "However, since we set a = b, any changes in either will be reflected in both to avoid copying."
},
{
"metadata": {},
"cell_type": "code",
"input": "b[0][0] = 100\nb",
"prompt_number": 51,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 51,
"metadata": {},
"text": "array([[100, 1],\n [ 2, 3],\n [ 4, 5]])"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "a",
"prompt_number": 52,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 52,
"metadata": {},
"text": "array([100, 1, 2, 3, 4, 5])"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Whenever you need a true copy, use **.copy( ) **"
},
{
"metadata": {},
"cell_type": "code",
"input": "c = a.reshape((3,2)).copy()",
"prompt_number": 53,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Another feature of NumPy array's is that the operations are propagated to the individual elements. Which is in contrast to normal python lists"
},
{
"metadata": {},
"cell_type": "code",
"input": "print a * 2\nprint [1,2,3,4] * 2",
"prompt_number": 84,
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[6 2 4 6 6 6]\n[1, 2, 3, 4, 1, 2, 3, 4]\n"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "heading",
"source": "Indexing",
"level": 3
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Arrays can be accessed in several ways.\n\n- In addtion to normal list indexing, we can use arrays themselves as indices."
},
{
"metadata": {},
"cell_type": "code",
"input": "# Index a with a vector 2,3,4\na[np.array([2,3,4])]",
"prompt_number": 55,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 55,
"metadata": {},
"text": "array([2, 3, 4])"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Since conditions are propagated to the individual elements, we can access out data in interesting ways."
},
{
"metadata": {},
"cell_type": "code",
"input": "# Return boolean mask\nprint a > 4\nmask = a > 4\n\nprint a[a>4] == a[mask]\n\n# Return the masked data or everything but the mask\nprint a[mask]\nprint a[-mask]",
"prompt_number": 56,
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[ True False False False False True]\n[ True True]\n[100 5]\n[1 2 3 4]\n"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "You could also do things like trim outliers."
},
{
"metadata": {},
"cell_type": "code",
"input": "a[a>3] = 3\na",
"prompt_number": 57,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 57,
"metadata": {},
"text": "array([3, 1, 2, 3, 3, 3])"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "It turns out that this is pretty popular, so there is a predefined function for it.\n\n**.clip( )** will take two arguments are clip the values at both ends of the interval."
},
{
"metadata": {},
"cell_type": "code",
"input": "a.clip(1,3)",
"prompt_number": 58,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 58,
"metadata": {},
"text": "array([3, 1, 2, 3, 3, 3])"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "heading",
"source": "Dealing with missing values",
"level": 3
},
{
"metadata": {},
"cell_type": "markdown",
"source": "One of the most common things we run into as data scientists is missing data. How we deal with that missing data is integral to the outcome and robustness of the analysis. NumPy can use one of several special NAN characters to denote missing values"
},
{
"metadata": {},
"cell_type": "code",
"input": "# Pretend to be read from text file\nc = np.array([1,2, np.NAN, 3, 5]) \nprint c\nprint np.isnan(c)",
"prompt_number": 29,
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "[ 1. 2. nan 3. 5.]\n[False False True False False]\n"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "c[-np.isnan(c)] # Non-missing data",
"prompt_number": 32,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 32,
"metadata": {},
"text": "array([ 1., 2., 3., 5.])"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "This becomes a required tool even for the first stages of exploratory analysis"
},
{
"metadata": {},
"cell_type": "code",
"input": "print np.mean(c)\nprint np.mean(c[-np.isnan(c)])",
"prompt_number": 36,
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "nan\n2.75\n"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "heading",
"source": "Lets compare runtime between NumPy and regular python lists",
"level": 3
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We are using NumPy for a reason right? Here we will simply calculate the sum of squares for all numbers from 1 - 2 000 and report how long it takes over 10 000 iterations."
},
{
"metadata": {},
"cell_type": "code",
"input": "from timeit import timeit\n\nnormal_python = timeit('sum(x*x for x in xrange(1000))',\n number = 10000)\n\n#compute dot product of vectors which is equivilent to the product\nNumpy_python = timeit('x.dot(x)', setup='import numpy as np;\\\n x=np.arange(1000)' ,number=10000)\n\nprint(\"Normal Python: {} seconds\").format(normal_python)\nprint (\"Numpy: {} seconds\").format(Numpy_python)",
"prompt_number": 74,
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Normal Python: 0.676285982132 seconds\nNumpy: 0.0202009677887 seconds\n"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "However, we have to be careful, because simply using NumPy does not gaurentee efficiency."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "If we dont take advantage of the optimized Numpy code, we are not going to get anywhere. We should always look for the optimized, or vectorized versions, which allow us to operate on the entire matrix or array at once, rather than looping."
},
{
"metadata": {},
"cell_type": "code",
"input": "dumb_numpy = timeit ('sum(x*x)', setup='import numpy as np;\\\n x=np.arange(1000)', number=10000)\nprint(\"Dumb numpy: {} seconds\").format(dumb_numpy)",
"prompt_number": 76,
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Dumb numpy: 3.61384701729 seconds\n"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "At the mercy of speed, we sacrifice some of the flexibility of python lists. In Numpy we can only store a single datatype in an array, where as a list can hold pretty much anything.\n\nKeep in mind that sometimes a list could be better suited to your problem rather than a NumPy array"
},
{
"metadata": {},
"cell_type": "code",
"input": "a.dtype",
"prompt_number": 77,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 77,
"metadata": {},
"text": "dtype('int64')"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "When we try to use different data types in the same array, Numpy will try to coerce them into a common format. For example, if we combine strings and integers, NumPy will convert the numeric values into strings."
},
{
"metadata": {},
"cell_type": "code",
"input": "np.array([1,'string'])",
"prompt_number": 78,
"outputs": [
{
"output_type": "pyout",
"prompt_number": 78,
"metadata": {},
"text": "array(['1', 'string'], \n dtype='|S6')"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "from IPython.core.display import HTML\n\n\ndef css_styling():\n styles = open(\"/users/ryankelly/desktop/custom_notebook.css\", \"r\").read()\n return HTML(styles)\ncss_styling()",
"prompt_number": 2,
"outputs": [
{
"text": "<IPython.core.display.HTML at 0x105c7e850>",
"html": "\n<style>\nbody {\n font-family: Century Gothic, sans;\n\n}\n\n\ndiv.text_cell_render h1 { /* Main titles bigger, centered */\nfont-size: 2.2em;\nline-height:1.4em;\ntext-align:center;\n}\n\n/*Input and output cells formatting*/\ndiv.prompt.input_prompt, div.prompt.output_prompt {\n visibility: hidden;\n /*font-family: Consolas;*/\n color: #575748;\n /*background-color: #CCCCCC;*/\n border: 0px;\n width: 6.5em;\n float:left;\n}\n\n\ndiv.output_subarea.output_text.output_stream.output_stdout,div.output_subarea.output_text {\n margin-left: 1.5em;\n padding-top: 1em;\n padding-bottom: 0.5em;\n margin-top: 8px; /*This is for getting the box-shadow property of the parent to display properly;*/\n}\n\ndiv.cell { /* Tunes the space between cells */\nmargin-top:1em;\nmargin-bottom:1em;\nwidth:100%;\nmargin-right:auto;\noverflow-x:hidden;\n}\n\ndiv.text_cell_render{\n overflow-x:hidden;\n \n}\n\n\ndiv.input{\nmargin-right:1%;\n}\n\n</style>\n \n\n\n\n",
"output_type": "pyout",
"metadata": {},
"prompt_number": 2
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "def social():\n code = \"\"\"\n <a style='float:left; margin-right:5px;' href=\"https://twitter.com/share\" class=\"twitter-share-button\" data-text=\"Check this out\" data-via=\"Ryanmdk\">Tweet</a>\n<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>\n <a style='float:left; margin-right:5px;' href=\"https://twitter.com/Ryanmdk\" class=\"twitter-follow-button\" data-show-count=\"false\">Follow @Ryanmdk</a>\n<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>\n <a style='float:left; margin-right:5px;'target='_parent' href=\"http://www.reddit.com/submit\" onclick=\"window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false\"> <img src=\"http://www.reddit.com/static/spreddit7.gif\" alt=\"submit to reddit\" border=\"0\" /> </a>\n<script src=\"//platform.linkedin.com/in.js\" type=\"text/javascript\">\n lang: en_US\n</script>\n<script type=\"IN/Share\"></script>\n\"\"\"\n return HTML(code)",
"prompt_number": 3,
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "code",
"input": "",
"outputs": [],
"language": "python",
"trusted": true,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"gist_id": "3a13c1e21be788efa5dc",
"name": "",
"signature": "sha256:6fb1b703afb8fd6422fb85fec7cc45e56fcee9034325935fe216ab406be40237"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment