Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Neloh/5b47f3e9a711a161c72cde46d3266a33 to your computer and use it in GitHub Desktop.
Save Neloh/5b47f3e9a711a161c72cde46d3266a33 to your computer and use it in GitHub Desktop.
Intro to Spark MLlib
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "# <center>Introduction to Spark MLlib with Python</center>\n## <center>Data Types</center>\n### <center>July 20,2016</center>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<img src = \"https://ibm.box.com/shared/static/wfbduwkbx22nx3i2psbp9g27s2p9s86v.png\", width=\"500\" align = 'center'>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## <b>Welcome to the first lab in the course, Introduction to Spark MLlib with Python.</b>\n### <b>Spark has many libraries, namely under MLlib (Machine Learning Library)! Spark allows for quick and easy scalability of practical machine learning!</b>\n\nIn this lab exercise, you will learn about the basic Data Types that are used in Spark MLlib. This lab will help you develop the building blocks required to continue developing knowledge in machine learning with Spark."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Some Notebook Commands\n#### In case you haven't dealt with a Jupyter Notebook before, here are some quick, useful commands that may be handy to get started.\n<ul>\n <li>Run a cell: CTRL + ENTER</li>\n <li>Create a cell above a cell: a</li>\n <li>Create a cell below a cell: b</li>\n <li>Change a cell to Markdown: m</li>\n \n <li>Change a cell to code: y</li>\n</ul>\n\n<b> If you are interested in more keyboard shortcuts, go to Help -> Keyboard Shortcuts </b>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Dense and Sparse Vectors"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Import the following libraries: <br>\n<ul>\n <li> numpy as np </li>\n <li> scipy.sparse as sps </li>\n <li> Vectors from pyspark.mllib.linalg </li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": "import numpy as np\nimport scipy.sparse as sps\nfrom pyspark.mllib.linalg import Vector, Vectors"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "First, we will be dealing with <b>Dense Vectors</b>. There are 2 types of <b>dense vectors</b> that we can create.<br>\nThe dense vectors will be modeled having the values: <b>8.0, 312.0, -9.0, 1.3</b>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The first <b>dense vector</b> we will create is as easy as creating a <b>numpy array</b>. <br>\nUsing the np.array function, create a <b>dense vector</b> called <b>dense_vector1</b> <br> <br>\nNote: numpy's array function takes an array as input"
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": "dense_vector1 = np.array([8.0, 312.0, -9.0, 1.3])"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print <b>dense_vector1</b> and its <b>type</b>"
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(array([ 8. , 312. , -9. , 1.3]), numpy.ndarray)"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "dense_vector1, type(dense_vector1)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The second <b>dense vector</b> is easier than the first, and is made by creating an <b>array</b>. <br>\nCreate a <b>dense vector</b> called <b>dense_vector2</b>"
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": "dense_vector2 = [8.0, 312.0, -9.0, 1.3]"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print <b>dense_vector2</b> and its <b>type</b>"
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "([8.0, 312.0, -9.0, 1.3], list)"
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "dense_vector2, type(dense_vector2)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next, we will be dealing with <b>sparse vectors</b>. There are 2 types of <b>sparse vectors</b> we can create. <br>\nThe sparse vectors we will be creating will follow these values: <b> 7.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 6.5 </b>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "First, create a <b>sparse vector</b> called <b>sparse_vector1</b> using Vector's <b>sparse</b> function. <br>\nInputs to Vector.sparse: <br>\n<ul>\n <li>1st: Size of the sparse vector</li>\n <li>2nd: Indicies of array</li>\n <li>3rd: Values placed where the indices are</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": "sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])"
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "3"
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "sv1.size"
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": "# Use a single-column SciPy csc_matrix as a sparse vector.\nsv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape = (3, 1))"
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "<3x1 sparse matrix of type '<class 'numpy.float64'>'\n\twith 2 stored elements in Compressed Sparse Column format>"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "sv2"
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "SparseVector(4, {1: 1.0, 3: 5.5})"
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "Vectors.sparse(4, [(1, 1.0), (3, 5.5)])"
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "array([7. , 2. , 1. , 6.5])"
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "Vectors.sparse(10, [(0, 7.0), (3, 2.0), (5, 1.0), (9, 6.5)]).values"
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": "sparse_vector1 = Vectors.sparse(10, [(0, 7.0), (3, 2.0), (5, 1.0), (9, 6.5)])"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print <b>sparse_vector1</b> and its <b>type</b>"
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(SparseVector(10, {0: 7.0, 3: 2.0, 5: 1.0, 9: 6.5}),\n pyspark.mllib.linalg.SparseVector)"
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "sparse_vector1, type(sparse_vector1)"
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(SparseVector(10, {0: 7.0, 3: 2.0, 5: 1.0, 9: 6.5}),\n pyspark.mllib.linalg.SparseVector)"
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "sparse_vector_11 = Vectors.sparse(10, [(0, 7.0), (3, 2.0), (5, 1.0), (9, 6.5)])\nsparse_vector_11, type(sparse_vector_11)"
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "10"
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "sparse_vector_11.size"
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "10"
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "sparse_vector1.size"
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(SparseVector(10, {0: 7.0, 3: 2.0, 5: 1.0, 9: 6.5}),\n pyspark.mllib.linalg.SparseVector)"
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "sparse_vector_12 = Vectors.sparse(10, [0, 3, 5, 9], [7. , 2. , 1. , 6.5])\nsparse_vector_12, type(sparse_vector_12)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next we will create a <b>sparse vector</b> called <i>sparse_vector2</i> using a single-column SciPy <b>csc_matrix</b> <br> <br>\nThe inputs to sps.csc_matrix are: <br>\n<ul>\n <li>1st: A tuple consisting of the three inputs:</li>\n <ul>\n <li>1st: Data Values (in a numpy array) (values placed at the specified indices)</li>\n <li>2nd: Indicies of the array (in a numpy array) (where the values will be placed)</li>\n <li>3rd: Index pointer of the array (in a numpy array)</li>\n </ul>\n <li>2nd: Shape of the array (#rows, #columns) Use 10 rows and 1 column</li>\n <ul>\n <li>shape = (\\_,\\_)</li>\n </ul>\n</ul> <br>\nNote: You may get a deprecation warning. Please Ignore it."
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": "sparse_vector2 = sps.csc_matrix((np.array([7. , 2. , 1. , 6.5]), np.array([0, 3, 5, 9]), np.array([0, 1])), shape = (10, 1))"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print <b>sparse_vector2</b> and its <b>type</b>"
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(<10x1 sparse matrix of type '<class 'numpy.float64'>'\n \twith 1 stored elements in Compressed Sparse Column format>,\n scipy.sparse.csc.csc_matrix)"
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "sparse_vector2, type(sparse_vector2)"
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(10, 1)"
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "sparse_vector2.shape"
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "array([0], dtype=int32)"
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "sparse_vector2.indices #dot([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Labeled Points"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "So the next data type will be Labeled points. Remember that this data type is mainly used for classification algorithms in supervised learning.<br>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Start by importing the following libraries: <br>\n<ul>\n <li>SparseVector from pyspark.mllib.linalg</li>\n <li>LabeledPoint from pyspark.mllib.regression</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": "from pyspark.mllib.linalg import SparseVector \nfrom pyspark.mllib.regression import LabeledPoint "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Remember that with a lableled point, we can create binary or multiclass classification. In this lab, we will deal with binary classification for ease. <br> <br>\nThe <b>LabeledPoint</b> function takes in 2 inputs:\n<ul>\n <li>1st: Label of the Point. In this case (for binary classification), we will be using <font color=\"green\">1.0</font> for <font color=\"green\">positive</font> and <font color=\"red\">0.0</font> for <font color=\"red\">negative</font></li>\n <li>2nd: Vector of features for the point (We will input a Dense or Sparse Vector using any of the methods defined in the <b>Dense and Sparse Vectors</b> section of this lab.</b>\n</ul>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Using the LabelPoint class, create a <b>dense</b> feature vector with a <b>positive</b> label called <b>pos_class</b> with the values: <b>5.0, 2.0, 1.0, 9.0</b>"
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": "pos_class = LabeledPoint(1.0, [5.0, 2.0, 1.0, 9.0])"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print <b>pos_class</b> and its <b>type</b>"
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(LabeledPoint(1.0, [5.0,2.0,1.0,9.0]), pyspark.mllib.regression.LabeledPoint)"
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "pos_class, type(pos_class)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next we will create a <b>sparse</b> feature vector with a <b>negative</b> label called <b>neg_class</b> with the values: <b>1.0, 0.0, 0.0, 4.0, 0.0, 2.0</b>"
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": "neg_class = LabeledPoint(0.0, Vectors.sparse(6, [0, 3, 5], [1. , 4. , 2.]))"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print <b>neg_class</b> and its <b>type</b>"
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(LabeledPoint(0.0, (6,[0,3,5],[1.0,4.0,2.0])),\n pyspark.mllib.regression.LabeledPoint)"
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "neg_class, type(neg_class)"
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "SparseVector(6, {0: 1.0, 3: 4.0, 5: 2.0})"
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "neg_class.features"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "---\n## Matrix Data Types\nIn this next section, we will be dealing creating the following matrices:\n<ul>\n <li>Local Matrix</li>\n <li>Row Matrix</li>\n <li>Indexed Row Matrix</li>\n <li>Coordinate Matrix</li>\n <li>Block Matrix</li>\n</ul> <br> <br>\n\nThroughout this section, we will be modelling the following matricies: <br>\n<center>For a Dense Matrix:</center> <br> <img src=\"http://imgur.com/O8a4ZS0.png\", align=\"center\"> </img> <br>\n<center>For a Sparse Matrix:</center> <br> <img src=\"http://imgur.com/k0XwOfA.png\", align=\"center\"> </img>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Local Matrix"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Import the following Library:\n<ul>\n <li>pyspark.mllib.linalg as laMat</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": "import pyspark.mllib.linalg as laMat "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Create a dense local matrix called <b>dense_LM</b> <br>\nThe inputs into the <b>laMat.Matrices.dense</b> function are:\n<ul>\n <li>1st: Number of Rows</li>\n <li>2nd: Number of Columns</li>\n <li>3rd: Values in an array format (Read as Column-Major)</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": "dense_LM = laMat.Matrices.dense(3, 2, [1., 2., 3., 4., 5., 6.])"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print <b>dense_LM</b> and its <b>type</b>"
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(DenseMatrix(3, 2, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0], False),\n pyspark.mllib.linalg.DenseMatrix)"
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "dense_LM, type(dense_LM)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next we will do the same thing with a sparse matrix, calling the output <b>sparse_LM</b>\nThe inputs into the <b>laMat.Matrices.sparse</b> function are:\n<ul>\n <li>1st: Number of Rows</li>\n <li>2nd: Number of Columns</li>\n <li>3rd: Column Pointers (in a list)</li>\n <li>4th: Row Indices (in a list)</li>\n <li>5th: Values of the Matrix (in a list)</li>\n</ul> <br>\n<b>Note</b>: Remember that this is <b>column-major</b> so all arrays should be read as columns first (top down, left to right)"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print <b>sparse_LM</b> and its <b>type</b>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Make sure the output of <b>sparse_LM</b> matches the original matrix."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## RowMatrix"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "A RowMatrix is a Row-oriented distributed matrix that doesn't have meaningful row indices."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Import the following library:\n<ul>\n <li>RowMatrix from pyspark.mllib.linalg.distributed</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now, let's create a RDD of vectors called <b>rowVecs</b>, using the SparkContext's parallelize function on the <b>Dense Matrix</b>.<br>\nThe input into <b>sc.parallelize</b> is:\n<ul>\n <li>A list (The list we will be creating will be a list of the row values (each row is a list))</li>\n</ul> <br>\n<b>Note</b>: And RDD is a fault-tolerated collection of elements that can be operated on in parallel. <br>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next, create a variable called <b>rowMat</b> by using the <b>RowMatrix</b> function and passing in the RDD."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we will retrieve the <font color=\"green\">row numbers</font> (save it as <font color=\"green\">m</font>) and <font color=\"blue\">column numbers</font> (save it as <font color=\"blue\">n</font>) from the RowMatrix.\n<ul>\n <li>To get the number of rows, use <i>numRows()</i> on rowMat</li>\n <li>To get the number of columns, use <i>numCols()</i> on rowMat</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print out <b>m</b> and <b>n</b>. The results should be:\n<ul>\n <li>Number of Rows: 3</li>\n <li>Number of Columns: 4</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## IndexedRowMatrix"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Since we just created a RowMatrix, which had no meaningful row indicies, let's create an <b>IndexedRowMatrix</b> which has meaningful row indices!"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Import the following Library:\n<ul>\n <li> IndexedRow, IndexedRowMatrix from pyspark.mllib.linalg.distributed</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now, create a RDD called <b>indRows</b> by using the SparkContext's parallelize function on the <b>Dense Matrix</b>. <br>\nThere are two different inputs you can use to create the RDD:\n<ul>\n <li>Method 1: A list containing multiple IndexedRow inputs</li>\n <ul>\n <li>Input into IndexedRow:</li>\n <ul>\n <li>1. Index for the given row (row number)</li>\n <li>2. row in the matrix for the given index</li>\n </ul>\n <li>ex. sc.parallelize([IndexedRow(0,[1, 2, 3]), ...])</li>\n </ul> <br>\n <li>Method 2: A list containing multiple tuples</li>\n <ul>\n <li>Values in the tuple:</li>\n <ul>\n <li>1. Index for the given row (row number) (type:long)</li>\n <li>2. List containing the values in the row for the given index (type:vector)</li>\n </ul>\n <li>ex. sc.parallelize([(0, [1, 2, 3]), ...])</li>\n </ul>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now, create the <b>IndexedRowMatrix</b> called <b>indRowMat</b> by using the IndexedRowMatrix function and passing in the <b>indRows</b> RDD"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we will retrieve the <font color=\"green\">row numbers</font> (save it as <font color=\"green\">m2</font>) and <font color=\"blue\">column numbers</font> (save it as <font color=\"blue\">n2</font>) from the IndexedRowMatrix.\n<ul>\n <li>To get the number of rows, use <i>numRows()</i> on indRowMat</li>\n <li>To get the number of columns, use <i>numCols()</i> on indRowMat</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print out <b>m2</b> and <b>n2</b>. The results should be:\n<ul>\n <li>Number of Rows: 3</li>\n <li>Number of Columns: 4</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## CoordinateMatrix"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now it's time to create a different type of matrix, whos use should be when both the dimensions of the matrix is very large, and the data in the matrix is sparse. <br>\n<b>Note</b>: In this case, we will be using the small, sparse matrix above, just to get the idea of how to initialize a CoordinateMatrix"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Import the following libraries:\n<ul>\n <li>CoordinateMatrix, MatrixEntry from pyspark.mllib.linalg.distributed</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now, create a RDD called <b>coordRows</b> by using the SparkContext's parallelize function on the <b>Sparse Matrix</b>. There are two different inputs you can use to create the RDD:\n<ul>\n <li>Method 1: A list containing multiple MatrixEntry inputs</li>\n <ul>\n <li>Input into MatrixEntry:</li>\n <ul>\n <li>1. Row index of the matrix (row number) (type: long)</li>\n <li>2. Column index of the matrix (column number) (type: long)</li>\n <li>3. Value at the (Row Index, Column Index) entry of the matrix (type: float)</li>\n </ul>\n <li>ex. sc.parallelize([MatrixEntry(0, 0, 1,), ...])</li>\n </ul> <br>\n <li>Method 2: A list containing multiple tuples</li>\n <ul>\n <li>Values in the tuple:</li>\n <ul>\n <li>1. Row index of the matrix (row number) (type: long)</li>\n <li>2. Column index of the matrix (column number) (type: long)</li>\n <li>3. Value at the (Row Index, Column Index) entry of the matrix (type: float)</li>\n </ul>\n <li>ex. sc.parallelize([(0, 0, 1), ...])</li>\n </ul>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now, create the <b>CoordinateMatrix</b> called <b>coordMat</b> by using the CoordinateMatrix function and passing in the <b>coordRows</b> RDD"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we will retrieve the <font color=\"green\">row numbers</font> (save it as <font color=\"green\">m3</font>) and <font color=\"blue\">column numbers</font> (save it as <font color=\"blue\">n3</font>) from the CoordinateMatrix.\n<ul>\n <li>To get the number of rows, use <i>numRows()</i> on coordMat</li>\n <li>To get the number of columns, use <i>numCols()</i> on coordMat</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print out <b>m3</b> and <b>n3</b>. The results should be:\n<ul>\n <li>Number of Rows: 3</li>\n <li>Number of Columns: 4</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now, we can get the <b>entries</b> of coordMat by calling the entries method on it. Store this in a variable called coordEnt."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Check out the <i>type</i> of coordEnt."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "It should be a <b>PipelinedRDD</b> type, which has many methods that are associated with it. One of them is <b>first()</b>, which will get the first element in the RDD. <br> <br>\n\nRun coordEnt.first()"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## BlockMatrix"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "A BlockMatrix is essentially a matrix consisting of elements which are partitions of the matrix that is being created."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Import the following libraries:\n<ul>\n <li>Matrices from pyspark.mllib.linalg</li>\n <li>BlockMatrix from pyspark.mllib.linalg.distributed</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now create a <b>RDD</b> of <b>sub-matrix blocks</b>. <br>\nThis will be done using SparkContext's parallelize function. <br>\n\nThe input into <b>sc.parallelize</b> requires a <b>list of tuples</b>. The tuples are the sub-matrices, which consist of two inputs:\n<ul>\n <li>1st: A tuple containing the row index and column index (row, column), denoting where the sub-matrix will start</li>\n <li>2nd: The sub-matrix, which will come from <b>Matrices.dense</b>. The sub-matrix requires 3 inputs:</li>\n <ul>\n <li>1st: Number of rows</li>\n <li>2nd: Number of columns</li>\n <li>3rd: A list containing the elements of the sub-matrix. These values are read into the sub-matrix column-major fashion</li>\n </ul>\n</ul> <br>\n(ex. ((51, 2), Matrices.dense(2, 2, [61.0, 43.0, 1.0, 74.0])) would be one row (one tuple))."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The matrix we will be modelling is <b>Dense Matrix</b> from above. Create the following sub-matrices:\n<ul>\n <li>Row: 0, Column: 0, Values: 1.0, 3.0, 6.0, 2.0, with 2 Rows and 2 Columns </li>\n <li>Row: 2, Column: 0, Values: 9.0, 4.0, with 1 Row and 2 Columns</li>\n <li>Row: 0, Column: 2, Values: 3.0, 5.0, 0.0, 0.0, 1.0, 3.0, with 3 Rows and 2 Columns</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now that we have the RDD, it's time to create the BlockMatrix called <b>blockMat</b> using the BlockMatrix class. The <b>BlockMatrix</b> class requires 3 inputs:\n<ul>\n <li>1st: The RDD of sub-matricies</li>\n <li>2nd: The rows per block. Keep this value at 1</li>\n <li>3rd: The columns per block. Keep this value at 1</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we will retrieve the <font color=\"green\">row numbers</font> (save it as <font color=\"green\">m4</font>) and <font color=\"blue\">column numbers</font> (save it as <font color=\"blue\">n4</font>) from the BlockMatrix.\n<ul>\n <li>To get the number of rows, use <i>numRows()</i> on blockMat</li>\n <li>To get the number of columns, use <i>numCols()</i> on blockMat</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Print out <b>m4</b> and <b>n4</b>. The results should be:\n<ul>\n <li>Number of Rows: 3</li>\n <li>Number of Columns: 4</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now, we need to check if our matrix is correct. We can do this by first converting <b>blockMat</b> into a LocalMatrix, by using the <b>.toLocalMatrix()</b> function on our matrix. Store the result into a variable called <b>locBMat</b>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now print out <b>locBMat</b> and its <b>type</b>. The result should model the original <b>Dense Matrix</b> and the type should be a DenseMatrix."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Bonus - Matrix Conversions"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In this bonus section, we will talk about a relationship between the different types of matrices. You can convert between these matrices that we discussed with the following functions. <br>\n<ul>\n <li>.toRowMatrix() converts the matrix to a RowMatrix</li>\n <li>.toIndexedRowMatrix() converts the matrix to an IndexedRowMatrix</li>\n <li>.toCoordinateMatrix() converts the matrix to a CoordinateMatrix</li>\n <li>.toBlockMatrix() converts the matrix to a BlockMatrix</li>\n</ul>"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### IndexedRowMatrix Conversions"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The following conversions are supported for an IndexedRowMatrix:\n<ul>\n <li>IndexedRowMatrix -> RowMatrix</li>\n <li>IndexedRowMatrix -> CoordinateMatrix</li>\n <li>IndexedRowMatrix -> BlockMatrix</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### CoordinateMatrix Conversions"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The following conversions are supported for an CoordinateMatrix:\n<ul>\n <li>CoordinateMatrix -> RowMatrix</li>\n <li>CoordinateMatrix -> IndexedRowMatrix</li>\n <li>CoordinateMatrix -> BlockMatrix</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### BlockMatrix Conversions"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The following conversions are supported for an BlockMatrix:\n<ul>\n <li>BlockMatrix -> LocalMatrix (Can display the Matrix)</li>\n <li>BlockMatrix -> IndexedRowMatrix</li>\n <li>BlockMatrix -> CoordinateMatrix</li>\n</ul>"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": ""
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.6 with Spark",
"language": "python3",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment