mrbkdad/sparkml_mllib_01.ipynb

## sparkml_mllib_01.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 스파크에서 선형 대수 연산 수행\n",
    "- Breeze, jblas : 로컬 환경의 선형 대수 연산\n",
    "- 분산환경의 선형 대수 연산은 스파크 자체적으로 구현\n",
    "- org.apache.spark.mllib.linalg : 로컬 벡터, 로컬 행렬"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 로컬 벡터 생성\n",
    "- DenseVector, SparseVector\n",
    "- new : Vector의 dense, sparse 메서드 사용\n",
    "    - dense : 모든 원소 값 전달(인라인, 배열)\n",
    "    - sparse : 벡터크기, 위치 배열, 값 배열"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import org.apache.spark.mllib.linalg.{Vectors, Vector}\n",
    "val dv1 = Vectors.dense(5.0,6.0,7.0,8.0)\n",
    "val dv2 = Vectors.dense(Array(5.0,6.0,7.0,8.0))\n",
    "val sv = Vectors.sparse(4,Array(0,1,3),Array(5.0,6.0,8.0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "class org.apache.spark.mllib.linalg.DenseVector"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dv2.getClass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "class org.apache.spark.mllib.linalg.SparseVector"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sv.getClass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dv2(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dv1.size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Array(5.0, 6.0, 7.0, 8.0)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dv2.toArray"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Array(5.0, 6.0, 0.0, 8.0)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sv.toArray"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 로컬벡터 선형 대수 연산\n",
    "- Breeze 라이브러리 사용 가능(스파크도 내부적으로 Breeze 활용)\n",
    "- Breeze 벡터 클래스로 변환 필요\n",
    "- Breeze 벡터 import 시 이름 충돌 주의"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import org.apache.spark.mllib.linalg.{DenseVector, SparseVector}\n",
    "import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def toBreezeV(v:Vector) :BV[Double] = v match {\n",
    "    case dv:DenseVector => new BDV(dv.values)\n",
    "    case sv:SparseVector => new BSV(sv.indices, sv.values, sv.size)\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DenseVector(10.0, 12.0, 14.0, 16.0)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "toBreezeV(dv1) + toBreezeV(dv2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "174.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "toBreezeV(dv1).dot(toBreezeV(dv2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 로컬 행렬 생성\n",
    "- Vector의 dense나 sparse 메서드 사용\n",
    "    - dense : 행과 열의 갯수, 데이터 배열(가장 왼쪽 열부터 순차적으로 입력됨)\n",
    "    - eye(n) : identity matrix, speye(n) : sparce identity matrix\n",
    "    - ones(n,m) : ones matrix, zeros(n,m) : zeros matrix\n",
    "    - diag(Vector) : diagonal matrix\n",
    "    - rand, randn : random matrix, 행과 열의 갯수 그리고 java.util.Random 객체\n",
    "    - sprand, sprandn : sparse random matrix\n",
    "    - sparse : 열위치, 행위치, 원소(csc 포맷)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "import org.apache.spark.mllib.linalg.{DenseMatrix, SparseMatrix, Matrix, Matrices}\n",
    "import breeze.linalg.{DenseMatrix => BDM, CSCMatrix => BSM, Matrix => BM}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.0  0.0  1.0\n",
       "0.0  3.0  4.0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val dm = Matrices.dense(2,3,Array(5.0,0.0,0.0,3.0,1.0,4.0))\n",
    "dm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0  0.0  0.0\n",
       "0.0  1.0  0.0\n",
       "0.0  0.0  1.0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Matrices.eye(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3 x 3 CSCMatrix\n",
       "(0,0) 1.0\n",
       "(1,1) 1.0\n",
       "(2,2) 1.0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Matrices.speye(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0  1.0  1.0  1.0\n",
       "1.0  1.0  1.0  1.0\n",
       "1.0  1.0  1.0  1.0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Matrices.ones(3,4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.0  0.0  0.0  0.0\n",
       "0.0  6.0  0.0  0.0\n",
       "0.0  0.0  7.0  0.0\n",
       "0.0  0.0  0.0  8.0"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Matrices.diag(dv1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.24195757362830206  0.48338029429776297  0.7295640132346459  0.9111713334577178\n",
       "0.5338953228076072   0.40732475317015204  0.5599033214128653  0.7227464735139351\n",
       "0.26061800259634715  0.11762896089043273  0.3574893375146967  0.7836246467511585"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Matrices.rand(3,4,new java.util.Random())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.10091008750260587  1.1003430904035745   -0.16868358279894863  -0.7017513880445515\n",
       "-0.2858952660461997   0.5718196598121303   -2.716554079811411    -0.9012402538189418\n",
       "-1.8202707815402122   -0.6656064061997368  0.4440977317479941    -0.604365905693581"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Matrices.randn(3,4,new java.util.Random())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2 x 3 CSCMatrix\n",
       "(0,0) 5.0\n",
       "(1,1) 3.0\n",
       "(0,1) 1.0\n",
       "(1,2) 4.0"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val sm = Matrices.sparse(2,3,Array(0,1,3,4),Array(0,1,0,1),Array(5.0,3.0,1.0,4.0))\n",
    "sm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4 x 5 CSCMatrix\n",
       "(0,0) 10.0\n",
       "(1,1) 16.0\n",
       "(2,2) 11.0\n",
       "(3,2) 11.0\n",
       "(0,3) 12.0\n",
       "(1,3) 13.0\n",
       "(3,4) 13.0"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val sm2 = Matrices.sparse(4,5,Array(0,1,2,4,6,7),Array(0,1,2,3,0,1,3),Array(10.0,16.0,11.0,11.0,12.0,13.0,13.0))\n",
    "sm2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "class org.apache.spark.mllib.linalg.SparseMatrix"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sm2.getClass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sm2.isInstanceOf[Matrix]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10.0  0.0   0.0   12.0  0.0\n",
       "0.0   16.0  0.0   13.0  0.0\n",
       "0.0   0.0   11.0  0.0   0.0\n",
       "0.0   0.0   11.0  0.0   13.0"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sm2.asInstanceOf[SparseMatrix].toDense"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 로컬 행렬의 선형 대수 연산\n",
    "- (row number,column number) : 0 ~\n",
    "- transpose : 전치 행렬\n",
    "- Breeze 행렬 객체 변환하여 행렬 연산 수행\n",
    "- 변환시 Matrix의 값을 활용\n",
    "    - numRows : Row 갯수\n",
    "    - numCols : Column 갯수\n",
    "    - values : Matrix의 값을 Array로 변환, DenseMatrix/SparseMatrix 변환\n",
    "    - (dm.asInstanceOf[DenseMatrix].values, sm.asInstanceOf[SparseMatrix].values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.0"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dm(1,1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.0  0.0\n",
       "0.0  3.0\n",
       "1.0  4.0"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dm.transpose"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "import org.apache.spark.mllib.linalg.{DenseMatrix, SparseMatrix, Matrix, Matrices}\n",
    "import breeze.linalg.{DenseMatrix => BDM,CSCMatrix => BSM,Matrix => BM}\n",
    "def toBreezeM(m:Matrix):BM[Double] = m match {\n",
    "    case dm:DenseMatrix => new BDM(dm.numRows, dm.numCols, dm.values)\n",
    "    case sm:SparseMatrix => new BSM(sm.values, sm.numCols, sm.numRows, sm.colPtrs, sm.rowIndices)\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2,3)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(dm.numRows,dm.numCols)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Array(5.0, 0.0, 0.0, 3.0, 1.0, 4.0)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dm.asInstanceOf[DenseMatrix].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Array(5.0, 3.0, 1.0, 4.0)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sm.asInstanceOf[SparseMatrix].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10.0  0.0  2.0\n",
       "0.0   6.0  8.0"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "toBreezeM(dm) + toBreezeM(dm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 분산 행렬\n",
    "- 빅데이터 머신 러닝 알고리즘 적용시 분산 행렬 필요\n",
    "- 분산 행렬은 분산 저장이 가능, 대량의 행과 열로 구성\n",
    "- 분산 행렬의 행과 열 번호는 Long 타입\n",
    "- org.apache.spark.mllib.linalg.distributed.{RowMatrix, IndexedRowMatrix, BlockMatrix, CoordinateMatrix}\n",
    "- RowMatrix\n",
    "    - 각 행을 Vector 객체에 저장하여 RDD 구성\n",
    "    - rows atrribute를 이용 접근\n",
    "    - numRows, numCols\n",
    "    - multiply : RowMatrix와 로컬 행렬 곱 실행\n",
    "    - toRowMatrix : 다른 분산 행렬을 RowMatrix로 변환\n",
    "    - RowMatrix를 다른 분산 행렬로 전환하는 메서드는 없음\n",
    "- IndexedRowMatrix\n",
    "    - IndexedRow 객체의 요소로 구성된 RDD 구성\n",
    "    - IndexedRow : 행의 원소 벡터와 행의 행렬내 위치를 저장\n",
    "- CoordinateMatrix\n",
    "    - MatrixEntry 객체의 요소로 구성된 RDD 구성\n",
    "    - MatrixEntry : 개별 원소 값과 해당 원소의 행렬 내 위치(i,j)를 저장\n",
    "    - sparse matrix 저장시 사용\n",
    "    - dense matrix에 사용하면 메모리 문제가 발생\n",
    "- BlockMatrix\n",
    "    - 분산 행렬간 덧셈 및 곱셈 연산 지원\n",
    "    - ((i,j), Matrix) 튜플의 RDD 형태로 저장\n",
    "    - 행렬을 여러개의 로컬 행렬 블럭으로 나누어 저장\n",
    "    - 각 블럭은 동일한 크기로 저장하며 마지막 블럭은 크기가 적을 수 있음\n",
    "    - 각 블럭의 크기가 동일한지 검사는 validate 메서드 사용하여 체크, 마지막 블럭은 검사하지 않음\n",
    "- 분산 행렬 선형 대수 연산\n",
    "    - 행렬 덧셈과 곱셈은 BlockMatrix 행렬에서 만 연산이 제공됨\n",
    "    - 전치 행렬 계산은 CoordinateMatrix와 BlockMatrix에서만 제공됨, transpose 메서드 사용\n",
    "    - 나머지 연산들은 별도로 구현해야함"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Apache Toree - Scala",
   "language": "scala",
   "name": "apache_toree_scala"
  },
  "language_info": {
   "file_extension": ".scala",
   "name": "scala",
   "version": "2.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### 스파크에서 선형 대수 연산 수행\n",
	"- Breeze, jblas : 로컬 환경의 선형 대수 연산\n",
	"- 분산환경의 선형 대수 연산은 스파크 자체적으로 구현\n",
	"- org.apache.spark.mllib.linalg : 로컬 벡터, 로컬 행렬"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### 로컬 벡터 생성\n",
	"- DenseVector, SparseVector\n",
	"- new : Vector의 dense, sparse 메서드 사용\n",
	" - dense : 모든 원소 값 전달(인라인, 배열)\n",
	" - sparse : 벡터크기, 위치 배열, 값 배열"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import org.apache.spark.mllib.linalg.{Vectors, Vector}\n",
	"val dv1 = Vectors.dense(5.0,6.0,7.0,8.0)\n",
	"val dv2 = Vectors.dense(Array(5.0,6.0,7.0,8.0))\n",
	"val sv = Vectors.sparse(4,Array(0,1,3),Array(5.0,6.0,8.0))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"class org.apache.spark.mllib.linalg.DenseVector"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dv2.getClass"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"class org.apache.spark.mllib.linalg.SparseVector"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sv.getClass"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"7.0"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dv2(2)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"4"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dv1.size"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Array(5.0, 6.0, 7.0, 8.0)"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dv2.toArray"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Array(5.0, 6.0, 0.0, 8.0)"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sv.toArray"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### 로컬벡터 선형 대수 연산\n",
	"- Breeze 라이브러리 사용 가능(스파크도 내부적으로 Breeze 활용)\n",
	"- Breeze 벡터 클래스로 변환 필요\n",
	"- Breeze 벡터 import 시 이름 충돌 주의"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"import org.apache.spark.mllib.linalg.{DenseVector, SparseVector}\n",
	"import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [],
	"source": [
	"def toBreezeV(v:Vector) :BV[Double] = v match {\n",
	" case dv:DenseVector => new BDV(dv.values)\n",
	" case sv:SparseVector => new BSV(sv.indices, sv.values, sv.size)\n",
	"}"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"DenseVector(10.0, 12.0, 14.0, 16.0)"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"toBreezeV(dv1) + toBreezeV(dv2)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"174.0"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"toBreezeV(dv1).dot(toBreezeV(dv2))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### 로컬 행렬 생성\n",
	"- Vector의 dense나 sparse 메서드 사용\n",
	" - dense : 행과 열의 갯수, 데이터 배열(가장 왼쪽 열부터 순차적으로 입력됨)\n",
	" - eye(n) : identity matrix, speye(n) : sparce identity matrix\n",
	" - ones(n,m) : ones matrix, zeros(n,m) : zeros matrix\n",
	" - diag(Vector) : diagonal matrix\n",
	" - rand, randn : random matrix, 행과 열의 갯수 그리고 java.util.Random 객체\n",
	" - sprand, sprandn : sparse random matrix\n",
	" - sparse : 열위치, 행위치, 원소(csc 포맷)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [],
	"source": [
	"import org.apache.spark.mllib.linalg.{DenseMatrix, SparseMatrix, Matrix, Matrices}\n",
	"import breeze.linalg.{DenseMatrix => BDM, CSCMatrix => BSM, Matrix => BM}"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"5.0 0.0 1.0\n",
	"0.0 3.0 4.0"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"val dm = Matrices.dense(2,3,Array(5.0,0.0,0.0,3.0,1.0,4.0))\n",
	"dm"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"1.0 0.0 0.0\n",
	"0.0 1.0 0.0\n",
	"0.0 0.0 1.0"
	]
	},
	"execution_count": 14,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Matrices.eye(3)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3 x 3 CSCMatrix\n",
	"(0,0) 1.0\n",
	"(1,1) 1.0\n",
	"(2,2) 1.0"
	]
	},
	"execution_count": 15,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Matrices.speye(3)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"1.0 1.0 1.0 1.0\n",
	"1.0 1.0 1.0 1.0\n",
	"1.0 1.0 1.0 1.0"
	]
	},
	"execution_count": 16,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Matrices.ones(3,4)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"5.0 0.0 0.0 0.0\n",
	"0.0 6.0 0.0 0.0\n",
	"0.0 0.0 7.0 0.0\n",
	"0.0 0.0 0.0 8.0"
	]
	},
	"execution_count": 17,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Matrices.diag(dv1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.24195757362830206 0.48338029429776297 0.7295640132346459 0.9111713334577178\n",
	"0.5338953228076072 0.40732475317015204 0.5599033214128653 0.7227464735139351\n",
	"0.26061800259634715 0.11762896089043273 0.3574893375146967 0.7836246467511585"
	]
	},
	"execution_count": 18,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Matrices.rand(3,4,new java.util.Random())"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"-0.10091008750260587 1.1003430904035745 -0.16868358279894863 -0.7017513880445515\n",
	"-0.2858952660461997 0.5718196598121303 -2.716554079811411 -0.9012402538189418\n",
	"-1.8202707815402122 -0.6656064061997368 0.4440977317479941 -0.604365905693581"
	]
	},
	"execution_count": 19,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"Matrices.randn(3,4,new java.util.Random())"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"2 x 3 CSCMatrix\n",
	"(0,0) 5.0\n",
	"(1,1) 3.0\n",
	"(0,1) 1.0\n",
	"(1,2) 4.0"
	]
	},
	"execution_count": 20,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"val sm = Matrices.sparse(2,3,Array(0,1,3,4),Array(0,1,0,1),Array(5.0,3.0,1.0,4.0))\n",
	"sm"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"4 x 5 CSCMatrix\n",
	"(0,0) 10.0\n",
	"(1,1) 16.0\n",
	"(2,2) 11.0\n",
	"(3,2) 11.0\n",
	"(0,3) 12.0\n",
	"(1,3) 13.0\n",
	"(3,4) 13.0"
	]
	},
	"execution_count": 21,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"val sm2 = Matrices.sparse(4,5,Array(0,1,2,4,6,7),Array(0,1,2,3,0,1,3),Array(10.0,16.0,11.0,11.0,12.0,13.0,13.0))\n",
	"sm2"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"class org.apache.spark.mllib.linalg.SparseMatrix"
	]
	},
	"execution_count": 22,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sm2.getClass"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"true"
	]
	},
	"execution_count": 23,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sm2.isInstanceOf[Matrix]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"10.0 0.0 0.0 12.0 0.0\n",
	"0.0 16.0 0.0 13.0 0.0\n",
	"0.0 0.0 11.0 0.0 0.0\n",
	"0.0 0.0 11.0 0.0 13.0"
	]
	},
	"execution_count": 24,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sm2.asInstanceOf[SparseMatrix].toDense"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### 로컬 행렬의 선형 대수 연산\n",
	"- (row number,column number) : 0 ~\n",
	"- transpose : 전치 행렬\n",
	"- Breeze 행렬 객체 변환하여 행렬 연산 수행\n",
	"- 변환시 Matrix의 값을 활용\n",
	" - numRows : Row 갯수\n",
	" - numCols : Column 갯수\n",
	" - values : Matrix의 값을 Array로 변환, DenseMatrix/SparseMatrix 변환\n",
	" - (dm.asInstanceOf[DenseMatrix].values, sm.asInstanceOf[SparseMatrix].values)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 25,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"3.0"
	]
	},
	"execution_count": 25,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dm(1,1)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"5.0 0.0\n",
	"0.0 3.0\n",
	"1.0 4.0"
	]
	},
	"execution_count": 26,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dm.transpose"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"metadata": {},
	"outputs": [],
	"source": [
	"import org.apache.spark.mllib.linalg.{DenseMatrix, SparseMatrix, Matrix, Matrices}\n",
	"import breeze.linalg.{DenseMatrix => BDM,CSCMatrix => BSM,Matrix => BM}\n",
	"def toBreezeM(m:Matrix):BM[Double] = m match {\n",
	" case dm:DenseMatrix => new BDM(dm.numRows, dm.numCols, dm.values)\n",
	" case sm:SparseMatrix => new BSM(sm.values, sm.numCols, sm.numRows, sm.colPtrs, sm.rowIndices)\n",
	"}"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(2,3)"
	]
	},
	"execution_count": 28,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"(dm.numRows,dm.numCols)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Array(5.0, 0.0, 0.0, 3.0, 1.0, 4.0)"
	]
	},
	"execution_count": 29,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dm.asInstanceOf[DenseMatrix].values"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Array(5.0, 3.0, 1.0, 4.0)"
	]
	},
	"execution_count": 30,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sm.asInstanceOf[SparseMatrix].values"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 31,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"10.0 0.0 2.0\n",
	"0.0 6.0 8.0"
	]
	},
	"execution_count": 31,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"toBreezeM(dm) + toBreezeM(dm)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### 분산 행렬\n",
	"- 빅데이터 머신 러닝 알고리즘 적용시 분산 행렬 필요\n",
	"- 분산 행렬은 분산 저장이 가능, 대량의 행과 열로 구성\n",
	"- 분산 행렬의 행과 열 번호는 Long 타입\n",
	"- org.apache.spark.mllib.linalg.distributed.{RowMatrix, IndexedRowMatrix, BlockMatrix, CoordinateMatrix}\n",
	"- RowMatrix\n",
	" - 각 행을 Vector 객체에 저장하여 RDD 구성\n",
	" - rows atrribute를 이용 접근\n",
	" - numRows, numCols\n",
	" - multiply : RowMatrix와 로컬 행렬 곱 실행\n",
	" - toRowMatrix : 다른 분산 행렬을 RowMatrix로 변환\n",
	" - RowMatrix를 다른 분산 행렬로 전환하는 메서드는 없음\n",
	"- IndexedRowMatrix\n",
	" - IndexedRow 객체의 요소로 구성된 RDD 구성\n",
	" - IndexedRow : 행의 원소 벡터와 행의 행렬내 위치를 저장\n",
	"- CoordinateMatrix\n",
	" - MatrixEntry 객체의 요소로 구성된 RDD 구성\n",
	" - MatrixEntry : 개별 원소 값과 해당 원소의 행렬 내 위치(i,j)를 저장\n",
	" - sparse matrix 저장시 사용\n",
	" - dense matrix에 사용하면 메모리 문제가 발생\n",
	"- BlockMatrix\n",
	" - 분산 행렬간 덧셈 및 곱셈 연산 지원\n",
	" - ((i,j), Matrix) 튜플의 RDD 형태로 저장\n",
	" - 행렬을 여러개의 로컬 행렬 블럭으로 나누어 저장\n",
	" - 각 블럭은 동일한 크기로 저장하며 마지막 블럭은 크기가 적을 수 있음\n",
	" - 각 블럭의 크기가 동일한지 검사는 validate 메서드 사용하여 체크, 마지막 블럭은 검사하지 않음\n",
	"- 분산 행렬 선형 대수 연산\n",
	" - 행렬 덧셈과 곱셈은 BlockMatrix 행렬에서 만 연산이 제공됨\n",
	" - 전치 행렬 계산은 CoordinateMatrix와 BlockMatrix에서만 제공됨, transpose 메서드 사용\n",
	" - 나머지 연산들은 별도로 구현해야함"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Apache Toree - Scala",
	"language": "scala",
	"name": "apache_toree_scala"
	},
	"language_info": {
	"file_extension": ".scala",
	"name": "scala",
	"version": "2.11.8"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}