mrbkdad/sparkml_mllib_02.ipynb

## sparkml_mllib_02.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import breeze.linalg.{DenseMatrix => BDM,CSCMatrix => BSM,Matrix => BM}\n",
    "import org.apache.spark.mllib.linalg.{DenseMatrix, SparseMatrix, Matrix, Matrices}\n",
    "import org.apache.spark.mllib.linalg.distributed.{RowMatrix, CoordinateMatrix, BlockMatrix, DistributedMatrix, MatrixEntry}\n",
    "\n",
    "def printMat(mat:BM[Double]) = {\n",
    "   print(\"            \")\n",
    "   for(j <- 0 to mat.cols-1) print(\"%-10d\".format(j));\n",
    "   println\n",
    "   for(i <- 0 to mat.rows-1) { print(\"%-6d\".format(i)); for(j <- 0 to mat.cols-1) print(\" %+9.3f\".format(mat(i, j))); println }\n",
    "}\n",
    "def toBreezeM(m:Matrix):BM[Double] = m match {\n",
    "    case dm:DenseMatrix => new BDM(dm.numRows, dm.numCols, dm.values)\n",
    "    case sm:SparseMatrix => new BSM(sm.values, sm.numCols, sm.numRows, sm.colPtrs, sm.rowIndices)\n",
    "}\n",
    "def toBreezeD(dm:DistributedMatrix):BM[Double] = dm match {\n",
    "    case rm:RowMatrix => {\n",
    "      val m = rm.numRows().toInt\n",
    "       val n = rm.numCols().toInt\n",
    "       val mat = BDM.zeros[Double](m, n)\n",
    "       var i = 0\n",
    "       rm.rows.collect().foreach { vector =>\n",
    "         for(j <- 0 to vector.size-1)\n",
    "         {\n",
    "           mat(i, j) = vector(j)\n",
    "         }\n",
    "         i += 1\n",
    "       }\n",
    "       mat\n",
    "     }\n",
    "    case cm:CoordinateMatrix => {\n",
    "       val m = cm.numRows().toInt\n",
    "       val n = cm.numCols().toInt\n",
    "       val mat = BDM.zeros[Double](m, n)\n",
    "       cm.entries.collect().foreach { case MatrixEntry(i, j, value) =>\n",
    "         mat(i.toInt, j.toInt) = value\n",
    "       }\n",
    "       mat\n",
    "    }\n",
    "    case bm:BlockMatrix => {\n",
    "       val localMat = bm.toLocalMatrix()\n",
    "       new BDM[Double](localMat.numRows, localMat.numCols, localMat.toArray)\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 선형 회귀\n",
    "- 선형회귀 동작 방식\n",
    "- 샘플 데이터셋 적용\n",
    "- 데이터 분석 및 준비 과정\n",
    "- 모델 성능 평가\n",
    "- bias & variance의 상충관계, 교차 검증, 일반화의 개념"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 선형회귀\n",
    "- 독립변수 셋을 사용해 목표변수를 예측하고 이들간의 관계를 계량화\n",
    "- 독립변수와 목표변수 사이에 선형관계가 있다고 가정\n",
    "- 단순 선형 회귀(simple linear regression), 다중 선형 회귀(multiple linear regression)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### simple linear regression\n",
    "- 보스톤 주택 데이터셋 : UC 어바인\n",
    "- 보스톤 교외에 위치한 자가 거주 주택의 평균 가격과 이 가격을 예측하는 데 사용 할 수 있는 특징 변수 13개로 구성\n",
    "- 거주인당 평균 방 개수만을 사용해 선형 회귀 모델 실습\n",
    "\n",
    "- 모델(가설함수): h(x) = w0 + w1x\n",
    "- 방법 : 모델에 적합한 가중치 추정(w0, w1), cost function 최소화하는 가장 적절한 가중치\n",
    "- cost function : mean squared error, C(w0,w1) = 1/2m * sum(h(xi) - yi)^2 = 1/2m * sum(w0+w1xi - yi)^2\n",
    "- cost function 값이 최저인 지점"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### multiple linear regression\n",
    "- 더많은 독립 변수(차원) 활용\n",
    "- 주택 데이터셋의 모든 독립변수(13개) 활용\n",
    "- 비용 함수를 그래프화 하기 어려움(불가능)\n",
    "- 모델(가설함수) : h(x) = w0 + w1x + w2x + ... + wnxn = W_t * X\n",
    "- W_t : [w0, w1, w2, ... , wn]  (weight vector traspose)\n",
    "- X   : [1, x1, x2, ... , xn]\n",
    "- cost function : C(w) = 1/2m * sum(W_t * X(i) - y(i))^2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 최저점 찾기\n",
    "- 정규 방정식(normal equation)\n",
    "    - w = (X_t * X)^-1 * X_t * y\n",
    "    - 많은 계산량 필요\n",
    "- 경사 하강법(gradient-descent)\n",
    "    - cost function의 편미분(partial derivative) 계산\n",
    "    - weight 수정 및 반복, 허용치(tolerance value) 이용 수렴(converged) 판단"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 데이터 분석 및 준비\n",
    "- 보스톤 주택 데이터셋 : https://goo.gl/MFsmFW"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RDD : resilient distributed dataset\n",
    "    - immutable : read-only\n",
    "    - resilent : 스파크 내부의 장애 복구 메커니즘이 RDD의 복원성을 부여\n",
    "    - distributed\n",
    "    - 분산 컬렉션의 성질과 장애 내성을 추상화하고 직관적인 방식으로 대규모 데이터셋에 병렬 연산을 수행할 수 있도록 지원\n",
    "    - 데이터의 분산 처리에 필요한 여러가지 요소들을 추상화하여 엔지니어가 데이터의 계산과 처리에 집중할 수 있도록 설계\n",
    "    - 데이터의 변환 메커니즘으로 변환방식을 기술하여 저장함으로서 목적을 달성한다.\n",
    "- RDD 연산자\n",
    "    - Transformation, Action 연산자로 나뉨\n",
    "- Transformation(변환 연산자)\n",
    "    - lazy evaluation\n",
    "    - map, filter, distinct, flatMap\n",
    "- Action(행동 연산자)\n",
    "    - 실제 transformation이 실행\n",
    "    - first, top, count, collect, foreach, take\n",
    "    - sample, takeSample : 복원샘플과 비복원 샘플 지원"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. 데이터 준비\n",
    "1. 파일 데이터 읽기\n",
    "2. 벡터화"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import org.apache.spark.mllib.linalg.Vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "../datas/housing.data MapPartitionsRDD[1] at textFile at <console>:24"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// 1. 보스톤 주택 데이터 RDD 생성, 파티션 수 6개\n",
    "val housingLines = sc.textFile(\"../datas/housing.data\",6)\n",
    "housingLines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "88.97620,   0.00,  18.100,  0,  0.6710,  6.9680,  91.90,  1.4165,  24,  666.0,  20.20, 396.90,  17.21,  10.40\n",
      "73.53410,   0.00,  18.100,  0,  0.6790,  5.9570, 100.00,  1.8026,  24,  666.0,  20.20,  16.45,  20.62,   8.80\n",
      "67.92080,   0.00,  18.100,  0,  0.6930,  5.6830, 100.00,  1.4254,  24,  666.0,  20.20, 384.97,  22.98,   5.00\n"
     ]
    }
   ],
   "source": [
    "// - 샘플 출력\n",
    "housingLines.top(3).foreach(println)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "506"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// - 전체 데이터 수\n",
    "housingLines.count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\" 0.00632,  18.00,   2.310,  0,  0.5380,  6.5750,  65.20,  4.0900,   1,  296.0,  15.30, 396.90,   4.98,  24.00\""
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// - 샘플 데이터 Dense 벡터 생성\n",
    "val housing1 = housingLines.first\n",
    "housing1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Vectors.dense(housing1.split(\",\").map(_.trim().toDouble))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Vectors.dense(for(d <- housing1.split(\",\")) yield d.trim().toDouble)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MapPartitionsRDD[3] at map at <console>:26"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// 2. Dense Vector RDD로 변환\n",
    "val housingVals = housingLines.map(x => Vectors.dense(x.split(\",\").map(_.trim().toDouble)))\n",
    "housingVals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0]\n",
      "[0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6]\n"
     ]
    }
   ],
   "source": [
    "// - 샘플 출력\n",
    "housingVals.take(2).foreach(println)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. 분포 분석\n",
    "1. RowMatrix 변환 하여 couputeColumnSummaryStatistics 이용\n",
    "2. Statistics.colStats 이용\n",
    "3. 생성된 Statistics 객체를 이용하여 분석\n",
    "    - min, max, mean, variance, normL1, normL2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "org.apache.spark.mllib.linalg.distributed.RowMatrix@15732b4a"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// RowMatrix\n",
    "import org.apache.spark.mllib.linalg.distributed.RowMatrix\n",
    "val housingMat = new RowMatrix(housingVals)\n",
    "housingMat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@35dec45b"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val housingStats = housingMat.computeColumnSummaryStatistics()\n",
    "housingStats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[3.6135235573122526,11.363636363636367,11.13677865612648,0.0691699604743083,0.5546950592885376,6.284634387351778,68.57490118577074,3.7950426877470362,9.549407114624508,408.2371541501976,18.45553359683794,356.67403162055336,12.653063241106718,22.532806324110666]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "housingStats.mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "// Statistics\n",
    "import org.apache.spark.mllib.stat.Statistics\n",
    "val housingStats = Statistics.colStats(housingVals)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[3.6135235573122526,11.363636363636365,11.13677865612648,0.0691699604743083,0.5546950592885376,6.284634387351778,68.57490118577074,3.7950426877470362,9.549407114624508,408.2371541501976,18.455533596837945,356.67403162055336,12.653063241106718,22.532806324110666]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "housingStats.mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0, 0 :    -0.388 \n",
      "0, 1 :     0.360 \n",
      "0, 2 :    -0.484 \n",
      "0, 3 :     0.175 \n",
      "0, 4 :    -0.427 \n",
      "0, 5 :     0.695 \n",
      "0, 6 :    -0.377 \n",
      "0, 7 :     0.250 \n",
      "0, 8 :    -0.382 \n",
      "0, 9 :    -0.469 \n",
      "0, 10 :    -0.508 \n",
      "0, 11 :     0.333 \n",
      "0, 12 :    -0.738 \n",
      "0, 13 :     1.000 \n"
     ]
    }
   ],
   "source": [
    "// correlatin coefficient(상관 계수 계산)\n",
    "val housingCorr = Statistics.corr(housingVals)\n",
    "for(i <- 0 until housingCorr.numRows) printf(\"0, %s : %9.3f \\n\",i,housingCorr(i,housingCorr.numRows-1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. cosine similarity(코사인 유사도)\n",
    "- 두 벡터간의 방향성을 분석, 두 벡터간의 각도\n",
    "- 두 벡터간의 유사도와 관련된 분야에 활용 : 상품이나 뉴스 추천, Word2Vec\n",
    "- RowMatrix.columnSimilarities 이용( > v 2.0 )\n",
    "- upper-triangular matrix(상삼각 행렬) 형태의 distributed CoordinateMatrix 생성\n",
    "- Breeze DenseMatrix 변환하여 데이터 확인"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "org.apache.spark.mllib.linalg.distributed.CoordinateMatrix@4b11d1fb"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val housingColSims = housingMat.columnSimilarities\n",
    "housingColSims"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0         1         2         3         4         5         6         7         8         9         10        11        12        13        \n",
      "0         +0.000    +0.004    +0.527    +0.052    +0.459    +0.363    +0.482    +0.169    +0.675    +0.563    +0.416    +0.288    +0.544    +0.224\n",
      "1         +0.000    +0.000    +0.122    +0.078    +0.334    +0.467    +0.211    +0.673    +0.135    +0.297    +0.394    +0.464    +0.200    +0.528\n",
      "2         +0.000    +0.000    +0.000    +0.256    +0.915    +0.824    +0.916    +0.565    +0.840    +0.931    +0.869    +0.779    +0.897    +0.693\n",
      "3         +0.000    +0.000    +0.000    +0.000    +0.275    +0.271    +0.275    +0.184    +0.190    +0.230    +0.248    +0.266    +0.204    +0.307\n",
      "4         +0.000    +0.000    +0.000    +0.000    +0.000    +0.966    +0.962    +0.780    +0.808    +0.957    +0.977    +0.929    +0.912    +0.873\n",
      "5         +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.909    +0.880    +0.719    +0.906    +0.982    +0.966    +0.832    +0.949\n",
      "6         +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.672    +0.801    +0.929    +0.930    +0.871    +0.918    +0.803\n",
      "7         +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.485    +0.710    +0.856    +0.882    +0.644    +0.856\n",
      "8         +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.917    +0.771    +0.642    +0.806    +0.588\n",
      "9         +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.939    +0.854    +0.907    +0.789\n",
      "10        +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.957    +0.887    +0.897\n",
      "11        +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.799    +0.928\n",
      "12        +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.670\n",
      "13        +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000    +0.000\n"
     ]
    }
   ],
   "source": [
    "val housingColSimsBDM = toBreezeD(housingColSims)\n",
    "printMat(housingColSimsBDM)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "housingColSimsBDM(0,0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0038331826140674792"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "housingColSimsBDM(0,1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0, 0 :     0.224 \n",
      "0, 1 :     0.528 \n",
      "0, 2 :     0.693 \n",
      "0, 3 :     0.307 \n",
      "0, 4 :     0.873 \n",
      "0, 5 :     0.949 \n",
      "0, 6 :     0.803 \n",
      "0, 7 :     0.856 \n",
      "0, 8 :     0.588 \n",
      "0, 9 :     0.789 \n",
      "0, 10 :     0.897 \n",
      "0, 11 :     0.928 \n",
      "0, 12 :     0.670 \n",
      "0, 13 :     0.000 \n"
     ]
    }
   ],
   "source": [
    "// 집값과 각 항목간의 유사도\n",
    "for(i <- 0 until housingColSimsBDM.rows) printf(\"0, %s : %9.3f \\n\",i,housingColSimsBDM(i,housingColSimsBDM.rows-1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. covariance matrix(공분산 행렬)\n",
    "- 두 벡터간의 유사도 분석\n",
    "- 선형 연관성을 모델링 하는 통계 기법\n",
    "- RowMatrix.computeCovariance 이용( > v 2.0 )\n",
    "- sysmmetric matrix(상삼각 행렬) 형태의 distributed CoordinateMatrix 생성\n",
    "- Statistics.corr 이용(위 예제 참고)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "class org.apache.spark.mllib.linalg.DenseMatrix"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val housingCovar = housingMat.computeCovariance()\n",
    "housingCovar.getClass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "            0         1         2         3         4         5         6         7         8         9         10        11        12        13        \n",
      "0        +73.987   -40.216   +23.992    -0.122    +0.420    -1.325   +85.405    -6.877   +46.848  +844.822    +5.399  -302.382   +27.986   -30.719\n",
      "1        -40.216  +543.937   -85.413    -0.253    -1.396    +5.113  -373.902   +32.629   -63.349 -1236.454   -19.777  +373.721   -68.783   +77.315\n",
      "2        +23.992   -85.413   +47.064    +0.110    +0.607    -1.888  +124.514   -10.228   +35.550  +833.360    +5.692  -223.580   +29.580   -30.521\n",
      "3         -0.122    -0.253    +0.110    +0.065    +0.003    +0.016    +0.619    -0.053    -0.016    -1.523    -0.067    +1.131    -0.098    +0.409\n",
      "4         +0.420    -1.396    +0.607    +0.003    +0.013    -0.025    +2.386    -0.188    +0.617   +13.046    +0.047    -4.021    +0.489    -0.455\n",
      "5         -1.325    +5.113    -1.888    +0.016    -0.025    +0.494    -4.752    +0.304    -1.284   -34.583    -0.541    +8.215    -3.080    +4.493\n",
      "6        +85.405  -373.902  +124.514    +0.619    +2.386    -4.752  +792.358   -44.329  +111.771 +2402.690   +15.937  -702.940  +121.078   -97.589\n",
      "7         -6.877   +32.629   -10.228    -0.053    -0.188    +0.304   -44.329    +4.434    -9.068  -189.665    -1.060   +56.040    -7.473    +4.840\n",
      "8        +46.848   -63.349   +35.550    -0.016    +0.617    -1.284  +111.771    -9.068   +75.816 +1335.757    +8.761  -353.276   +30.385   -30.561\n",
      "9       +844.822 -1236.454  +833.360    -1.523   +13.046   -34.583 +2402.690  -189.665 +1335.757 +28404.759  +168.153 -6797.911  +654.715  -726.256\n",
      "10        +5.399   -19.777    +5.692    -0.067    +0.047    -0.541   +15.937    -1.060    +8.761  +168.153    +4.687   -35.060    +5.783   -10.111\n",
      "11      -302.382  +373.721  -223.580    +1.131    -4.021    +8.215  -702.940   +56.040  -353.276 -6797.911   -35.060 +8334.752  -238.668  +279.990\n",
      "12       +27.986   -68.783   +29.580    -0.098    +0.489    -3.080  +121.078    -7.473   +30.385  +654.715    +5.783  -238.668   +50.995   -48.448\n",
      "13       -30.719   +77.315   -30.521    +0.409    -0.455    +4.493   -97.589    +4.840   -30.561  -726.256   -10.111  +279.990   -48.448   +84.587\n"
     ]
    }
   ],
   "source": [
    "printMat(toBreezeM(housingCovar))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. linear regress을 위한 데이터 준비\n",
    "- 레이블 포인트로 변환\n",
    "    - LabeledPoint 구조체로 변환 : 목표값, 특징 변수 벡터로 구성, 대부분의 머신러닝 알고리즘에서 사용\n",
    "- 데이터 분할 : train, test 데이터 셋 분할\n",
    "    - RDD 함수 이용 : randomSplit\n",
    "- 데이터 표준화 : 스케일링 및 평균 정규화\n",
    "    - 스케일링(feature scaling) : 데이터 범위를 비슷한 크리로 조정\n",
    "    - 평균 정규화(mean normalization) : 평균이 0인 분포로 변환, 정규 분포임을 가장하고 주로 사용\n",
    "    - StandardScaler : 스케일링과 표준 정규화를 함께 처리"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(24.0,[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// 레이블 포인트\n",
    "import org.apache.spark.mllib.regression.LabeledPoint\n",
    "\n",
    "val housingData = housingVals.map(x => {\n",
    "    val line = x.toArray\n",
    "    LabeledPoint(line.last,Vectors.dense(line.slice(0,line.length-1)))\n",
    "})\n",
    "housingData.first"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "417"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// 데이터 분할\n",
    "val sets = housingData.randomSplit(Array(0.8,0.2))\n",
    "val housingTrain = sets(0)\n",
    "val housingTest = sets(1)\n",
    "housingTrain.count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.0063 - 88.9762 : 88.9699\n",
      "0.0000 - 100.0000 : 100.0000\n",
      "0.4600 - 27.7400 : 27.2800\n",
      "0.0000 - 1.0000 : 1.0000\n",
      "0.3850 - 0.8710 : 0.4860\n",
      "3.5610 - 8.7800 : 5.2190\n",
      "2.9000 - 100.0000 : 97.1000\n",
      "1.1296 - 12.1265 : 10.9969\n",
      "1.0000 - 24.0000 : 23.0000\n",
      "187.0000 - 711.0000 : 524.0000\n",
      "12.6000 - 22.0000 : 9.4000\n",
      "0.3200 - 396.9000 : 396.5800\n",
      "1.7300 - 37.9700 : 36.2400\n",
      "5.0000 - 50.0000 : 45.0000\n"
     ]
    }
   ],
   "source": [
    "// 데이터 분포 체크\n",
    "for(mm <- housingStats.min.toArray.zip(housingStats.max.toArray)){\n",
    "    printf(\"%1.4f - %1.4f : %1.4f\\n\",mm._1,mm._2,mm._2-mm._1)\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "org.apache.spark.mllib.feature.StandardScalerModel@26c52f77"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// 스케일러 생성(스케일링, 정규화)\n",
    "import org.apache.spark.mllib.feature.StandardScaler\n",
    "val scaler = new StandardScaler(true, true).fit(housingTrain.map(_.features))\n",
    "scaler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "417"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// 스케일 적용(스케일링, 정규화)\n",
    "val trainScaled = housingTrain.map(x => LabeledPoint(x.label,scaler.transform(x.features)))\n",
    "val testScaled = housingTest.map(x => LabeledPoint(x.label,scaler.transform(x.features)))\n",
    "trainScaled.count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(24.0,[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "housingTrain.first"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(24.0,[-0.4693925254631978,0.2687770211872631,-1.2747871495548844,-0.2730623188731474,-0.14385350282962484,0.43199631109148545,-0.10805893821126049,0.14558225189264024,-0.96975264275946,-0.653237741365371,-1.4218179119546304,0.4480888765307575,-1.0721133879367546])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainScaled.first"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.  선형 회귀 모델 학습\n",
    "- org.aprache.spark.mllib.regression 패키지 LinearRegressionModel 클래스로 구현\n",
    "    - 학습이 완료된 선형 회귀 모델의 매개변수 저장\n",
    "    - predict 메서드를 이용하여 값을 예측\n",
    "- LinearRegressionWithSGD 이용하여 LinsearRegressionModel  생성\n",
    "    - train 메서드 : bias를 제외한 weight 학습, LinearRegressionWithSGD.train(데이터, 반복횟수, 학습율)\n",
    "    - LinearRegressionWithSGD 객체 생성후, bias 학습 세팅, 반복횟수 설정, 데이터 캐싱, run 메스드(훈련)\n",
    "- 값 예측\n",
    "    - predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import org.apache.spark.mllib.regression.LinearRegressionWithSGD\n",
    "val model1 = LinearRegressionWithSGD.train(trainScaled,200, 1.0)\n",
    "model1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "class org.apache.spark.mllib.regression.LinearRegressionModel"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model1.getClass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[-0.37785930098621784,0.7281842617508575,-0.11745209635664786,0.8246981641151655,-1.760815522334877,3.1055999305728936,-0.485204611122377,-2.9388833381201045,1.7360597129016102,-1.1979267166377676,-2.0326627329778537,1.003143321052567,-3.287004588197274]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model1.weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model1.intercept"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "13"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val lrm = new LinearRegressionWithSGD()\n",
    "lrm.setIntercept(true)\n",
    "lrm.optimizer.setNumIterations(200)\n",
    "lrm.optimizer.setStepSize(1.0)\n",
    "trainScaled.cache()\n",
    "val model2 = lrm.run(trainScaled)\n",
    "model2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "class org.apache.spark.mllib.regression.LinearRegressionModel"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model2.getClass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[-0.2923374177734875,0.5941692181822921,-0.24760684105702974,0.8463715671785175,-1.4687758707821237,3.2028460786202753,-0.5190199406486725,-2.6760815549376082,1.309178867916586,-0.8367071826015235,-1.9757876070323401,1.0289607840181993,-3.2566385434466443]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model2.weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "22.45707434052759"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model2.intercept"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MapPartitionsRDD[699] at map at <console>:51"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// 테스트 데이터 예측\n",
    "val testPredicts = testScaled.map(x => (model2.predict(x.features), x.label))\n",
    "testPredicts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "89"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val testPredictsArr = testPredicts.collect()\n",
    "testPredictsArr.length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(22.785998999685642,22.9)\n",
      "(19.1619715954319,18.2)\n",
      "(19.396878618741173,19.9)\n",
      "(12.521252819041827,13.6)\n",
      "(13.23501944140063,13.9)\n"
     ]
    }
   ],
   "source": [
    "testPredictsArr.slice(0,5).foreach(println)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.608326430430329"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// RMSE\n",
    "val rvals = testPredictsArr.map{case(p,y) => math.pow(p-y,2)}\n",
    "math.sqrt(rvals.sum / rvals.length)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. 모델 평가 및 해석\n",
    "- RegressionMetrics 클래스를 이용하여 여러가지 평가 가능\n",
    "    - rootMeanSquareError\n",
    "    - meanSquareError\n",
    "    - meanAbsoluteError\n",
    "    - r2 : coefficient of determination(결정계수, R^2), 0~1 값, 설명 변량의 비율, 1에 가까울 수록 좋음\n",
    "    - explainedVariance : r2 와 유사\n",
    "    - 결정계수를 실무에서 자주 사용하지만 상관성이 적은 feature라도 추가를 하면 값이 커지는 경향이 있다.\n",
    "- 매개 변수 해석\n",
    "    - 모델의 weight를 이용하여 각 변수가 예측 값에 미치는 영향력을 해석\n",
    "    - 값이 클수록 영향력이 크다는 것을 의미"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.608326430430328"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// 평가 : RMSE, MSE, MAE, R^2(coefficient of determination)\n",
    "import org.apache.spark.mllib.evaluation.RegressionMetrics\n",
    "val validMetrics = new RegressionMetrics(testPredicts)\n",
    "validMetrics.rootMeanSquaredError"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "31.45332535026338"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "validMetrics.meanSquaredError"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0.24760684105702974,2)\n",
      "(0.2923374177734875,0)\n",
      "(0.5190199406486725,6)\n",
      "(0.5941692181822921,1)\n",
      "(0.8367071826015235,9)\n",
      "(0.8463715671785175,3)\n",
      "(1.0289607840181993,11)\n",
      "(1.309178867916586,8)\n",
      "(1.4687758707821237,4)\n",
      "(1.9757876070323401,10)\n",
      "(2.6760815549376082,7)\n",
      "(3.2028460786202753,5)\n",
      "(3.2566385434466443,12)\n"
     ]
    }
   ],
   "source": [
    "// 해석\n",
    "model2.weights.toArray.map(_.abs).zipWithIndex.sortBy(_._1).foreach(println)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 매개변수 해석\n",
    "- 가장 영향력이 큰 컬럼은 12번째 LSTAT(저소득층 인구 비율), 두번째로는 7번째 DIS(보스턴 각 고용센터까지의 거리합)\n",
    "- 영향력이 적은 컬럼은 6번째, 2번째 컬럼이며 모델에서 제거하여도 별 영향이 없으며 오히려 성늘이 더 향상 될 수도 있다.\n",
    "    - 6번 : AGE, 1940년 이전 지어진 자가 거주 건물 비율\n",
    "    - 2번 : INDUS, 마을당 비-소매 비즈니스 토지의 에이커 비율"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 모델의 저장 및 불러오기\n",
    "- 저장 : save\n",
    "- 불러오기 : load\n",
    "- Parqut 파일 포맷"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Name: org.apache.hadoop.mapred.FileAlreadyExistsException\n",
       "Message: Output directory file:/opt/mynotebook/models/linearRegressionModel/metadata already exists\n",
       "StackTrace:   at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)\n",
       "  at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:283)\n",
       "  at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)\n",
       "  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n",
       "  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n",
       "  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)\n",
       "  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n",
       "  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n",
       "  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)\n",
       "  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n",
       "  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n",
       "  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n",
       "  at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)\n",
       "  at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)\n",
       "  at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)\n",
       "  at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)\n",
       "  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n",
       "  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n",
       "  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n",
       "  at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)\n",
       "  at org.apache.spark.mllib.regression.impl.GLMRegressionModel$SaveLoadV1_0$.save(GLMRegressionModel.scala:56)\n",
       "  at org.apache.spark.mllib.regression.LinearRegressionModel.save(LinearRegression.scala:52)"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model2.save(sc,\"../models/linearRegressionModel\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[-0.901459854136914,0.9509812890773397,-0.10314713489439738,0.8805042807920195,-2.0668983757289556,2.4936373935100935,-0.06818383702924491,-3.2553295187452846,1.7217190130008164,-0.777583539625745,-2.126496701079123,0.9829657849466839,-4.016312120372918]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import org.apache.spark.mllib.regression.LinearRegressionModel\n",
    "val loadModel = LinearRegressionModel.load(sc,\"../models/linearRegressionModel\")\n",
    "loadModel.weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Apache Toree - Scala",
   "language": "scala",
   "name": "apache_toree_scala"
  },
  "language_info": {
   "file_extension": ".scala",
   "name": "scala",
   "version": "2.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}