Created
July 19, 2018 02:02
-
-
Save mrbkdad/37faa8966b07ca96cbd10279035c77d1 to your computer and use it in GitHub Desktop.
learning spark ML chapter 2
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import breeze.linalg.{DenseMatrix => BDM,CSCMatrix => BSM,Matrix => BM}\n", | |
"import org.apache.spark.mllib.linalg.{DenseMatrix, SparseMatrix, Matrix, Matrices}\n", | |
"import org.apache.spark.mllib.linalg.distributed.{RowMatrix, CoordinateMatrix, BlockMatrix, DistributedMatrix, MatrixEntry}\n", | |
"\n", | |
"def printMat(mat:BM[Double]) = {\n", | |
" print(\" \")\n", | |
" for(j <- 0 to mat.cols-1) print(\"%-10d\".format(j));\n", | |
" println\n", | |
" for(i <- 0 to mat.rows-1) { print(\"%-6d\".format(i)); for(j <- 0 to mat.cols-1) print(\" %+9.3f\".format(mat(i, j))); println }\n", | |
"}\n", | |
"def toBreezeM(m:Matrix):BM[Double] = m match {\n", | |
" case dm:DenseMatrix => new BDM(dm.numRows, dm.numCols, dm.values)\n", | |
" case sm:SparseMatrix => new BSM(sm.values, sm.numCols, sm.numRows, sm.colPtrs, sm.rowIndices)\n", | |
"}\n", | |
"def toBreezeD(dm:DistributedMatrix):BM[Double] = dm match {\n", | |
" case rm:RowMatrix => {\n", | |
" val m = rm.numRows().toInt\n", | |
" val n = rm.numCols().toInt\n", | |
" val mat = BDM.zeros[Double](m, n)\n", | |
" var i = 0\n", | |
" rm.rows.collect().foreach { vector =>\n", | |
" for(j <- 0 to vector.size-1)\n", | |
" {\n", | |
" mat(i, j) = vector(j)\n", | |
" }\n", | |
" i += 1\n", | |
" }\n", | |
" mat\n", | |
" }\n", | |
" case cm:CoordinateMatrix => {\n", | |
" val m = cm.numRows().toInt\n", | |
" val n = cm.numCols().toInt\n", | |
" val mat = BDM.zeros[Double](m, n)\n", | |
" cm.entries.collect().foreach { case MatrixEntry(i, j, value) =>\n", | |
" mat(i.toInt, j.toInt) = value\n", | |
" }\n", | |
" mat\n", | |
" }\n", | |
" case bm:BlockMatrix => {\n", | |
" val localMat = bm.toLocalMatrix()\n", | |
" new BDM[Double](localMat.numRows, localMat.numCols, localMat.toArray)\n", | |
" }\n", | |
"}" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 선형 회귀\n", | |
"- 선형회귀 동작 방식\n", | |
"- 샘플 데이터셋 적용\n", | |
"- 데이터 분석 및 준비 과정\n", | |
"- 모델 성능 평가\n", | |
"- bias & variance의 상충관계, 교차 검증, 일반화의 개념" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 선형회귀\n", | |
"- 독립변수 셋을 사용해 목표변수를 예측하고 이들간의 관계를 계량화\n", | |
"- 독립변수와 목표변수 사이에 선형관계가 있다고 가정\n", | |
"- 단순 선형 회귀(simple linear regression), 다중 선형 회귀(multiple linear regression)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### simple linear regression\n", | |
"- 보스톤 주택 데이터셋 : UC 어바인\n", | |
"- 보스톤 교외에 위치한 자가 거주 주택의 평균 가격과 이 가격을 예측하는 데 사용 할 수 있는 특징 변수 13개로 구성\n", | |
"- 거주인당 평균 방 개수만을 사용해 선형 회귀 모델 실습\n", | |
"\n", | |
"- 모델(가설함수): h(x) = w0 + w1x\n", | |
"- 방법 : 모델에 적합한 가중치 추정(w0, w1), cost function 최소화하는 가장 적절한 가중치\n", | |
"- cost function : mean squared error, C(w0,w1) = 1/2m * sum(h(xi) - yi)^2 = 1/2m * sum(w0+w1xi - yi)^2\n", | |
"- cost function 값이 최저인 지점" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### multiple linear regression\n", | |
"- 더많은 독립 변수(차원) 활용\n", | |
"- 주택 데이터셋의 모든 독립변수(13개) 활용\n", | |
"- 비용 함수를 그래프화 하기 어려움(불가능)\n", | |
"- 모델(가설함수) : h(x) = w0 + w1x + w2x + ... + wnxn = W_t * X\n", | |
"- W_t : [w0, w1, w2, ... , wn] (weight vector traspose)\n", | |
"- X : [1, x1, x2, ... , xn]\n", | |
"- cost function : C(w) = 1/2m * sum(W_t * X(i) - y(i))^2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 최저점 찾기\n", | |
"- 정규 방정식(normal equation)\n", | |
" - w = (X_t * X)^-1 * X_t * y\n", | |
" - 많은 계산량 필요\n", | |
"- 경사 하강법(gradient-descent)\n", | |
" - cost function의 편미분(partial derivative) 계산\n", | |
" - weight 수정 및 반복, 허용치(tolerance value) 이용 수렴(converged) 판단" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 데이터 분석 및 준비\n", | |
"- 보스톤 주택 데이터셋 : https://goo.gl/MFsmFW" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### RDD : resilient distributed dataset\n", | |
" - immutable : read-only\n", | |
" - resilent : 스파크 내부의 장애 복구 메커니즘이 RDD의 복원성을 부여\n", | |
" - distributed\n", | |
" - 분산 컬렉션의 성질과 장애 내성을 추상화하고 직관적인 방식으로 대규모 데이터셋에 병렬 연산을 수행할 수 있도록 지원\n", | |
" - 데이터의 분산 처리에 필요한 여러가지 요소들을 추상화하여 엔지니어가 데이터의 계산과 처리에 집중할 수 있도록 설계\n", | |
" - 데이터의 변환 메커니즘으로 변환방식을 기술하여 저장함으로서 목적을 달성한다.\n", | |
"- RDD 연산자\n", | |
" - Transformation, Action 연산자로 나뉨\n", | |
"- Transformation(변환 연산자)\n", | |
" - lazy evaluation\n", | |
" - map, filter, distinct, flatMap\n", | |
"- Action(행동 연산자)\n", | |
" - 실제 transformation이 실행\n", | |
" - first, top, count, collect, foreach, take\n", | |
" - sample, takeSample : 복원샘플과 비복원 샘플 지원" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 1. 데이터 준비\n", | |
"1. 파일 데이터 읽기\n", | |
"2. 벡터화" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import org.apache.spark.mllib.linalg.Vectors" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"../datas/housing.data MapPartitionsRDD[1] at textFile at <console>:24" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// 1. 보스톤 주택 데이터 RDD 생성, 파티션 수 6개\n", | |
"val housingLines = sc.textFile(\"../datas/housing.data\",6)\n", | |
"housingLines" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"88.97620, 0.00, 18.100, 0, 0.6710, 6.9680, 91.90, 1.4165, 24, 666.0, 20.20, 396.90, 17.21, 10.40\n", | |
"73.53410, 0.00, 18.100, 0, 0.6790, 5.9570, 100.00, 1.8026, 24, 666.0, 20.20, 16.45, 20.62, 8.80\n", | |
"67.92080, 0.00, 18.100, 0, 0.6930, 5.6830, 100.00, 1.4254, 24, 666.0, 20.20, 384.97, 22.98, 5.00\n" | |
] | |
} | |
], | |
"source": [ | |
"// - 샘플 출력\n", | |
"housingLines.top(3).foreach(println)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"506" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// - 전체 데이터 수\n", | |
"housingLines.count" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"\" 0.00632, 18.00, 2.310, 0, 0.5380, 6.5750, 65.20, 4.0900, 1, 296.0, 15.30, 396.90, 4.98, 24.00\"" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// - 샘플 데이터 Dense 벡터 생성\n", | |
"val housing1 = housingLines.first\n", | |
"housing1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0]" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"Vectors.dense(housing1.split(\",\").map(_.trim().toDouble))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0]" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"Vectors.dense(for(d <- housing1.split(\",\")) yield d.trim().toDouble)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"MapPartitionsRDD[3] at map at <console>:26" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// 2. Dense Vector RDD로 변환\n", | |
"val housingVals = housingLines.map(x => Vectors.dense(x.split(\",\").map(_.trim().toDouble)))\n", | |
"housingVals" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0]\n", | |
"[0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6]\n" | |
] | |
} | |
], | |
"source": [ | |
"// - 샘플 출력\n", | |
"housingVals.take(2).foreach(println)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 2. 분포 분석\n", | |
"1. RowMatrix 변환 하여 couputeColumnSummaryStatistics 이용\n", | |
"2. Statistics.colStats 이용\n", | |
"3. 생성된 Statistics 객체를 이용하여 분석\n", | |
" - min, max, mean, variance, normL1, normL2" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"org.apache.spark.mllib.linalg.distributed.RowMatrix@15732b4a" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// RowMatrix\n", | |
"import org.apache.spark.mllib.linalg.distributed.RowMatrix\n", | |
"val housingMat = new RowMatrix(housingVals)\n", | |
"housingMat" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@35dec45b" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"val housingStats = housingMat.computeColumnSummaryStatistics()\n", | |
"housingStats" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[3.6135235573122526,11.363636363636367,11.13677865612648,0.0691699604743083,0.5546950592885376,6.284634387351778,68.57490118577074,3.7950426877470362,9.549407114624508,408.2371541501976,18.45553359683794,356.67403162055336,12.653063241106718,22.532806324110666]" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"housingStats.mean" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"// Statistics\n", | |
"import org.apache.spark.mllib.stat.Statistics\n", | |
"val housingStats = Statistics.colStats(housingVals)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[3.6135235573122526,11.363636363636365,11.13677865612648,0.0691699604743083,0.5546950592885376,6.284634387351778,68.57490118577074,3.7950426877470362,9.549407114624508,408.2371541501976,18.455533596837945,356.67403162055336,12.653063241106718,22.532806324110666]" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"housingStats.mean" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0, 0 : -0.388 \n", | |
"0, 1 : 0.360 \n", | |
"0, 2 : -0.484 \n", | |
"0, 3 : 0.175 \n", | |
"0, 4 : -0.427 \n", | |
"0, 5 : 0.695 \n", | |
"0, 6 : -0.377 \n", | |
"0, 7 : 0.250 \n", | |
"0, 8 : -0.382 \n", | |
"0, 9 : -0.469 \n", | |
"0, 10 : -0.508 \n", | |
"0, 11 : 0.333 \n", | |
"0, 12 : -0.738 \n", | |
"0, 13 : 1.000 \n" | |
] | |
} | |
], | |
"source": [ | |
"// correlatin coefficient(상관 계수 계산)\n", | |
"val housingCorr = Statistics.corr(housingVals)\n", | |
"for(i <- 0 until housingCorr.numRows) printf(\"0, %s : %9.3f \\n\",i,housingCorr(i,housingCorr.numRows-1))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 3. cosine similarity(코사인 유사도)\n", | |
"- 두 벡터간의 방향성을 분석, 두 벡터간의 각도\n", | |
"- 두 벡터간의 유사도와 관련된 분야에 활용 : 상품이나 뉴스 추천, Word2Vec\n", | |
"- RowMatrix.columnSimilarities 이용( > v 2.0 )\n", | |
"- upper-triangular matrix(상삼각 행렬) 형태의 distributed CoordinateMatrix 생성\n", | |
"- Breeze DenseMatrix 변환하여 데이터 확인" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"org.apache.spark.mllib.linalg.distributed.CoordinateMatrix@4b11d1fb" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"val housingColSims = housingMat.columnSimilarities\n", | |
"housingColSims" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0 1 2 3 4 5 6 7 8 9 10 11 12 13 \n", | |
"0 +0.000 +0.004 +0.527 +0.052 +0.459 +0.363 +0.482 +0.169 +0.675 +0.563 +0.416 +0.288 +0.544 +0.224\n", | |
"1 +0.000 +0.000 +0.122 +0.078 +0.334 +0.467 +0.211 +0.673 +0.135 +0.297 +0.394 +0.464 +0.200 +0.528\n", | |
"2 +0.000 +0.000 +0.000 +0.256 +0.915 +0.824 +0.916 +0.565 +0.840 +0.931 +0.869 +0.779 +0.897 +0.693\n", | |
"3 +0.000 +0.000 +0.000 +0.000 +0.275 +0.271 +0.275 +0.184 +0.190 +0.230 +0.248 +0.266 +0.204 +0.307\n", | |
"4 +0.000 +0.000 +0.000 +0.000 +0.000 +0.966 +0.962 +0.780 +0.808 +0.957 +0.977 +0.929 +0.912 +0.873\n", | |
"5 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.909 +0.880 +0.719 +0.906 +0.982 +0.966 +0.832 +0.949\n", | |
"6 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.672 +0.801 +0.929 +0.930 +0.871 +0.918 +0.803\n", | |
"7 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.485 +0.710 +0.856 +0.882 +0.644 +0.856\n", | |
"8 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.917 +0.771 +0.642 +0.806 +0.588\n", | |
"9 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.939 +0.854 +0.907 +0.789\n", | |
"10 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.957 +0.887 +0.897\n", | |
"11 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.799 +0.928\n", | |
"12 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.670\n", | |
"13 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000\n" | |
] | |
} | |
], | |
"source": [ | |
"val housingColSimsBDM = toBreezeD(housingColSims)\n", | |
"printMat(housingColSimsBDM)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.0" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"housingColSimsBDM(0,0)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.0038331826140674792" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"housingColSimsBDM(0,1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0, 0 : 0.224 \n", | |
"0, 1 : 0.528 \n", | |
"0, 2 : 0.693 \n", | |
"0, 3 : 0.307 \n", | |
"0, 4 : 0.873 \n", | |
"0, 5 : 0.949 \n", | |
"0, 6 : 0.803 \n", | |
"0, 7 : 0.856 \n", | |
"0, 8 : 0.588 \n", | |
"0, 9 : 0.789 \n", | |
"0, 10 : 0.897 \n", | |
"0, 11 : 0.928 \n", | |
"0, 12 : 0.670 \n", | |
"0, 13 : 0.000 \n" | |
] | |
} | |
], | |
"source": [ | |
"// 집값과 각 항목간의 유사도\n", | |
"for(i <- 0 until housingColSimsBDM.rows) printf(\"0, %s : %9.3f \\n\",i,housingColSimsBDM(i,housingColSimsBDM.rows-1))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 4. covariance matrix(공분산 행렬)\n", | |
"- 두 벡터간의 유사도 분석\n", | |
"- 선형 연관성을 모델링 하는 통계 기법\n", | |
"- RowMatrix.computeCovariance 이용( > v 2.0 )\n", | |
"- sysmmetric matrix(상삼각 행렬) 형태의 distributed CoordinateMatrix 생성\n", | |
"- Statistics.corr 이용(위 예제 참고)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"class org.apache.spark.mllib.linalg.DenseMatrix" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"val housingCovar = housingMat.computeCovariance()\n", | |
"housingCovar.getClass" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" 0 1 2 3 4 5 6 7 8 9 10 11 12 13 \n", | |
"0 +73.987 -40.216 +23.992 -0.122 +0.420 -1.325 +85.405 -6.877 +46.848 +844.822 +5.399 -302.382 +27.986 -30.719\n", | |
"1 -40.216 +543.937 -85.413 -0.253 -1.396 +5.113 -373.902 +32.629 -63.349 -1236.454 -19.777 +373.721 -68.783 +77.315\n", | |
"2 +23.992 -85.413 +47.064 +0.110 +0.607 -1.888 +124.514 -10.228 +35.550 +833.360 +5.692 -223.580 +29.580 -30.521\n", | |
"3 -0.122 -0.253 +0.110 +0.065 +0.003 +0.016 +0.619 -0.053 -0.016 -1.523 -0.067 +1.131 -0.098 +0.409\n", | |
"4 +0.420 -1.396 +0.607 +0.003 +0.013 -0.025 +2.386 -0.188 +0.617 +13.046 +0.047 -4.021 +0.489 -0.455\n", | |
"5 -1.325 +5.113 -1.888 +0.016 -0.025 +0.494 -4.752 +0.304 -1.284 -34.583 -0.541 +8.215 -3.080 +4.493\n", | |
"6 +85.405 -373.902 +124.514 +0.619 +2.386 -4.752 +792.358 -44.329 +111.771 +2402.690 +15.937 -702.940 +121.078 -97.589\n", | |
"7 -6.877 +32.629 -10.228 -0.053 -0.188 +0.304 -44.329 +4.434 -9.068 -189.665 -1.060 +56.040 -7.473 +4.840\n", | |
"8 +46.848 -63.349 +35.550 -0.016 +0.617 -1.284 +111.771 -9.068 +75.816 +1335.757 +8.761 -353.276 +30.385 -30.561\n", | |
"9 +844.822 -1236.454 +833.360 -1.523 +13.046 -34.583 +2402.690 -189.665 +1335.757 +28404.759 +168.153 -6797.911 +654.715 -726.256\n", | |
"10 +5.399 -19.777 +5.692 -0.067 +0.047 -0.541 +15.937 -1.060 +8.761 +168.153 +4.687 -35.060 +5.783 -10.111\n", | |
"11 -302.382 +373.721 -223.580 +1.131 -4.021 +8.215 -702.940 +56.040 -353.276 -6797.911 -35.060 +8334.752 -238.668 +279.990\n", | |
"12 +27.986 -68.783 +29.580 -0.098 +0.489 -3.080 +121.078 -7.473 +30.385 +654.715 +5.783 -238.668 +50.995 -48.448\n", | |
"13 -30.719 +77.315 -30.521 +0.409 -0.455 +4.493 -97.589 +4.840 -30.561 -726.256 -10.111 +279.990 -48.448 +84.587\n" | |
] | |
} | |
], | |
"source": [ | |
"printMat(toBreezeM(housingCovar))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 5. linear regress을 위한 데이터 준비\n", | |
"- 레이블 포인트로 변환\n", | |
" - LabeledPoint 구조체로 변환 : 목표값, 특징 변수 벡터로 구성, 대부분의 머신러닝 알고리즘에서 사용\n", | |
"- 데이터 분할 : train, test 데이터 셋 분할\n", | |
" - RDD 함수 이용 : randomSplit\n", | |
"- 데이터 표준화 : 스케일링 및 평균 정규화\n", | |
" - 스케일링(feature scaling) : 데이터 범위를 비슷한 크리로 조정\n", | |
" - 평균 정규화(mean normalization) : 평균이 0인 분포로 변환, 정규 분포임을 가장하고 주로 사용\n", | |
" - StandardScaler : 스케일링과 표준 정규화를 함께 처리" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(24.0,[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98])" | |
] | |
}, | |
"execution_count": 24, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// 레이블 포인트\n", | |
"import org.apache.spark.mllib.regression.LabeledPoint\n", | |
"\n", | |
"val housingData = housingVals.map(x => {\n", | |
" val line = x.toArray\n", | |
" LabeledPoint(line.last,Vectors.dense(line.slice(0,line.length-1)))\n", | |
"})\n", | |
"housingData.first" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"417" | |
] | |
}, | |
"execution_count": 25, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// 데이터 분할\n", | |
"val sets = housingData.randomSplit(Array(0.8,0.2))\n", | |
"val housingTrain = sets(0)\n", | |
"val housingTest = sets(1)\n", | |
"housingTrain.count" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0.0063 - 88.9762 : 88.9699\n", | |
"0.0000 - 100.0000 : 100.0000\n", | |
"0.4600 - 27.7400 : 27.2800\n", | |
"0.0000 - 1.0000 : 1.0000\n", | |
"0.3850 - 0.8710 : 0.4860\n", | |
"3.5610 - 8.7800 : 5.2190\n", | |
"2.9000 - 100.0000 : 97.1000\n", | |
"1.1296 - 12.1265 : 10.9969\n", | |
"1.0000 - 24.0000 : 23.0000\n", | |
"187.0000 - 711.0000 : 524.0000\n", | |
"12.6000 - 22.0000 : 9.4000\n", | |
"0.3200 - 396.9000 : 396.5800\n", | |
"1.7300 - 37.9700 : 36.2400\n", | |
"5.0000 - 50.0000 : 45.0000\n" | |
] | |
} | |
], | |
"source": [ | |
"// 데이터 분포 체크\n", | |
"for(mm <- housingStats.min.toArray.zip(housingStats.max.toArray)){\n", | |
" printf(\"%1.4f - %1.4f : %1.4f\\n\",mm._1,mm._2,mm._2-mm._1)\n", | |
"}" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"org.apache.spark.mllib.feature.StandardScalerModel@26c52f77" | |
] | |
}, | |
"execution_count": 27, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// 스케일러 생성(스케일링, 정규화)\n", | |
"import org.apache.spark.mllib.feature.StandardScaler\n", | |
"val scaler = new StandardScaler(true, true).fit(housingTrain.map(_.features))\n", | |
"scaler" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"417" | |
] | |
}, | |
"execution_count": 28, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// 스케일 적용(스케일링, 정규화)\n", | |
"val trainScaled = housingTrain.map(x => LabeledPoint(x.label,scaler.transform(x.features)))\n", | |
"val testScaled = housingTest.map(x => LabeledPoint(x.label,scaler.transform(x.features)))\n", | |
"trainScaled.count" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(24.0,[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98])" | |
] | |
}, | |
"execution_count": 29, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"housingTrain.first" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(24.0,[-0.4693925254631978,0.2687770211872631,-1.2747871495548844,-0.2730623188731474,-0.14385350282962484,0.43199631109148545,-0.10805893821126049,0.14558225189264024,-0.96975264275946,-0.653237741365371,-1.4218179119546304,0.4480888765307575,-1.0721133879367546])" | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"trainScaled.first" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 6. 선형 회귀 모델 학습\n", | |
"- org.aprache.spark.mllib.regression 패키지 LinearRegressionModel 클래스로 구현\n", | |
" - 학습이 완료된 선형 회귀 모델의 매개변수 저장\n", | |
" - predict 메서드를 이용하여 값을 예측\n", | |
"- LinearRegressionWithSGD 이용하여 LinsearRegressionModel 생성\n", | |
" - train 메서드 : bias를 제외한 weight 학습, LinearRegressionWithSGD.train(데이터, 반복횟수, 학습율)\n", | |
" - LinearRegressionWithSGD 객체 생성후, bias 학습 세팅, 반복횟수 설정, 데이터 캐싱, run 메스드(훈련)\n", | |
"- 값 예측\n", | |
" - predict" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"13" | |
] | |
}, | |
"execution_count": 31, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import org.apache.spark.mllib.regression.LinearRegressionWithSGD\n", | |
"val model1 = LinearRegressionWithSGD.train(trainScaled,200, 1.0)\n", | |
"model1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"class org.apache.spark.mllib.regression.LinearRegressionModel" | |
] | |
}, | |
"execution_count": 32, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"model1.getClass" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[-0.37785930098621784,0.7281842617508575,-0.11745209635664786,0.8246981641151655,-1.760815522334877,3.1055999305728936,-0.485204611122377,-2.9388833381201045,1.7360597129016102,-1.1979267166377676,-2.0326627329778537,1.003143321052567,-3.287004588197274]" | |
] | |
}, | |
"execution_count": 33, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"model1.weights" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.0" | |
] | |
}, | |
"execution_count": 34, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"model1.intercept" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"13" | |
] | |
}, | |
"execution_count": 35, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"val lrm = new LinearRegressionWithSGD()\n", | |
"lrm.setIntercept(true)\n", | |
"lrm.optimizer.setNumIterations(200)\n", | |
"lrm.optimizer.setStepSize(1.0)\n", | |
"trainScaled.cache()\n", | |
"val model2 = lrm.run(trainScaled)\n", | |
"model2" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"class org.apache.spark.mllib.regression.LinearRegressionModel" | |
] | |
}, | |
"execution_count": 36, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"model2.getClass" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[-0.2923374177734875,0.5941692181822921,-0.24760684105702974,0.8463715671785175,-1.4687758707821237,3.2028460786202753,-0.5190199406486725,-2.6760815549376082,1.309178867916586,-0.8367071826015235,-1.9757876070323401,1.0289607840181993,-3.2566385434466443]" | |
] | |
}, | |
"execution_count": 37, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"model2.weights" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"22.45707434052759" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"model2.intercept" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"MapPartitionsRDD[699] at map at <console>:51" | |
] | |
}, | |
"execution_count": 39, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// 테스트 데이터 예측\n", | |
"val testPredicts = testScaled.map(x => (model2.predict(x.features), x.label))\n", | |
"testPredicts" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"89" | |
] | |
}, | |
"execution_count": 40, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"val testPredictsArr = testPredicts.collect()\n", | |
"testPredictsArr.length" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(22.785998999685642,22.9)\n", | |
"(19.1619715954319,18.2)\n", | |
"(19.396878618741173,19.9)\n", | |
"(12.521252819041827,13.6)\n", | |
"(13.23501944140063,13.9)\n" | |
] | |
} | |
], | |
"source": [ | |
"testPredictsArr.slice(0,5).foreach(println)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"5.608326430430329" | |
] | |
}, | |
"execution_count": 42, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// RMSE\n", | |
"val rvals = testPredictsArr.map{case(p,y) => math.pow(p-y,2)}\n", | |
"math.sqrt(rvals.sum / rvals.length)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 7. 모델 평가 및 해석\n", | |
"- RegressionMetrics 클래스를 이용하여 여러가지 평가 가능\n", | |
" - rootMeanSquareError\n", | |
" - meanSquareError\n", | |
" - meanAbsoluteError\n", | |
" - r2 : coefficient of determination(결정계수, R^2), 0~1 값, 설명 변량의 비율, 1에 가까울 수록 좋음\n", | |
" - explainedVariance : r2 와 유사\n", | |
" - 결정계수를 실무에서 자주 사용하지만 상관성이 적은 feature라도 추가를 하면 값이 커지는 경향이 있다.\n", | |
"- 매개 변수 해석\n", | |
" - 모델의 weight를 이용하여 각 변수가 예측 값에 미치는 영향력을 해석\n", | |
" - 값이 클수록 영향력이 크다는 것을 의미" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"5.608326430430328" | |
] | |
}, | |
"execution_count": 43, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"// 평가 : RMSE, MSE, MAE, R^2(coefficient of determination)\n", | |
"import org.apache.spark.mllib.evaluation.RegressionMetrics\n", | |
"val validMetrics = new RegressionMetrics(testPredicts)\n", | |
"validMetrics.rootMeanSquaredError" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"31.45332535026338" | |
] | |
}, | |
"execution_count": 44, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"validMetrics.meanSquaredError" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(0.24760684105702974,2)\n", | |
"(0.2923374177734875,0)\n", | |
"(0.5190199406486725,6)\n", | |
"(0.5941692181822921,1)\n", | |
"(0.8367071826015235,9)\n", | |
"(0.8463715671785175,3)\n", | |
"(1.0289607840181993,11)\n", | |
"(1.309178867916586,8)\n", | |
"(1.4687758707821237,4)\n", | |
"(1.9757876070323401,10)\n", | |
"(2.6760815549376082,7)\n", | |
"(3.2028460786202753,5)\n", | |
"(3.2566385434466443,12)\n" | |
] | |
} | |
], | |
"source": [ | |
"// 해석\n", | |
"model2.weights.toArray.map(_.abs).zipWithIndex.sortBy(_._1).foreach(println)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 매개변수 해석\n", | |
"- 가장 영향력이 큰 컬럼은 12번째 LSTAT(저소득층 인구 비율), 두번째로는 7번째 DIS(보스턴 각 고용센터까지의 거리합)\n", | |
"- 영향력이 적은 컬럼은 6번째, 2번째 컬럼이며 모델에서 제거하여도 별 영향이 없으며 오히려 성늘이 더 향상 될 수도 있다.\n", | |
" - 6번 : AGE, 1940년 이전 지어진 자가 거주 건물 비율\n", | |
" - 2번 : INDUS, 마을당 비-소매 비즈니스 토지의 에이커 비율" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### 모델의 저장 및 불러오기\n", | |
"- 저장 : save\n", | |
"- 불러오기 : load\n", | |
"- Parqut 파일 포맷" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Name: org.apache.hadoop.mapred.FileAlreadyExistsException\n", | |
"Message: Output directory file:/opt/mynotebook/models/linearRegressionModel/metadata already exists\n", | |
"StackTrace: at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)\n", | |
" at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:283)\n", | |
" at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)\n", | |
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n", | |
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n", | |
" at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)\n", | |
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n", | |
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n", | |
" at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)\n", | |
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n", | |
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n", | |
" at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n", | |
" at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)\n", | |
" at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)\n", | |
" at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)\n", | |
" at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)\n", | |
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n", | |
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n", | |
" at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n", | |
" at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)\n", | |
" at org.apache.spark.mllib.regression.impl.GLMRegressionModel$SaveLoadV1_0$.save(GLMRegressionModel.scala:56)\n", | |
" at org.apache.spark.mllib.regression.LinearRegressionModel.save(LinearRegression.scala:52)" | |
] | |
}, | |
"execution_count": 46, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"model2.save(sc,\"../models/linearRegressionModel\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[-0.901459854136914,0.9509812890773397,-0.10314713489439738,0.8805042807920195,-2.0668983757289556,2.4936373935100935,-0.06818383702924491,-3.2553295187452846,1.7217190130008164,-0.777583539625745,-2.126496701079123,0.9829657849466839,-4.016312120372918]" | |
] | |
}, | |
"execution_count": 47, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import org.apache.spark.mllib.regression.LinearRegressionModel\n", | |
"val loadModel = LinearRegressionModel.load(sc,\"../models/linearRegressionModel\")\n", | |
"loadModel.weights" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Apache Toree - Scala", | |
"language": "scala", | |
"name": "apache_toree_scala" | |
}, | |
"language_info": { | |
"file_extension": ".scala", | |
"name": "scala", | |
"version": "2.11.8" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment