Skip to content

Instantly share code, notes, and snippets.

@mrbkdad
Created July 19, 2018 02:02
Show Gist options
  • Save mrbkdad/37faa8966b07ca96cbd10279035c77d1 to your computer and use it in GitHub Desktop.
Save mrbkdad/37faa8966b07ca96cbd10279035c77d1 to your computer and use it in GitHub Desktop.
learning spark ML chapter 2
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import breeze.linalg.{DenseMatrix => BDM,CSCMatrix => BSM,Matrix => BM}\n",
"import org.apache.spark.mllib.linalg.{DenseMatrix, SparseMatrix, Matrix, Matrices}\n",
"import org.apache.spark.mllib.linalg.distributed.{RowMatrix, CoordinateMatrix, BlockMatrix, DistributedMatrix, MatrixEntry}\n",
"\n",
"def printMat(mat:BM[Double]) = {\n",
" print(\" \")\n",
" for(j <- 0 to mat.cols-1) print(\"%-10d\".format(j));\n",
" println\n",
" for(i <- 0 to mat.rows-1) { print(\"%-6d\".format(i)); for(j <- 0 to mat.cols-1) print(\" %+9.3f\".format(mat(i, j))); println }\n",
"}\n",
"def toBreezeM(m:Matrix):BM[Double] = m match {\n",
" case dm:DenseMatrix => new BDM(dm.numRows, dm.numCols, dm.values)\n",
" case sm:SparseMatrix => new BSM(sm.values, sm.numCols, sm.numRows, sm.colPtrs, sm.rowIndices)\n",
"}\n",
"def toBreezeD(dm:DistributedMatrix):BM[Double] = dm match {\n",
" case rm:RowMatrix => {\n",
" val m = rm.numRows().toInt\n",
" val n = rm.numCols().toInt\n",
" val mat = BDM.zeros[Double](m, n)\n",
" var i = 0\n",
" rm.rows.collect().foreach { vector =>\n",
" for(j <- 0 to vector.size-1)\n",
" {\n",
" mat(i, j) = vector(j)\n",
" }\n",
" i += 1\n",
" }\n",
" mat\n",
" }\n",
" case cm:CoordinateMatrix => {\n",
" val m = cm.numRows().toInt\n",
" val n = cm.numCols().toInt\n",
" val mat = BDM.zeros[Double](m, n)\n",
" cm.entries.collect().foreach { case MatrixEntry(i, j, value) =>\n",
" mat(i.toInt, j.toInt) = value\n",
" }\n",
" mat\n",
" }\n",
" case bm:BlockMatrix => {\n",
" val localMat = bm.toLocalMatrix()\n",
" new BDM[Double](localMat.numRows, localMat.numCols, localMat.toArray)\n",
" }\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 선형 회귀\n",
"- 선형회귀 동작 방식\n",
"- 샘플 데이터셋 적용\n",
"- 데이터 분석 및 준비 과정\n",
"- 모델 성능 평가\n",
"- bias & variance의 상충관계, 교차 검증, 일반화의 개념"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 선형회귀\n",
"- 독립변수 셋을 사용해 목표변수를 예측하고 이들간의 관계를 계량화\n",
"- 독립변수와 목표변수 사이에 선형관계가 있다고 가정\n",
"- 단순 선형 회귀(simple linear regression), 다중 선형 회귀(multiple linear regression)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### simple linear regression\n",
"- 보스톤 주택 데이터셋 : UC 어바인\n",
"- 보스톤 교외에 위치한 자가 거주 주택의 평균 가격과 이 가격을 예측하는 데 사용 할 수 있는 특징 변수 13개로 구성\n",
"- 거주인당 평균 방 개수만을 사용해 선형 회귀 모델 실습\n",
"\n",
"- 모델(가설함수): h(x) = w0 + w1x\n",
"- 방법 : 모델에 적합한 가중치 추정(w0, w1), cost function 최소화하는 가장 적절한 가중치\n",
"- cost function : mean squared error, C(w0,w1) = 1/2m * sum(h(xi) - yi)^2 = 1/2m * sum(w0+w1xi - yi)^2\n",
"- cost function 값이 최저인 지점"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### multiple linear regression\n",
"- 더많은 독립 변수(차원) 활용\n",
"- 주택 데이터셋의 모든 독립변수(13개) 활용\n",
"- 비용 함수를 그래프화 하기 어려움(불가능)\n",
"- 모델(가설함수) : h(x) = w0 + w1x + w2x + ... + wnxn = W_t * X\n",
"- W_t : [w0, w1, w2, ... , wn] (weight vector traspose)\n",
"- X : [1, x1, x2, ... , xn]\n",
"- cost function : C(w) = 1/2m * sum(W_t * X(i) - y(i))^2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 최저점 찾기\n",
"- 정규 방정식(normal equation)\n",
" - w = (X_t * X)^-1 * X_t * y\n",
" - 많은 계산량 필요\n",
"- 경사 하강법(gradient-descent)\n",
" - cost function의 편미분(partial derivative) 계산\n",
" - weight 수정 및 반복, 허용치(tolerance value) 이용 수렴(converged) 판단"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 데이터 분석 및 준비\n",
"- 보스톤 주택 데이터셋 : https://goo.gl/MFsmFW"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### RDD : resilient distributed dataset\n",
" - immutable : read-only\n",
" - resilent : 스파크 내부의 장애 복구 메커니즘이 RDD의 복원성을 부여\n",
" - distributed\n",
" - 분산 컬렉션의 성질과 장애 내성을 추상화하고 직관적인 방식으로 대규모 데이터셋에 병렬 연산을 수행할 수 있도록 지원\n",
" - 데이터의 분산 처리에 필요한 여러가지 요소들을 추상화하여 엔지니어가 데이터의 계산과 처리에 집중할 수 있도록 설계\n",
" - 데이터의 변환 메커니즘으로 변환방식을 기술하여 저장함으로서 목적을 달성한다.\n",
"- RDD 연산자\n",
" - Transformation, Action 연산자로 나뉨\n",
"- Transformation(변환 연산자)\n",
" - lazy evaluation\n",
" - map, filter, distinct, flatMap\n",
"- Action(행동 연산자)\n",
" - 실제 transformation이 실행\n",
" - first, top, count, collect, foreach, take\n",
" - sample, takeSample : 복원샘플과 비복원 샘플 지원"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. 데이터 준비\n",
"1. 파일 데이터 읽기\n",
"2. 벡터화"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import org.apache.spark.mllib.linalg.Vectors"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"../datas/housing.data MapPartitionsRDD[1] at textFile at <console>:24"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// 1. 보스톤 주택 데이터 RDD 생성, 파티션 수 6개\n",
"val housingLines = sc.textFile(\"../datas/housing.data\",6)\n",
"housingLines"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"88.97620, 0.00, 18.100, 0, 0.6710, 6.9680, 91.90, 1.4165, 24, 666.0, 20.20, 396.90, 17.21, 10.40\n",
"73.53410, 0.00, 18.100, 0, 0.6790, 5.9570, 100.00, 1.8026, 24, 666.0, 20.20, 16.45, 20.62, 8.80\n",
"67.92080, 0.00, 18.100, 0, 0.6930, 5.6830, 100.00, 1.4254, 24, 666.0, 20.20, 384.97, 22.98, 5.00\n"
]
}
],
"source": [
"// - 샘플 출력\n",
"housingLines.top(3).foreach(println)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"506"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// - 전체 데이터 수\n",
"housingLines.count"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" 0.00632, 18.00, 2.310, 0, 0.5380, 6.5750, 65.20, 4.0900, 1, 296.0, 15.30, 396.90, 4.98, 24.00\""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// - 샘플 데이터 Dense 벡터 생성\n",
"val housing1 = housingLines.first\n",
"housing1"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Vectors.dense(housing1.split(\",\").map(_.trim().toDouble))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Vectors.dense(for(d <- housing1.split(\",\")) yield d.trim().toDouble)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MapPartitionsRDD[3] at map at <console>:26"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// 2. Dense Vector RDD로 변환\n",
"val housingVals = housingLines.map(x => Vectors.dense(x.split(\",\").map(_.trim().toDouble)))\n",
"housingVals"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0]\n",
"[0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6]\n"
]
}
],
"source": [
"// - 샘플 출력\n",
"housingVals.take(2).foreach(println)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. 분포 분석\n",
"1. RowMatrix 변환 하여 couputeColumnSummaryStatistics 이용\n",
"2. Statistics.colStats 이용\n",
"3. 생성된 Statistics 객체를 이용하여 분석\n",
" - min, max, mean, variance, normL1, normL2"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"org.apache.spark.mllib.linalg.distributed.RowMatrix@15732b4a"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// RowMatrix\n",
"import org.apache.spark.mllib.linalg.distributed.RowMatrix\n",
"val housingMat = new RowMatrix(housingVals)\n",
"housingMat"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@35dec45b"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val housingStats = housingMat.computeColumnSummaryStatistics()\n",
"housingStats"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[3.6135235573122526,11.363636363636367,11.13677865612648,0.0691699604743083,0.5546950592885376,6.284634387351778,68.57490118577074,3.7950426877470362,9.549407114624508,408.2371541501976,18.45553359683794,356.67403162055336,12.653063241106718,22.532806324110666]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"housingStats.mean"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"// Statistics\n",
"import org.apache.spark.mllib.stat.Statistics\n",
"val housingStats = Statistics.colStats(housingVals)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[3.6135235573122526,11.363636363636365,11.13677865612648,0.0691699604743083,0.5546950592885376,6.284634387351778,68.57490118577074,3.7950426877470362,9.549407114624508,408.2371541501976,18.455533596837945,356.67403162055336,12.653063241106718,22.532806324110666]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"housingStats.mean"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0, 0 : -0.388 \n",
"0, 1 : 0.360 \n",
"0, 2 : -0.484 \n",
"0, 3 : 0.175 \n",
"0, 4 : -0.427 \n",
"0, 5 : 0.695 \n",
"0, 6 : -0.377 \n",
"0, 7 : 0.250 \n",
"0, 8 : -0.382 \n",
"0, 9 : -0.469 \n",
"0, 10 : -0.508 \n",
"0, 11 : 0.333 \n",
"0, 12 : -0.738 \n",
"0, 13 : 1.000 \n"
]
}
],
"source": [
"// correlatin coefficient(상관 계수 계산)\n",
"val housingCorr = Statistics.corr(housingVals)\n",
"for(i <- 0 until housingCorr.numRows) printf(\"0, %s : %9.3f \\n\",i,housingCorr(i,housingCorr.numRows-1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. cosine similarity(코사인 유사도)\n",
"- 두 벡터간의 방향성을 분석, 두 벡터간의 각도\n",
"- 두 벡터간의 유사도와 관련된 분야에 활용 : 상품이나 뉴스 추천, Word2Vec\n",
"- RowMatrix.columnSimilarities 이용( > v 2.0 )\n",
"- upper-triangular matrix(상삼각 행렬) 형태의 distributed CoordinateMatrix 생성\n",
"- Breeze DenseMatrix 변환하여 데이터 확인"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"org.apache.spark.mllib.linalg.distributed.CoordinateMatrix@4b11d1fb"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val housingColSims = housingMat.columnSimilarities\n",
"housingColSims"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1 2 3 4 5 6 7 8 9 10 11 12 13 \n",
"0 +0.000 +0.004 +0.527 +0.052 +0.459 +0.363 +0.482 +0.169 +0.675 +0.563 +0.416 +0.288 +0.544 +0.224\n",
"1 +0.000 +0.000 +0.122 +0.078 +0.334 +0.467 +0.211 +0.673 +0.135 +0.297 +0.394 +0.464 +0.200 +0.528\n",
"2 +0.000 +0.000 +0.000 +0.256 +0.915 +0.824 +0.916 +0.565 +0.840 +0.931 +0.869 +0.779 +0.897 +0.693\n",
"3 +0.000 +0.000 +0.000 +0.000 +0.275 +0.271 +0.275 +0.184 +0.190 +0.230 +0.248 +0.266 +0.204 +0.307\n",
"4 +0.000 +0.000 +0.000 +0.000 +0.000 +0.966 +0.962 +0.780 +0.808 +0.957 +0.977 +0.929 +0.912 +0.873\n",
"5 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.909 +0.880 +0.719 +0.906 +0.982 +0.966 +0.832 +0.949\n",
"6 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.672 +0.801 +0.929 +0.930 +0.871 +0.918 +0.803\n",
"7 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.485 +0.710 +0.856 +0.882 +0.644 +0.856\n",
"8 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.917 +0.771 +0.642 +0.806 +0.588\n",
"9 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.939 +0.854 +0.907 +0.789\n",
"10 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.957 +0.887 +0.897\n",
"11 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.799 +0.928\n",
"12 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.670\n",
"13 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000 +0.000\n"
]
}
],
"source": [
"val housingColSimsBDM = toBreezeD(housingColSims)\n",
"printMat(housingColSimsBDM)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"housingColSimsBDM(0,0)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0038331826140674792"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"housingColSimsBDM(0,1)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0, 0 : 0.224 \n",
"0, 1 : 0.528 \n",
"0, 2 : 0.693 \n",
"0, 3 : 0.307 \n",
"0, 4 : 0.873 \n",
"0, 5 : 0.949 \n",
"0, 6 : 0.803 \n",
"0, 7 : 0.856 \n",
"0, 8 : 0.588 \n",
"0, 9 : 0.789 \n",
"0, 10 : 0.897 \n",
"0, 11 : 0.928 \n",
"0, 12 : 0.670 \n",
"0, 13 : 0.000 \n"
]
}
],
"source": [
"// 집값과 각 항목간의 유사도\n",
"for(i <- 0 until housingColSimsBDM.rows) printf(\"0, %s : %9.3f \\n\",i,housingColSimsBDM(i,housingColSimsBDM.rows-1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. covariance matrix(공분산 행렬)\n",
"- 두 벡터간의 유사도 분석\n",
"- 선형 연관성을 모델링 하는 통계 기법\n",
"- RowMatrix.computeCovariance 이용( > v 2.0 )\n",
"- sysmmetric matrix(상삼각 행렬) 형태의 distributed CoordinateMatrix 생성\n",
"- Statistics.corr 이용(위 예제 참고)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"class org.apache.spark.mllib.linalg.DenseMatrix"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val housingCovar = housingMat.computeCovariance()\n",
"housingCovar.getClass"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0 1 2 3 4 5 6 7 8 9 10 11 12 13 \n",
"0 +73.987 -40.216 +23.992 -0.122 +0.420 -1.325 +85.405 -6.877 +46.848 +844.822 +5.399 -302.382 +27.986 -30.719\n",
"1 -40.216 +543.937 -85.413 -0.253 -1.396 +5.113 -373.902 +32.629 -63.349 -1236.454 -19.777 +373.721 -68.783 +77.315\n",
"2 +23.992 -85.413 +47.064 +0.110 +0.607 -1.888 +124.514 -10.228 +35.550 +833.360 +5.692 -223.580 +29.580 -30.521\n",
"3 -0.122 -0.253 +0.110 +0.065 +0.003 +0.016 +0.619 -0.053 -0.016 -1.523 -0.067 +1.131 -0.098 +0.409\n",
"4 +0.420 -1.396 +0.607 +0.003 +0.013 -0.025 +2.386 -0.188 +0.617 +13.046 +0.047 -4.021 +0.489 -0.455\n",
"5 -1.325 +5.113 -1.888 +0.016 -0.025 +0.494 -4.752 +0.304 -1.284 -34.583 -0.541 +8.215 -3.080 +4.493\n",
"6 +85.405 -373.902 +124.514 +0.619 +2.386 -4.752 +792.358 -44.329 +111.771 +2402.690 +15.937 -702.940 +121.078 -97.589\n",
"7 -6.877 +32.629 -10.228 -0.053 -0.188 +0.304 -44.329 +4.434 -9.068 -189.665 -1.060 +56.040 -7.473 +4.840\n",
"8 +46.848 -63.349 +35.550 -0.016 +0.617 -1.284 +111.771 -9.068 +75.816 +1335.757 +8.761 -353.276 +30.385 -30.561\n",
"9 +844.822 -1236.454 +833.360 -1.523 +13.046 -34.583 +2402.690 -189.665 +1335.757 +28404.759 +168.153 -6797.911 +654.715 -726.256\n",
"10 +5.399 -19.777 +5.692 -0.067 +0.047 -0.541 +15.937 -1.060 +8.761 +168.153 +4.687 -35.060 +5.783 -10.111\n",
"11 -302.382 +373.721 -223.580 +1.131 -4.021 +8.215 -702.940 +56.040 -353.276 -6797.911 -35.060 +8334.752 -238.668 +279.990\n",
"12 +27.986 -68.783 +29.580 -0.098 +0.489 -3.080 +121.078 -7.473 +30.385 +654.715 +5.783 -238.668 +50.995 -48.448\n",
"13 -30.719 +77.315 -30.521 +0.409 -0.455 +4.493 -97.589 +4.840 -30.561 -726.256 -10.111 +279.990 -48.448 +84.587\n"
]
}
],
"source": [
"printMat(toBreezeM(housingCovar))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. linear regress을 위한 데이터 준비\n",
"- 레이블 포인트로 변환\n",
" - LabeledPoint 구조체로 변환 : 목표값, 특징 변수 벡터로 구성, 대부분의 머신러닝 알고리즘에서 사용\n",
"- 데이터 분할 : train, test 데이터 셋 분할\n",
" - RDD 함수 이용 : randomSplit\n",
"- 데이터 표준화 : 스케일링 및 평균 정규화\n",
" - 스케일링(feature scaling) : 데이터 범위를 비슷한 크리로 조정\n",
" - 평균 정규화(mean normalization) : 평균이 0인 분포로 변환, 정규 분포임을 가장하고 주로 사용\n",
" - StandardScaler : 스케일링과 표준 정규화를 함께 처리"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(24.0,[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// 레이블 포인트\n",
"import org.apache.spark.mllib.regression.LabeledPoint\n",
"\n",
"val housingData = housingVals.map(x => {\n",
" val line = x.toArray\n",
" LabeledPoint(line.last,Vectors.dense(line.slice(0,line.length-1)))\n",
"})\n",
"housingData.first"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"417"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// 데이터 분할\n",
"val sets = housingData.randomSplit(Array(0.8,0.2))\n",
"val housingTrain = sets(0)\n",
"val housingTest = sets(1)\n",
"housingTrain.count"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.0063 - 88.9762 : 88.9699\n",
"0.0000 - 100.0000 : 100.0000\n",
"0.4600 - 27.7400 : 27.2800\n",
"0.0000 - 1.0000 : 1.0000\n",
"0.3850 - 0.8710 : 0.4860\n",
"3.5610 - 8.7800 : 5.2190\n",
"2.9000 - 100.0000 : 97.1000\n",
"1.1296 - 12.1265 : 10.9969\n",
"1.0000 - 24.0000 : 23.0000\n",
"187.0000 - 711.0000 : 524.0000\n",
"12.6000 - 22.0000 : 9.4000\n",
"0.3200 - 396.9000 : 396.5800\n",
"1.7300 - 37.9700 : 36.2400\n",
"5.0000 - 50.0000 : 45.0000\n"
]
}
],
"source": [
"// 데이터 분포 체크\n",
"for(mm <- housingStats.min.toArray.zip(housingStats.max.toArray)){\n",
" printf(\"%1.4f - %1.4f : %1.4f\\n\",mm._1,mm._2,mm._2-mm._1)\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"org.apache.spark.mllib.feature.StandardScalerModel@26c52f77"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// 스케일러 생성(스케일링, 정규화)\n",
"import org.apache.spark.mllib.feature.StandardScaler\n",
"val scaler = new StandardScaler(true, true).fit(housingTrain.map(_.features))\n",
"scaler"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"417"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// 스케일 적용(스케일링, 정규화)\n",
"val trainScaled = housingTrain.map(x => LabeledPoint(x.label,scaler.transform(x.features)))\n",
"val testScaled = housingTest.map(x => LabeledPoint(x.label,scaler.transform(x.features)))\n",
"trainScaled.count"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(24.0,[0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98])"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"housingTrain.first"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(24.0,[-0.4693925254631978,0.2687770211872631,-1.2747871495548844,-0.2730623188731474,-0.14385350282962484,0.43199631109148545,-0.10805893821126049,0.14558225189264024,-0.96975264275946,-0.653237741365371,-1.4218179119546304,0.4480888765307575,-1.0721133879367546])"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trainScaled.first"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. 선형 회귀 모델 학습\n",
"- org.aprache.spark.mllib.regression 패키지 LinearRegressionModel 클래스로 구현\n",
" - 학습이 완료된 선형 회귀 모델의 매개변수 저장\n",
" - predict 메서드를 이용하여 값을 예측\n",
"- LinearRegressionWithSGD 이용하여 LinsearRegressionModel 생성\n",
" - train 메서드 : bias를 제외한 weight 학습, LinearRegressionWithSGD.train(데이터, 반복횟수, 학습율)\n",
" - LinearRegressionWithSGD 객체 생성후, bias 학습 세팅, 반복횟수 설정, 데이터 캐싱, run 메스드(훈련)\n",
"- 값 예측\n",
" - predict"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"13"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import org.apache.spark.mllib.regression.LinearRegressionWithSGD\n",
"val model1 = LinearRegressionWithSGD.train(trainScaled,200, 1.0)\n",
"model1"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"class org.apache.spark.mllib.regression.LinearRegressionModel"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model1.getClass"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[-0.37785930098621784,0.7281842617508575,-0.11745209635664786,0.8246981641151655,-1.760815522334877,3.1055999305728936,-0.485204611122377,-2.9388833381201045,1.7360597129016102,-1.1979267166377676,-2.0326627329778537,1.003143321052567,-3.287004588197274]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model1.weights"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model1.intercept"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"13"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val lrm = new LinearRegressionWithSGD()\n",
"lrm.setIntercept(true)\n",
"lrm.optimizer.setNumIterations(200)\n",
"lrm.optimizer.setStepSize(1.0)\n",
"trainScaled.cache()\n",
"val model2 = lrm.run(trainScaled)\n",
"model2"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"class org.apache.spark.mllib.regression.LinearRegressionModel"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model2.getClass"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[-0.2923374177734875,0.5941692181822921,-0.24760684105702974,0.8463715671785175,-1.4687758707821237,3.2028460786202753,-0.5190199406486725,-2.6760815549376082,1.309178867916586,-0.8367071826015235,-1.9757876070323401,1.0289607840181993,-3.2566385434466443]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model2.weights"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"22.45707434052759"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model2.intercept"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MapPartitionsRDD[699] at map at <console>:51"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// 테스트 데이터 예측\n",
"val testPredicts = testScaled.map(x => (model2.predict(x.features), x.label))\n",
"testPredicts"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"89"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val testPredictsArr = testPredicts.collect()\n",
"testPredictsArr.length"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(22.785998999685642,22.9)\n",
"(19.1619715954319,18.2)\n",
"(19.396878618741173,19.9)\n",
"(12.521252819041827,13.6)\n",
"(13.23501944140063,13.9)\n"
]
}
],
"source": [
"testPredictsArr.slice(0,5).foreach(println)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5.608326430430329"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// RMSE\n",
"val rvals = testPredictsArr.map{case(p,y) => math.pow(p-y,2)}\n",
"math.sqrt(rvals.sum / rvals.length)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7. 모델 평가 및 해석\n",
"- RegressionMetrics 클래스를 이용하여 여러가지 평가 가능\n",
" - rootMeanSquareError\n",
" - meanSquareError\n",
" - meanAbsoluteError\n",
" - r2 : coefficient of determination(결정계수, R^2), 0~1 값, 설명 변량의 비율, 1에 가까울 수록 좋음\n",
" - explainedVariance : r2 와 유사\n",
" - 결정계수를 실무에서 자주 사용하지만 상관성이 적은 feature라도 추가를 하면 값이 커지는 경향이 있다.\n",
"- 매개 변수 해석\n",
" - 모델의 weight를 이용하여 각 변수가 예측 값에 미치는 영향력을 해석\n",
" - 값이 클수록 영향력이 크다는 것을 의미"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5.608326430430328"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"// 평가 : RMSE, MSE, MAE, R^2(coefficient of determination)\n",
"import org.apache.spark.mllib.evaluation.RegressionMetrics\n",
"val validMetrics = new RegressionMetrics(testPredicts)\n",
"validMetrics.rootMeanSquaredError"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"31.45332535026338"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"validMetrics.meanSquaredError"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(0.24760684105702974,2)\n",
"(0.2923374177734875,0)\n",
"(0.5190199406486725,6)\n",
"(0.5941692181822921,1)\n",
"(0.8367071826015235,9)\n",
"(0.8463715671785175,3)\n",
"(1.0289607840181993,11)\n",
"(1.309178867916586,8)\n",
"(1.4687758707821237,4)\n",
"(1.9757876070323401,10)\n",
"(2.6760815549376082,7)\n",
"(3.2028460786202753,5)\n",
"(3.2566385434466443,12)\n"
]
}
],
"source": [
"// 해석\n",
"model2.weights.toArray.map(_.abs).zipWithIndex.sortBy(_._1).foreach(println)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 매개변수 해석\n",
"- 가장 영향력이 큰 컬럼은 12번째 LSTAT(저소득층 인구 비율), 두번째로는 7번째 DIS(보스턴 각 고용센터까지의 거리합)\n",
"- 영향력이 적은 컬럼은 6번째, 2번째 컬럼이며 모델에서 제거하여도 별 영향이 없으며 오히려 성늘이 더 향상 될 수도 있다.\n",
" - 6번 : AGE, 1940년 이전 지어진 자가 거주 건물 비율\n",
" - 2번 : INDUS, 마을당 비-소매 비즈니스 토지의 에이커 비율"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 모델의 저장 및 불러오기\n",
"- 저장 : save\n",
"- 불러오기 : load\n",
"- Parqut 파일 포맷"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Name: org.apache.hadoop.mapred.FileAlreadyExistsException\n",
"Message: Output directory file:/opt/mynotebook/models/linearRegressionModel/metadata already exists\n",
"StackTrace: at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)\n",
" at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:283)\n",
" at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)\n",
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)\n",
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)\n",
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)\n",
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n",
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n",
" at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n",
" at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)\n",
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)\n",
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)\n",
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)\n",
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n",
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n",
" at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n",
" at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)\n",
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)\n",
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)\n",
" at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)\n",
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n",
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n",
" at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n",
" at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)\n",
" at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)\n",
" at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)\n",
" at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)\n",
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n",
" at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n",
" at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)\n",
" at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)\n",
" at org.apache.spark.mllib.regression.impl.GLMRegressionModel$SaveLoadV1_0$.save(GLMRegressionModel.scala:56)\n",
" at org.apache.spark.mllib.regression.LinearRegressionModel.save(LinearRegression.scala:52)"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model2.save(sc,\"../models/linearRegressionModel\")"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[-0.901459854136914,0.9509812890773397,-0.10314713489439738,0.8805042807920195,-2.0668983757289556,2.4936373935100935,-0.06818383702924491,-3.2553295187452846,1.7217190130008164,-0.777583539625745,-2.126496701079123,0.9829657849466839,-4.016312120372918]"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import org.apache.spark.mllib.regression.LinearRegressionModel\n",
"val loadModel = LinearRegressionModel.load(sc,\"../models/linearRegressionModel\")\n",
"loadModel.weights"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Apache Toree - Scala",
"language": "scala",
"name": "apache_toree_scala"
},
"language_info": {
"file_extension": ".scala",
"name": "scala",
"version": "2.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment