Skip to content

Instantly share code, notes, and snippets.

@rezapci
Created August 27, 2019 01:47
Show Gist options
  • Save rezapci/0170b6230e489f3faf33d6a0a610a5a2 to your computer and use it in GitHub Desktop.
Save rezapci/0170b6230e489f3faf33d6a0a610a5a2 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"nbformat_minor": 1,
"cells": [
{
"source": "This is the third assignment for the Coursera course \"Advanced Machine Learning and Signal Processing\"\n\nJust execute all cells one after the other and you are done - just note that in the last one you must update your email address (the one you've used for coursera) and obtain a submission token, you get this from the programming assignment directly on coursera.\n\nPlease fill in the sections labelled with \"###YOUR_CODE_GOES_HERE###\"\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 1,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "Waiting for a Spark session to start...\nSpark Initialization Done! ApplicationId = app-20190827013535-0000\nKERNEL_ID = 24705b45-228e-4d74-92ac-b13741cfec76\n--2019-08-27 01:35:38-- https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet\nResolving github.com (github.com)... 140.82.114.3\nConnecting to github.com (github.com)|140.82.114.3|:443... connected.\nHTTP request sent, awaiting response... 302 Found\nLocation: https://raw.githubusercontent.com/IBM/coursera/master/coursera_ml/a2.parquet [following]\n--2019-08-27 01:35:38-- https://raw.githubusercontent.com/IBM/coursera/master/coursera_ml/a2.parquet\nResolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.8.133\nConnecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.8.133|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 59032 (58K) [application/octet-stream]\nSaving to: 'a2.parquet'\n\na2.parquet 100%[===================>] 57.65K --.-KB/s in 0.004s \n\n2019-08-27 01:35:39 (15.1 MB/s) - 'a2.parquet' saved [59032/59032]\n\n"
}
],
"source": "!wget https://github.com/IBM/coursera/raw/master/coursera_ml/a2.parquet"
},
{
"source": "Now it\u2019s time to have a look at the recorded sensor data. You should see data similar to the one exemplified below\u2026.\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 2,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "+-----+-----------+-------------------+-------------------+-------------------+\n|CLASS| SENSORID| X| Y| Z|\n+-----+-----------+-------------------+-------------------+-------------------+\n| 0| 26| 380.66434005495194| -139.3470983812975|-247.93697521077704|\n| 0| 29| 104.74324299209692| -32.27421440203938|-25.105013725863852|\n| 0| 8589934658| 118.11469236129976| 45.916682927433534| -87.97203782706572|\n| 0|34359738398| 246.55394030642543|-0.6122810693132044|-398.18662513951506|\n| 0|17179869241|-190.32584900181487| 234.7849657520335|-206.34483804019288|\n| 0|25769803830| 178.62396382387422| -47.07529438881511| 84.38310769821979|\n| 0|25769803831| 85.03128805189493|-4.3024316644854546|-1.1841857567516714|\n| 0|34359738411| 26.786262674736566| -46.33193951911338| 20.880756008396055|\n| 0| 8589934592|-16.203752396859194| 51.080957032176954| -96.80526656416971|\n| 0|25769803852| 47.2048142440404| -78.2950899652916| 181.99604091494786|\n| 0|34359738369| 15.608872398939273| -79.90322809181754| 69.62150711098005|\n| 0| 19|-4.8281721129789315| -67.38050508399905| 221.24876396496404|\n| 0| 54| -98.40725712852762|-19.989364074314732| -302.695196085276|\n| 0|17179869313| 22.835845394816594| 17.1633660118843| 32.877914832011385|\n| 0|34359738454| 84.20178070080324| -32.81572075916947| -48.63517643958031|\n| 0| 0| 56.54732521345129| -7.980106018032676| 95.05162719436447|\n| 0|17179869201| -57.6008655247749| 5.135393798773895| 236.99158698947267|\n| 0|17179869308| -65.59264738389012| -48.92660057215126| -61.58970715383383|\n| 0|25769803790| 34.82337351291005| 9.483542084393937| 197.6066372962772|\n| 0|25769803825| 39.80573823439121|-0.7955236412785212| -79.66652640650325|\n+-----+-----------+-------------------+-------------------+-------------------+\nonly showing top 20 rows\n\n"
}
],
"source": "df=spark.read.load('a2.parquet')\n\ndf.createOrReplaceTempView(\"df\")\nspark.sql(\"SELECT * from df\").show()\n"
},
{
"source": "Let\u2019s check if we have balanced classes \u2013 this means that we have roughly the same number of examples for each class we want to predict. This is important for classification but also helpful for clustering",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 3,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "+------------+-----+\n|count(class)|class|\n+------------+-----+\n| 1416| 1|\n| 1626| 0|\n+------------+-----+\n\n"
}
],
"source": "spark.sql(\"SELECT count(class), class from df group by class\").show()"
},
{
"source": "Let's create a VectorAssembler which consumes columns X, Y and Z and produces a column \u201cfeatures\u201d\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 4,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "from pyspark.ml.feature import VectorAssembler\nvectorAssembler = VectorAssembler(inputCols=[\"X\",\"Y\",\"Z\"],\n outputCol=\"features\")"
},
{
"source": "Please insatiate a clustering algorithm from the SparkML package and assign it to the clust variable. Here we don\u2019t need to take care of the \u201cCLASS\u201d column since we are in unsupervised learning mode \u2013 so let\u2019s pretend to not even have the \u201cCLASS\u201d column for now \u2013 but it will become very handy later in assessing the clustering performance. PLEASE NOTE \u2013 IN REAL-WORLD SCENARIOS THERE IS NO CLASS COLUMN \u2013 THEREFORE YOU CAN\u2019T ASSESS CLASSIFICATION PERFORMANCE USING THIS COLUMN \n\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 5,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "from pyspark.ml.clustering import KMeans\n\nclust = KMeans().setK(13).setSeed(1)"
},
{
"source": "Let\u2019s train...\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 6,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "from pyspark.ml import Pipeline\npipeline = Pipeline(stages=[vectorAssembler, clust])\nmodel = pipeline.fit(df)"
},
{
"source": "...and evaluate...",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 7,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "+-----+-----------+-------------------+-------------------+-------------------+--------------------+----------+\n|CLASS| SENSORID| X| Y| Z| features|prediction|\n+-----+-----------+-------------------+-------------------+-------------------+--------------------+----------+\n| 0| 26| 380.66434005495194| -139.3470983812975|-247.93697521077704|[380.664340054951...| 12|\n| 0| 29| 104.74324299209692| -32.27421440203938|-25.105013725863852|[104.743242992096...| 11|\n| 0| 8589934658| 118.11469236129976| 45.916682927433534| -87.97203782706572|[118.114692361299...| 11|\n| 0|34359738398| 246.55394030642543|-0.6122810693132044|-398.18662513951506|[246.553940306425...| 12|\n| 0|17179869241|-190.32584900181487| 234.7849657520335|-206.34483804019288|[-190.32584900181...| 5|\n| 0|25769803830| 178.62396382387422| -47.07529438881511| 84.38310769821979|[178.623963823874...| 2|\n| 0|25769803831| 85.03128805189493|-4.3024316644854546|-1.1841857567516714|[85.0312880518949...| 11|\n| 0|34359738411| 26.786262674736566| -46.33193951911338| 20.880756008396055|[26.7862626747365...| 0|\n| 0| 8589934592|-16.203752396859194| 51.080957032176954| -96.80526656416971|[-16.203752396859...| 1|\n| 0|25769803852| 47.2048142440404| -78.2950899652916| 181.99604091494786|[47.2048142440404...| 7|\n| 0|34359738369| 15.608872398939273| -79.90322809181754| 69.62150711098005|[15.6088723989392...| 2|\n| 0| 19|-4.8281721129789315| -67.38050508399905| 221.24876396496404|[-4.8281721129789...| 7|\n| 0| 54| -98.40725712852762|-19.989364074314732| -302.695196085276|[-98.407257128527...| 8|\n| 0|17179869313| 22.835845394816594| 17.1633660118843| 32.877914832011385|[22.8358453948165...| 0|\n| 0|34359738454| 84.20178070080324| -32.81572075916947| -48.63517643958031|[84.2017807008032...| 11|\n| 0| 0| 56.54732521345129| -7.980106018032676| 95.05162719436447|[56.5473252134512...| 2|\n| 0|17179869201| -57.6008655247749| 5.135393798773895| 236.99158698947267|[-57.600865524774...| 10|\n| 0|17179869308| -65.59264738389012| -48.92660057215126| -61.58970715383383|[-65.592647383890...| 0|\n| 0|25769803790| 34.82337351291005| 9.483542084393937| 197.6066372962772|[34.8233735129100...| 7|\n| 0|25769803825| 39.80573823439121|-0.7955236412785212| -79.66652640650325|[39.8057382343912...| 11|\n+-----+-----------+-------------------+-------------------+-------------------+--------------------+----------+\nonly showing top 20 rows\n\n"
}
],
"source": "prediction = model.transform(df)\nprediction.show()"
},
{
"execution_count": 8,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 8,
"metadata": {},
"data": {
"text/plain": "0.9329388560157791"
},
"output_type": "execute_result"
}
],
"source": "prediction.createOrReplaceTempView('prediction')\nspark.sql('''\nselect max(correct)/max(total) as accuracy from (\n\n select sum(correct) as correct, count(correct) as total from (\n select case when class != prediction then 1 else 0 end as correct from prediction \n ) \n \n union\n \n select sum(correct) as correct, count(correct) as total from (\n select case when class = prediction then 1 else 0 end as correct from prediction \n ) \n)\n''').rdd.map(lambda row: row.accuracy).collect()[0]"
},
{
"source": "If you reached at least 55% of accuracy you are fine to submit your predictions to the grader. Otherwise please experiment with parameters setting to your clustering algorithm, use a different algorithm or just re-record your data and try to obtain. In case you are stuck, please use the Coursera Discussion Forum. Please note again \u2013 in a real-world scenario there is no way in doing this \u2013 since there is no class label in your data. Please have a look at this further reading on clustering performance evaluation https://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_and_assessment\n",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 9,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "--2019-08-27 01:36:46-- https://raw.githubusercontent.com/IBM/coursera/master/rklib.py\nResolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.8.133\nConnecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.8.133|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 2540 (2.5K) [text/plain]\nSaving to: 'rklib.py'\n\nrklib.py 100%[===================>] 2.48K --.-KB/s in 0s \n\n2019-08-27 01:36:46 (52.8 MB/s) - 'rklib.py' saved [2540/2540]\n\n"
}
],
"source": "!rm -f rklib.py\n!wget https://raw.githubusercontent.com/IBM/coursera/master/rklib.py"
},
{
"execution_count": 10,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "!rm -Rf a2_m3.json"
},
{
"execution_count": 11,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "prediction= prediction.repartition(1)\nprediction.write.json('a2_m3.json')"
},
{
"execution_count": 12,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "import zipfile\n\ndef zipdir(path, ziph):\n for root, dirs, files in os.walk(path):\n for file in files:\n ziph.write(os.path.join(root, file))\n\nzipf = zipfile.ZipFile('a2_m3.json.zip', 'w', zipfile.ZIP_DEFLATED)\nzipdir('a2_m3.json', zipf)\nzipf.close()"
},
{
"execution_count": 13,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "!base64 a2_m3.json.zip > a2_m3.json.zip.base64"
},
{
"execution_count": 14,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "Submission successful, please check on the coursera grader page for the status\n-------------------------\n{\"elements\":[{\"itemId\":\"Cu6KW\",\"id\":\"f_F-qCtuEei_fRLwaVDk3g~Cu6KW~NomKo8hrEem6LhKVyhmtlA\",\"courseId\":\"f_F-qCtuEei_fRLwaVDk3g\"}],\"paging\":{},\"linked\":{}}\n-------------------------\n"
}
],
"source": "from rklib import submit\nkey = \"pPfm62VXEeiJOBL0dhxPkA\"\npart = \"EOTMs\"\nemail = \"rezapci@msn.com\"\nsecret = \"iJ7TF8lyfH9bcyRo\"\n\n\nwith open('a2_m3.json.zip.base64', 'r') as myfile:\n data=myfile.read()\nsubmit(email, secret, key, part, [part], data)"
},
{
"execution_count": null,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": ""
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.6 with Spark",
"name": "python36",
"language": "python3"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"version": "3.6.8",
"name": "python",
"file_extension": ".py",
"pygments_lexer": "ipython3",
"codemirror_mode": {
"version": 3,
"name": "ipython"
}
}
},
"nbformat": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment