Skip to content

Instantly share code, notes, and snippets.

@tvganesh
Created October 12, 2018 11:36
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save tvganesh/4863f6ae03b9a5a264cfc14b254fc5d8 to your computer and use it in GitHub Desktop.
Save tvganesh/4863f6ae03b9a5a264cfc14b254fc5d8 to your computer and use it in GitHub Desktop.
Big Data-1: Move into the big league:Graduate from Python to Pyspark
Display the source blob
Display the rendered blob
Raw
{"cells":[{"cell_type":"markdown","source":["This post draws parallels between Python and Pyspark. If you are already adept in Python, then the move \ninto Pyspark is relatively easy. Here I try to show how similar many of the key constructs are. It is important\nto note that writing code in Python and in Pyspark work very differently. In Python each statement is evaluated\nimmediately. Pyspark, on the other hand has transformations (operations performed on dataframes etc) and actions.\nPyspark does lazy evaluation i.e. the transformations on the dataframe do not happen immediately. Spark\nqueues the transformations and evaluates them only when an action like show(), count(), collect() are performed.\nRegardless, it is important to notice the essential similarities in coding these 2 languages namely Python and its\n'big brother' Pyspark for handling small data sets to large datasets"],"metadata":{}},{"cell_type":"markdown","source":["## 1a. Read CSV file - Python\nThis is a common operations that will be required whether one uses Python or Pyspark. To see how load a CSV file to Databricks see this video [Upload Flat File to Databricks Table](https://www.youtube.com/watch?v=H5LxjaJgpSk)"],"metadata":{}},{"cell_type":"code","source":["import pandas as pd\nimport os\nos.getcwd()\n#Read CSV file\ntendulkar = pd.read_csv(\"/dbfs/FileStore/tables/tendulkar.csv\", header='infer')\n#Check the shape of the dataframe\ntendulkar.shape\n\n"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">24</span><span class=\"ansired\">]: </span>(347, 12)\n</div>"]}}],"execution_count":3},{"cell_type":"markdown","source":["##1.b Read CSV file - Pyspark - Option 1"],"metadata":{}},{"cell_type":"code","source":["from pyspark import SparkContext, SparkConf\nfrom pyspark.sql import SQLContext \nsql=SQLContext.getOrCreate(SparkContext.getOrCreate())\ntendulkar1= (sql\n .read.format(\"com.databricks.spark.csv\")\n .options(delimiter=',', header='true', inferschema='true')\n .load(\"/FileStore/tables/tendulkar.csv\"))\ntendulkar1.count()"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">25</span><span class=\"ansired\">]: </span>347\n</div>"]}}],"execution_count":5},{"cell_type":"markdown","source":["##1c. Read CSV file - Pyspark - Option 2"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql import SparkSession\nspark = SparkSession.builder.appName('Read CSV').getOrCreate()\ntendulkar2 = spark.read.format('csv').option('header','true').load('/FileStore/tables/tendulkar.csv')\ntendulkar2.count()"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">26</span><span class=\"ansired\">]: </span>347\n</div>"]}}],"execution_count":7},{"cell_type":"markdown","source":["##2a. Data frame Shape -Python\nDetermine the shape of the dataframe"],"metadata":{}},{"cell_type":"code","source":["tendulkar.shape"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">27</span><span class=\"ansired\">]: </span>(347, 12)\n</div>"]}}],"execution_count":9},{"cell_type":"markdown","source":["##2b. Data frame Shape -Pyspark\nWhen computing the shape of a dataframe we have to do compute the count of rows and the length of the columns separately"],"metadata":{}},{"cell_type":"code","source":["tendulkar1.count()\nlen(tendulkar1.columns)\ndef dfShape(df):\n return(df.count(),len(df.columns))\n\ndfShape(tendulkar1)"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">28</span><span class=\"ansired\">]: </span>(347, 12)\n</div>"]}}],"execution_count":11},{"cell_type":"markdown","source":["##3a. Data frame columns - Python"],"metadata":{}},{"cell_type":"code","source":["tendulkar.columns"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">29</span><span class=\"ansired\">]: </span>\nIndex([u&apos;Runs&apos;, u&apos;Mins&apos;, u&apos;BF&apos;, u&apos;4s&apos;, u&apos;6s&apos;, u&apos;SR&apos;, u&apos;Pos&apos;, u&apos;Dismissal&apos;,\n u&apos;Inns&apos;, u&apos;Opposition&apos;, u&apos;Ground&apos;, u&apos;Start Date&apos;],\n dtype=&apos;object&apos;)\n</div>"]}}],"execution_count":13},{"cell_type":"markdown","source":["## 3b. Data frame columns - Python"],"metadata":{}},{"cell_type":"code","source":["tendulkar1.columns"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">30</span><span class=\"ansired\">]: </span>\n[&apos;Runs&apos;,\n &apos;Mins&apos;,\n &apos;BF&apos;,\n &apos;4s&apos;,\n &apos;6s&apos;,\n &apos;SR&apos;,\n &apos;Pos&apos;,\n &apos;Dismissal&apos;,\n &apos;Inns&apos;,\n &apos;Opposition&apos;,\n &apos;Ground&apos;,\n &apos;Start Date&apos;]\n</div>"]}}],"execution_count":15},{"cell_type":"markdown","source":["## 4a. Rename columns - Python"],"metadata":{}},{"cell_type":"code","source":["tendulkar.columns=[['Runs','Minutes','BallsFaced','Fours','Sixes','StrikeRate','Position','Dismissal','Innings','Opposition','Ground','StartDate']]\ntendulkar.columns"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">31</span><span class=\"ansired\">]: </span>\nIndex([u&apos;Runs&apos;, u&apos;Minutes&apos;, u&apos;BallsFaced&apos;, u&apos;Fours&apos;, u&apos;Sixes&apos;, u&apos;StrikeRate&apos;,\n u&apos;Position&apos;, u&apos;Dismissal&apos;, u&apos;Innings&apos;, u&apos;Opposition&apos;, u&apos;Ground&apos;,\n u&apos;StartDate&apos;],\n dtype=&apos;object&apos;)\n</div>"]}}],"execution_count":17},{"cell_type":"markdown","source":["##4b. Rename columns - Pyspark"],"metadata":{}},{"cell_type":"code","source":["tendulkar1 = tendulkar1.withColumnRenamed(\"Runs\", \"Runs\")\\\n .withColumnRenamed(\"Mins\", \"Minutes\")\\\n .withColumnRenamed(\"BF\", \"BallsFaced\")\\\n .withColumnRenamed(\"4s\", \"Fours\")\\\n .withColumnRenamed(\"6s\", \"Sixes\")\\\n .withColumnRenamed(\"SR\", \"StrikeRate\")\\\n .withColumnRenamed(\"Dismissal\", \"Dismissal\") \n"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"></div>"]}}],"execution_count":19},{"cell_type":"markdown","source":["##5a. Dtypes - Python"],"metadata":{}},{"cell_type":"code","source":["tendulkar.dtypes"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">33</span><span class=\"ansired\">]: </span>\nRuns object\nMinutes object\nBallsFaced object\nFours object\nSixes object\nStrikeRate object\nPosition object\nDismissal object\nInnings object\nOpposition object\nGround object\nStartDate object\ndtype: object\n</div>"]}}],"execution_count":21},{"cell_type":"markdown","source":["##5b. Dtypes -Pyspark"],"metadata":{}},{"cell_type":"code","source":["tendulkar1.dtypes"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">34</span><span class=\"ansired\">]: </span>\n[(&apos;Runs&apos;, &apos;string&apos;),\n (&apos;Minutes&apos;, &apos;string&apos;),\n (&apos;BallsFaced&apos;, &apos;string&apos;),\n (&apos;Fours&apos;, &apos;string&apos;),\n (&apos;Sixes&apos;, &apos;string&apos;),\n (&apos;StrikeRate&apos;, &apos;string&apos;),\n (&apos;Pos&apos;, &apos;string&apos;),\n (&apos;Dismissal&apos;, &apos;string&apos;),\n (&apos;Inns&apos;, &apos;string&apos;),\n (&apos;Opposition&apos;, &apos;string&apos;),\n (&apos;Ground&apos;, &apos;string&apos;),\n (&apos;Start Date&apos;, &apos;string&apos;)]\n</div>"]}}],"execution_count":23},{"cell_type":"markdown","source":["## 5c. printSchema"],"metadata":{}},{"cell_type":"code","source":["tendulkar1.printSchema()"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\">root\n-- Runs: string (nullable = true)\n-- Minutes: string (nullable = true)\n-- BallsFaced: string (nullable = true)\n-- Fours: string (nullable = true)\n-- Sixes: string (nullable = true)\n-- StrikeRate: string (nullable = true)\n-- Pos: string (nullable = true)\n-- Dismissal: string (nullable = true)\n-- Inns: string (nullable = true)\n-- Opposition: string (nullable = true)\n-- Ground: string (nullable = true)\n-- Start Date: string (nullable = true)\n\n</div>"]}}],"execution_count":25},{"cell_type":"markdown","source":["##6a. Select columns - Python"],"metadata":{}},{"cell_type":"code","source":["import pandas as pd\nimport os\nos.getcwd()\n#Read CSV file\ntendulkar = pd.read_csv(\"/dbfs/FileStore/tables/tendulkar.csv\", header='infer')\ndf=tendulkar[['Runs','Mins','BF']]\ndf.head(10)"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">36</span><span class=\"ansired\">]: </span>\n Runs Mins BF\n0 15 28 24\n1 DNB - -\n2 59 254 172\n3 8 24 16\n4 41 124 90\n5 35 74 51\n6 57 193 134\n7 0 1 1\n8 24 50 44\n9 88 324 266\n</div>"]}}],"execution_count":27},{"cell_type":"markdown","source":["##6b. Select columns - Pyspark"],"metadata":{}},{"cell_type":"code","source":["from pyspark import SparkContext, SparkConf\nfrom pyspark.sql import SQLContext \nsql=SQLContext.getOrCreate(SparkContext.getOrCreate())\ntendulkar1= (sql\n .read.format(\"com.databricks.spark.csv\")\n .options(delimiter=',', header='true', inferschema='true')\n .load(\"/FileStore/tables/tendulkar.csv\"))\ndf1=tendulkar1.select('Runs','Mins','BF')\ndf1.show(10)"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\">+----+----+---+\nRuns|Mins| BF|\n+----+----+---+\n 15| 28| 24|\n DNB| -| -|\n 59| 254|172|\n 8| 24| 16|\n 41| 124| 90|\n 35| 74| 51|\n 57| 193|134|\n 0| 1| 1|\n 24| 50| 44|\n 88| 324|266|\n+----+----+---+\nonly showing top 10 rows\n\n</div>"]}}],"execution_count":29},{"cell_type":"markdown","source":["##7a. Filter rows by criteria - Python"],"metadata":{}},{"cell_type":"code","source":["b = tendulkar['Runs'] >50\ndf = tendulkar[b]\ndf.head(10)"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">38</span><span class=\"ansired\">]: </span>\n Runs Mins BF 4s 6s SR Pos Dismissal Inns Opposition Ground \\\n0 15 28 24 2 0 62.5 6 bowled 2 v Pakistan Karachi \n1 DNB - - - - - - - 4 v Pakistan Karachi \n2 59 254 172 4 0 34.3 6 lbw 1 v Pakistan Faisalabad \n3 8 24 16 1 0 50 6 run out 3 v Pakistan Faisalabad \n4 41 124 90 5 0 45.55 7 bowled 1 v Pakistan Lahore \n5 35 74 51 5 0 68.62 6 lbw 1 v Pakistan Sialkot \n6 57 193 134 6 0 42.53 6 caught 3 v Pakistan Sialkot \n7 0 1 1 0 0 0 6 caught 2 v New Zealand Christchurch \n8 24 50 44 3 0 54.54 6 caught 3 v New Zealand Christchurch \n9 88 324 266 5 0 33.08 6 caught 1 v New Zealand Napier \n\n Start Date \n0 15-Nov-89 \n1 15-Nov-89 \n2 23-Nov-89 \n3 23-Nov-89 \n4 1-Dec-89 \n5 9-Dec-89 \n6 9-Dec-89 \n7 2-Feb-90 \n8 2-Feb-90 \n9 9-Feb-90 \n</div>"]}}],"execution_count":31},{"cell_type":"markdown","source":["##7b Filtering by criteria - Pyspark"],"metadata":{}},{"cell_type":"code","source":["df1=tendulkar1.filter(tendulkar1['Runs']>50)\ndf1.show(10)"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\">+----+----+---+---+---+-----+---+---------+----+--------------+------------+----------+\nRuns|Mins| BF| 4s| 6s| SR|Pos|Dismissal|Inns| Opposition| Ground|Start Date|\n+----+----+---+---+---+-----+---+---------+----+--------------+------------+----------+\n 59| 254|172| 4| 0| 34.3| 6| lbw| 1| v Pakistan| Faisalabad| 23-Nov-89|\n 57| 193|134| 6| 0|42.53| 6| caught| 3| v Pakistan| Sialkot| 9-Dec-89|\n 88| 324|266| 5| 0|33.08| 6| caught| 1| v New Zealand| Napier| 9-Feb-90|\n 68| 216|136| 8| 0| 50| 6| caught| 2| v England| Manchester| 9-Aug-90|\n 114| 228|161| 16| 0| 70.8| 4| caught| 2| v Australia| Perth| 1-Feb-92|\n 111| 373|270| 19| 0|41.11| 4| caught| 2|v South Africa|Johannesburg| 26-Nov-92|\n 73| 272|208| 8| 1|35.09| 5| caught| 2|v South Africa| Cape Town| 2-Jan-93|\n 165| 361|296| 24| 1|55.74| 4| caught| 1| v England| Chennai| 11-Feb-93|\n 78| 285|213| 10| 0|36.61| 4| lbw| 2| v England| Mumbai| 19-Feb-93|\n 62| 128|114| 7| 0|54.38| 4| caught| 1| v Zimbabwe| Delhi| 13-Mar-93|\n+----+----+---+---+---+-----+---+---------+----+--------------+------------+----------+\nonly showing top 10 rows\n\n</div>"]}}],"execution_count":33},{"cell_type":"markdown","source":["##8a. Display unique contents of a column - Python"],"metadata":{}},{"cell_type":"code","source":["tendulkar = pd.read_csv(\"/dbfs/FileStore/tables/tendulkar.csv\", header='infer')\ntendulkar['Runs'].unique()"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">40</span><span class=\"ansired\">]: </span>\narray([&apos;15&apos;, &apos;DNB&apos;, &apos;59&apos;, &apos;8&apos;, &apos;41&apos;, &apos;35&apos;, &apos;57&apos;, &apos;0&apos;, &apos;24&apos;, &apos;88&apos;, &apos;5&apos;,\n &apos;10&apos;, &apos;27&apos;, &apos;68&apos;, &apos;119*&apos;, &apos;21&apos;, &apos;11&apos;, &apos;16&apos;, &apos;7&apos;, &apos;40&apos;, &apos;148*&apos;, &apos;6&apos;,\n &apos;17&apos;, &apos;114&apos;, &apos;111&apos;, &apos;1&apos;, &apos;73&apos;, &apos;50&apos;, &apos;9*&apos;, &apos;165&apos;, &apos;78&apos;, &apos;62&apos;,\n &apos;TDNB&apos;, &apos;28&apos;, &apos;104*&apos;, &apos;71&apos;, &apos;142&apos;, &apos;96&apos;, &apos;43&apos;, &apos;11*&apos;, &apos;34&apos;, &apos;85&apos;,\n &apos;179&apos;, &apos;54&apos;, &apos;4&apos;, &apos;0*&apos;, &apos;52*&apos;, &apos;2&apos;, &apos;122&apos;, &apos;31&apos;, &apos;177&apos;, &apos;74&apos;, &apos;42&apos;,\n &apos;18&apos;, &apos;61&apos;, &apos;36&apos;, &apos;169&apos;, &apos;9&apos;, &apos;15*&apos;, &apos;92&apos;, &apos;83&apos;, &apos;143&apos;, &apos;139&apos;, &apos;23&apos;,\n &apos;148&apos;, &apos;13&apos;, &apos;155*&apos;, &apos;79&apos;, &apos;47&apos;, &apos;113&apos;, &apos;67&apos;, &apos;136&apos;, &apos;29&apos;, &apos;53&apos;,\n &apos;124*&apos;, &apos;126*&apos;, &apos;44*&apos;, &apos;217&apos;, &apos;116&apos;, &apos;52&apos;, &apos;45&apos;, &apos;97&apos;, &apos;20&apos;, &apos;39&apos;,\n &apos;201*&apos;, &apos;76&apos;, &apos;65&apos;, &apos;126&apos;, &apos;36*&apos;, &apos;69&apos;, &apos;155&apos;, &apos;22*&apos;, &apos;103&apos;, &apos;26&apos;,\n &apos;90&apos;, &apos;176&apos;, &apos;117&apos;, &apos;86&apos;, &apos;12&apos;, &apos;193&apos;, &apos;16*&apos;, &apos;51&apos;, &apos;32&apos;, &apos;55&apos;,\n &apos;37&apos;, &apos;44&apos;, &apos;241*&apos;, &apos;60*&apos;, &apos;194*&apos;, &apos;3&apos;, &apos;32*&apos;, &apos;248*&apos;, &apos;94&apos;, &apos;22&apos;,\n &apos;109&apos;, &apos;19&apos;, &apos;14&apos;, &apos;28*&apos;, &apos;63&apos;, &apos;64&apos;, &apos;101&apos;, &apos;122*&apos;, &apos;91&apos;, &apos;82&apos;,\n &apos;56*&apos;, &apos;154*&apos;, &apos;153&apos;, &apos;49&apos;, &apos;10*&apos;, &apos;103*&apos;, &apos;160&apos;, &apos;100*&apos;, &apos;105*&apos;,\n &apos;100&apos;, &apos;106&apos;, &apos;84&apos;, &apos;203&apos;, &apos;98&apos;, &apos;38&apos;, &apos;214&apos;, &apos;53*&apos;, &apos;111*&apos;, &apos;146&apos;,\n &apos;14*&apos;, &apos;56&apos;, &apos;80&apos;, &apos;25&apos;, &apos;81&apos;, &apos;13*&apos;], dtype=object)\n</div>"]}}],"execution_count":35},{"cell_type":"markdown","source":["##8b. Display unique contents of a column - Pyspark"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.functions import *\nsqlContext=SQLContext.getOrCreate(SparkContext.getOrCreate())\ntendulkar1= (sqlContext\n .read.format(\"com.databricks.spark.csv\")\n .options(delimiter=',', header='true', inferschema='true')\n .load(\"/FileStore/tables/tendulkar.csv\"))\ntendulkar1.select('Runs').rdd.distinct().collect()"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">41</span><span class=\"ansired\">]: </span>\n[Row(Runs=u&apos;24&apos;),\n Row(Runs=u&apos;148*&apos;),\n Row(Runs=u&apos;100*&apos;),\n Row(Runs=u&apos;0&apos;),\n Row(Runs=u&apos;16&apos;),\n Row(Runs=u&apos;80&apos;),\n Row(Runs=u&apos;146&apos;),\n Row(Runs=u&apos;19&apos;),\n Row(Runs=u&apos;96&apos;),\n Row(Runs=u&apos;16*&apos;),\n Row(Runs=u&apos;76&apos;),\n Row(Runs=u&apos;51&apos;),\n Row(Runs=u&apos;88&apos;),\n Row(Runs=u&apos;9&apos;),\n Row(Runs=u&apos;35&apos;),\n Row(Runs=u&apos;64&apos;),\n Row(Runs=u&apos;41&apos;),\n Row(Runs=u&apos;11&apos;),\n Row(Runs=u&apos;25&apos;),\n Row(Runs=u&apos;59&apos;),\n Row(Runs=u&apos;1&apos;),\n Row(Runs=u&apos;49&apos;),\n Row(Runs=u&apos;91&apos;),\n Row(Runs=u&apos;81&apos;),\n Row(Runs=u&apos;122*&apos;),\n Row(Runs=u&apos;32*&apos;),\n Row(Runs=u&apos;18&apos;),\n Row(Runs=u&apos;60*&apos;),\n Row(Runs=u&apos;126*&apos;),\n Row(Runs=u&apos;DNB&apos;),\n Row(Runs=u&apos;53*&apos;),\n Row(Runs=u&apos;116&apos;),\n Row(Runs=u&apos;34&apos;),\n Row(Runs=u&apos;65&apos;),\n Row(Runs=u&apos;10&apos;),\n Row(Runs=u&apos;26&apos;),\n Row(Runs=u&apos;13*&apos;),\n Row(Runs=u&apos;36*&apos;),\n Row(Runs=u&apos;165&apos;),\n Row(Runs=u&apos;TDNB&apos;),\n Row(Runs=u&apos;139&apos;),\n Row(Runs=u&apos;68&apos;),\n Row(Runs=u&apos;27&apos;),\n Row(Runs=u&apos;217&apos;),\n Row(Runs=u&apos;86&apos;),\n Row(Runs=u&apos;6&apos;),\n Row(Runs=u&apos;203&apos;),\n Row(Runs=u&apos;47&apos;),\n Row(Runs=u&apos;98&apos;),\n Row(Runs=u&apos;74&apos;),\n Row(Runs=u&apos;53&apos;),\n Row(Runs=u&apos;111&apos;),\n Row(Runs=u&apos;105*&apos;),\n Row(Runs=u&apos;179&apos;),\n Row(Runs=u&apos;154*&apos;),\n Row(Runs=u&apos;13&apos;),\n Row(Runs=u&apos;28*&apos;),\n Row(Runs=u&apos;50&apos;),\n Row(Runs=u&apos;248*&apos;),\n Row(Runs=u&apos;7&apos;),\n Row(Runs=u&apos;14*&apos;),\n Row(Runs=u&apos;214&apos;),\n Row(Runs=u&apos;32&apos;),\n Row(Runs=u&apos;44&apos;),\n Row(Runs=u&apos;28&apos;),\n Row(Runs=u&apos;56*&apos;),\n Row(Runs=u&apos;52*&apos;),\n Row(Runs=u&apos;73&apos;),\n Row(Runs=u&apos;52&apos;),\n Row(Runs=u&apos;103*&apos;),\n Row(Runs=u&apos;126&apos;),\n Row(Runs=u&apos;67&apos;),\n Row(Runs=u&apos;12&apos;),\n Row(Runs=u&apos;20&apos;),\n Row(Runs=u&apos;201*&apos;),\n Row(Runs=u&apos;177&apos;),\n Row(Runs=u&apos;119*&apos;),\n Row(Runs=u&apos;103&apos;),\n Row(Runs=u&apos;4&apos;),\n Row(Runs=u&apos;10*&apos;),\n Row(Runs=u&apos;92&apos;),\n Row(Runs=u&apos;55&apos;),\n Row(Runs=u&apos;84&apos;),\n Row(Runs=u&apos;142&apos;),\n Row(Runs=u&apos;31&apos;),\n Row(Runs=u&apos;45&apos;),\n Row(Runs=u&apos;44*&apos;),\n Row(Runs=u&apos;111*&apos;),\n Row(Runs=u&apos;29&apos;),\n Row(Runs=u&apos;0*&apos;),\n Row(Runs=u&apos;113&apos;),\n Row(Runs=u&apos;39&apos;),\n Row(Runs=u&apos;241*&apos;),\n Row(Runs=u&apos;15&apos;),\n Row(Runs=u&apos;21&apos;),\n Row(Runs=u&apos;69&apos;),\n Row(Runs=u&apos;117&apos;),\n Row(Runs=u&apos;193&apos;),\n Row(Runs=u&apos;176&apos;),\n Row(Runs=u&apos;5&apos;),\n Row(Runs=u&apos;100&apos;),\n Row(Runs=u&apos;15*&apos;),\n Row(Runs=u&apos;160&apos;),\n Row(Runs=u&apos;153&apos;),\n Row(Runs=u&apos;9*&apos;),\n Row(Runs=u&apos;79&apos;),\n Row(Runs=u&apos;54&apos;),\n Row(Runs=u&apos;85&apos;),\n Row(Runs=u&apos;143&apos;),\n Row(Runs=u&apos;124*&apos;),\n Row(Runs=u&apos;61&apos;),\n Row(Runs=u&apos;42&apos;),\n Row(Runs=u&apos;136&apos;),\n Row(Runs=u&apos;71&apos;),\n Row(Runs=u&apos;38&apos;),\n Row(Runs=u&apos;14&apos;),\n Row(Runs=u&apos;22&apos;),\n Row(Runs=u&apos;82&apos;),\n Row(Runs=u&apos;2&apos;),\n Row(Runs=u&apos;148&apos;),\n Row(Runs=u&apos;11*&apos;),\n Row(Runs=u&apos;101&apos;),\n Row(Runs=u&apos;94&apos;),\n Row(Runs=u&apos;78&apos;),\n Row(Runs=u&apos;57&apos;),\n Row(Runs=u&apos;90&apos;),\n Row(Runs=u&apos;37&apos;),\n Row(Runs=u&apos;62&apos;),\n Row(Runs=u&apos;43&apos;),\n Row(Runs=u&apos;109&apos;),\n Row(Runs=u&apos;104*&apos;),\n Row(Runs=u&apos;169&apos;),\n Row(Runs=u&apos;155*&apos;),\n Row(Runs=u&apos;194*&apos;),\n Row(Runs=u&apos;17&apos;),\n Row(Runs=u&apos;23&apos;),\n Row(Runs=u&apos;3&apos;),\n Row(Runs=u&apos;83&apos;),\n Row(Runs=u&apos;106&apos;),\n Row(Runs=u&apos;97&apos;),\n Row(Runs=u&apos;155&apos;),\n Row(Runs=u&apos;22*&apos;),\n Row(Runs=u&apos;56&apos;),\n Row(Runs=u&apos;114&apos;),\n Row(Runs=u&apos;122&apos;),\n Row(Runs=u&apos;36&apos;),\n Row(Runs=u&apos;63&apos;),\n Row(Runs=u&apos;40&apos;),\n Row(Runs=u&apos;8&apos;)]\n</div>"]}}],"execution_count":37},{"cell_type":"markdown","source":["##9a. Aggregate mean, max, min - Python"],"metadata":{}},{"cell_type":"code","source":["\n\nimport pandas as pd\nimport os\nos.getcwd()\ntendulkar = pd.read_csv(\"/dbfs/FileStore/tables/tendulkar.csv\", header='infer')\ntendulkar.shape\n# Remove rows which have DNB\na=tendulkar.Runs !=\"DNB\"\ntendulkar=tendulkar[a]\ntendulkar.shape\n\n# Remove rows which have TDNB\nb=tendulkar.Runs !=\"TDNB\"\ntendulkar=tendulkar[b]\ntendulkar.shape\n\n# Remove the '*' character\nc= tendulkar.BF != \"-\"\ntendulkar=tendulkar[c]\ntendulkar.Runs= tendulkar.Runs.str.replace(r\"[*]\",\"\")\n\n#tendulkar.shape\ntype(tendulkar['Runs'][0])\ntendulkar['Runs']=pd.to_numeric(tendulkar['Runs'])\ntendulkar['BF']=pd.to_numeric(tendulkar['BF'])\n\n# Group by ground anc compute mean,min and max\ndf=tendulkar[['Runs','BF','Ground']].groupby('Ground').agg(['mean','min','max'])\ndf.head(10)"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">42</span><span class=\"ansired\">]: </span>\n Runs BF \n mean min max mean min max\nGround \nAdelaide 32.600000 0 153 58.400000 5 205\nAhmedabad 40.125000 4 217 80.062500 3 344\nAuckland 5.000000 5 5 13.000000 13 13\nBangalore 54.312500 0 214 98.687500 0 363\nBirmingham 46.750000 1 122 71.500000 8 177\nBloemfontein 85.000000 15 155 109.500000 35 184\nBridgetown 26.000000 0 92 45.000000 2 147\nBrisbane 7.666667 0 16 23.333333 3 42\nBulawayo 55.000000 36 74 88.500000 49 128\nCape Town 69.857143 9 169 155.285714 28 314\n</div>"]}}],"execution_count":39},{"cell_type":"markdown","source":["##9b. Aggregate mean,min max - Pyspark"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.functions import *\ntendulkar1= (sqlContext\n .read.format(\"com.databricks.spark.csv\")\n .options(delimiter=',', header='true', inferschema='true')\n .load(\"/FileStore/tables/tendulkar.csv\"))\n# Filter rows which have DNB\ntendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'DNB')\nprint(dfShape(tendulkar1))\n\n# Filter rows which have TDNB\ntendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'TDNB')\nprint(dfShape(tendulkar1))\n\n# Replace * with \"\"\ntendulkar1 = tendulkar1.withColumn('Runs', regexp_replace('Runs', '[*]', ''))\ntendulkar1.select('Runs').rdd.distinct().collect()\n\nfrom pyspark.sql import functions as F\n\n#Group by ground and compute mean, min and max\ndf=tendulkar1[['Runs','BF','Ground']].groupby(tendulkar1['Ground']).agg(F.mean(tendulkar1['Runs']),F.min(tendulkar1['Runs']),F.max(tendulkar1['Runs']))\ndf.show()"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\">(330, 12)\n(329, 12)\n+-------------+-----------------+---------+---------+\n Ground| avg(Runs)|min(Runs)|max(Runs)|\n+-------------+-----------------+---------+---------+\n Bangalore| 54.3125| 0| 96|\n Adelaide| 32.6| 0| 61|\nColombo (PSS)| 37.2| 14| 71|\n Christchurch| 12.0| 0| 24|\n Auckland| 5.0| 5| 5|\n Chennai| 60.625| 0| 81|\n Centurion| 73.5| 111| 36|\n Brisbane|7.666666666666667| 0| 7|\n Birmingham| 46.75| 1| 40|\n Ahmedabad| 40.125| 100| 8|\nColombo (RPS)| 143.0| 143| 143|\n Chittagong| 57.8| 101| 36|\n Cape Town|69.85714285714286| 14| 9|\n Bridgetown| 26.0| 0| 92|\n Bulawayo| 55.0| 36| 74|\n Delhi|39.94736842105263| 0| 76|\n Chandigarh| 11.0| 11| 11|\n Bloemfontein| 85.0| 15| 155|\nColombo (SSC)|77.55555555555556| 104| 8|\n Cuttack| 2.0| 2| 2|\n+-------------+-----------------+---------+---------+\nonly showing top 20 rows\n\n</div>"]}}],"execution_count":41},{"cell_type":"markdown","source":["##10 Using SQL with Pyspark"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql import SparkSession\nspark = SparkSession.builder.appName('Read CSV').getOrCreate()\ntendulkar1.createOrReplaceTempView(\"tendulkarTbl\")\ndf = spark.sql(\"SELECT * FROM tendulkarTbl\")\ndf.show(3)\ndf1 = spark.sql(\"SELECT * FROM tendulkarTbl where Ground='Karachi'\")\ndf1.show()"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\">+----+----+---+---+---+----+---+---------+----+----------+----------+----------+\nRuns|Mins| BF| 4s| 6s| SR|Pos|Dismissal|Inns|Opposition| Ground|Start Date|\n+----+----+---+---+---+----+---+---------+----+----------+----------+----------+\n 15| 28| 24| 2| 0|62.5| 6| bowled| 2|v Pakistan| Karachi| 15-Nov-89|\n 59| 254|172| 4| 0|34.3| 6| lbw| 1|v Pakistan|Faisalabad| 23-Nov-89|\n 8| 24| 16| 1| 0| 50| 6| run out| 3|v Pakistan|Faisalabad| 23-Nov-89|\n+----+----+---+---+---+----+---+---------+----+----------+----------+----------+\nonly showing top 3 rows\n\n+----+----+---+---+---+-----+---+---------+----+----------+-------+----------+\nRuns|Mins| BF| 4s| 6s| SR|Pos|Dismissal|Inns|Opposition| Ground|Start Date|\n+----+----+---+---+---+-----+---+---------+----+----------+-------+----------+\n 15| 28| 24| 2| 0| 62.5| 6| bowled| 2|v Pakistan|Karachi| 15-Nov-89|\n 23| 49| 29| 5| 0|79.31| 4| bowled| 2|v Pakistan|Karachi| 29-Jan-06|\n 26| 74| 47| 5| 0|55.31| 4| bowled| 4|v Pakistan|Karachi| 29-Jan-06|\n+----+----+---+---+---+-----+---+---------+----+----------+-------+----------+\n\n</div>"]}}],"execution_count":43},{"cell_type":"markdown","source":["##11a. Apply lambda - Python"],"metadata":{}},{"cell_type":"code","source":["import pandas as pd\nimport numpy as np\nimport os\nos.getcwd()\ntendulkar = pd.read_csv(\"/dbfs/FileStore/tables/tendulkar.csv\", header='infer')\ntendulkar['4s']=tendulkar['4s'].convert_objects(convert_numeric=True)\na=tendulkar['4s'].apply(lambda x: x*4)\na"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"><span class=\"ansired\">Out[</span><span class=\"ansired\">45</span><span class=\"ansired\">]: </span>\n0 8.0\n1 NaN\n2 16.0\n3 4.0\n4 20.0\n5 20.0\n6 24.0\n7 0.0\n8 12.0\n9 20.0\n10 4.0\n11 NaN\n12 4.0\n13 16.0\n14 32.0\n15 68.0\n16 12.0\n17 NaN\n18 12.0\n19 4.0\n20 0.0\n21 20.0\n22 56.0\n23 4.0\n24 8.0\n25 64.0\n26 4.0\n27 0.0\n28 4.0\n29 76.0\n ... \n317 20.0\n318 32.0\n319 0.0\n320 32.0\n321 16.0\n322 32.0\n323 36.0\n324 12.0\n325 4.0\n326 12.0\n327 8.0\n328 8.0\n329 12.0\n330 20.0\n331 8.0\n332 NaN\n333 4.0\n334 8.0\n335 52.0\n336 4.0\n337 0.0\n338 28.0\n339 0.0\n340 4.0\n341 20.0\n342 8.0\n343 20.0\n344 0.0\n345 8.0\n346 48.0\nName: 4s, dtype: float64\n</div>"]}}],"execution_count":45},{"cell_type":"markdown","source":["##11b. Apply lambda - Pyspark"],"metadata":{}},{"cell_type":"code","source":["from pyspark.sql.functions import *\nfrom pyspark.sql.types import IntegerType\ntendulkar1= (sqlContext\n .read.format(\"com.databricks.spark.csv\")\n .options(delimiter=',', header='true', inferschema='true')\n .load(\"/FileStore/tables/tendulkar.csv\"))\n\ntendulkar1 = tendulkar1.withColumn(\"4s\", tendulkar1[\"4s\"].cast(IntegerType()))\n\ntendulkar2= tendulkar1.where(col(\"4s\").isNotNull())\nfours = udf(lambda x: 4*x, IntegerType())\nusers = tendulkar2.withColumn(\"4s\",fours(tendulkar1[\"4s\"]))\nusers.show(5)\n#df_ascii = df.select('integers', spark_convert_ascii('integers').alias('ascii_map'))"],"metadata":{},"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\">+----+----+---+---+---+-----+---+---------+----+----------+----------+----------+\nRuns|Mins| BF| 4s| 6s| SR|Pos|Dismissal|Inns|Opposition| Ground|Start Date|\n+----+----+---+---+---+-----+---+---------+----+----------+----------+----------+\n 15| 28| 24| 8| 0| 62.5| 6| bowled| 2|v Pakistan| Karachi| 15-Nov-89|\n 59| 254|172| 16| 0| 34.3| 6| lbw| 1|v Pakistan|Faisalabad| 23-Nov-89|\n 8| 24| 16| 4| 0| 50| 6| run out| 3|v Pakistan|Faisalabad| 23-Nov-89|\n 41| 124| 90| 20| 0|45.55| 7| bowled| 1|v Pakistan| Lahore| 1-Dec-89|\n 35| 74| 51| 20| 0|68.62| 6| lbw| 1|v Pakistan| Sialkot| 9-Dec-89|\n+----+----+---+---+---+-----+---+---------+----+----------+----------+----------+\nonly showing top 5 rows\n\n</div>"]}}],"execution_count":47},{"cell_type":"markdown","source":["##12. Conclusion\nThe above post shows some of the most important equivalent constructs in Python and Pyspark. I will be adding more constructs soon.\nWatch this space."],"metadata":{}}],"metadata":{"name":"Big Data-1: Move into the big league:Graduate from Python to Pyspark","notebookId":696915129665749},"nbformat":4,"nbformat_minor":0}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment