You, too, can learn PySpark right on your Linux ( or other Docker-compatible ) Desktop with a few short, easy steps!
Before we begin, make sure you have docker installed. Here's the official documentation: https://docs.docker.com/get-docker/
Got that done? Good! Let's get started.
First let's get our bashrc set up so that we can just use a command instead of remembering complicated docker flags. Nobody likes those.
echo 'alias pyspark-jupyter="docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook"' >> ~/.bashrc
source !$
Great, now let's run the command --
pyspark-jupyter
This might take a moment on first run, but don't panic. This is going to run a jupyter notebook on your machine and make it available in your web browser. Pay attention to your terminal -- it will eventually give you a link to click to open the notebook. It probably looks like this:
Or copy and paste one of these URLs:
http://baae0d65a0c5:8888/?token=4854138a9a2fe7b2a7eb7be5e93c2c9a57da42d5f8547cd0
or http://127.0.0.1:8888/?token=4854138a9a2fe7b2a7eb7be5e93c2c9a57da42d5f8547cd0
Get in? Good! Click new > python3 notebook. You should be greeted with a series of cells. And now... Let's get started with a little bit of PySpark!
First, let's build our SparkSession
, and a SparkContext
too. Copy these into a cell, and then execute the cell --
from pyspark.context import SparkContext
from pyspark.sql import DataFrame, Row, SparkSession
spark_context = SparkContext.getOrCreate()
spark_session = SparkSession.builder.getOrCreate()
I'm going to add some common functions I generally wind up using. You can find more that you might like to try in the PySpark API documentation: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
from pyspark.sql.functions import col, collect_set, coalesce, concat, expr, struct, when, col, desc
I'm not going to cover schema creation in this gist unless someone eyeballs it and requests it, but I will cover creating DataFrames and doing some... stuff with them. Let's try!
First, a sample DataFrame:
my_dataframe = spark_context.parallelize([
Row(name='James', age=23, phone='4419950090', service_group='a'), # Let's get some rows going to test with
Row(name='Alyssa', age=17, phone='4419950091', service_group='a'),
Row(name='Albert', age=41, phone='4419950092', service_group='b'),
Row(name='Danny', age=55, phone='4419950092', service_group='b')
]).toDF() # .toDF() will cast the rdd we make with parallelize() into a DataFrame.
Let's start slow...
my_dataframe.select(
'name',
'age',
'phone'
).show()
You ought to see output like this --
+------+---+----------+
| name|age| phone|
+------+---+----------+
| James| 23|4419950090|
|Alyssa| 17|4419950091|
|Albert| 41|4419950092|
| Danny| 55|4419950092|
+------+---+----------+
Now let's do some fancier... stuff. I'm going to use a .groupBy
. Any time you use a groupBy
, you must aggregate in some way afterward. Here's an example where we sort out our users into arrays based on their service groups --
my_dataframe_array_aggregated = my_dataframe.withColumn(
'user_data', # Add a new column to our dataframe called user_data made with struct()!
struct( # struct() creates an accessible object out of the chosen columns
'name',
'age',
'phone'
)
).groupBy(
'service_group' # Now let's group our users by service group,
).agg(
collect_set('user_data').alias('users') # And add them to an array that's assigned to their service group.
)
my_dataframe_array_aggregated.show(truncate=False) # The truncate=False setting makes it so we can see our whole output.
Run it! You should get output that looks like this --
+-------------+---------------------------------------------------+
|service_group|users |
+-------------+---------------------------------------------------+
|b |[{Albert, 41, 4419950092}, {Danny, 55, 4419950092}]|
|a |[{James, 23, 4419950090}, {Alyssa, 17, 4419950091}]|
+-------------+---------------------------------------------------+
Now that is data ready for saving as JSON to your favorite document / object store!
That's some basics. You should absolutely play around and make up your own sample data!
You can also create RDDs ( and DataFrames ) from python dictionaries with spark_context.parallelize
( though it's deprecated ) so don't feel like you're stuck manually constructing rows. Try something like this to make a DataFrame:
spark_context.parallelize(
[
{'name': 'James', 'age':23, 'phone': '4419950090', 'service_group': 'a'},
{'name': 'Alyssa', 'age':17, 'phone': '4419950091', 'service_group': 'a'},
{'name': 'Albert', 'age':41, 'phone': '4419950092', 'service_group': 'b'},
{'name': 'Danny', 'age':55, 'phone': '4419950092', 'service_group': 'b'},
]
).toDF()
👇 Feel free to ask any questions you might have below! 👇