svidovich/pyspark.md

## pyspark.md

      
    Raw
  

              pyspark.md
            
          
    You, too, can learn PySpark right on your Linux ( or other Docker-compatible ) Desktop with a few short, easy steps!
Before we begin, make sure you have docker installed. Here's the official documentation: https://docs.docker.com/get-docker/
Got that done? Good! Let's get started.
First let's get our bashrc set up so that we can just use a command instead of remembering complicated docker flags. Nobody likes those.
echo 'alias pyspark-jupyter="docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook"' >> ~/.bashrc
source !$

Great, now let's run the command --
pyspark-jupyter

This might take a moment on first run, but don't panic. This is going to run a jupyter notebook on your machine and make it available in your web browser.
Pay attention to your terminal -- it will eventually give you a link to click to open the notebook. It probably looks like this:
    Or copy and paste one of these URLs:
        http://baae0d65a0c5:8888/?token=4854138a9a2fe7b2a7eb7be5e93c2c9a57da42d5f8547cd0
     or http://127.0.0.1:8888/?token=4854138a9a2fe7b2a7eb7be5e93c2c9a57da42d5f8547cd0

Get in? Good! Click new > python3 notebook. You should be greeted with a series of cells. And now... Let's get started with a little bit of PySpark!
First, let's build our SparkSession, and a SparkContext too. Copy these into a cell, and then execute the cell --
from pyspark.context import SparkContext
from pyspark.sql import DataFrame, Row, SparkSession

spark_context = SparkContext.getOrCreate()
spark_session = SparkSession.builder.getOrCreate()

I'm going to add some common functions I generally wind up using. You can find more that you might like to try in the PySpark API documentation: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
from pyspark.sql.functions import col, collect_set, coalesce, concat, expr, struct, when, col, desc

I'm not going to cover schema creation in this gist unless someone eyeballs it and requests it, but I will cover creating DataFrames and doing some... stuff with them. Let's try!
First, a sample DataFrame:
my_dataframe = spark_context.parallelize([
    Row(name='James', age=23, phone='4419950090', service_group='a'),  # Let's get some rows going to test with
    Row(name='Alyssa', age=17, phone='4419950091', service_group='a'),
    Row(name='Albert', age=41, phone='4419950092', service_group='b'),
    Row(name='Danny', age=55, phone='4419950092', service_group='b')
]).toDF()  # .toDF() will cast the rdd we make with parallelize() into a DataFrame.

Let's start slow...
my_dataframe.select(
     'name',
     'age',
     'phone'
).show()

You ought to see output like this --
+------+---+----------+
|  name|age|     phone|
+------+---+----------+
| James| 23|4419950090|
|Alyssa| 17|4419950091|
|Albert| 41|4419950092|
| Danny| 55|4419950092|
+------+---+----------+

Now let's do some fancier... stuff. I'm going to use a .groupBy. Any time you use a groupBy, you must aggregate in some way afterward. Here's an example where we sort out our users into arrays based on their service groups --
my_dataframe_array_aggregated = my_dataframe.withColumn(
    'user_data',  # Add a new column to our dataframe called user_data made with struct()!
    struct(       # struct() creates an accessible object out of the chosen columns
        'name',
        'age',
        'phone'
    )
).groupBy(
    'service_group'  # Now let's group our users by service group,
).agg(
    collect_set('user_data').alias('users')  # And add them to an array that's assigned to their service group.
)

my_dataframe_array_aggregated.show(truncate=False)  # The truncate=False setting makes it so we can see our whole output.

Run it! You should get output that looks like this --
+-------------+---------------------------------------------------+
|service_group|users                                              |
+-------------+---------------------------------------------------+
|b            |[{Albert, 41, 4419950092}, {Danny, 55, 4419950092}]|
|a            |[{James, 23, 4419950090}, {Alyssa, 17, 4419950091}]|
+-------------+---------------------------------------------------+

Now that is data ready for saving as JSON to your favorite document / object store!
That's some basics. You should absolutely play around and make up your own sample data!
Tips

You can also create RDDs ( and DataFrames ) from python dictionaries with spark_context.parallelize ( though it's deprecated ) so don't feel like you're stuck manually constructing rows. Try something like this to make a DataFrame:
spark_context.parallelize(
    [
        {'name': 'James', 'age':23, 'phone': '4419950090', 'service_group': 'a'},
        {'name': 'Alyssa', 'age':17, 'phone': '4419950091', 'service_group': 'a'},
        {'name': 'Albert', 'age':41, 'phone': '4419950092', 'service_group': 'b'},
        {'name': 'Danny', 'age':55, 'phone': '4419950092', 'service_group': 'b'},
    ]
).toDF()

👇 Feel free to ask any questions you might have below! 👇