Udemy Apache Spark Course Notes

Udemy - "Taming Big Data with Apache Spark 3 and Python - Hands On!" Course Notes

Introduction to Spark

  • According to Apache, Spark is "a fast and general engine for large-scale data processing"
  • Since it runs on a cluster, it is very scalable
  • It has a built in cluster manager, but it can also run on top of a Hadoop cluster, which would then use YARN
  • According to Apache, Spark can "run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk"
    • This may be hyperbolic, instructor has seen speed improvements of 3-4x
  • DAG (directed acyclic graph) engine optimizes workflows
  • Very popular in industry
  • Can code in Python, Java, or Scala
  • Built around one main concept: the Resilient Distributed Dataset (RDD)
  • Components of Spark
    • Spark Core
      • Spark Streaming
      • Spark SQL
      • MLLib
      • GraphX
  • Why Python?
    • No need to compile, manage dependencies, etc.
    • Less coding overhead
    • Many people already know Python
  • But...
    • Scala is probably a more popular choice with Spark
    • Spark is built in Scala, so coding in Scala is "native" to Spark
    • New features, libraries tend to be Scala first
  • Python code in Spark looks a lot like Scala code

The Resilient Distributed Dataset (RDD)

Spark Context

  • Created by your driver program
  • Is responsible for making RDDs resilient and distributed
  • Creates RDDs
  • The Spark shell creates a 'sc' object for you
  • Can create RDDs from:
    • Text files
    • JDBC
    • Cassandra
    • HBase
    • Elastisearch
    • JSON, CSV, sequence files, object files, various compressed formats

Transforming RDDs

  • Most common transformations:
    • map: allows you to take a set of data and transform it to another set of data using a function that operates on the data
      • Has a 1-to-1 relationship between input RDDs and output RDDs (i.e. you'll get the same number of RDDs before and after the map operation)
    • flatmap: similar to map, but it has the capability to produce multiple RDDs for every input RDD
    • filter: can filter out information you don't need
      • Takes a function and returns a boolean
    • distinct: gets the unique values in an RDD
    • sample: pulls a random sample from an RDD
    • union, intersection, subtract, cartesian: operations that manipulate multiple RDDs

RDD Actions

  • Most common actions:
    • collect: print out all values in an RDD
    • count: counts the number of items in an RDD
    • countByValue: breakdown by unique value that each element appears
    • take: lets you sample a few values from an RDD
    • top: similar to take, it lets you sample values from an RDD
    • reduce: lets you write a function that combines all the values for each given key
  • Nothing actually happens in your driver program until an action is called (Lazy Evaluation)

Key/Value Data

  • RDDs can hold key/value pairs
  • Simple one-line function to map pairs of data into the RDD
    • For example, to map all the keys with a value of 1 (for the purpose of counting by key perhaps) you could use the following statement:
      • keyValuePairs = x: (x, 1))
    • It's ok to have lists as values as well
  • Key/Value Functions:
    • reduceByKey(): combines values with the same key using some function
      • rdd.reduceByKey(lambda x, y: x + y) adds up all values by key
    • groupByKey(): groups values into a list by key
    • sortByKey(): sorts RDD by key values
    • keys(), values(): creates an RDD of just the keys or just the values
  • You can do SQL-style joins on two key/value RDDs
    • join, rightOuterJoin, leftOuterJoin, cogroup, subtractByKey
  • With key/value data, use mapValues() or flatMapValues() if your transformation doesn't affect the keys
    • It's more efficient because it allows Spark to maintain the partitioning from your original RDD

Map vs Flatmap

  • map will transform each element of an RDD into a one new element in a new RDD, i.e. it has a 1-to-1 relationship between the input and output RDDs
    • Example: capitalizing every character in a line of text
  • flatmap will transform each element of an RDD into more or fewer than one element
    • Example: breaking up a line of text into individual words

Broadcast Variables

  • Spark allows you to send information one time to all the executor nodes in your cluster and keep it there
    • This would be useful for lookup tables or anything that needs to be referenced within your script, especially if the script is massive
  • This is called broadcasting & is as simple as sc.broadcast()
    • You then use .value() to get the object back

Breadth-First Search

  • A way to traverse through nodes in a graph, searching all neighbor nodes emanating from a root node and working outward
  • An accumulator allows many executors to increment a shared variable across a cluster, basically a counter that is maintained and synchronized across all the nodes in your cluster

Collaborative Filtering

  • Example: find similarities between movies based on user ratings
    • Find every pair of movies that were watched by the same person
    • Measure the similarity of their ratings across all users who watched both
    • Sort by movie, then by similarity strength
  • In this example, you would query the final RDD of movie similarities multiple times
    • Any time you will perform more than one action on an RDD, you should cache it
    • Otherwise, Spark might re-evaluate the entire RDD all over again
    • Use .cache() or .persist() to do this
      • Difference: .persist() optionally lets you cache it to disk instead of memory, just in case a node fails

Elastic MapReduce

  • Amazon service
  • Quick and easy way to rent time on a cluster
  • Sets up a default spark configuration for you on top of Hadoop's YARN cluster manager
  • Spark also has a built-in standalone cluster manager and scripts to set up its own EC2-based cluster
    • But the AWS console is even easier
  • Spark on EMR isn't really expensive, but it's not free
    • Unlike MapReduce with MRJob, Spark uses m3.xlarge instances
    • You have to remember to shut down your clusters when you're done, or you'll be charged for that time
    • You'll also want to make sure things run locally on a subset of your data first
      • Can use RDD operations such as top or sample to create smaller datasets to test on
  • You still need to know your data to run Spark jobs efficiently, it's not magic
    • Spark does not distribute automatically on its own
  • Use .partitionBy() on an RDD before running a large operation that benefits from partitioning
    • cogroup(), groupWith(), join(), leftOuterJoin(), rightOuterJoin(), groupByKey(), reduceByKey(), combineByKey(), and lookup() preserve your partitioning in their result
      • With any large RDDs, you should check to see if you have partitioned the RDD before calling any of these methods
  • Choosing a partition size
    • Too few partitions won't take full advantage of your cluster
    • Too many partitions results in too much overhead from shuffling data
    • You should use at least as many partitions as you have cores, or executors that fit within your available memory
    • partitionBy(100) is usually a reasonable place to start for large operations
  • If you use an empty SparkConf when using EMR, it will use all the defaults as set up by AWS. You can override these defaults with command line arguments

  • If you're running on your own cluster (i.e. not on AWS), logs are shown on the web UI
  • In YARN though, the logs are distributed - you need to collect them after the fact using yarn logs -applicationID
  • While your driver script runs, it will log errors like executors failing to issue heartbeats
    • This generally means you are asking too much of each executor
    • You may need more executors, i.e. more machines on your cluster
    • Each executor may need more memory
    • Or use partitionBy() to demand less work from individual executors by using smaller partitions

Managing Dependencies

  • Executors aren't necessarily on the same box as your driver script
  • Use broadcast variables to share data outside of RDDs
  • Need a Python package that's not pre-loaded on EMR?
    • Set up a stem in EMR to run pip for what you need on each worker machine
    • Or use --py-files with spark-submit to add individual libraries that are on the master
    • Try to avoid using obscure packages you don't need in the first place - time is money on an EMR cluster (and your own cluster for that matter)


  • Extends RDDs to a "DataFrame" object
  • Dataframes:
    • Contain Row objects
    • Can run SQL queries
    • Have a schema (leading to more efficient storage)
    • Read & write to JSON, Hive, Parquet
    • Communicate with JDBC/ODBC, Tableau
  • In Spark 2.0, a DataFrame is really a DataSet of Row objects
  • DataSets can wrap known, typed data too. But this is mostly transparent in Python, since Python is untyped (more important in Java & Scala)
  • It doesn't matter as much with Python, but the Spark 2.0 way is to use DataSets instead of DataFrames when you can

User Defined Functions (UDFs)

  • You can extend the SQL syntax yourself to do specialized operations

  • Example of a UDF to square a number:

    from pyspark.sql.types import IntegerType

    hiveCtx.registerFunction("square", lambda x: d*d, IntegerType())

    df = hiveCtx.sql("SELECT square('someNumericField') FROM tableName)

MLLib: Spark's Machine Learning Library


  • Feature extraction
    • Term Frequency/Inverse Document Frequency useful for search
  • Basis statistics
    • Chi-squared test, Pearson or Spearman correlation, min, max, mean, variance
  • Linear regression, logistic regression
  • Support Vector Machines
  • Naive Bayes classifier
  • Decision trees
  • K-means clustering
  • Principal component analysis, singular value decomposition
  • Recommendations using Alternating Least Squares (winner of Netflix prize)

Special MLLib Data Types

  • Vector (dense or sparse)
  • LabeledPoint
  • Rating

MLLib Considerations

  • ALS is very sensitive to the parameters chosen. It takes more work to find optimal parameters for a dataset than to run the recommendations
    • Can use "train/test" to evaluate various permutations of parameters
      • But this could lead to overfitting
  • No way to be sure the algorithm is working properly internally
    • Don't always want to just put your faith in a black box
    • Complicated isn't always better
  • Never blindly trust results when analyzing big data
    • Small problems in algorithms become big ones
    • Very often the quality of your input data is the real issue

Spark Streaming

  • Analyzes continual streams of data
    • Common example: processing log data from a website or server
  • Data is aggregated and analyzed at some given interval
  • Can take data fed to a port, Amazon Kinesis, HDFS, Kafka, Flume, and others
  • "Checkpointing" stores state to disk periodically for fault tolerance
  • A "Dstream" object breaks up the stream into distinct RDDs
    • Not technically "real-time" - it uses these Dstream microbatches
      • A common batch increment is 1 second
  • RDDs only contain one little chunk of incoming data
  • "Windowed operations" cancombine results from mlutiple batches over some sliding time window
    • window(), reduceByWindow(), reduceByKeyAndWindow()
  • updateStateByKey()
    • Lets you maintain a state across many batches as time goes on
    • For example, running counts of some event
    • See example in Spark SDK

Structured Streaming

  • Instead of using Dstreams and discrete RDDs, structured streaming models the stream as a Dataframe that just keeps expanding
  • Streaming code ends up looking a lot like non-streaming code
  • Dataframes provide interoperability with MLLib


  • This deals with graphs in the computer science/network sense, e.g. a graph of a social network
  • Currently (as of the time of the recording) it is Scala only
  • It is only really useful for specific applications
    • It can measure things like "connectedness", degree distribution, average path length, triangle counts - high level measures of a graph
    • It can also join graphs together and transform graphs quickly
  • It introduces a couple of new datatypes under the hood: VertexRDD & EdgeRDD
  • Otherwise, GraphX code looks like any other Spark code for the most part , GraphX code looks like any other Spark code for the most part
