Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save andrew-curthoys/0a3c5ee9ba4211cc36a6f47a46338cea to your computer and use it in GitHub Desktop.
Save andrew-curthoys/0a3c5ee9ba4211cc36a6f47a46338cea to your computer and use it in GitHub Desktop.
Udemy Apache Spark Course Notes

Udemy - "Taming Big Data with Apache Spark 3 and Python - Hands On!" Course Notes

Introduction to Spark

  • According to Apache, Spark is "a fast and general engine for large-scale data processing"
  • Since it runs on a cluster, it is very scalable
  • It has a built in cluster manager, but it can also run on top of a Hadoop cluster, which would then use YARN
  • According to Apache, Spark can "run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk"
    • This may be hyperbolic, instructor has seen speed improvements of 3-4x
  • DAG (directed acyclic graph) engine optimizes workflows
  • Very popular in industry
  • Can code in Python, Java, or Scala
  • Built around one main concept: the Resilient Distributed Dataset (RDD)
  • Components of Spark
    • Spark Core
      • Spark Streaming
      • Spark SQL
      • MLLib
      • GraphX
  • Why Python?
    • No need to compile, manage dependencies, etc.
    • Less coding overhead
    • Many people already know Python
  • But...
    • Scala is probably a more popular choice with Spark
    • Spark is built in Scala, so coding in Scala is "native" to Spark
    • New features, libraries tend to be Scala first
  • Python code in Spark looks a lot like Scala code

The Resilient Distributed Dataset (RDD)

Spark Context

  • Created by your driver program
  • Is responsible for making RDDs resilient and distributed
  • Creates RDDs
  • The Spark shell creates a 'sc' object for you
  • Can create RDDs from:
    • Text files
    • JDBC
    • Cassandra
    • HBase
    • Elastisearch
    • JSON, CSV, sequence files, object files, various compressed formats

Transforming RDDs

  • Most common transformations:
    • map: allows you to take a set of data and transform it to another set of data using a function that operates on the data
      • Has a 1-to-1 relationship between input RDDs and output RDDs (i.e. you'll get the same number of RDDs before and after the map operation)
    • flatmap: similar to map, but it has the capability to produce multiple RDDs for every input RDD
    • filter: can filter out information you don't need
      • Takes a function and returns a boolean
    • distinct: gets the unique values in an RDD
    • sample: pulls a random sample from an RDD
    • union, intersection, subtract, cartesian: operations that manipulate multiple RDDs

RDD Actions

  • Most common actions:
    • collect: print out all values in an RDD
    • count: counts the number of items in an RDD
    • countByValue: breakdown by unique value that each element appears
    • take: lets you sample a few values from an RDD
    • top: similar to take, it lets you sample values from an RDD
    • reduce: lets you write a function that combines all the values for each given key
  • Nothing actually happens in your driver program until an action is called (Lazy Evaluation)

Key/Value Data

  • RDDs can hold key/value pairs
  • Simple one-line function to map pairs of data into the RDD
    • For example, to map all the keys with a value of 1 (for the purpose of counting by key perhaps) you could use the following statement:
      • keyValuePairs = rdd.map(lambda x: (x, 1))
    • It's ok to have lists as values as well
  • Key/Value Functions:
    • reduceByKey(): combines values with the same key using some function
      • rdd.reduceByKey(lambda x, y: x + y) adds up all values by key
    • groupByKey(): groups values into a list by key
    • sortByKey(): sorts RDD by key values
    • keys(), values(): creates an RDD of just the keys or just the values
  • You can do SQL-style joins on two key/value RDDs
    • join, rightOuterJoin, leftOuterJoin, cogroup, subtractByKey
  • With key/value data, use mapValues() or flatMapValues() if your transformation doesn't affect the keys
    • It's more efficient because it allows Spark to maintain the partitioning from your original RDD

Map vs Flatmap

  • map will transform each element of an RDD into a one new element in a new RDD, i.e. it has a 1-to-1 relationship between the input and output RDDs
    • Example: capitalizing every character in a line of text
  • flatmap will transform each element of an RDD into more or fewer than one element
    • Example: breaking up a line of text into individual words

Broadcast Variables

  • Spark allows you to send information one time to all the executor nodes in your cluster and keep it there
    • This would be useful for lookup tables or anything that needs to be referenced within your script, especially if the script is massive
  • This is called broadcasting & is as simple as sc.broadcast()
    • You then use .value() to get the object back

Breadth-First Search

  • A way to traverse through nodes in a graph, searching all neighbor nodes emanating from a root node and working outward
  • An accumulator allows many executors to increment a shared variable across a cluster, basically a counter that is maintained and synchronized across all the nodes in your cluster

Collaborative Filtering

  • Example: find similarities between movies based on user ratings
    • Find every pair of movies that were watched by the same person
    • Measure the similarity of their ratings across all users who watched both
    • Sort by movie, then by similarity strength
  • In this example, you would query the final RDD of movie similarities multiple times
    • Any time you will perform more than one action on an RDD, you should cache it
    • Otherwise, Spark might re-evaluate the entire RDD all over again
    • Use .cache() or .persist() to do this
      • Difference: .persist() optionally lets you cache it to disk instead of memory, just in case a node fails

Elastic MapReduce

  • Amazon service
  • Quick and easy way to rent time on a cluster
  • Sets up a default spark configuration for you on top of Hadoop's YARN cluster manager
  • Spark also has a built-in standalone cluster manager and scripts to set up its own EC2-based cluster
    • But the AWS console is even easier
  • Spark on EMR isn't really expensive, but it's not free
    • Unlike MapReduce with MRJob, Spark uses m3.xlarge instances
    • You have to remember to shut down your clusters when you're done, or you'll be charged for that time
    • You'll also want to make sure things run locally on a subset of your data first
      • Can use RDD operations such as top or sample to create smaller datasets to test on
  • You still need to know your data to run Spark jobs efficiently, it's not magic
    • Spark does not distribute automatically on its own
  • Use .partitionBy() on an RDD before running a large operation that benefits from partitioning
    • cogroup(), groupWith(), join(), leftOuterJoin(), rightOuterJoin(), groupByKey(), reduceByKey(), combineByKey(), and lookup() preserve your partitioning in their result
      • With any large RDDs, you should check to see if you have partitioned the RDD before calling any of these methods
  • Choosing a partition size
    • Too few partitions won't take full advantage of your cluster
    • Too many partitions results in too much overhead from shuffling data
    • You should use at least as many partitions as you have cores, or executors that fit within your available memory
    • partitionBy(100) is usually a reasonable place to start for large operations
  • If you use an empty SparkConf when using EMR, it will use all the defaults as set up by AWS. You can override these defaults with command line arguments

Troubleshooting Cluster Jobs

Logs

  • If you're running on your own cluster (i.e. not on AWS), logs are shown on the web UI
  • In YARN though, the logs are distributed - you need to collect them after the fact using yarn logs -applicationID
  • While your driver script runs, it will log errors like executors failing to issue heartbeats
    • This generally means you are asking too much of each executor
    • You may need more executors, i.e. more machines on your cluster
    • Each executor may need more memory
    • Or use partitionBy() to demand less work from individual executors by using smaller partitions

Managing Dependencies

  • Executors aren't necessarily on the same box as your driver script
  • Use broadcast variables to share data outside of RDDs
  • Need a Python package that's not pre-loaded on EMR?
    • Set up a stem in EMR to run pip for what you need on each worker machine
    • Or use --py-files with spark-submit to add individual libraries that are on the master
    • Try to avoid using obscure packages you don't need in the first place - time is money on an EMR cluster (and your own cluster for that matter)

SparkSQL

  • Extends RDDs to a "DataFrame" object
  • Dataframes:
    • Contain Row objects
    • Can run SQL queries
    • Have a schema (leading to more efficient storage)
    • Read & write to JSON, Hive, Parquet
    • Communicate with JDBC/ODBC, Tableau
  • In Spark 2.0, a DataFrame is really a DataSet of Row objects
  • DataSets can wrap known, typed data too. But this is mostly transparent in Python, since Python is untyped (more important in Java & Scala)
  • It doesn't matter as much with Python, but the Spark 2.0 way is to use DataSets instead of DataFrames when you can

User Defined Functions (UDFs)

  • You can extend the SQL syntax yourself to do specialized operations

  • Example of a UDF to square a number:

    from pyspark.sql.types import IntegerType

    hiveCtx.registerFunction("square", lambda x: d*d, IntegerType())

    df = hiveCtx.sql("SELECT square('someNumericField') FROM tableName)

MLLib: Spark's Machine Learning Library

Capabilities

  • Feature extraction
    • Term Frequency/Inverse Document Frequency useful for search
  • Basis statistics
    • Chi-squared test, Pearson or Spearman correlation, min, max, mean, variance
  • Linear regression, logistic regression
  • Support Vector Machines
  • Naive Bayes classifier
  • Decision trees
  • K-means clustering
  • Principal component analysis, singular value decomposition
  • Recommendations using Alternating Least Squares (winner of Netflix prize)

Special MLLib Data Types

  • Vector (dense or sparse)
  • LabeledPoint
  • Rating

MLLib Considerations

  • ALS is very sensitive to the parameters chosen. It takes more work to find optimal parameters for a dataset than to run the recommendations
    • Can use "train/test" to evaluate various permutations of parameters
      • But this could lead to overfitting
  • No way to be sure the algorithm is working properly internally
    • Don't always want to just put your faith in a black box
    • Complicated isn't always better
  • Never blindly trust results when analyzing big data
    • Small problems in algorithms become big ones
    • Very often the quality of your input data is the real issue

Spark Streaming

  • Analyzes continual streams of data
    • Common example: processing log data from a website or server
  • Data is aggregated and analyzed at some given interval
  • Can take data fed to a port, Amazon Kinesis, HDFS, Kafka, Flume, and others
  • "Checkpointing" stores state to disk periodically for fault tolerance
  • A "Dstream" object breaks up the stream into distinct RDDs
    • Not technically "real-time" - it uses these Dstream microbatches
      • A common batch increment is 1 second
  • RDDs only contain one little chunk of incoming data
  • "Windowed operations" cancombine results from mlutiple batches over some sliding time window
    • window(), reduceByWindow(), reduceByKeyAndWindow()
  • updateStateByKey()
    • Lets you maintain a state across many batches as time goes on
    • For example, running counts of some event
    • See stateful_network_wordcount.py example in Spark SDK

Structured Streaming

  • Instead of using Dstreams and discrete RDDs, structured streaming models the stream as a Dataframe that just keeps expanding
  • Streaming code ends up looking a lot like non-streaming code
  • Dataframes provide interoperability with MLLib

GraphX

  • This deals with graphs in the computer science/network sense, e.g. a graph of a social network
  • Currently (as of the time of the recording) it is Scala only
  • It is only really useful for specific applications
    • It can measure things like "connectedness", degree distribution, average path length, triangle counts - high level measures of a graph
    • It can also join graphs together and transform graphs quickly
  • It introduces a couple of new datatypes under the hood: VertexRDD & EdgeRDD
  • Otherwise, GraphX code looks like any other Spark code for the most part , GraphX code looks like any other Spark code for the most part
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment