Skip to content

Instantly share code, notes, and snippets.

@pydemo
Created May 8, 2024 14:11
Show Gist options
  • Save pydemo/7489760978918bd10cfeb6fa94c3cf99 to your computer and use it in GitHub Desktop.
Save pydemo/7489760978918bd10cfeb6fa94c3cf99 to your computer and use it in GitHub Desktop.

What are some interview questions with answers on strategies for optimizing complex Spark SQL queries, including code examples?

Question Answer with Code Example
How would you use the EXPLAIN command to analyze a Spark SQL query? The EXPLAIN command is used to view the execution plan of a SQL query, helping identify performance bottlenecks. Code: spark.sql("SELECT * FROM table").explain()
Can you optimize a query by changing its physical plan without altering its logic? Yes, by using hints or by altering the configuration to change join strategies or serialization formats. Code: spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") to disable broadcast joins.
What is predicate pushdown and how does it optimize Spark SQL queries? Predicate pushdown optimizes queries by allowing the database engine to filter data before loading it into memory. This reduces I/O and speeds up queries. Code: spark.read.parquet("/data").filter("age > 30").show()
How can broadcasting a DataFrame help in query optimization? Broadcasting a DataFrame can minimize data shuffling by sending a small DataFrame to all worker nodes, which is effective for reducing join costs. Code: val broadcastDF = broadcast(smallDataFrame) joinDF.join(broadcastDF, "key")
What role does caching play in optimizing Spark SQL queries? Caching stores frequently accessed data in memory, reducing the need to read from disk repeatedly. Useful for iterative algorithms and queries. Code: df.cache()
How can repartitioning a DataFrame affect performance? Repartitioning can optimize the layout of data across clusters, improving the performance of wide transformations by reducing shuffle. Code: df.repartition(100, $"key")
How does tuning the number of shuffle partitions affect query performance? Adjusting the number of shuffle partitions can help manage the data processed and shuffled by each task, enhancing parallelism and reducing bottlenecks. Code: spark.conf.set("spark.sql.shuffle.partitions", "200")
What is the advantage of using DataFrame API over RDD for SQL tasks? DataFrame API is optimized with Catalyst optimizer and Tungsten execution engine, providing better memory management and optimized execution plans than RDDs. Code: df.select("name").where($"age" > 25)
How would you handle skewed data during joins in Spark? Handling skewed data can be done by salting the keys to distribute the load evenly. Code: df1.withColumn("salted_key", concat($"key", lit(rand()))) join df2.withColumn("salted_key", concat($"key", lit(rand())))
When and why should you use window functions in Spark SQL? Window functions allow computations across a range of input rows related to the current row, useful for running totals, moving averages without shuffling data. Code: windowSpec = Window.partitionBy("dept").orderBy("salary") df.withColumn("rank", rank().over(windowSpec))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment