Skip to content

Instantly share code, notes, and snippets.

View 30lm32's full-sized avatar
😶‍🌫️
Out sick

30lm32

😶‍🌫️
Out sick
  • Micrososft
View GitHub Profile
@30lm32
30lm32 / spark_tips_and_tricks.md
Created April 12, 2020 17:59 — forked from dusenberrymw/spark_tips_and_tricks.md
Tips and tricks for Apache Spark.

Spark Tips & Tricks

Misc. Tips & Tricks

  • If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
  • Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
  • Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
@30lm32
30lm32 / docker-compose.yml
Created August 7, 2018 21:41 — forked from lucasjellema/docker-compose.yml
Docker Compose file for Apache Kafka, the Confluent Platform (4.1.0) - with Kafka Connect, Kafka Manager, Schema Registry and KSQL (1.0) - assuming a Docker Host accessible at 192.168.188.102
version: '2'
services:
zookeeper:
image: "confluentinc/cp-zookeeper:4.1.0"
hostname: zookeeper
ports:
- "2181:2181"
environment:
ZOOKEEPER_CLIENT_PORT: 2181