Skip to content

Instantly share code, notes, and snippets.

@cordje
cordje / spark_tips_and_tricks.md
Created October 9, 2020 11:58 — forked from dusenberrymw/spark_tips_and_tricks.md
Tips and tricks for Apache Spark.

Spark Tips & Tricks

Misc. Tips & Tricks

  • If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
  • Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
  • Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
@cordje
cordje / java-8-ami.md
Created October 14, 2016 13:16 — forked from rtfpessoa/java-8-ami.md
[Guide] Install Oracle Java (JDK) 8 on Amazon EC2 Ami
@cordje
cordje / DamerauLevenshteinDistanceWithThreshold.java
Created January 28, 2016 13:32
Damerau Levenshtein Distance With Threshold
public class DamerauLevenshteinDistanceWithThreshold {
public static int distance(String source, String target, int threshold) {
//this code was ported to Java from http://stackoverflow.com/questions/9453731/how-to-calculate-distance-similarity-measure-of-given-2-strings/9454016#9454016
int length1 = source.length();
int length2 = target.length();
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.abs(length1 - length2) > threshold) { return 2147483647; }