Skip to content

Instantly share code, notes, and snippets.

@joshlk
Created November 20, 2018 17:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joshlk/58df790a2b0c06b820d2dc078308d970 to your computer and use it in GitHub Desktop.
Save joshlk/58df790a2b0c06b820d2dc078308d970 to your computer and use it in GitHub Desktop.
Minimal working example of pySpark memory leak
from pyspark import SparkContext
from pyspark.sql import SQLContext
import numpy as np
sc = SparkContext()
sqlContext = SQLContext(sc)
# Create dummy pySpark DataFrame with 1e5 rows and 16 partitions
df = sqlContext.range(0, int(1e5), numPartitions=16)
def toy_example(rdd):
# Read in pySpark DataFrame partition
data = list(rdd)
# Generate random data using Numpy
rand_data = np.random.random(int(1e7))
# Apply the `int` function to each element of `rand_data`
for i in range(len(rand_data)):
e = rand_data[i]
int(e)
# Return a single `0` value
return [[0]]
# Execute the above function on each partition (16 partitions)
result = df.rdd.mapPartitions(toy_example)
result = result.collect()
@joshlk
Copy link
Author

joshlk commented Jan 3, 2019

@vikasgandham
Copy link

Were you able to resolve this? I suspect i have similar issue when running machine
learning model inside spark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment