Created
November 20, 2018 17:45
-
-
Save joshlk/58df790a2b0c06b820d2dc078308d970 to your computer and use it in GitHub Desktop.
Minimal working example of pySpark memory leak
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark import SparkContext | |
from pyspark.sql import SQLContext | |
import numpy as np | |
sc = SparkContext() | |
sqlContext = SQLContext(sc) | |
# Create dummy pySpark DataFrame with 1e5 rows and 16 partitions | |
df = sqlContext.range(0, int(1e5), numPartitions=16) | |
def toy_example(rdd): | |
# Read in pySpark DataFrame partition | |
data = list(rdd) | |
# Generate random data using Numpy | |
rand_data = np.random.random(int(1e7)) | |
# Apply the `int` function to each element of `rand_data` | |
for i in range(len(rand_data)): | |
e = rand_data[i] | |
int(e) | |
# Return a single `0` value | |
return [[0]] | |
# Execute the above function on each partition (16 partitions) | |
result = df.rdd.mapPartitions(toy_example) | |
result = result.collect() |
Were you able to resolve this? I suspect i have similar issue when running machine
learning model inside spark.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Related posts: