Skip to content

Instantly share code, notes, and snippets.

@colbyford
Created April 30, 2019 14:50
Show Gist options
  • Save colbyford/d711c9f037e308d1903314d0fcc81a64 to your computer and use it in GitHub Desktop.
Save colbyford/d711c9f037e308d1903314d0fcc81a64 to your computer and use it in GitHub Desktop.
Convert Spark DataFrame to Numpy Array for AutoML or Scikit-Learn
## PySpark Part
from pyspark.ml import PipelineModel
from pyspark.sql.functions import col
dataset = spark.read.format("csv") \
.options(header = True, inferSchema = True) \
.load("/mnt/myfile.csv")
pipeline = PipelineModel.load("/mnt/pipeline/")
dataset = pipeline.transform(dataset)
train = dataset.where(col("data_split") == "train").select(col("label"), col("features"))
test = dataset.where(col("data_split") == "test").select(col("label"), col("features"))
## Numpy Part
## Training Data
pdtrain = train.toPandas()
trainseries = pdtrain['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
X_train = np.apply_along_axis(lambda x : x[0], 1, trainseries)
y_train = pdtrain['label'].values.reshape(-1,1).ravel()
## Test Data
pdtest = test.toPandas()
testseries = pdtest['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
X_test = np.apply_along_axis(lambda x : x[0], 1, testseries)
y_test = pdtest['label'].values.reshape(-1,1).ravel()
print(y_test)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment