Skip to content

Instantly share code, notes, and snippets.

View bfraiche's full-sized avatar

BF bfraiche

  • Washington, DC
View GitHub Profile
@bfraiche
bfraiche / add_rf.py
Last active April 2, 2019 17:43
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
from pyspark.ml.regression import RandomForestRegressor
rf = RandomForestRegressor(labelCol="label", featuresCol="features")
@bfraiche
bfraiche / best_hp.py
Last active April 2, 2019 22:18
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
print('numTrees - ', bestModel.getNumTrees)
print('maxDepth - ', bestModel.getOrDefault('maxDepth'))
@bfraiche
bfraiche / build_cv.py
Created April 2, 2019 17:41
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(),
numFolds=3)
@bfraiche
bfraiche / build_grid.py
Last active April 2, 2019 22:08
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
from pyspark.ml.tuning import ParamGridBuilder
import numpy as np
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [int(x) for x in np.linspace(start = 10, stop = 50, num = 3)]) \
.addGrid(rf.maxDepth, [int(x) for x in np.linspace(start = 5, stop = 25, num = 3)]) \
.build()
@bfraiche
bfraiche / build_pl.py
Created April 2, 2019 17:41
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, rf])
@bfraiche
bfraiche / evaluate.py
Created April 2, 2019 17:42
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
import matplotlib.pyplot as plt
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
rfPred = model.transform(df)
rfResult = rfPred.toPandas()
@bfraiche
bfraiche / get_df.py
Created April 2, 2019 17:42
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
df = mc.sql("SELECT * FROM kings_county_housing")
@bfraiche
bfraiche / importance.py
Last active April 2, 2019 22:17
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
bestPipeline = cvModel.bestModel
bestModel = bestPipeline.stages[1]
importances = bestModel.featureImportances
x_values = list(range(len(importances)))
plt.bar(x_values, importances, orientation = 'vertical')
plt.xticks(x_values, feature_list, rotation=40)
plt.ylabel('Importance')
@bfraiche
bfraiche / split_data.py
Created April 2, 2019 17:42
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
(trainingData, testData) = df.randomSplit([0.8, 0.2])
@bfraiche
bfraiche / test_pred.py
Created April 2, 2019 17:42
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
predictions = cvModel.transform(testData)