Skip to content

Instantly share code, notes, and snippets.

View bfraiche's full-sized avatar

BF bfraiche

  • Washington, DC
View GitHub Profile
@bfraiche
bfraiche / bayes_w_r_and_sparklyr.R
Last active April 22, 2020 01:32
This gist contains the complete code for my blogpost: 'Bayesian Machine Learning and NLP with R and sparklyr'
mc$defaultLibrary <- "sparklyr"
library(sparklyr)
library(tidyverse)
speeches <- magpie::sql(mc, "SELECT * FROM presidential_speeches WHERE president")
partitions <- speeches %>%
ft_tokenizer(input_col = 'speech_text', output_col = 'words') %>%
ft_stop_words_remover(input_col = 'words', output_col = 'clean_words') %>%
@bfraiche
bfraiche / random_forest_with_python_and_spark_ml.py
Created April 2, 2019 22:30
This gist contains the complete code for my blogpost: 'Random Forest with Python and Spark ML'
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
import matplotlib.pyplot as plt
import numpy as np
# Pull in the data
df = mc.sql("SELECT * FROM kings_county_housing")
@bfraiche
bfraiche / vec_asmbl.py
Created April 2, 2019 17:43
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
from pyspark.ml.feature import VectorAssembler
feature_list = []
for col in df.columns:
if col == 'label':
continue
else:
feature_list.append(col)
assembler = VectorAssembler(inputCols=feature_list, outputCol="features")
@bfraiche
bfraiche / train_model.py
Created April 2, 2019 17:43
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
cvModel = crossval.fit(trainingData)
@bfraiche
bfraiche / test_pred.py
Created April 2, 2019 17:42
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
predictions = cvModel.transform(testData)
@bfraiche
bfraiche / split_data.py
Created April 2, 2019 17:42
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
(trainingData, testData) = df.randomSplit([0.8, 0.2])
@bfraiche
bfraiche / importance.py
Last active April 2, 2019 22:17
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
bestPipeline = cvModel.bestModel
bestModel = bestPipeline.stages[1]
importances = bestModel.featureImportances
x_values = list(range(len(importances)))
plt.bar(x_values, importances, orientation = 'vertical')
plt.xticks(x_values, feature_list, rotation=40)
plt.ylabel('Importance')
@bfraiche
bfraiche / get_df.py
Created April 2, 2019 17:42
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
df = mc.sql("SELECT * FROM kings_county_housing")
@bfraiche
bfraiche / evaluate.py
Created April 2, 2019 17:42
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
import matplotlib.pyplot as plt
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
rfPred = model.transform(df)
rfResult = rfPred.toPandas()
@bfraiche
bfraiche / build_pl.py
Created April 2, 2019 17:41
This gist contains code snippets for my blogpost: 'Random Forest with Python and Spark ML'
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, rf])