Skip to content

Instantly share code, notes, and snippets.

@srnghn
srnghn / ANOVA_Spark_2.0.py
Created October 20, 2016 23:16
ANOVA Test for Spark 2.0 using PySpark. The function returns 5 values: degrees of freedom between (numerator), degrees of freedom within (denominator), F-value, eta squared and omega squared.
from pyspark.sql.functions import *
# Implementation of ANOVA function: calculates the degrees of freedom, F-value, eta squared and omega squared values.
# Expects that 'categoryData' with two columns, the first being the categorical independent variable and the second being the scale dependent variable
def getAnovaStats(categoryData) :
cat_val = categoryData.toDF("cat","value")
cat_val.createOrReplaceTempView("df")
newdf = spark.sql("select A.cat, A.value, cast((A.value * A.value) as double) as valueSq, ((A.value - B.avg) * (A.value - B.avg)) as diffSq from df A join (select cat, avg(value) as avg from df group by cat) B where A.cat = B.cat")
grouped = newdf.groupBy("cat")
@srnghn
srnghn / Pearsons_R_Correlation_Spark_2.0.scala
Created October 5, 2016 00:22
Pearson's R Correlation for Spark 2.0. Created after getting inconsistant results with Statistics.corr. The two scale columns to be evaluated are to be selected from a DataFrame, converted to class type Dataset[ScaleTuple] (defined in this code) and passed to the correlation function.
// Create a class, ScaleTuple, to pass to the Pearson's R function so that columns can be referred to by specific names.
final case class ScaleTuple(var1: Double, var2: Double)
// Column names to use when converting to ScaleTuple
val colnames = Seq("var1", "var2")
/**
* Implementation of Pearson's R function: calculates r, the measurement of linear dependence between two variables
* Utilizes DataSet's 'agg' function
**/
@srnghn
srnghn / ANOVA_Spark_2.0.scala
Last active December 6, 2019 07:25
ANOVA Test for Spark 2.0 (using RelationalGroupedDataset instead of Iterable[RDD[Double]]). The categorical and scale columns to be evaluated are to be selected from a DataFrame, converted to class type Dataset[CatTuple] (defined in this code) and passed to the ANOVA function. The returned object is of class ANOVAStats (also defined here) and co…
/**
* Create a class, CatTuple, to pass to the ANOVA function so that columns can be referred to by specific names.
* Create a class, ANOVAStats, that will be returned from the ANOVA function so that its outputs can be selected and referred to by name.
**/
final case class CatTuple(cat: String, value: Double)
final case class ANOVAStats(dfb: Long, dfw: Double, F_value: Double, etaSq: Double, omegaSq: Double)
// Column names to use when converting to CatTuple
val colnames = Seq("cat", "value")