Skip to content

Instantly share code, notes, and snippets.

@vsimko
Created May 10, 2017 13:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vsimko/a8e554ac449c235ca363ba8636e7a110 to your computer and use it in GitHub Desktop.
Save vsimko/a8e554ac449c235ca363ba8636e7a110 to your computer and use it in GitHub Desktop.
library(SparkR)
library(sparklyr)
library(dplyr)
# use specific version of spark/hadoop
sc <- spark_connect("local", version = "2.0.2", hadoop_version = "2.7")
# assuming comma-separated input
spark_read_csv(
sc, "bmw", "~/SOC.csv",
memory = FALSE, infer_schema = TRUE, null_value = "null") -> tab1
# copy subsample into R's memory as df1
tab1 %>% sdf_sample(fraction = 0.01) %>% collect -> df1
# use corplot to investigate correlations amongst columns
library(corrplot)
df1$VIN <- NULL # because VIN is chr
M <- cor(df1, use = "pairwise.complete.obs", method = "pearson") # also method="spearman" is possible
corrplot(M, order = "FPC")
# take a look at https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
# or this https://github.com/taiyun/corrplot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment