Skip to content

Instantly share code, notes, and snippets.

@mwacc
Created January 14, 2014 15:40
Show Gist options
  • Save mwacc/8420310 to your computer and use it in GitHub Desktop.
Save mwacc/8420310 to your computer and use it in GitHub Desktop.
linear regression and automatic formula building and best result pick up
library(ggplot2)
# read dataset from local file
abalone <- read.csv("/Users/kostya/Downloads/abalone.data.csv", header=F)
# set names for dataframe columns
colnames(abalone) <- c('Sex', 'Length', 'Diameter', 'Height', 'WholeWeight', 'ShuckedWeight',
'VisceraWeight', 'ShellWeight', 'Rings')
# plot histogram
hist(abalone$Rings, freq=F)
# depicture all charts on one plot
qplot(Diameter, Rings, data=abalone, geom=c("point", "smooth"), method="lm", color=Sex, se=F)
# plot each sex on different plot
ggplot(abalone, aes(VisceraWeight, Rings)) +
geom_jitter(alpha=0.25) +
geom_smooth(method=lm, se=FALSE) +
facet_grid(. ~ Sex)
summary(lm(Rings~Length+I(Diameter^2)+log(WholeWeight)+log(ShellWeight)+log(ShuckedWeight)
+Height+VisceraWeight, data=subset(abalone, Sex %in% 'I')) )
summary(lm(Rings~Length+I(Diameter^2)+log(WholeWeight)+log(ShellWeight)+ShuckedWeight
+Height+VisceraWeight, data=subset(abalone, Sex %in% 'M')) )
summary(lm(Rings~Length+I(Diameter^2)+WholeWeight+ShellWeight+ShuckedWeight
+Height+VisceraWeight, data=subset(abalone, Sex %in% 'F')) )
##################################
# for Infrant:
# Rings= 8.5398 - 7.6755*Length + 8.7707*Diameter^2 + 1.4837*log(WholeWeight) + 2.0745*log((ShellWeight) +
# -2.3415*log(ShuckedWeight) + 27.8275*Height + 5.9972*VisceraWeight
##########################################################################################################
# Let's build the best regression to describe data
##########################################################################################################
# get the names of the columns
props <- names(abalone[,-length(names(abalone))])
props <- props[! props %in% 'Rings']
props <- mapply(c,
#props,
c('Sex'),
lapply(props[! props %in% 'Sex'], function(x) paste('I(', x, '^2)') ),
lapply(props[! props %in% 'Sex'], function(x) paste('log(', x, ')') ),
SIMPLIFY = TRUE)
n <- length(props)
# construct all possible combinations
id <- unlist(
lapply(1:n,
function(i) combn(1:n,i,simplify=F)
)
,recursive=F)
# and paste them to formula
Formulas <- sapply(id, function(i)
paste("Rings~",paste(props[i],collapse="+"))
)
ptm <- proc.time()
# evaluate all formulas
lm.m <- lapply(Formulas, function(f)
return(tryCatch(
summary( lm(as.formula(f), data=abalone)),
error=function(e) NULL
))
)
proc.time() - ptm
# user system elapsed
# 26470.45 5.10 26484.1
# pick up the formula based on the best prediction
bestLM <- lm.m[[1]]
bestARS <- bestLM$adj.r.squared
for(i in 2:length(lm.m)) {
if( !is.null(lm.m[[i]]) && lm.m[[i]]$adj.r.squared > bestARS ) {
bestLM <- lm.m[[i]]
bestARS <- bestLM$adj.r.squared
}
}
bestLM
> bestLM
Call:
lm(formula = as.formula(f), data = abalone)
Residuals:
Min 1Q Median 3Q Max
-9.4223 -1.2949 -0.3214 0.8767 14.6160
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.90010 1.14005 10.438 < 2e-16 ***
SexI -0.60927 0.10145 -6.006 2.07e-09 ***
SexM 0.01463 0.08144 0.180 0.8574
Length -8.37453 1.92501 -4.350 1.39e-05 ***
Diameter 4.41220 2.24399 1.966 0.0493 *
Height 6.33314 1.54240 4.106 4.10e-05 ***
log(WholeWeight) 3.03359 0.74462 4.074 4.71e-05 ***
WholeWeight 7.05492 1.03490 6.817 1.06e-11 ***
log(ShuckedWeight) -3.09818 0.46808 -6.619 4.08e-11 ***
ShuckedWeight -12.05390 1.33481 -9.030 < 2e-16 ***
VisceraWeight -9.29066 1.27568 -7.283 3.89e-13 ***
log(ShellWeight) 2.15886 0.44562 4.845 1.31e-06 ***
ShellWeight 4.34356 1.79616 2.418 0.0156 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.142 on 4164 degrees of freedom
Multiple R-squared: 0.5598, Adjusted R-squared: 0.5585
F-statistic: 441.3 on 12 and 4164 DF, p-value: < 2.2e-16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment