Skip to content

Instantly share code, notes, and snippets.

@ajp619
Created April 19, 2014 05:00
Show Gist options
  • Save ajp619/11074588 to your computer and use it in GitHub Desktop.
Save ajp619/11074588 to your computer and use it in GitHub Desktop.
R poly()
### poly
How does poly work?
```{r}
a <- 1:10
# Let's start easy
p <- poly(a, 3, raw=TRUE)
p
# This is easy to reproduce
data.frame('1'=a, '2'=a^2, '3'=a^3)
# So what about:
p <- poly(a, 3, raw=FALSE) # raw=FALSE is the default option
# Can I reproduce this?
# First let's define a couple of functions to make this easier
# vector length, like the octave norm() function
o.norm <- function(v){return(sqrt(sum(v*v)))}
# Normalize
v.normalize <- function(v){
v <- v - mean(v)
v <- v / sd(v)
v <- v / o.norm(v)
return(v)
}
a1 <- v.normalize(a)
# If I got it right, the next line of code should produce: > [1] TRUE
all(round(p[ ,1], 4) == round(a1, 4))
# What about the higher degrees?
a2 <- v.normalize(a^2)
all(round(p[ ,2], 4) == round(a2, 4))
# That's not right
# Let's see what they look like:
plot(p[,2], pch=19)
points(a2, pch=19, col='blue')
lines(p[,2])
lines(a2, col='blue')
# I don't know how to make sense of this
```
I think what we're doing is creating a higher order polynomial by adding features (or columns) to the data set and then fitting a linear model to the new data set, so the raw=TRUE format makes sense to me.
Reading through ?poly and other resources, the raw=FALSE option creates orthagonal polynomials. This is to reduce multicollinearity:
From wikipedia: http://en.wikipedia.org/wiki/Multicollinearity
Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data themselves; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.
But looking at the graph, it feels like we're just making up data points. ??????
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment