eggie5/ISL Exercises.md

## ISL Exercises.md

      
    Raw
  

              ISL Exercises.md
            
          
    References:
https://rpubs.com/ppaquay/65557
https://raw.githubusercontent.com/asadoughi/stat-learning/master/ch2/answers
2.4 Exercises

Conceptual


For each of parts (a) through (d), indicate whether we woudl generally expect the performance of a flexible statistical learning method to be better or worse than an inflexable method. Justify your answer.
a. The sample size n is extremely large, and the number of predictors p is small.

A flexable/complex model w/ low dimensional feature set w/ a lot of data would potentially perform well compared to an inflexable/simple model. B/c of the small feature space and large # examples you would get the benefits of the flexiable model w/o the overfitting that would occur if you had a large features space and small sample size.

b. The number of predictors p is extremely large and the number of observations n is small.

In this case, we have a large features set and a small observation set. This is dangerous territory for a flexable model as it will fit the noise of the large feature set and due to the lack of observations it will not average out.

c. The relationship between the predictors and response is highly non-linear:

Flexible is better: In this case we hae a non-linear decision bounty. Typically an infexible model would not be able to fit this, however, a flexable model has a higher probability to learn the non-linear relationships.

d. The variance of the error terms, i.e. $\sigma^2 = Var(e)$ is extremly high.

Worse. A flexible method would fit to the noise in the error terms and increase variance.


Explain whether each scenario  is a classification  or regression problem and indicate whether we are most intersted in inference or prediction. Finally, provide n and p.
a. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Regression, inference. n=500, p=3

b. We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

Classification, prediction, n=20, p=13

c. We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

Regression, prediction, n=52(weeks), p=3


Bias-variance
a. Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.
[see pic in notebook]
b. Explain why each of the five curves has the shape displayed in part (a).

Bias: (or the error of the model) initailly starts high as the model is simple and lowers as the model learns the noise of the observations
Variance: Initially the variance is low as the inflexable model over-generalizes, however as the model becomes more flexable the vairance goes up as the model becomes less general.
Training Error: Starts high initially, then approaches 0 as the model becomes more flexable.
Test Error: Starts high as initially, the model is general, however, as the model learns the features it goes lower, however, as the model learns the noise and becomes less genreal, the test error trends higher against: the U shape.


You will now think of some real-life applications for statistical learning:


What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

The advantages of a flexable model are that it can potentially model non-linear relationships and it decreases the bias. The disadvantage is the risk of overfitting if there are too many features/not enough observations — increase the vairance. Less flexable models require less data and are less prone to overfitting.
A flexable approach is prefererred over inflexable when the relationship is non-linear and there is a lot of data. A less flexable approach is deseriable b/c the model is more simple and you'll have to take less precautions to avoid overfitting.


Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach) ? What are its disadvantages ?

A parametric approach reduces the problem of estimating f down to one of estimating a set of parameters because it assumes a form for f.
A non-parametric approach does not assume a patricular form of f and so requires a very large sample to accurately estimate f
The advantages of a parametric approach to regression or classification are the simplifying of modeling f to a few parameters and not as many observations are required compared to a non-parametric approach.
The disadvantages of a parametric approach to regression or classification are a potentially inaccurate estimate f if the form of f assumed is wrong or to overfit the observations if more flexible models are used.


The table below provides a training data set containing 6 observations, 3 predictors, and 1 qualitative response variable. Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.


3.7 Exercises

Conceptual


Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

Null: There is no relationship between TV, radio and newspaper.  More precissly B1=B2=B3 = 0. The null does not hold for TV and radio b/c the p-value is very low. Thustly we reject H0, and h3, like H3 holds. This means that newspaper really has no contribution to sales.


Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN works by identifing (by some distance-measure e.g. euclidian) proximity points and then taking a majority votes of the K nearest points. Applied to classifiction or regression.


Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ,  X3 = Gender (1 for Female and 0 for Male),  X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get β0 = 50, β1 =20 , β2 = 0.07 , β3 = 35 , β4 = 0.01 , β5 =−10. Which answer is correct, and why?
a)

For a fixed value of IQ and GPA, males earn more on average than females.
For a fixed value of IQ and GPA, females earn more on average than males.
For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough.
For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.

b)
c) True or false: Since the coefficient for the GPA/IQ interactionterm is very small, there is very little evidence of an interactioneffect. Justify your answer.

False, inorder to determine the effect of the variable, we must test the hypothesis H_0 B_4=0 and lookup the p-value associated w/ the t or the F statistic to draw a conclusion.


I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. Y =β0 +β1X +β2X2 +β3X3 +ε.
a. Suppose that the true relationship between X and Y is linear, i.e. Y = β0 + β1X + ε. Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

There is not enough information. However, if we go w/ the assumption that the relationship between X & Y is linear, then the linear regression RSS will be lower.

b. Answer (a) using test rather than training RSS.

We still don't have enough information. However, a high-order polynomial will generally overfit more than a lower-order so the RSS for the cubic model will likely be larger.

c. Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for thecubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

It's hard to tell, but if it is a non-linear relationship, it would be reasonable to say that the test RSS for the cubic model will be lower.

d.  Answer (c) using test rather than training RSS.

same


Consider the fitted values that result from performing linear regression without an intercept. In this setting, the ith fitted value takes the form