Skip to content

Instantly share code, notes, and snippets.

@mccreigh
Created October 17, 2011 00:23
Show Gist options
  • Select an option

  • Save mccreigh/1291640 to your computer and use it in GitHub Desktop.

Select an option

Save mccreigh/1291640 to your computer and use it in GitHub Desktop.
Cluster Analysis + CART homework problems

Cluster Analysis & PCA

Begin from the single field analysis of Kaplan SST in the tropical pacific for Jun-Sep for 1976-2010. Plot the first 3 spatial patterns and then plot its first 3 modes (space pattern*time series) along with its time mode (preferably normalized). Now perform a k-means clustering on the same data using k=3. Plot the spatial pattern of the clustering and then plot the individual timseries of the clusters in separate panels

To simplify the above (including the single field analysis problem from the PCA homework), the above tasks can be performed by code found here: https://github.com/mccreigh/EOF. In particular, navigate to the file R/kaplan.pca.r (in the R/ directory). However this will/may require installation of several (highly recommended) R packages including 2 of my own (which are in progress and which i invite collaboration on).

Reflect on the differences between the analyses. How are these related to the objectives of each? What is the objective of PCA? What is the objective of Cluster analysis? How do the timeseries of each mode compare to that of each cluster? What are benefits and drawbacks of each approach to decomposing the space-time series?

Classification and regression trees

We will analyze the mtcars data set in R. If you dont want to use R, the data are supplied below as CSV with a header. The documentation is also given.

Our overall goal is to understand what factors determine the gas mileage (mpg) of these cars.

With that in mind:

  1. perform a k-means cluster analysis on the data. make sure to run kmeans several times for each choice of cluster size. how do you justify the number of clusters? use the kink.wss function to examine W as a function of the number of clusters. run this a few times to see how it changes with each run of kmeans.

  2. extra credit: try pam() from the "cluster" package in R. invoke the print() and plot() methods on the result. What are the medoids under pam? notice the diagnostic plots.

  3. extra credit: for a hierarchical clustering try diana() also in the cluster package and plot() the result. more diagnoistic plots.

  4. of course, if you really want to understand a particular variable, clustering is not the right tool because all variables are considered at once. use a binary regression tree to understand which variables control the mpg of the car. in R the package is rpart. to get a bit finer detail, in rpart() set control=rpart.control(minslipt=2) or the equivalent in whatever language you are using. In R, plot(result) followed by text(result) gives a graphical binary regression tree. what are the primary variables controlling gas mileage? remove qsec (since this is a function of other things) from the data and repeat. finally, perform a "leave-one-out" cross-validation on each car model mpg. give RMSE and at least one other metric for evaluating the accuracy of the model.

       "model","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
       "Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
       "Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
       "Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
       "Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
       "Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
       "Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
       "Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
       "Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
       "Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
       "Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
       "Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
       "Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
       "Merc 450SL",17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3
       "Merc 450SLC",15.2,8,275.8,180,3.07,3.78,18,0,0,3,3
       "Cadillac Fleetwood",10.4,8,472,205,2.93,5.25,17.98,0,0,3,4
       "Lincoln Continental",10.4,8,460,215,3,5.424,17.82,0,0,3,4
       "Chrysler Imperial",14.7,8,440,230,3.23,5.345,17.42,0,0,3,4
       "Fiat 128",32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
       "Honda Civic",30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
       "Toyota Corolla",33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
       "Toyota Corona",21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
       "Dodge Challenger",15.5,8,318,150,2.76,3.52,16.87,0,0,3,2
       "AMC Javelin",15.2,8,304,150,3.15,3.435,17.3,0,0,3,2
       "Camaro Z28",13.3,8,350,245,3.73,3.84,15.41,0,0,3,4
       "Pontiac Firebird",19.2,8,400,175,3.08,3.845,17.05,0,0,3,2
       "Fiat X1-9",27.3,4,79,66,4.08,1.935,18.9,1,1,4,1
       "Porsche 914-2",26,4,120.3,91,4.43,2.14,16.7,0,1,5,2
       "Lotus Europa",30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
       "Ford Pantera L",15.8,8,351,264,4.22,3.17,14.5,0,1,5,4
       "Ferrari Dino",19.7,6,145,175,3.62,2.77,15.5,0,1,5,6
       "Maserati Bora",15,8,301,335,3.54,3.57,14.6,0,1,5,8
       "Volvo 142E",21.4,4,121,109,4.11,2.78,18.6,1,1,4,2
    
       ?mtcars
       mtcars                package:datasets                 R Documentation
       Motor Trend Car Road Tests
       Description:
            The data was extracted from the 1974 _Motor Trend_ US magazine,
            and comprises fuel consumption and 10 aspects of automobile design
            and performance for 32 automobiles (1973-74 models).
       Usage:
            mtcars
       Format:
            A data frame with 32 observations on 11 variables.
              [, 1]  mpg   Miles/(US) gallon                       
              [, 2]  cyl   Number of cylinders                     
              [, 3]  disp  Displacement (cu.in.)                   
              [, 4]  hp    Gross horsepower                        
              [, 5]  drat  Rear axle ratio                         
              [, 6]  wt    Weight (lb/1000)                        
              [, 7]  qsec  1/4 mile time                           
              [, 8]  vs    V/S                                     
              [, 9]  am    Transmission (0 = automatic, 1 = manual)
              [,10]  gear  Number of forward gears                 
              [,11]  carb  Number of carburetors                        
       Source:
            Henderson and Velleman (1981), Building multiple regression models
            interactively.  _Biometrics_, *37*, 391-411.
       Examples:
            require(graphics)
            pairs(mtcars, main = "mtcars data")
            coplot(mpg ~ disp | as.factor(cyl), data = mtcars,
                   panel = panel.smooth, rows = 1)
    
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment