sandys/computing_data_analysis_notes.md

## computing_data_analysis_notes.md

      
    Raw
  

              computing_data_analysis_notes.md
            
          
    using logical indexes

in1 <- c(TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE)
x[in1]```

* single [] operator returns the same class - except in matrix. To get the same behavior in matrix, set ``` drop=FALSE``` attribute
    * ```x[1,,drop=FALSE] ``` 
* ```attributes(y) ``` to list all attributes of a data strucuture
    * ``` attr(y, "class")``` to print out one particular attribute
* lists also work as hashes ```x <- list(foo = 1:4, bar=0.6)  x$foo  ```
    * multiple elements cannot be extracted using ```[[]] ``` or ```$``` referencing
    * if you have to dive inside a list and then its contained list, then you have to use a vector . e.g. ``` x[[c(1,2)]]``` gives 2
* Reading files
    * usually, I do ```a <- read.table("specdata/110.csv", comment.char="", nrows=10, header=TRUE,sep=",") ```  
    * [Help page](http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html) for ```read.table ```  
* Large datasets 
    * ```version``` command gives useful output
  <pre>
platform       x86_64-pc-linux-gnu          
arch           x86_64                       
os             linux-gnu                    
system         x86_64, linux-gnu            
status                                      
major          2                            
minor          14.1                         
year           2011                         
month          12                           
day            22                           
svn rev        57956                        
language       R                            
version.string R version 2.14.1 (2011-12-22)
  </pre> 

    *  numeric data are stored in 64 bits
    *  1,500,000 rows of 120 columns numeric data takes up **1500000 X 120 X 8 bytes /(2^20) MB **
* dput without file name is a good way of seeing the real (underlying) data structure of R data
e.g. ```y <- data.frame(a=1, b=2, c="a") ```  and then ```dput(y) ```
    
<pre>
  structure(list(a = 1, b = 2, c = structure(1L, .Label = "a", class = "factor")), .Names = c("a", 
"b", "c"), row.names = c(NA, -1L), class = "data.frame")
 </pre>
    * alternatively, you can use ``` str(y)```
    * ``` summary(y)``` can also be used
* to read from a url do
    * ``` con <- url("http://www.google.com", "r")```
    * ``` y <- readLines(con)```
    * ``` head(y)``` to get the headers
* functions
    *```?sd ``` - gives information about function
    * ```args(sd)``` gives info about arguments
    * ```formals(sd) ``` gives info about formal parameters
    * argument evaluation is lazy
* namespaces and libraries
    * ```search()``` gives list of packages already loaded
    * ``` library(lattice)``` pushes the *lattice* namespace just after the Global namespace.
    * R uses lexical scoping which is particularly useful for statistical computations
        * in lexical scoping, variables are picked from where they are **defined**. dynamic scoping picks up variable from where they are **called**
    * function + its environment is called *closure*. calling ```environment(f)``` returns the environment. ```parent.env(environment(f))``` goes one step above
         * if a function is defined inside another function, then the environment will be something funky like ```<environment: 0x24cc520>```
         * ```ls(environment(f))``` gives symbols listed inside an environment

#Graphing
```xyplot(weight ~ Time | Diet, data=BodyWeight)``` = A set of 3 panels showing the relationship between weight and time for each diet.