raov5/Data Preparation in R.ipynb

## Data Preparation in R.ipynb
{
    "nbformat_minor": 1,
    "cells": [
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#@author: Venky Rao raove@us.ibm.com\n#@last edited: 22 Aug 2017\n#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff",
            "execution_count": 90
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "# Data preparation in R"
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Creating a Data Frame"
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#let us begin by creating a data frame of data for us to analyze\nmanager <- c(1, 2, 3, 4, 5)\ndate <- c(\"10/24/08\", \"10/28/08\", \"10/1/08\", \"10/12/08\", \"5/1/09\")\ncountry <- c(\"US\", \"US\", \"UK\", \"UK\", \"UK\")\ngender <- c(\"M\", \"F\", \"F\", \"M\", \"F\")\nage <- c(32, 45, 25, 39, 99)\nq1 <- c(5, 3, 3, 3, 2)\nq2 <- c(4, 5, 5, 3, 2)\nq3 <- c(5, 2, 5, 4, 1)\nq4 <- c(5, 5, 5, NA, 2)\nq5 <- c(5, 5, 2, NA, 1)\nleadership <- data.frame(manager, date, country, gender, age, q1, q2, q3, q4, q5, stringsAsFactors = F)\n#if stringsAsFactors = T, then dates would be treated as factors",
            "execution_count": 4
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>date</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th></tr></thead>\n<tbody>\n\t<tr><td>1       </td><td>10/24/08</td><td>US      </td><td>M       </td><td>32      </td><td>5       </td><td>4       </td><td>5       </td><td> 5      </td><td> 5      </td></tr>\n\t<tr><td>2       </td><td>10/28/08</td><td>US      </td><td>F       </td><td>45      </td><td>3       </td><td>5       </td><td>2       </td><td> 5      </td><td> 5      </td></tr>\n\t<tr><td>3       </td><td>10/1/08 </td><td>UK      </td><td>F       </td><td>25      </td><td>3       </td><td>5       </td><td>5       </td><td> 5      </td><td> 2      </td></tr>\n\t<tr><td>4       </td><td>10/12/08</td><td>UK      </td><td>M       </td><td>39      </td><td>3       </td><td>3       </td><td>4       </td><td>NA      </td><td>NA      </td></tr>\n\t<tr><td>5       </td><td>5/1/09  </td><td>UK      </td><td>F       </td><td>99      </td><td>2       </td><td>2       </td><td>1       </td><td> 2      </td><td> 1      </td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|llllllllll}\n manager & date & country & gender & age & q1 & q2 & q3 & q4 & q5\\\\\n\\hline\n\t 1        & 10/24/08 & US       & M        & 32       & 5        & 4        & 5        &  5       &  5      \\\\\n\t 2        & 10/28/08 & US       & F        & 45       & 3        & 5        & 2        &  5       &  5      \\\\\n\t 3        & 10/1/08  & UK       & F        & 25       & 3        & 5        & 5        &  5       &  2      \\\\\n\t 4        & 10/12/08 & UK       & M        & 39       & 3        & 3        & 4        & NA       & NA      \\\\\n\t 5        & 5/1/09   & UK       & F        & 99       & 2        & 2        & 1        &  2       &  1      \\\\\n\\end{tabular}\n",
                        "text/plain": "  manager date     country gender age q1 q2 q3 q4 q5\n1 1       10/24/08 US      M      32  5  4  5   5  5\n2 2       10/28/08 US      F      45  3  5  2   5  5\n3 3       10/1/08  UK      F      25  3  5  5   5  2\n4 4       10/12/08 UK      M      39  3  3  4  NA NA\n5 5       5/1/09   UK      F      99  2  2  1   2  1"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#display the data frame that we created\nleadership",
            "execution_count": 5
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Creating new variables"
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th></tr></thead>\n<tbody>\n\t<tr><td>2</td><td>3</td></tr>\n\t<tr><td>2</td><td>4</td></tr>\n\t<tr><td>6</td><td>2</td></tr>\n\t<tr><td>4</td><td>8</td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|ll}\n x1 & x2\\\\\n\\hline\n\t 2 & 3\\\\\n\t 2 & 4\\\\\n\t 6 & 2\\\\\n\t 4 & 8\\\\\n\\end{tabular}\n",
                        "text/plain": "  x1 x2\n1 2  3 \n2 2  4 \n3 6  2 \n4 4  8 "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#data preparation often involves creating new variables in a data frame\n#let's create a new data frame to explore this further\nmydata <- data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8))\nmydata",
            "execution_count": 6
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#let's say we want to create two new variables in the data frame:\n#sumx which is the sum of x1 and x2; and\n#meanx which is the mean of x1 and x2\n#here are 3 ways of doing this:",
            "execution_count": 7
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th><th scope=col>sumx</th><th scope=col>meanx</th></tr></thead>\n<tbody>\n\t<tr><td>2  </td><td>3  </td><td> 5 </td><td>2.5</td></tr>\n\t<tr><td>2  </td><td>4  </td><td> 6 </td><td>3.0</td></tr>\n\t<tr><td>6  </td><td>2  </td><td> 8 </td><td>4.0</td></tr>\n\t<tr><td>4  </td><td>8  </td><td>12 </td><td>6.0</td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|llll}\n x1 & x2 & sumx & meanx\\\\\n\\hline\n\t 2   & 3   &  5  & 2.5\\\\\n\t 2   & 4   &  6  & 3.0\\\\\n\t 6   & 2   &  8  & 4.0\\\\\n\t 4   & 8   & 12  & 6.0\\\\\n\\end{tabular}\n",
                        "text/plain": "  x1 x2 sumx meanx\n1 2  3   5   2.5  \n2 2  4   6   3.0  \n3 6  2   8   4.0  \n4 4  8  12   6.0  "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#option 1\nmydata$sumx <- mydata$x1 + mydata$x2\nmydata$meanx <- (mydata$x1 + mydata$x2) / 2\nmydata",
            "execution_count": 8
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th><th scope=col>sumx</th><th scope=col>meanx</th></tr></thead>\n<tbody>\n\t<tr><td>2  </td><td>3  </td><td> 5 </td><td>2.5</td></tr>\n\t<tr><td>2  </td><td>4  </td><td> 6 </td><td>3.0</td></tr>\n\t<tr><td>6  </td><td>2  </td><td> 8 </td><td>4.0</td></tr>\n\t<tr><td>4  </td><td>8  </td><td>12 </td><td>6.0</td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|llll}\n x1 & x2 & sumx & meanx\\\\\n\\hline\n\t 2   & 3   &  5  & 2.5\\\\\n\t 2   & 4   &  6  & 3.0\\\\\n\t 6   & 2   &  8  & 4.0\\\\\n\t 4   & 8   & 12  & 6.0\\\\\n\\end{tabular}\n",
                        "text/plain": "  x1 x2 sumx meanx\n1 2  3   5   2.5  \n2 2  4   6   3.0  \n3 6  2   8   4.0  \n4 4  8  12   6.0  "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#option 2\nattach(mydata)\nmydata$sumx <- x1 + x2\nmydata$meanx <- (x1 + x2) / 2\ndetach(mydata)\nmydata",
            "execution_count": 9
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th><th scope=col>sumx</th><th scope=col>meanx</th></tr></thead>\n<tbody>\n\t<tr><td>2  </td><td>3  </td><td> 5 </td><td>2.5</td></tr>\n\t<tr><td>2  </td><td>4  </td><td> 6 </td><td>3.0</td></tr>\n\t<tr><td>6  </td><td>2  </td><td> 8 </td><td>4.0</td></tr>\n\t<tr><td>4  </td><td>8  </td><td>12 </td><td>6.0</td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|llll}\n x1 & x2 & sumx & meanx\\\\\n\\hline\n\t 2   & 3   &  5  & 2.5\\\\\n\t 2   & 4   &  6  & 3.0\\\\\n\t 6   & 2   &  8  & 4.0\\\\\n\t 4   & 8   & 12  & 6.0\\\\\n\\end{tabular}\n",
                        "text/plain": "  x1 x2 sumx meanx\n1 2  3   5   2.5  \n2 2  4   6   3.0  \n3 6  2   8   4.0  \n4 4  8  12   6.0  "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#option 3\nmydata <- transform(mydata, sumx = x1 + x2, meanx = (x1 + x2) / 2)\nmydata",
            "execution_count": 10
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Recoding variables"
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#recoding involves creating new values of a variable conditional on the existing values of\n#the same and / or other variables.  for example: you may want to\n#1. change a continuous variable into a set of categories\n#2. replace miscoded values with correct values\n#3. create a pass / fail variable based on a set of cutoff scores\n#to recode data, you use one of R's logical operators (i.e. expressions that return TRUE or FALSE)",
            "execution_count": 11
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|llllllllll}\n manager & date & country & gender & age & q1 & q2 & q3 & q4 & q5\\\\\n\\hline\n\t 1        & 10/24/08 & US       & M        & 32       & 5        & 4        & 5        &  5       &  5      \\\\\n\t 2        & 10/28/08 & US       & F        & 45       & 3        & 5        & 2        &  5       &  5      \\\\\n\t 3        & 10/1/08  & UK       & F        & 25       & 3        & 5        & 5        &  5       &  2      \\\\\n\t 4        & 10/12/08 & UK       & M        & 39       & 3        & 3        & 4        & NA       & NA      \\\\\n\t 5        & 5/1/09   & UK       & F        & NA       & 2        & 2        & 1        &  2       &  1      \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>date</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th></tr></thead>\n<tbody>\n\t<tr><td>1       </td><td>10/24/08</td><td>US      </td><td>M       </td><td>32      </td><td>5       </td><td>4       </td><td>5       </td><td> 5      </td><td> 5      </td></tr>\n\t<tr><td>2       </td><td>10/28/08</td><td>US      </td><td>F       </td><td>45      </td><td>3       </td><td>5       </td><td>2       </td><td> 5      </td><td> 5      </td></tr>\n\t<tr><td>3       </td><td>10/1/08 </td><td>UK      </td><td>F       </td><td>25      </td><td>3       </td><td>5       </td><td>5       </td><td> 5      </td><td> 2      </td></tr>\n\t<tr><td>4       </td><td>10/12/08</td><td>UK      </td><td>M       </td><td>39      </td><td>3       </td><td>3       </td><td>4       </td><td>NA      </td><td>NA      </td></tr>\n\t<tr><td>5       </td><td>5/1/09  </td><td>UK      </td><td>F       </td><td>NA      </td><td>2       </td><td>2       </td><td>1       </td><td> 2      </td><td> 1      </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  manager date     country gender age q1 q2 q3 q4 q5\n1 1       10/24/08 US      M      32  5  4  5   5  5\n2 2       10/28/08 US      F      45  3  5  2   5  5\n3 3       10/1/08  UK      F      25  3  5  5   5  2\n4 4       10/12/08 UK      M      39  3  3  4  NA NA\n5 5       5/1/09   UK      F      NA  2  2  1   2  1"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#example: recode ages of managers in the \"leadership\" data frame from continuous to categorical\n#step 1: recode age 99 to indicate missing value since 99 is obviously incorrect\nleadership$age[leadership$age == 99] <- NA\nleadership",
            "execution_count": 14
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n manager & date & country & gender & age & q1 & q2 & q3 & q4 & q5 & agecat\\\\\n\\hline\n\t 1        & 10/24/08 & US       & M        & 32       & 5        & 4        & 5        &  5       &  5       & Young   \\\\\n\t 2        & 10/28/08 & US       & F        & 45       & 3        & 5        & 2        &  5       &  5       & Young   \\\\\n\t 3        & 10/1/08  & UK       & F        & 25       & 3        & 5        & 5        &  5       &  2       & Young   \\\\\n\t 4        & 10/12/08 & UK       & M        & 39       & 3        & 3        & 4        & NA       & NA       & Young   \\\\\n\t 5        & 5/1/09   & UK       & F        & NA       & 2        & 2        & 1        &  2       &  1       & NA      \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>date</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1       </td><td>10/24/08</td><td>US      </td><td>M       </td><td>32      </td><td>5       </td><td>4       </td><td>5       </td><td> 5      </td><td> 5      </td><td>Young   </td></tr>\n\t<tr><td>2       </td><td>10/28/08</td><td>US      </td><td>F       </td><td>45      </td><td>3       </td><td>5       </td><td>2       </td><td> 5      </td><td> 5      </td><td>Young   </td></tr>\n\t<tr><td>3       </td><td>10/1/08 </td><td>UK      </td><td>F       </td><td>25      </td><td>3       </td><td>5       </td><td>5       </td><td> 5      </td><td> 2      </td><td>Young   </td></tr>\n\t<tr><td>4       </td><td>10/12/08</td><td>UK      </td><td>M       </td><td>39      </td><td>3       </td><td>3       </td><td>4       </td><td>NA      </td><td>NA      </td><td>Young   </td></tr>\n\t<tr><td>5       </td><td>5/1/09  </td><td>UK      </td><td>F       </td><td>NA      </td><td>2       </td><td>2       </td><td>1       </td><td> 2      </td><td> 1      </td><td>NA      </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  manager date     country gender age q1 q2 q3 q4 q5 agecat\n1 1       10/24/08 US      M      32  5  4  5   5  5 Young \n2 2       10/28/08 US      F      45  3  5  2   5  5 Young \n3 3       10/1/08  UK      F      25  3  5  5   5  2 Young \n4 4       10/12/08 UK      M      39  3  3  4  NA NA Young \n5 5       5/1/09   UK      F      NA  2  2  1   2  1 NA    "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {
                "scrolled": true
            },
            "source": "#step 2: create the categories\nleadership <- within(leadership, {\n    agecat <- NA\n    agecat[age > 75] <- \"Elder\"\n    agecat[age >= 55 & age <= 75] <- \"Middle Aged\"\n    agecat[age < 55] <- \"Young\"\n})\nleadership",
            "execution_count": 15
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Renaming variables"
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n manager & testDate & country & gender & age & q1 & q2 & q3 & q4 & q5 & agecat\\\\\n\\hline\n\t 1        & 10/24/08 & US       & M        & 32       & 5        & 4        & 5        &  5       &  5       & Young   \\\\\n\t 2        & 10/28/08 & US       & F        & 45       & 3        & 5        & 2        &  5       &  5       & Young   \\\\\n\t 3        & 10/1/08  & UK       & F        & 25       & 3        & 5        & 5        &  5       &  2       & Young   \\\\\n\t 4        & 10/12/08 & UK       & M        & 39       & 3        & 3        & 4        & NA       & NA       & Young   \\\\\n\t 5        & 5/1/09   & UK       & F        & NA       & 2        & 2        & 1        &  2       &  1       & NA      \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1       </td><td>10/24/08</td><td>US      </td><td>M       </td><td>32      </td><td>5       </td><td>4       </td><td>5       </td><td> 5      </td><td> 5      </td><td>Young   </td></tr>\n\t<tr><td>2       </td><td>10/28/08</td><td>US      </td><td>F       </td><td>45      </td><td>3       </td><td>5       </td><td>2       </td><td> 5      </td><td> 5      </td><td>Young   </td></tr>\n\t<tr><td>3       </td><td>10/1/08 </td><td>UK      </td><td>F       </td><td>25      </td><td>3       </td><td>5       </td><td>5       </td><td> 5      </td><td> 2      </td><td>Young   </td></tr>\n\t<tr><td>4       </td><td>10/12/08</td><td>UK      </td><td>M       </td><td>39      </td><td>3       </td><td>3       </td><td>4       </td><td>NA      </td><td>NA      </td><td>Young   </td></tr>\n\t<tr><td>5       </td><td>5/1/09  </td><td>UK      </td><td>F       </td><td>NA      </td><td>2       </td><td>2       </td><td>1       </td><td> 2      </td><td> 1      </td><td>NA      </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  manager testDate country gender age q1 q2 q3 q4 q5 agecat\n1 1       10/24/08 US      M      32  5  4  5   5  5 Young \n2 2       10/28/08 US      F      45  3  5  2   5  5 Young \n3 3       10/1/08  UK      F      25  3  5  5   5  2 Young \n4 4       10/12/08 UK      M      39  3  3  4  NA NA Young \n5 5       5/1/09   UK      F      NA  2  2  1   2  1 NA    "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#here's one way:\nnames(leadership)[2] <- \"testDate\"\nleadership",
            "execution_count": 17
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n manager & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1        & 10/24/08 & US       & M        & 32       & 5        & 4        & 5        &  5       &  5       & Young   \\\\\n\t 2        & 10/28/08 & US       & F        & 45       & 3        & 5        & 2        &  5       &  5       & Young   \\\\\n\t 3        & 10/1/08  & UK       & F        & 25       & 3        & 5        & 5        &  5       &  2       & Young   \\\\\n\t 4        & 10/12/08 & UK       & M        & 39       & 3        & 3        & 4        & NA       & NA       & Young   \\\\\n\t 5        & 5/1/09   & UK       & F        & NA       & 2        & 2        & 1        &  2       &  1       & NA      \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1       </td><td>10/24/08</td><td>US      </td><td>M       </td><td>32      </td><td>5       </td><td>4       </td><td>5       </td><td> 5      </td><td> 5      </td><td>Young   </td></tr>\n\t<tr><td>2       </td><td>10/28/08</td><td>US      </td><td>F       </td><td>45      </td><td>3       </td><td>5       </td><td>2       </td><td> 5      </td><td> 5      </td><td>Young   </td></tr>\n\t<tr><td>3       </td><td>10/1/08 </td><td>UK      </td><td>F       </td><td>25      </td><td>3       </td><td>5       </td><td>5       </td><td> 5      </td><td> 2      </td><td>Young   </td></tr>\n\t<tr><td>4       </td><td>10/12/08</td><td>UK      </td><td>M       </td><td>39      </td><td>3       </td><td>3       </td><td>4       </td><td>NA      </td><td>NA      </td><td>Young   </td></tr>\n\t<tr><td>5       </td><td>5/1/09  </td><td>UK      </td><td>F       </td><td>NA      </td><td>2       </td><td>2       </td><td>1       </td><td> 2      </td><td> 1      </td><td>NA      </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  manager testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1       10/24/08 US      M      32  5     4     5      5     5    Young \n2 2       10/28/08 US      F      45  3     5     2      5     5    Young \n3 3       10/1/08  UK      F      25  3     5     5      5     2    Young \n4 4       10/12/08 UK      M      39  3     3     4     NA    NA    Young \n5 5       5/1/09   UK      F      NA  2     2     1      2     1    NA    "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#another example:\nnames(leadership)[6:10] <- c(\"item1\", \"item2\", \"item3\", \"item4\", \"item5\")\nleadership",
            "execution_count": 18
        },
        {
            "outputs": [
                {
                    "output_type": "stream",
                    "name": "stderr",
                    "text": "Installing package into \u2018/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs\u2019\n(as \u2018lib\u2019 is unspecified)\n"
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#another way:\n#using the rename function in the \"plyr\" package\ninstall.packages(\"plyr\")",
            "execution_count": 21
        },
        {
            "outputs": [
                {
                    "output_type": "stream",
                    "name": "stderr",
                    "text": "\nAttaching package: \u2018plyr\u2019\n\nThe following objects are masked from \u2018package:SparkR\u2019:\n\n    arrange, count, desc, join, mutate, rename, summarize, take\n\n"
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#load libraries\nlibrary(plyr)",
            "execution_count": 22
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1        & 10/24/08 & US       & M        & 32       & 5        & 4        & 5        &  5       &  5       & Young   \\\\\n\t 2        & 10/28/08 & US       & F        & 45       & 3        & 5        & 2        &  5       &  5       & Young   \\\\\n\t 3        & 10/1/08  & UK       & F        & 25       & 3        & 5        & 5        &  5       &  2       & Young   \\\\\n\t 4        & 10/12/08 & UK       & M        & 39       & 3        & 3        & 4        & NA       & NA       & Young   \\\\\n\t 5        & 5/1/09   & UK       & F        & NA       & 2        & 2        & 1        &  2       &  1       & NA      \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1       </td><td>10/24/08</td><td>US      </td><td>M       </td><td>32      </td><td>5       </td><td>4       </td><td>5       </td><td> 5      </td><td> 5      </td><td>Young   </td></tr>\n\t<tr><td>2       </td><td>10/28/08</td><td>US      </td><td>F       </td><td>45      </td><td>3       </td><td>5       </td><td>2       </td><td> 5      </td><td> 5      </td><td>Young   </td></tr>\n\t<tr><td>3       </td><td>10/1/08 </td><td>UK      </td><td>F       </td><td>25      </td><td>3       </td><td>5       </td><td>5       </td><td> 5      </td><td> 2      </td><td>Young   </td></tr>\n\t<tr><td>4       </td><td>10/12/08</td><td>UK      </td><td>M       </td><td>39      </td><td>3       </td><td>3       </td><td>4       </td><td>NA      </td><td>NA      </td><td>Young   </td></tr>\n\t<tr><td>5       </td><td>5/1/09  </td><td>UK      </td><td>F       </td><td>NA      </td><td>2       </td><td>2       </td><td>1       </td><td> 2      </td><td> 1      </td><td>NA      </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1         10/24/08 US      M      32  5     4     5      5     5    Young \n2 2         10/28/08 US      F      45  3     5     2      5     5    Young \n3 3         10/1/08  UK      F      25  3     5     5      5     2    Young \n4 4         10/12/08 UK      M      39  3     3     4     NA    NA    Young \n5 5         5/1/09   UK      F      NA  2     2     1      2     1    NA    "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#use the rename() function\nleadership <- rename(leadership, c(manager = \"managerID\"))\nleadership",
            "execution_count": 23
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Missing values"
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{enumerate*}\n\\item FALSE\n\\item FALSE\n\\item FALSE\n\\item TRUE\n\\end{enumerate*}\n",
                        "text/html": "<ol class=list-inline>\n\t<li>FALSE</li>\n\t<li>FALSE</li>\n\t<li>FALSE</li>\n\t<li>TRUE</li>\n</ol>\n",
                        "text/plain": "[1] FALSE FALSE FALSE  TRUE",
                        "text/markdown": "1. FALSE\n2. FALSE\n3. FALSE\n4. TRUE\n\n\n"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#identifying missing values using the is.na() function\n#example:\ny <- c(1, 2, 3, NA)\nis.na(y)",
            "execution_count": 24
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE &  TRUE &  TRUE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE &  TRUE & FALSE & FALSE & FALSE & FALSE & FALSE &  TRUE\\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td> TRUE</td><td> TRUE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td> TRUE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td> TRUE</td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "     managerID testDate country gender age   item1 item2 item3 item4 item5\n[1,] FALSE     FALSE    FALSE   FALSE  FALSE FALSE FALSE FALSE FALSE FALSE\n[2,] FALSE     FALSE    FALSE   FALSE  FALSE FALSE FALSE FALSE FALSE FALSE\n[3,] FALSE     FALSE    FALSE   FALSE  FALSE FALSE FALSE FALSE FALSE FALSE\n[4,] FALSE     FALSE    FALSE   FALSE  FALSE FALSE FALSE FALSE  TRUE  TRUE\n[5,] FALSE     FALSE    FALSE   FALSE   TRUE FALSE FALSE FALSE FALSE FALSE\n     agecat\n[1,] FALSE \n[2,] FALSE \n[3,] FALSE \n[4,] FALSE \n[5,]  TRUE ",
                        "text/markdown": "1. FALSE\n2. FALSE\n3. FALSE\n4. FALSE\n5. FALSE\n6. FALSE\n7. FALSE\n8. FALSE\n9. FALSE\n10. FALSE\n11. FALSE\n12. FALSE\n13. FALSE\n14. FALSE\n15. FALSE\n16. FALSE\n17. FALSE\n18. FALSE\n19. FALSE\n20. FALSE\n21. FALSE\n22. FALSE\n23. FALSE\n24. FALSE\n25. TRUE\n26. FALSE\n27. FALSE\n28. FALSE\n29. FALSE\n30. FALSE\n31. FALSE\n32. FALSE\n33. FALSE\n34. FALSE\n35. FALSE\n36. FALSE\n37. FALSE\n38. FALSE\n39. FALSE\n40. FALSE\n41. FALSE\n42. FALSE\n43. FALSE\n44. TRUE\n45. FALSE\n46. FALSE\n47. FALSE\n48. FALSE\n49. TRUE\n50. FALSE\n51. FALSE\n52. FALSE\n53. FALSE\n54. FALSE\n55. TRUE\n\n\n"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#another example\nis.na(leadership)",
            "execution_count": 25
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### Excluding missing values from analyses"
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/plain": "[1] NA"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#sometimes you need to eliminate missing vaues from your analysis.  for example:\nx <- c(1, 2, NA, 3)\ny <- x[1] + x[2] + x[3] + x[4] #will result in NA\ny",
            "execution_count": 29
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/plain": "[1] NA"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "z <- sum(x) #will result in NA\nz",
            "execution_count": 30
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "6",
                        "text/html": "6",
                        "text/plain": "[1] 6",
                        "text/markdown": "6"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#one way to fix this is to use the na.rm = T option, as follows:\nz <- sum(x, na.rm = T)\nz",
            "execution_count": 31
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1        & 10/24/08 & US       & M        & 32       & 5        & 4        & 5        &  5       &  5       & Young   \\\\\n\t 2        & 10/28/08 & US       & F        & 45       & 3        & 5        & 2        &  5       &  5       & Young   \\\\\n\t 3        & 10/1/08  & UK       & F        & 25       & 3        & 5        & 5        &  5       &  2       & Young   \\\\\n\t 4        & 10/12/08 & UK       & M        & 39       & 3        & 3        & 4        & NA       & NA       & Young   \\\\\n\t 5        & 5/1/09   & UK       & F        & NA       & 2        & 2        & 1        &  2       &  1       & NA      \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1       </td><td>10/24/08</td><td>US      </td><td>M       </td><td>32      </td><td>5       </td><td>4       </td><td>5       </td><td> 5      </td><td> 5      </td><td>Young   </td></tr>\n\t<tr><td>2       </td><td>10/28/08</td><td>US      </td><td>F       </td><td>45      </td><td>3       </td><td>5       </td><td>2       </td><td> 5      </td><td> 5      </td><td>Young   </td></tr>\n\t<tr><td>3       </td><td>10/1/08 </td><td>UK      </td><td>F       </td><td>25      </td><td>3       </td><td>5       </td><td>5       </td><td> 5      </td><td> 2      </td><td>Young   </td></tr>\n\t<tr><td>4       </td><td>10/12/08</td><td>UK      </td><td>M       </td><td>39      </td><td>3       </td><td>3       </td><td>4       </td><td>NA      </td><td>NA      </td><td>Young   </td></tr>\n\t<tr><td>5       </td><td>5/1/09  </td><td>UK      </td><td>F       </td><td>NA      </td><td>2       </td><td>2       </td><td>1       </td><td> 2      </td><td> 1      </td><td>NA      </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1         10/24/08 US      M      32  5     4     5      5     5    Young \n2 2         10/28/08 US      F      45  3     5     2      5     5    Young \n3 3         10/1/08  UK      F      25  3     5     5      5     2    Young \n4 4         10/12/08 UK      M      39  3     3     4     NA    NA    Young \n5 5         5/1/09   UK      F      NA  2     2     1      2     1    NA    "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#another way to remove missing data is by using the na.omit() function\nleadership #without removing missing data",
            "execution_count": 32
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1        & 10/24/08 & US       & M        & 32       & 5        & 4        & 5        & 5        & 5        & Young   \\\\\n\t 2        & 10/28/08 & US       & F        & 45       & 3        & 5        & 2        & 5        & 5        & Young   \\\\\n\t 3        & 10/1/08  & UK       & F        & 25       & 3        & 5        & 5        & 5        & 2        & Young   \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1       </td><td>10/24/08</td><td>US      </td><td>M       </td><td>32      </td><td>5       </td><td>4       </td><td>5       </td><td>5       </td><td>5       </td><td>Young   </td></tr>\n\t<tr><td>2       </td><td>10/28/08</td><td>US      </td><td>F       </td><td>45      </td><td>3       </td><td>5       </td><td>2       </td><td>5       </td><td>5       </td><td>Young   </td></tr>\n\t<tr><td>3       </td><td>10/1/08 </td><td>UK      </td><td>F       </td><td>25      </td><td>3       </td><td>5       </td><td>5       </td><td>5       </td><td>2       </td><td>Young   </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1         10/24/08 US      M      32  5     4     5     5     5     Young \n2 2         10/28/08 US      F      45  3     5     2     5     5     Young \n3 3         10/1/08  UK      F      25  3     5     5     5     2     Young "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "newdata <- na.omit(leadership)\nnewdata",
            "execution_count": 33
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Date values"
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1          & 2008-10-24 & US         & M          & 32         & 5          & 4          & 5          &  5         &  5         & Young     \\\\\n\t 2          & 2008-10-28 & US         & F          & 45         & 3          & 5          & 2          &  5         &  5         & Young     \\\\\n\t 3          & 2008-10-01 & UK         & F          & 25         & 3          & 5          & 5          &  5         &  2         & Young     \\\\\n\t 4          & 2008-10-12 & UK         & M          & 39         & 3          & 3          & 4          & NA         & NA         & Young     \\\\\n\t 5          & 2009-05-01 & UK         & F          & NA         & 2          & 2          & 1          &  2         &  1         & NA        \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1         </td><td>2008-10-24</td><td>US        </td><td>M         </td><td>32        </td><td>5         </td><td>4         </td><td>5         </td><td> 5        </td><td> 5        </td><td>Young     </td></tr>\n\t<tr><td>2         </td><td>2008-10-28</td><td>US        </td><td>F         </td><td>45        </td><td>3         </td><td>5         </td><td>2         </td><td> 5        </td><td> 5        </td><td>Young     </td></tr>\n\t<tr><td>3         </td><td>2008-10-01</td><td>UK        </td><td>F         </td><td>25        </td><td>3         </td><td>5         </td><td>5         </td><td> 5        </td><td> 2        </td><td>Young     </td></tr>\n\t<tr><td>4         </td><td>2008-10-12</td><td>UK        </td><td>M         </td><td>39        </td><td>3         </td><td>3         </td><td>4         </td><td>NA        </td><td>NA        </td><td>Young     </td></tr>\n\t<tr><td>5         </td><td>2009-05-01</td><td>UK        </td><td>F         </td><td>NA        </td><td>2         </td><td>2         </td><td>1         </td><td> 2        </td><td> 1        </td><td>NA        </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  managerID testDate   country gender age item1 item2 item3 item4 item5 agecat\n1 1         2008-10-24 US      M      32  5     4     5      5     5    Young \n2 2         2008-10-28 US      F      45  3     5     2      5     5    Young \n3 3         2008-10-01 UK      F      25  3     5     5      5     2    Young \n4 4         2008-10-12 UK      M      39  3     3     4     NA    NA    Young \n5 5         2009-05-01 UK      F      NA  2     2     1      2     1    NA    "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#convert \"testDate\" into a date type by using the as.Date() function\n#default format in R is yyyy-mm-dd\nmyformat <- \"%m/%d/%y\"\nleadership$testDate <- as.Date(leadership$testDate, myformat)\nleadership",
            "execution_count": 38
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/plain": "Time difference of 2535 days"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#when R stores dates internally, they're represented as number of days since 1 Jan 1970 with negative values for earlier dates\n#so you can perform arithmetic operations on them\nstartDate <- as.Date(\"2004-02-13\")\nendDate <- as.Date(\"2011-01-22\")\ndays <- endDate - startDate\ndays",
            "execution_count": 39
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/plain": "Time difference of 116.4286 weeks"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#you can use the difftime() function to calculate a time interval\n#for example if you want to calculate how old someone born on 29 May 2015 is, you could do it as follows:\ntoday <- Sys.Date()\ndob <- as.Date(\"2015-05-29\")\ndifftime(today, dob, units = \"weeks\")",
            "execution_count": 40
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#you can convert date variables into character variables using the as.character() function\n#this conversion allows you to apply a range of character functions to the data values\n#(subsetting, replacement, concatenation, etc)",
            "execution_count": 41
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Type conversions"
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "TRUE",
                        "text/html": "TRUE",
                        "text/plain": "[1] TRUE",
                        "text/markdown": "TRUE"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#example of converting one data type into another\na <- c(1, 2, 3)\nis.numeric(a)",
            "execution_count": 42
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "TRUE",
                        "text/html": "TRUE",
                        "text/plain": "[1] TRUE",
                        "text/markdown": "TRUE"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "is.vector(a)",
            "execution_count": 43
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{enumerate*}\n\\item '1'\n\\item '2'\n\\item '3'\n\\end{enumerate*}\n",
                        "text/html": "<ol class=list-inline>\n\t<li>'1'</li>\n\t<li>'2'</li>\n\t<li>'3'</li>\n</ol>\n",
                        "text/plain": "[1] \"1\" \"2\" \"3\"",
                        "text/markdown": "1. '1'\n2. '2'\n3. '3'\n\n\n"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "b <- as.character(a)\nb",
            "execution_count": 47
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "FALSE",
                        "text/html": "FALSE",
                        "text/plain": "[1] FALSE",
                        "text/markdown": "FALSE"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "is.numeric(b)",
            "execution_count": 48
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "TRUE",
                        "text/html": "TRUE",
                        "text/plain": "[1] TRUE",
                        "text/markdown": "TRUE"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "is.vector(b)",
            "execution_count": 49
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "TRUE",
                        "text/html": "TRUE",
                        "text/plain": "[1] TRUE",
                        "text/markdown": "TRUE"
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "is.character(b)",
            "execution_count": 50
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Sorting data"
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n  & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t3 & 3          & 2008-10-01 & UK         & F          & 25         & 3          & 5          & 5          &  5         &  2         & Young     \\\\\n\t1 & 1          & 2008-10-24 & US         & M          & 32         & 5          & 4          & 5          &  5         &  5         & Young     \\\\\n\t4 & 4          & 2008-10-12 & UK         & M          & 39         & 3          & 3          & 4          & NA         & NA         & Young     \\\\\n\t2 & 2          & 2008-10-28 & US         & F          & 45         & 3          & 5          & 2          &  5         &  5         & Young     \\\\\n\t5 & 5          & 2009-05-01 & UK         & F          & NA         & 2          & 2          & 1          &  2         &  1         & NA        \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>3</th><td>3         </td><td>2008-10-01</td><td>UK        </td><td>F         </td><td>25        </td><td>3         </td><td>5         </td><td>5         </td><td> 5        </td><td> 2        </td><td>Young     </td></tr>\n\t<tr><th scope=row>1</th><td>1         </td><td>2008-10-24</td><td>US        </td><td>M         </td><td>32        </td><td>5         </td><td>4         </td><td>5         </td><td> 5        </td><td> 5        </td><td>Young     </td></tr>\n\t<tr><th scope=row>4</th><td>4         </td><td>2008-10-12</td><td>UK        </td><td>M         </td><td>39        </td><td>3         </td><td>3         </td><td>4         </td><td>NA        </td><td>NA        </td><td>Young     </td></tr>\n\t<tr><th scope=row>2</th><td>2         </td><td>2008-10-28</td><td>US        </td><td>F         </td><td>45        </td><td>3         </td><td>5         </td><td>2         </td><td> 5        </td><td> 5        </td><td>Young     </td></tr>\n\t<tr><th scope=row>5</th><td>5         </td><td>2009-05-01</td><td>UK        </td><td>F         </td><td>NA        </td><td>2         </td><td>2         </td><td>1         </td><td> 2        </td><td> 1        </td><td>NA        </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  managerID testDate   country gender age item1 item2 item3 item4 item5 agecat\n3 3         2008-10-01 UK      F      25  3     5     5      5     2    Young \n1 1         2008-10-24 US      M      32  5     4     5      5     5    Young \n4 4         2008-10-12 UK      M      39  3     3     4     NA    NA    Young \n2 2         2008-10-28 US      F      45  3     5     2      5     5    Young \n5 5         2009-05-01 UK      F      NA  2     2     1      2     1    NA    "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#the order() function lets you sort a data frame in R.  The default sorting order is ascending\n#prepend the sorting variable with a minus sign to indicate descending order\n#example:\nnewdata <- leadership[order(leadership$age),] #creates a new dataset sorted from youngest manager to oldest manager\nnewdata",
            "execution_count": 51
        },
        {
            "outputs": [
                {
                    "output_type": "stream",
                    "name": "stderr",
                    "text": "The following objects are masked _by_ .GlobalEnv:\n\n    age, country, gender\n\n"
                },
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n  & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t3 & 3          & 2008-10-01 & UK         & F          & 25         & 3          & 5          & 5          &  5         &  2         & Young     \\\\\n\t2 & 2          & 2008-10-28 & US         & F          & 45         & 3          & 5          & 2          &  5         &  5         & Young     \\\\\n\t5 & 5          & 2009-05-01 & UK         & F          & NA         & 2          & 2          & 1          &  2         &  1         & NA        \\\\\n\t1 & 1          & 2008-10-24 & US         & M          & 32         & 5          & 4          & 5          &  5         &  5         & Young     \\\\\n\t4 & 4          & 2008-10-12 & UK         & M          & 39         & 3          & 3          & 4          & NA         & NA         & Young     \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>3</th><td>3         </td><td>2008-10-01</td><td>UK        </td><td>F         </td><td>25        </td><td>3         </td><td>5         </td><td>5         </td><td> 5        </td><td> 2        </td><td>Young     </td></tr>\n\t<tr><th scope=row>2</th><td>2         </td><td>2008-10-28</td><td>US        </td><td>F         </td><td>45        </td><td>3         </td><td>5         </td><td>2         </td><td> 5        </td><td> 5        </td><td>Young     </td></tr>\n\t<tr><th scope=row>5</th><td>5         </td><td>2009-05-01</td><td>UK        </td><td>F         </td><td>NA        </td><td>2         </td><td>2         </td><td>1         </td><td> 2        </td><td> 1        </td><td>NA        </td></tr>\n\t<tr><th scope=row>1</th><td>1         </td><td>2008-10-24</td><td>US        </td><td>M         </td><td>32        </td><td>5         </td><td>4         </td><td>5         </td><td> 5        </td><td> 5        </td><td>Young     </td></tr>\n\t<tr><th scope=row>4</th><td>4         </td><td>2008-10-12</td><td>UK        </td><td>M         </td><td>39        </td><td>3         </td><td>3         </td><td>4         </td><td>NA        </td><td>NA        </td><td>Young     </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  managerID testDate   country gender age item1 item2 item3 item4 item5 agecat\n3 3         2008-10-01 UK      F      25  3     5     5      5     2    Young \n2 2         2008-10-28 US      F      45  3     5     2      5     5    Young \n5 5         2009-05-01 UK      F      NA  2     2     1      2     1    NA    \n1 1         2008-10-24 US      M      32  5     4     5      5     5    Young \n4 4         2008-10-12 UK      M      39  3     3     4     NA    NA    Young "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#another example\nattach(leadership)\nnewdata <- leadership[order(gender, age),] #creates a new dataset sorting rows into female, followed by male\n                                           #and youngest to oldest within each gender\ndetach(leadership)\nnewdata",
            "execution_count": 52
        },
        {
            "outputs": [
                {
                    "output_type": "stream",
                    "name": "stderr",
                    "text": "The following objects are masked _by_ .GlobalEnv:\n\n    age, country, gender\n\n"
                },
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n  & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t5 & 5          & 2009-05-01 & UK         & F          & NA         & 2          & 2          & 1          &  2         &  1         & NA        \\\\\n\t2 & 2          & 2008-10-28 & US         & F          & 45         & 3          & 5          & 2          &  5         &  5         & Young     \\\\\n\t3 & 3          & 2008-10-01 & UK         & F          & 25         & 3          & 5          & 5          &  5         &  2         & Young     \\\\\n\t4 & 4          & 2008-10-12 & UK         & M          & 39         & 3          & 3          & 4          & NA         & NA         & Young     \\\\\n\t1 & 1          & 2008-10-24 & US         & M          & 32         & 5          & 4          & 5          &  5         &  5         & Young     \\\\\n\\end{tabular}\n",
                        "text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>5</th><td>5         </td><td>2009-05-01</td><td>UK        </td><td>F         </td><td>NA        </td><td>2         </td><td>2         </td><td>1         </td><td> 2        </td><td> 1        </td><td>NA        </td></tr>\n\t<tr><th scope=row>2</th><td>2         </td><td>2008-10-28</td><td>US        </td><td>F         </td><td>45        </td><td>3         </td><td>5         </td><td>2         </td><td> 5        </td><td> 5        </td><td>Young     </td></tr>\n\t<tr><th scope=row>3</th><td>3         </td><td>2008-10-01</td><td>UK        </td><td>F         </td><td>25        </td><td>3         </td><td>5         </td><td>5         </td><td> 5        </td><td> 2        </td><td>Young     </td></tr>\n\t<tr><th scope=row>4</th><td>4         </td><td>2008-10-12</td><td>UK        </td><td>M         </td><td>39        </td><td>3         </td><td>3         </td><td>4         </td><td>NA        </td><td>NA        </td><td>Young     </td></tr>\n\t<tr><th scope=row>1</th><td>1         </td><td>2008-10-24</td><td>US        </td><td>M         </td><td>32        </td><td>5         </td><td>4         </td><td>5         </td><td> 5        </td><td> 5        </td><td>Young     </td></tr>\n</tbody>\n</table>\n",
                        "text/plain": "  managerID testDate   country gender age item1 item2 item3 item4 item5 agecat\n5 5         2009-05-01 UK      F      NA  2     2     1      2     1    NA    \n2 2         2008-10-28 US      F      45  3     5     2      5     5    Young \n3 3         2008-10-01 UK      F      25  3     5     5      5     2    Young \n4 4         2008-10-12 UK      M      39  3     3     4     NA    NA    Young \n1 1         2008-10-24 US      M      32  5     4     5      5     5    Young "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#another example:\nattach(leadership)\nnewdata <- leadership[order(gender, -age),] #creates a new dataset sorting rows into female, followed by male\n                                            #and oldest to youngest within each gender\ndetach(leadership)\nnewdata",
            "execution_count": 53
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Merging datasets"
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#if your data exists in different locations, you will need to combine it",
            "execution_count": 55
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### Adding columns to a data frame"
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#adding columns to a data frame:\n#to merge two data frames horizontally, you use the merge() function\n#syntax: total <- merge(dataFrameA, dataFrameB, by = \"ID\") #by = the common key variable",
            "execution_count": 54
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### Horizontal concatenation"
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#another way to merge two matrices or data frames where you don't need to specify a common key,\n#is by using the cbind() function.  syntax:\n#total <- cbind(a, b)\n#cbind() concatenates objects a and b.  each object must have the same number of rows and be sorted in the same order",
            "execution_count": 56
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### Adding rows to a data frame"
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#use the rbind() function to add rows to a data frame.  syntax:\n#total <- rbind(a, b)\n#a and b must have the same number of variables but they don't need to be in the same order\n#if a has more variables than b, then either 1) delete the extra variables or 2) create the same variable in b and assign them to NAs",
            "execution_count": 57
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Subsetting datasets"
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#R has powerful indexing features to select and exclude variables, observations or both",
            "execution_count": 58
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### Selecting (keeping) variables"
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th></tr></thead>\n<tbody>\n\t<tr><td>5 </td><td>4 </td><td>5 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>2 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>5 </td><td> 5</td><td> 2</td></tr>\n\t<tr><td>3 </td><td>3 </td><td>4 </td><td>NA</td><td>NA</td></tr>\n\t<tr><td>2 </td><td>2 </td><td>1 </td><td> 2</td><td> 1</td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|lllll}\n item1 & item2 & item3 & item4 & item5\\\\\n\\hline\n\t 5  & 4  & 5  &  5 &  5\\\\\n\t 3  & 5  & 2  &  5 &  5\\\\\n\t 3  & 5  & 5  &  5 &  2\\\\\n\t 3  & 3  & 4  & NA & NA\\\\\n\t 2  & 2  & 1  &  2 &  1\\\\\n\\end{tabular}\n",
                        "text/plain": "  item1 item2 item3 item4 item5\n1 5     4     5      5     5   \n2 3     5     2      5     5   \n3 3     5     5      5     2   \n4 3     3     4     NA    NA   \n5 2     2     1      2     1   "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#selecting using [row indices, column indices].  example:\nnewdata <- leadership[, c(6:10)] #selects all rows and columns 6 through 10 of the leadership data frame\nnewdata",
            "execution_count": 60
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th></tr></thead>\n<tbody>\n\t<tr><td>5 </td><td>4 </td><td>5 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>2 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>5 </td><td> 5</td><td> 2</td></tr>\n\t<tr><td>3 </td><td>3 </td><td>4 </td><td>NA</td><td>NA</td></tr>\n\t<tr><td>2 </td><td>2 </td><td>1 </td><td> 2</td><td> 1</td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|lllll}\n item1 & item2 & item3 & item4 & item5\\\\\n\\hline\n\t 5  & 4  & 5  &  5 &  5\\\\\n\t 3  & 5  & 2  &  5 &  5\\\\\n\t 3  & 5  & 5  &  5 &  2\\\\\n\t 3  & 3  & 4  & NA & NA\\\\\n\t 2  & 2  & 1  &  2 &  1\\\\\n\\end{tabular}\n",
                        "text/plain": "  item1 item2 item3 item4 item5\n1 5     4     5      5     5   \n2 3     5     2      5     5   \n3 3     5     5      5     2   \n4 3     3     4     NA    NA   \n5 2     2     1      2     1   "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#another example\nmyvars <- c(\"item1\", \"item2\", \"item3\", \"item4\", \"item5\")\nnewdata <- leadership[myvars]\nnewdata",
            "execution_count": 62
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### Excluding (dropping) variables"
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {
                "collapsed": true
            },
            "source": "#there are many reasons to exclude variables.  for example, if a variable has many missing values,\n#you may want to drop it from further analysis",
            "execution_count": 63
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1         </td><td>2008-10-24</td><td>US        </td><td>M         </td><td>32        </td><td>5         </td><td>4         </td><td>5         </td><td>Young     </td></tr>\n\t<tr><td>2         </td><td>2008-10-28</td><td>US        </td><td>F         </td><td>45        </td><td>3         </td><td>5         </td><td>2         </td><td>Young     </td></tr>\n\t<tr><td>3         </td><td>2008-10-01</td><td>UK        </td><td>F         </td><td>25        </td><td>3         </td><td>5         </td><td>5         </td><td>Young     </td></tr>\n\t<tr><td>4         </td><td>2008-10-12</td><td>UK        </td><td>M         </td><td>39        </td><td>3         </td><td>3         </td><td>4         </td><td>Young     </td></tr>\n\t<tr><td>5         </td><td>2009-05-01</td><td>UK        </td><td>F         </td><td>NA        </td><td>2         </td><td>2         </td><td>1         </td><td>NA        </td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|lllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & agecat\\\\\n\\hline\n\t 1          & 2008-10-24 & US         & M          & 32         & 5          & 4          & 5          & Young     \\\\\n\t 2          & 2008-10-28 & US         & F          & 45         & 3          & 5          & 2          & Young     \\\\\n\t 3          & 2008-10-01 & UK         & F          & 25         & 3          & 5          & 5          & Young     \\\\\n\t 4          & 2008-10-12 & UK         & M          & 39         & 3          & 3          & 4          & Young     \\\\\n\t 5          & 2009-05-01 & UK         & F          & NA         & 2          & 2          & 1          & NA        \\\\\n\\end{tabular}\n",
                        "text/plain": "  managerID testDate   country gender age item1 item2 item3 agecat\n1 1         2008-10-24 US      M      32  5     4     5     Young \n2 2         2008-10-28 US      F      45  3     5     2     Young \n3 3         2008-10-01 UK      F      25  3     5     5     Young \n4 4         2008-10-12 UK      M      39  3     3     4     Young \n5 5         2009-05-01 UK      F      NA  2     2     1     NA    "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#here is an example: lets say you want to drop item4 and item5 from the leadership data frame.\n#here is one way you can do this:\nmyvars <- names(leadership) %in% c(\"item4\", \"item5\") #names(leadership) produces a character vector with names of all variables\n                           #names(leadership) %in% c(\"item4\", \"item5\") returns a logical vector with TRUE for each element in\n                           #names(leadership) that matches either \"item4\" or \"item5\"\nnewdata <- leadership[!myvars] #the ! operator reverses the logical values\nnewdata",
            "execution_count": 64
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1         </td><td>2008-10-24</td><td>US        </td><td>M         </td><td>32        </td><td>5         </td><td>4         </td><td>5         </td><td>Young     </td></tr>\n\t<tr><td>2         </td><td>2008-10-28</td><td>US        </td><td>F         </td><td>45        </td><td>3         </td><td>5         </td><td>2         </td><td>Young     </td></tr>\n\t<tr><td>3         </td><td>2008-10-01</td><td>UK        </td><td>F         </td><td>25        </td><td>3         </td><td>5         </td><td>5         </td><td>Young     </td></tr>\n\t<tr><td>4         </td><td>2008-10-12</td><td>UK        </td><td>M         </td><td>39        </td><td>3         </td><td>3         </td><td>4         </td><td>Young     </td></tr>\n\t<tr><td>5         </td><td>2009-05-01</td><td>UK        </td><td>F         </td><td>NA        </td><td>2         </td><td>2         </td><td>1         </td><td>NA        </td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|lllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & agecat\\\\\n\\hline\n\t 1          & 2008-10-24 & US         & M          & 32         & 5          & 4          & 5          & Young     \\\\\n\t 2          & 2008-10-28 & US         & F          & 45         & 3          & 5          & 2          & Young     \\\\\n\t 3          & 2008-10-01 & UK         & F          & 25         & 3          & 5          & 5          & Young     \\\\\n\t 4          & 2008-10-12 & UK         & M          & 39         & 3          & 3          & 4          & Young     \\\\\n\t 5          & 2009-05-01 & UK         & F          & NA         & 2          & 2          & 1          & NA        \\\\\n\\end{tabular}\n",
                        "text/plain": "  managerID testDate   country gender age item1 item2 item3 agecat\n1 1         2008-10-24 US      M      32  5     4     5     Young \n2 2         2008-10-28 US      F      45  3     5     2     Young \n3 3         2008-10-01 UK      F      25  3     5     5     Young \n4 4         2008-10-12 UK      M      39  3     3     4     Young \n5 5         2009-05-01 UK      F      NA  2     2     1     NA    "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#another way of dropping variables\n#since item4 is column 9 and item 5 is column 10, you can also do:\nnewdata <- leadership[c(-9, -10)]\nnewdata",
            "execution_count": 65
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### Selecting observations"
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1         </td><td>2008-10-24</td><td>US        </td><td>M         </td><td>32        </td><td>5         </td><td>4         </td><td>5         </td><td>5         </td><td>5         </td><td>Young     </td></tr>\n\t<tr><td>2         </td><td>2008-10-28</td><td>US        </td><td>F         </td><td>45        </td><td>3         </td><td>5         </td><td>2         </td><td>5         </td><td>5         </td><td>Young     </td></tr>\n\t<tr><td>3         </td><td>2008-10-01</td><td>UK        </td><td>F         </td><td>25        </td><td>3         </td><td>5         </td><td>5         </td><td>5         </td><td>2         </td><td>Young     </td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1          & 2008-10-24 & US         & M          & 32         & 5          & 4          & 5          & 5          & 5          & Young     \\\\\n\t 2          & 2008-10-28 & US         & F          & 45         & 3          & 5          & 2          & 5          & 5          & Young     \\\\\n\t 3          & 2008-10-01 & UK         & F          & 25         & 3          & 5          & 5          & 5          & 2          & Young     \\\\\n\\end{tabular}\n",
                        "text/plain": "  managerID testDate   country gender age item1 item2 item3 item4 item5 agecat\n1 1         2008-10-24 US      M      32  5     4     5     5     5     Young \n2 2         2008-10-28 US      F      45  3     5     2     5     5     Young \n3 3         2008-10-01 UK      F      25  3     5     5     5     2     Young "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#selecting or excluding observations is a key aspect of data preparation\n#here are some examples:\nnewdata <- leadership[1:3,] #selects the first three rows of data\nnewdata",
            "execution_count": 66
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>1</th><td>1         </td><td>2008-10-24</td><td>US        </td><td>M         </td><td>32        </td><td>5         </td><td>4         </td><td>5         </td><td> 5        </td><td> 5        </td><td>Young     </td></tr>\n\t<tr><th scope=row>4</th><td>4         </td><td>2008-10-12</td><td>UK        </td><td>M         </td><td>39        </td><td>3         </td><td>3         </td><td>4         </td><td>NA        </td><td>NA        </td><td>Young     </td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n  & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t1 & 1          & 2008-10-24 & US         & M          & 32         & 5          & 4          & 5          &  5         &  5         & Young     \\\\\n\t4 & 4          & 2008-10-12 & UK         & M          & 39         & 3          & 3          & 4          & NA         & NA         & Young     \\\\\n\\end{tabular}\n",
                        "text/plain": "  managerID testDate   country gender age item1 item2 item3 item4 item5 agecat\n1 1         2008-10-24 US      M      32  5     4     5      5     5    Young \n4 4         2008-10-12 UK      M      39  3     3     4     NA    NA    Young "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "newdata <- leadership[leadership$gender == \"M\" & #selects all men aged > 30\n                     leadership$age > 30,]\nnewdata",
            "execution_count": 67
        },
        {
            "outputs": [
                {
                    "output_type": "stream",
                    "name": "stderr",
                    "text": "The following objects are masked _by_ .GlobalEnv:\n\n    age, country, gender\n\n"
                },
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>1</th><td>1         </td><td>2008-10-24</td><td>US        </td><td>M         </td><td>32        </td><td>5         </td><td>4         </td><td>5         </td><td> 5        </td><td> 5        </td><td>Young     </td></tr>\n\t<tr><th scope=row>4</th><td>4         </td><td>2008-10-12</td><td>UK        </td><td>M         </td><td>39        </td><td>3         </td><td>3         </td><td>4         </td><td>NA        </td><td>NA        </td><td>Young     </td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n  & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t1 & 1          & 2008-10-24 & US         & M          & 32         & 5          & 4          & 5          &  5         &  5         & Young     \\\\\n\t4 & 4          & 2008-10-12 & UK         & M          & 39         & 3          & 3          & 4          & NA         & NA         & Young     \\\\\n\\end{tabular}\n",
                        "text/plain": "  managerID testDate   country gender age item1 item2 item3 item4 item5 agecat\n1 1         2008-10-24 US      M      32  5     4     5      5     5    Young \n4 4         2008-10-12 UK      M      39  3     3     4     NA    NA    Young "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#another way of selecting men aged > 30 by attaching the leadership data frame\nattach(leadership)\nnewdata <- leadership[gender == \"M\" & age > 30,]\ndetach(leadership)\nnewdata",
            "execution_count": 68
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>5</th><td>5         </td><td>2009-05-01</td><td>UK        </td><td>F         </td><td>NA        </td><td>2         </td><td>2         </td><td>1         </td><td>2         </td><td>1         </td><td>NA        </td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|lllllllllll}\n  & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t5 & 5          & 2009-05-01 & UK         & F          & NA         & 2          & 2          & 1          & 2          & 1          & NA        \\\\\n\\end{tabular}\n",
                        "text/plain": "  managerID testDate   country gender age item1 item2 item3 item4 item5 agecat\n5 5         2009-05-01 UK      F      NA  2     2     1     2     1     NA    "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#limit analysis to observations collected between 01/01/2009 and 12/31/2009\nnewdata <- leadership[leadership$testDate >= \"2009-01-01\" & leadership$testDate <= \"2009-12-31\",]\nnewdata",
            "execution_count": 69
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### The subset() function"
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th></th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>2</th><td>3 </td><td>5 </td><td>2 </td><td> 5</td></tr>\n\t<tr><th scope=row>4</th><td>3 </td><td>3 </td><td>4 </td><td>NA</td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|llll}\n  & item1 & item2 & item3 & item4\\\\\n\\hline\n\t2 & 3  & 5  & 2  &  5\\\\\n\t4 & 3  & 3  & 4  & NA\\\\\n\\end{tabular}\n",
                        "text/plain": "  item1 item2 item3 item4\n2 3     5     2      5   \n4 3     3     4     NA   "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#the subset() function is a really powerful function to select or exclude variables and observations\n#examples:\n#this statement selects all observations where the age is either 35 or over OR under 24 and returns 4 columns (items 1 through 4)\nnewdata <- subset(leadership, age >= 35 | age < 24, select = c(item1, item2, item3, item4))\nnewdata",
            "execution_count": 71
        },
        {
            "outputs": [
                {
                    "output_type": "display_data",
                    "metadata": {},
                    "data": {
                        "text/html": "<table>\n<thead><tr><th></th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>1</th><td>M </td><td>32</td><td>5 </td><td>4 </td><td>5 </td><td> 5</td><td> 5</td></tr>\n\t<tr><th scope=row>4</th><td>M </td><td>39</td><td>3 </td><td>3 </td><td>4 </td><td>NA</td><td>NA</td></tr>\n</tbody>\n</table>\n",
                        "text/latex": "\\begin{tabular}{r|lllllll}\n  & gender & age & item1 & item2 & item3 & item4 & item5\\\\\n\\hline\n\t1 & M  & 32 & 5  & 4  & 5  &  5 &  5\\\\\n\t4 & M  & 39 & 3  & 3  & 4  & NA & NA\\\\\n\\end{tabular}\n",
                        "text/plain": "  gender age item1 item2 item3 item4 item5\n1 M      32  5     4     5      5     5   \n4 M      39  3     3     4     NA    NA   "
                    }
                }
            ],
            "cell_type": "code",
            "metadata": {},
            "source": "#another example\n#this statement selects all men > 25 years old and all columns from gender to item 5 (and everything in between)\nnewdata <- subset(leadership, gender == \"M\" & age > 25, select = gender:item5)\nnewdata",
            "execution_count": 73
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "### Random samples"
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {},
            "source": "#sampling from larger datasets is a common practice in data mining and machine learning\n#for example, you may want to create a predictive model from one sample and validate its effectiveness on the other\n#the sample() function lets you take a random sample (with or without replacement) from size n of a dataset\n#R has extensive facilities for sampling (see the \"sampling\" package)",
            "execution_count": 81
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": "## Using SQL statements to manipulate data frames"
        },
        {
            "outputs": [],
            "cell_type": "code",
            "metadata": {},
            "source": "#using the \"sqldf\" package, you can use SQL statements within R\n#experienced SQl users will find the sqldf package a useful addition to the R data preparation toolkit",
            "execution_count": 89
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "R with Spark 2.0",
            "name": "r-spark20",
            "language": "R"
        },
        "language_info": {
            "mimetype": "text/x-r-source",
            "codemirror_mode": "r",
            "pygments_lexer": "r",
            "version": "3.3.2",
            "name": "R",
            "file_extension": ".r"
        }
    },
    "nbformat": 4
}