Created
August 24, 2017 13:23
-
-
Save raov5/c1544f3f3f2819005cf2b98a6c32eb87 to your computer and use it in GitHub Desktop.
This notebook covers data preparation in R, including: * manipulating dates and missing values * understanding data type conversions * creating and recoding variables * sorting, merging and subsetting datasets * selecting and dropping variables
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat_minor": 1, | |
"cells": [ | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#@author: Venky Rao raove@us.ibm.com\n#@last edited: 22 Aug 2017\n#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff", | |
"execution_count": 90 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# Data preparation in R" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Creating a Data Frame" | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#let us begin by creating a data frame of data for us to analyze\nmanager <- c(1, 2, 3, 4, 5)\ndate <- c(\"10/24/08\", \"10/28/08\", \"10/1/08\", \"10/12/08\", \"5/1/09\")\ncountry <- c(\"US\", \"US\", \"UK\", \"UK\", \"UK\")\ngender <- c(\"M\", \"F\", \"F\", \"M\", \"F\")\nage <- c(32, 45, 25, 39, 99)\nq1 <- c(5, 3, 3, 3, 2)\nq2 <- c(4, 5, 5, 3, 2)\nq3 <- c(5, 2, 5, 4, 1)\nq4 <- c(5, 5, 5, NA, 2)\nq5 <- c(5, 5, 2, NA, 1)\nleadership <- data.frame(manager, date, country, gender, age, q1, q2, q3, q4, q5, stringsAsFactors = F)\n#if stringsAsFactors = T, then dates would be treated as factors", | |
"execution_count": 4 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>date</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>99 </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|llllllllll}\n manager & date & country & gender & age & q1 & q2 & q3 & q4 & q5\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA \\\\\n\t 5 & 5/1/09 & UK & F & 99 & 2 & 2 & 1 & 2 & 1 \\\\\n\\end{tabular}\n", | |
"text/plain": " manager date country gender age q1 q2 q3 q4 q5\n1 1 10/24/08 US M 32 5 4 5 5 5\n2 2 10/28/08 US F 45 3 5 2 5 5\n3 3 10/1/08 UK F 25 3 5 5 5 2\n4 4 10/12/08 UK M 39 3 3 4 NA NA\n5 5 5/1/09 UK F 99 2 2 1 2 1" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#display the data frame that we created\nleadership", | |
"execution_count": 5 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Creating new variables" | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th></tr></thead>\n<tbody>\n\t<tr><td>2</td><td>3</td></tr>\n\t<tr><td>2</td><td>4</td></tr>\n\t<tr><td>6</td><td>2</td></tr>\n\t<tr><td>4</td><td>8</td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|ll}\n x1 & x2\\\\\n\\hline\n\t 2 & 3\\\\\n\t 2 & 4\\\\\n\t 6 & 2\\\\\n\t 4 & 8\\\\\n\\end{tabular}\n", | |
"text/plain": " x1 x2\n1 2 3 \n2 2 4 \n3 6 2 \n4 4 8 " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#data preparation often involves creating new variables in a data frame\n#let's create a new data frame to explore this further\nmydata <- data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8))\nmydata", | |
"execution_count": 6 | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#let's say we want to create two new variables in the data frame:\n#sumx which is the sum of x1 and x2; and\n#meanx which is the mean of x1 and x2\n#here are 3 ways of doing this:", | |
"execution_count": 7 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th><th scope=col>sumx</th><th scope=col>meanx</th></tr></thead>\n<tbody>\n\t<tr><td>2 </td><td>3 </td><td> 5 </td><td>2.5</td></tr>\n\t<tr><td>2 </td><td>4 </td><td> 6 </td><td>3.0</td></tr>\n\t<tr><td>6 </td><td>2 </td><td> 8 </td><td>4.0</td></tr>\n\t<tr><td>4 </td><td>8 </td><td>12 </td><td>6.0</td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|llll}\n x1 & x2 & sumx & meanx\\\\\n\\hline\n\t 2 & 3 & 5 & 2.5\\\\\n\t 2 & 4 & 6 & 3.0\\\\\n\t 6 & 2 & 8 & 4.0\\\\\n\t 4 & 8 & 12 & 6.0\\\\\n\\end{tabular}\n", | |
"text/plain": " x1 x2 sumx meanx\n1 2 3 5 2.5 \n2 2 4 6 3.0 \n3 6 2 8 4.0 \n4 4 8 12 6.0 " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#option 1\nmydata$sumx <- mydata$x1 + mydata$x2\nmydata$meanx <- (mydata$x1 + mydata$x2) / 2\nmydata", | |
"execution_count": 8 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th><th scope=col>sumx</th><th scope=col>meanx</th></tr></thead>\n<tbody>\n\t<tr><td>2 </td><td>3 </td><td> 5 </td><td>2.5</td></tr>\n\t<tr><td>2 </td><td>4 </td><td> 6 </td><td>3.0</td></tr>\n\t<tr><td>6 </td><td>2 </td><td> 8 </td><td>4.0</td></tr>\n\t<tr><td>4 </td><td>8 </td><td>12 </td><td>6.0</td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|llll}\n x1 & x2 & sumx & meanx\\\\\n\\hline\n\t 2 & 3 & 5 & 2.5\\\\\n\t 2 & 4 & 6 & 3.0\\\\\n\t 6 & 2 & 8 & 4.0\\\\\n\t 4 & 8 & 12 & 6.0\\\\\n\\end{tabular}\n", | |
"text/plain": " x1 x2 sumx meanx\n1 2 3 5 2.5 \n2 2 4 6 3.0 \n3 6 2 8 4.0 \n4 4 8 12 6.0 " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#option 2\nattach(mydata)\nmydata$sumx <- x1 + x2\nmydata$meanx <- (x1 + x2) / 2\ndetach(mydata)\nmydata", | |
"execution_count": 9 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th><th scope=col>sumx</th><th scope=col>meanx</th></tr></thead>\n<tbody>\n\t<tr><td>2 </td><td>3 </td><td> 5 </td><td>2.5</td></tr>\n\t<tr><td>2 </td><td>4 </td><td> 6 </td><td>3.0</td></tr>\n\t<tr><td>6 </td><td>2 </td><td> 8 </td><td>4.0</td></tr>\n\t<tr><td>4 </td><td>8 </td><td>12 </td><td>6.0</td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|llll}\n x1 & x2 & sumx & meanx\\\\\n\\hline\n\t 2 & 3 & 5 & 2.5\\\\\n\t 2 & 4 & 6 & 3.0\\\\\n\t 6 & 2 & 8 & 4.0\\\\\n\t 4 & 8 & 12 & 6.0\\\\\n\\end{tabular}\n", | |
"text/plain": " x1 x2 sumx meanx\n1 2 3 5 2.5 \n2 2 4 6 3.0 \n3 6 2 8 4.0 \n4 4 8 12 6.0 " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#option 3\nmydata <- transform(mydata, sumx = x1 + x2, meanx = (x1 + x2) / 2)\nmydata", | |
"execution_count": 10 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Recoding variables" | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#recoding involves creating new values of a variable conditional on the existing values of\n#the same and / or other variables. for example: you may want to\n#1. change a continuous variable into a set of categories\n#2. replace miscoded values with correct values\n#3. create a pass / fail variable based on a set of cutoff scores\n#to recode data, you use one of R's logical operators (i.e. expressions that return TRUE or FALSE)", | |
"execution_count": 11 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|llllllllll}\n manager & date & country & gender & age & q1 & q2 & q3 & q4 & q5\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>date</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " manager date country gender age q1 q2 q3 q4 q5\n1 1 10/24/08 US M 32 5 4 5 5 5\n2 2 10/28/08 US F 45 3 5 2 5 5\n3 3 10/1/08 UK F 25 3 5 5 5 2\n4 4 10/12/08 UK M 39 3 3 4 NA NA\n5 5 5/1/09 UK F NA 2 2 1 2 1" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#example: recode ages of managers in the \"leadership\" data frame from continuous to categorical\n#step 1: recode age 99 to indicate missing value since 99 is obviously incorrect\nleadership$age[leadership$age == 99] <- NA\nleadership", | |
"execution_count": 14 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n manager & date & country & gender & age & q1 & q2 & q3 & q4 & q5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>date</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " manager date country gender age q1 q2 q3 q4 q5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young \n4 4 10/12/08 UK M 39 3 3 4 NA NA Young \n5 5 5/1/09 UK F NA 2 2 1 2 1 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": { | |
"scrolled": true | |
}, | |
"source": "#step 2: create the categories\nleadership <- within(leadership, {\n agecat <- NA\n agecat[age > 75] <- \"Elder\"\n agecat[age >= 55 & age <= 75] <- \"Middle Aged\"\n agecat[age < 55] <- \"Young\"\n})\nleadership", | |
"execution_count": 15 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Renaming variables" | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n manager & testDate & country & gender & age & q1 & q2 & q3 & q4 & q5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " manager testDate country gender age q1 q2 q3 q4 q5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young \n4 4 10/12/08 UK M 39 3 3 4 NA NA Young \n5 5 5/1/09 UK F NA 2 2 1 2 1 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#here's one way:\nnames(leadership)[2] <- \"testDate\"\nleadership", | |
"execution_count": 17 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n manager & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " manager testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young \n4 4 10/12/08 UK M 39 3 3 4 NA NA Young \n5 5 5/1/09 UK F NA 2 2 1 2 1 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#another example:\nnames(leadership)[6:10] <- c(\"item1\", \"item2\", \"item3\", \"item4\", \"item5\")\nleadership", | |
"execution_count": 18 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stderr", | |
"text": "Installing package into \u2018/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs\u2019\n(as \u2018lib\u2019 is unspecified)\n" | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#another way:\n#using the rename function in the \"plyr\" package\ninstall.packages(\"plyr\")", | |
"execution_count": 21 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stderr", | |
"text": "\nAttaching package: \u2018plyr\u2019\n\nThe following objects are masked from \u2018package:SparkR\u2019:\n\n arrange, count, desc, join, mutate, rename, summarize, take\n\n" | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#load libraries\nlibrary(plyr)", | |
"execution_count": 22 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young \n4 4 10/12/08 UK M 39 3 3 4 NA NA Young \n5 5 5/1/09 UK F NA 2 2 1 2 1 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#use the rename() function\nleadership <- rename(leadership, c(manager = \"managerID\"))\nleadership", | |
"execution_count": 23 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Missing values" | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{enumerate*}\n\\item FALSE\n\\item FALSE\n\\item FALSE\n\\item TRUE\n\\end{enumerate*}\n", | |
"text/html": "<ol class=list-inline>\n\t<li>FALSE</li>\n\t<li>FALSE</li>\n\t<li>FALSE</li>\n\t<li>TRUE</li>\n</ol>\n", | |
"text/plain": "[1] FALSE FALSE FALSE TRUE", | |
"text/markdown": "1. FALSE\n2. FALSE\n3. FALSE\n4. TRUE\n\n\n" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#identifying missing values using the is.na() function\n#example:\ny <- c(1, 2, 3, NA)\nis.na(y)", | |
"execution_count": 24 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & TRUE & TRUE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & TRUE & FALSE & FALSE & FALSE & FALSE & FALSE & TRUE\\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td> TRUE</td><td> TRUE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td> TRUE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td> TRUE</td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5\n[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE\n agecat\n[1,] FALSE \n[2,] FALSE \n[3,] FALSE \n[4,] FALSE \n[5,] TRUE ", | |
"text/markdown": "1. FALSE\n2. FALSE\n3. FALSE\n4. FALSE\n5. FALSE\n6. FALSE\n7. FALSE\n8. FALSE\n9. FALSE\n10. FALSE\n11. FALSE\n12. FALSE\n13. FALSE\n14. FALSE\n15. FALSE\n16. FALSE\n17. FALSE\n18. FALSE\n19. FALSE\n20. FALSE\n21. FALSE\n22. FALSE\n23. FALSE\n24. FALSE\n25. TRUE\n26. FALSE\n27. FALSE\n28. FALSE\n29. FALSE\n30. FALSE\n31. FALSE\n32. FALSE\n33. FALSE\n34. FALSE\n35. FALSE\n36. FALSE\n37. FALSE\n38. FALSE\n39. FALSE\n40. FALSE\n41. FALSE\n42. FALSE\n43. FALSE\n44. TRUE\n45. FALSE\n46. FALSE\n47. FALSE\n48. FALSE\n49. TRUE\n50. FALSE\n51. FALSE\n52. FALSE\n53. FALSE\n54. FALSE\n55. TRUE\n\n\n" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#another example\nis.na(leadership)", | |
"execution_count": 25 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Excluding missing values from analyses" | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/plain": "[1] NA" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#sometimes you need to eliminate missing vaues from your analysis. for example:\nx <- c(1, 2, NA, 3)\ny <- x[1] + x[2] + x[3] + x[4] #will result in NA\ny", | |
"execution_count": 29 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/plain": "[1] NA" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "z <- sum(x) #will result in NA\nz", | |
"execution_count": 30 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "6", | |
"text/html": "6", | |
"text/plain": "[1] 6", | |
"text/markdown": "6" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#one way to fix this is to use the na.rm = T option, as follows:\nz <- sum(x, na.rm = T)\nz", | |
"execution_count": 31 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young \n4 4 10/12/08 UK M 39 3 3 4 NA NA Young \n5 5 5/1/09 UK F NA 2 2 1 2 1 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#another way to remove missing data is by using the na.omit() function\nleadership #without removing missing data", | |
"execution_count": 32 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td>5 </td><td>2 </td><td>Young </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "newdata <- na.omit(leadership)\nnewdata", | |
"execution_count": 33 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Date values" | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n2 2 2008-10-28 US F 45 3 5 2 5 5 Young \n3 3 2008-10-01 UK F 25 3 5 5 5 2 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young \n5 5 2009-05-01 UK F NA 2 2 1 2 1 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#convert \"testDate\" into a date type by using the as.Date() function\n#default format in R is yyyy-mm-dd\nmyformat <- \"%m/%d/%y\"\nleadership$testDate <- as.Date(leadership$testDate, myformat)\nleadership", | |
"execution_count": 38 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/plain": "Time difference of 2535 days" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#when R stores dates internally, they're represented as number of days since 1 Jan 1970 with negative values for earlier dates\n#so you can perform arithmetic operations on them\nstartDate <- as.Date(\"2004-02-13\")\nendDate <- as.Date(\"2011-01-22\")\ndays <- endDate - startDate\ndays", | |
"execution_count": 39 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/plain": "Time difference of 116.4286 weeks" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#you can use the difftime() function to calculate a time interval\n#for example if you want to calculate how old someone born on 29 May 2015 is, you could do it as follows:\ntoday <- Sys.Date()\ndob <- as.Date(\"2015-05-29\")\ndifftime(today, dob, units = \"weeks\")", | |
"execution_count": 40 | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#you can convert date variables into character variables using the as.character() function\n#this conversion allows you to apply a range of character functions to the data values\n#(subsetting, replacement, concatenation, etc)", | |
"execution_count": 41 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Type conversions" | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "TRUE", | |
"text/html": "TRUE", | |
"text/plain": "[1] TRUE", | |
"text/markdown": "TRUE" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#example of converting one data type into another\na <- c(1, 2, 3)\nis.numeric(a)", | |
"execution_count": 42 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "TRUE", | |
"text/html": "TRUE", | |
"text/plain": "[1] TRUE", | |
"text/markdown": "TRUE" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "is.vector(a)", | |
"execution_count": 43 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{enumerate*}\n\\item '1'\n\\item '2'\n\\item '3'\n\\end{enumerate*}\n", | |
"text/html": "<ol class=list-inline>\n\t<li>'1'</li>\n\t<li>'2'</li>\n\t<li>'3'</li>\n</ol>\n", | |
"text/plain": "[1] \"1\" \"2\" \"3\"", | |
"text/markdown": "1. '1'\n2. '2'\n3. '3'\n\n\n" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "b <- as.character(a)\nb", | |
"execution_count": 47 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "FALSE", | |
"text/html": "FALSE", | |
"text/plain": "[1] FALSE", | |
"text/markdown": "FALSE" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "is.numeric(b)", | |
"execution_count": 48 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "TRUE", | |
"text/html": "TRUE", | |
"text/plain": "[1] TRUE", | |
"text/markdown": "TRUE" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "is.vector(b)", | |
"execution_count": 49 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "TRUE", | |
"text/html": "TRUE", | |
"text/plain": "[1] TRUE", | |
"text/markdown": "TRUE" | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "is.character(b)", | |
"execution_count": 50 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Sorting data" | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t3 & 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t1 & 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t4 & 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t2 & 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t5 & 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>3</th><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><th scope=row>1</th><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>4</th><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><th scope=row>2</th><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>5</th><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n3 3 2008-10-01 UK F 25 3 5 5 5 2 Young \n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young \n2 2 2008-10-28 US F 45 3 5 2 5 5 Young \n5 5 2009-05-01 UK F NA 2 2 1 2 1 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#the order() function lets you sort a data frame in R. The default sorting order is ascending\n#prepend the sorting variable with a minus sign to indicate descending order\n#example:\nnewdata <- leadership[order(leadership$age),] #creates a new dataset sorted from youngest manager to oldest manager\nnewdata", | |
"execution_count": 51 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stderr", | |
"text": "The following objects are masked _by_ .GlobalEnv:\n\n age, country, gender\n\n" | |
}, | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t3 & 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t2 & 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t5 & 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\t1 & 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t4 & 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>3</th><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><th scope=row>2</th><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>5</th><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n\t<tr><th scope=row>1</th><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>4</th><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n3 3 2008-10-01 UK F 25 3 5 5 5 2 Young \n2 2 2008-10-28 US F 45 3 5 2 5 5 Young \n5 5 2009-05-01 UK F NA 2 2 1 2 1 NA \n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#another example\nattach(leadership)\nnewdata <- leadership[order(gender, age),] #creates a new dataset sorting rows into female, followed by male\n #and youngest to oldest within each gender\ndetach(leadership)\nnewdata", | |
"execution_count": 52 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stderr", | |
"text": "The following objects are masked _by_ .GlobalEnv:\n\n age, country, gender\n\n" | |
}, | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t5 & 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\t2 & 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t3 & 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t4 & 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t1 & 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\\end{tabular}\n", | |
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>5</th><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n\t<tr><th scope=row>2</th><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>3</th><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><th scope=row>4</th><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><th scope=row>1</th><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n</tbody>\n</table>\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n5 5 2009-05-01 UK F NA 2 2 1 2 1 NA \n2 2 2008-10-28 US F 45 3 5 2 5 5 Young \n3 3 2008-10-01 UK F 25 3 5 5 5 2 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young \n1 1 2008-10-24 US M 32 5 4 5 5 5 Young " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#another example:\nattach(leadership)\nnewdata <- leadership[order(gender, -age),] #creates a new dataset sorting rows into female, followed by male\n #and oldest to youngest within each gender\ndetach(leadership)\nnewdata", | |
"execution_count": 53 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Merging datasets" | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#if your data exists in different locations, you will need to combine it", | |
"execution_count": 55 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Adding columns to a data frame" | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#adding columns to a data frame:\n#to merge two data frames horizontally, you use the merge() function\n#syntax: total <- merge(dataFrameA, dataFrameB, by = \"ID\") #by = the common key variable", | |
"execution_count": 54 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Horizontal concatenation" | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#another way to merge two matrices or data frames where you don't need to specify a common key,\n#is by using the cbind() function. syntax:\n#total <- cbind(a, b)\n#cbind() concatenates objects a and b. each object must have the same number of rows and be sorted in the same order", | |
"execution_count": 56 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Adding rows to a data frame" | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#use the rbind() function to add rows to a data frame. syntax:\n#total <- rbind(a, b)\n#a and b must have the same number of variables but they don't need to be in the same order\n#if a has more variables than b, then either 1) delete the extra variables or 2) create the same variable in b and assign them to NAs", | |
"execution_count": 57 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Subsetting datasets" | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#R has powerful indexing features to select and exclude variables, observations or both", | |
"execution_count": 58 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Selecting (keeping) variables" | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th></tr></thead>\n<tbody>\n\t<tr><td>5 </td><td>4 </td><td>5 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>2 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>5 </td><td> 5</td><td> 2</td></tr>\n\t<tr><td>3 </td><td>3 </td><td>4 </td><td>NA</td><td>NA</td></tr>\n\t<tr><td>2 </td><td>2 </td><td>1 </td><td> 2</td><td> 1</td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|lllll}\n item1 & item2 & item3 & item4 & item5\\\\\n\\hline\n\t 5 & 4 & 5 & 5 & 5\\\\\n\t 3 & 5 & 2 & 5 & 5\\\\\n\t 3 & 5 & 5 & 5 & 2\\\\\n\t 3 & 3 & 4 & NA & NA\\\\\n\t 2 & 2 & 1 & 2 & 1\\\\\n\\end{tabular}\n", | |
"text/plain": " item1 item2 item3 item4 item5\n1 5 4 5 5 5 \n2 3 5 2 5 5 \n3 3 5 5 5 2 \n4 3 3 4 NA NA \n5 2 2 1 2 1 " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#selecting using [row indices, column indices]. example:\nnewdata <- leadership[, c(6:10)] #selects all rows and columns 6 through 10 of the leadership data frame\nnewdata", | |
"execution_count": 60 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th></tr></thead>\n<tbody>\n\t<tr><td>5 </td><td>4 </td><td>5 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>2 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>5 </td><td> 5</td><td> 2</td></tr>\n\t<tr><td>3 </td><td>3 </td><td>4 </td><td>NA</td><td>NA</td></tr>\n\t<tr><td>2 </td><td>2 </td><td>1 </td><td> 2</td><td> 1</td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|lllll}\n item1 & item2 & item3 & item4 & item5\\\\\n\\hline\n\t 5 & 4 & 5 & 5 & 5\\\\\n\t 3 & 5 & 2 & 5 & 5\\\\\n\t 3 & 5 & 5 & 5 & 2\\\\\n\t 3 & 3 & 4 & NA & NA\\\\\n\t 2 & 2 & 1 & 2 & 1\\\\\n\\end{tabular}\n", | |
"text/plain": " item1 item2 item3 item4 item5\n1 5 4 5 5 5 \n2 3 5 2 5 5 \n3 3 5 5 5 2 \n4 3 3 4 NA NA \n5 2 2 1 2 1 " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#another example\nmyvars <- c(\"item1\", \"item2\", \"item3\", \"item4\", \"item5\")\nnewdata <- leadership[myvars]\nnewdata", | |
"execution_count": 62 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Excluding (dropping) variables" | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": "#there are many reasons to exclude variables. for example, if a variable has many missing values,\n#you may want to drop it from further analysis", | |
"execution_count": 63 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td>NA </td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|lllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & agecat\\\\\n\\hline\n\t 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & Young \\\\\n\t 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & Young \\\\\n\t 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & Young \\\\\n\t 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & Young \\\\\n\t 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & NA \\\\\n\\end{tabular}\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 agecat\n1 1 2008-10-24 US M 32 5 4 5 Young \n2 2 2008-10-28 US F 45 3 5 2 Young \n3 3 2008-10-01 UK F 25 3 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 Young \n5 5 2009-05-01 UK F NA 2 2 1 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#here is an example: lets say you want to drop item4 and item5 from the leadership data frame.\n#here is one way you can do this:\nmyvars <- names(leadership) %in% c(\"item4\", \"item5\") #names(leadership) produces a character vector with names of all variables\n #names(leadership) %in% c(\"item4\", \"item5\") returns a logical vector with TRUE for each element in\n #names(leadership) that matches either \"item4\" or \"item5\"\nnewdata <- leadership[!myvars] #the ! operator reverses the logical values\nnewdata", | |
"execution_count": 64 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td>NA </td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|lllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & agecat\\\\\n\\hline\n\t 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & Young \\\\\n\t 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & Young \\\\\n\t 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & Young \\\\\n\t 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & Young \\\\\n\t 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & NA \\\\\n\\end{tabular}\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 agecat\n1 1 2008-10-24 US M 32 5 4 5 Young \n2 2 2008-10-28 US F 45 3 5 2 Young \n3 3 2008-10-01 UK F 25 3 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 Young \n5 5 2009-05-01 UK F NA 2 2 1 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#another way of dropping variables\n#since item4 is column 9 and item 5 is column 10, you can also do:\nnewdata <- leadership[c(-9, -10)]\nnewdata", | |
"execution_count": 65 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Selecting observations" | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td>5 </td><td>2 </td><td>Young </td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\\end{tabular}\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n2 2 2008-10-28 US F 45 3 5 2 5 5 Young \n3 3 2008-10-01 UK F 25 3 5 5 5 2 Young " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#selecting or excluding observations is a key aspect of data preparation\n#here are some examples:\nnewdata <- leadership[1:3,] #selects the first three rows of data\nnewdata", | |
"execution_count": 66 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>1</th><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>4</th><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t1 & 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t4 & 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\\end{tabular}\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "newdata <- leadership[leadership$gender == \"M\" & #selects all men aged > 30\n leadership$age > 30,]\nnewdata", | |
"execution_count": 67 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stderr", | |
"text": "The following objects are masked _by_ .GlobalEnv:\n\n age, country, gender\n\n" | |
}, | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>1</th><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>4</th><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t1 & 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t4 & 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\\end{tabular}\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#another way of selecting men aged > 30 by attaching the leadership data frame\nattach(leadership)\nnewdata <- leadership[gender == \"M\" & age > 30,]\ndetach(leadership)\nnewdata", | |
"execution_count": 68 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>5</th><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td>2 </td><td>1 </td><td>NA </td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t5 & 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n", | |
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n5 5 2009-05-01 UK F NA 2 2 1 2 1 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#limit analysis to observations collected between 01/01/2009 and 12/31/2009\nnewdata <- leadership[leadership$testDate >= \"2009-01-01\" & leadership$testDate <= \"2009-12-31\",]\nnewdata", | |
"execution_count": 69 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### The subset() function" | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th></th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>2</th><td>3 </td><td>5 </td><td>2 </td><td> 5</td></tr>\n\t<tr><th scope=row>4</th><td>3 </td><td>3 </td><td>4 </td><td>NA</td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|llll}\n & item1 & item2 & item3 & item4\\\\\n\\hline\n\t2 & 3 & 5 & 2 & 5\\\\\n\t4 & 3 & 3 & 4 & NA\\\\\n\\end{tabular}\n", | |
"text/plain": " item1 item2 item3 item4\n2 3 5 2 5 \n4 3 3 4 NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#the subset() function is a really powerful function to select or exclude variables and observations\n#examples:\n#this statement selects all observations where the age is either 35 or over OR under 24 and returns 4 columns (items 1 through 4)\nnewdata <- subset(leadership, age >= 35 | age < 24, select = c(item1, item2, item3, item4))\nnewdata", | |
"execution_count": 71 | |
}, | |
{ | |
"outputs": [ | |
{ | |
"output_type": "display_data", | |
"metadata": {}, | |
"data": { | |
"text/html": "<table>\n<thead><tr><th></th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>1</th><td>M </td><td>32</td><td>5 </td><td>4 </td><td>5 </td><td> 5</td><td> 5</td></tr>\n\t<tr><th scope=row>4</th><td>M </td><td>39</td><td>3 </td><td>3 </td><td>4 </td><td>NA</td><td>NA</td></tr>\n</tbody>\n</table>\n", | |
"text/latex": "\\begin{tabular}{r|lllllll}\n & gender & age & item1 & item2 & item3 & item4 & item5\\\\\n\\hline\n\t1 & M & 32 & 5 & 4 & 5 & 5 & 5\\\\\n\t4 & M & 39 & 3 & 3 & 4 & NA & NA\\\\\n\\end{tabular}\n", | |
"text/plain": " gender age item1 item2 item3 item4 item5\n1 M 32 5 4 5 5 5 \n4 M 39 3 3 4 NA NA " | |
} | |
} | |
], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#another example\n#this statement selects all men > 25 years old and all columns from gender to item 5 (and everything in between)\nnewdata <- subset(leadership, gender == \"M\" & age > 25, select = gender:item5)\nnewdata", | |
"execution_count": 73 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Random samples" | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#sampling from larger datasets is a common practice in data mining and machine learning\n#for example, you may want to create a predictive model from one sample and validate its effectiveness on the other\n#the sample() function lets you take a random sample (with or without replacement) from size n of a dataset\n#R has extensive facilities for sampling (see the \"sampling\" package)", | |
"execution_count": 81 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Using SQL statements to manipulate data frames" | |
}, | |
{ | |
"outputs": [], | |
"cell_type": "code", | |
"metadata": {}, | |
"source": "#using the \"sqldf\" package, you can use SQL statements within R\n#experienced SQl users will find the sqldf package a useful addition to the R data preparation toolkit", | |
"execution_count": 89 | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "R with Spark 2.0", | |
"name": "r-spark20", | |
"language": "R" | |
}, | |
"language_info": { | |
"mimetype": "text/x-r-source", | |
"codemirror_mode": "r", | |
"pygments_lexer": "r", | |
"version": "3.3.2", | |
"name": "R", | |
"file_extension": ".r" | |
} | |
}, | |
"nbformat": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment