Skip to content

Instantly share code, notes, and snippets.

@raov5
Created August 24, 2017 13:23
Show Gist options
  • Save raov5/c1544f3f3f2819005cf2b98a6c32eb87 to your computer and use it in GitHub Desktop.
Save raov5/c1544f3f3f2819005cf2b98a6c32eb87 to your computer and use it in GitHub Desktop.
This notebook covers data preparation in R, including: * manipulating dates and missing values * understanding data type conversions * creating and recoding variables * sorting, merging and subsetting datasets * selecting and dropping variables
Display the source blob
Display the rendered blob
Raw
{
"nbformat_minor": 1,
"cells": [
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#@author: Venky Rao raove@us.ibm.com\n#@last edited: 22 Aug 2017\n#@source: materials, data and examples adapted from R in Action 2nd Edition by Dr. Robert Kabacoff",
"execution_count": 90
},
{
"cell_type": "markdown",
"metadata": {},
"source": "# Data preparation in R"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Creating a Data Frame"
},
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#let us begin by creating a data frame of data for us to analyze\nmanager <- c(1, 2, 3, 4, 5)\ndate <- c(\"10/24/08\", \"10/28/08\", \"10/1/08\", \"10/12/08\", \"5/1/09\")\ncountry <- c(\"US\", \"US\", \"UK\", \"UK\", \"UK\")\ngender <- c(\"M\", \"F\", \"F\", \"M\", \"F\")\nage <- c(32, 45, 25, 39, 99)\nq1 <- c(5, 3, 3, 3, 2)\nq2 <- c(4, 5, 5, 3, 2)\nq3 <- c(5, 2, 5, 4, 1)\nq4 <- c(5, 5, 5, NA, 2)\nq5 <- c(5, 5, 2, NA, 1)\nleadership <- data.frame(manager, date, country, gender, age, q1, q2, q3, q4, q5, stringsAsFactors = F)\n#if stringsAsFactors = T, then dates would be treated as factors",
"execution_count": 4
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>date</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>99 </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|llllllllll}\n manager & date & country & gender & age & q1 & q2 & q3 & q4 & q5\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA \\\\\n\t 5 & 5/1/09 & UK & F & 99 & 2 & 2 & 1 & 2 & 1 \\\\\n\\end{tabular}\n",
"text/plain": " manager date country gender age q1 q2 q3 q4 q5\n1 1 10/24/08 US M 32 5 4 5 5 5\n2 2 10/28/08 US F 45 3 5 2 5 5\n3 3 10/1/08 UK F 25 3 5 5 5 2\n4 4 10/12/08 UK M 39 3 3 4 NA NA\n5 5 5/1/09 UK F 99 2 2 1 2 1"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#display the data frame that we created\nleadership",
"execution_count": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Creating new variables"
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th></tr></thead>\n<tbody>\n\t<tr><td>2</td><td>3</td></tr>\n\t<tr><td>2</td><td>4</td></tr>\n\t<tr><td>6</td><td>2</td></tr>\n\t<tr><td>4</td><td>8</td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|ll}\n x1 & x2\\\\\n\\hline\n\t 2 & 3\\\\\n\t 2 & 4\\\\\n\t 6 & 2\\\\\n\t 4 & 8\\\\\n\\end{tabular}\n",
"text/plain": " x1 x2\n1 2 3 \n2 2 4 \n3 6 2 \n4 4 8 "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#data preparation often involves creating new variables in a data frame\n#let's create a new data frame to explore this further\nmydata <- data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8))\nmydata",
"execution_count": 6
},
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#let's say we want to create two new variables in the data frame:\n#sumx which is the sum of x1 and x2; and\n#meanx which is the mean of x1 and x2\n#here are 3 ways of doing this:",
"execution_count": 7
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th><th scope=col>sumx</th><th scope=col>meanx</th></tr></thead>\n<tbody>\n\t<tr><td>2 </td><td>3 </td><td> 5 </td><td>2.5</td></tr>\n\t<tr><td>2 </td><td>4 </td><td> 6 </td><td>3.0</td></tr>\n\t<tr><td>6 </td><td>2 </td><td> 8 </td><td>4.0</td></tr>\n\t<tr><td>4 </td><td>8 </td><td>12 </td><td>6.0</td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|llll}\n x1 & x2 & sumx & meanx\\\\\n\\hline\n\t 2 & 3 & 5 & 2.5\\\\\n\t 2 & 4 & 6 & 3.0\\\\\n\t 6 & 2 & 8 & 4.0\\\\\n\t 4 & 8 & 12 & 6.0\\\\\n\\end{tabular}\n",
"text/plain": " x1 x2 sumx meanx\n1 2 3 5 2.5 \n2 2 4 6 3.0 \n3 6 2 8 4.0 \n4 4 8 12 6.0 "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#option 1\nmydata$sumx <- mydata$x1 + mydata$x2\nmydata$meanx <- (mydata$x1 + mydata$x2) / 2\nmydata",
"execution_count": 8
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th><th scope=col>sumx</th><th scope=col>meanx</th></tr></thead>\n<tbody>\n\t<tr><td>2 </td><td>3 </td><td> 5 </td><td>2.5</td></tr>\n\t<tr><td>2 </td><td>4 </td><td> 6 </td><td>3.0</td></tr>\n\t<tr><td>6 </td><td>2 </td><td> 8 </td><td>4.0</td></tr>\n\t<tr><td>4 </td><td>8 </td><td>12 </td><td>6.0</td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|llll}\n x1 & x2 & sumx & meanx\\\\\n\\hline\n\t 2 & 3 & 5 & 2.5\\\\\n\t 2 & 4 & 6 & 3.0\\\\\n\t 6 & 2 & 8 & 4.0\\\\\n\t 4 & 8 & 12 & 6.0\\\\\n\\end{tabular}\n",
"text/plain": " x1 x2 sumx meanx\n1 2 3 5 2.5 \n2 2 4 6 3.0 \n3 6 2 8 4.0 \n4 4 8 12 6.0 "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#option 2\nattach(mydata)\nmydata$sumx <- x1 + x2\nmydata$meanx <- (x1 + x2) / 2\ndetach(mydata)\nmydata",
"execution_count": 9
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th scope=col>x1</th><th scope=col>x2</th><th scope=col>sumx</th><th scope=col>meanx</th></tr></thead>\n<tbody>\n\t<tr><td>2 </td><td>3 </td><td> 5 </td><td>2.5</td></tr>\n\t<tr><td>2 </td><td>4 </td><td> 6 </td><td>3.0</td></tr>\n\t<tr><td>6 </td><td>2 </td><td> 8 </td><td>4.0</td></tr>\n\t<tr><td>4 </td><td>8 </td><td>12 </td><td>6.0</td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|llll}\n x1 & x2 & sumx & meanx\\\\\n\\hline\n\t 2 & 3 & 5 & 2.5\\\\\n\t 2 & 4 & 6 & 3.0\\\\\n\t 6 & 2 & 8 & 4.0\\\\\n\t 4 & 8 & 12 & 6.0\\\\\n\\end{tabular}\n",
"text/plain": " x1 x2 sumx meanx\n1 2 3 5 2.5 \n2 2 4 6 3.0 \n3 6 2 8 4.0 \n4 4 8 12 6.0 "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#option 3\nmydata <- transform(mydata, sumx = x1 + x2, meanx = (x1 + x2) / 2)\nmydata",
"execution_count": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Recoding variables"
},
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#recoding involves creating new values of a variable conditional on the existing values of\n#the same and / or other variables. for example: you may want to\n#1. change a continuous variable into a set of categories\n#2. replace miscoded values with correct values\n#3. create a pass / fail variable based on a set of cutoff scores\n#to recode data, you use one of R's logical operators (i.e. expressions that return TRUE or FALSE)",
"execution_count": 11
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|llllllllll}\n manager & date & country & gender & age & q1 & q2 & q3 & q4 & q5\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>date</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td></tr>\n</tbody>\n</table>\n",
"text/plain": " manager date country gender age q1 q2 q3 q4 q5\n1 1 10/24/08 US M 32 5 4 5 5 5\n2 2 10/28/08 US F 45 3 5 2 5 5\n3 3 10/1/08 UK F 25 3 5 5 5 2\n4 4 10/12/08 UK M 39 3 3 4 NA NA\n5 5 5/1/09 UK F NA 2 2 1 2 1"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#example: recode ages of managers in the \"leadership\" data frame from continuous to categorical\n#step 1: recode age 99 to indicate missing value since 99 is obviously incorrect\nleadership$age[leadership$age == 99] <- NA\nleadership",
"execution_count": 14
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|lllllllllll}\n manager & date & country & gender & age & q1 & q2 & q3 & q4 & q5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>date</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n",
"text/plain": " manager date country gender age q1 q2 q3 q4 q5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young \n4 4 10/12/08 UK M 39 3 3 4 NA NA Young \n5 5 5/1/09 UK F NA 2 2 1 2 1 NA "
}
}
],
"cell_type": "code",
"metadata": {
"scrolled": true
},
"source": "#step 2: create the categories\nleadership <- within(leadership, {\n agecat <- NA\n agecat[age > 75] <- \"Elder\"\n agecat[age >= 55 & age <= 75] <- \"Middle Aged\"\n agecat[age < 55] <- \"Young\"\n})\nleadership",
"execution_count": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Renaming variables"
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|lllllllllll}\n manager & testDate & country & gender & age & q1 & q2 & q3 & q4 & q5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>q1</th><th scope=col>q2</th><th scope=col>q3</th><th scope=col>q4</th><th scope=col>q5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n",
"text/plain": " manager testDate country gender age q1 q2 q3 q4 q5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young \n4 4 10/12/08 UK M 39 3 3 4 NA NA Young \n5 5 5/1/09 UK F NA 2 2 1 2 1 NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#here's one way:\nnames(leadership)[2] <- \"testDate\"\nleadership",
"execution_count": 17
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|lllllllllll}\n manager & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th scope=col>manager</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n",
"text/plain": " manager testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young \n4 4 10/12/08 UK M 39 3 3 4 NA NA Young \n5 5 5/1/09 UK F NA 2 2 1 2 1 NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#another example:\nnames(leadership)[6:10] <- c(\"item1\", \"item2\", \"item3\", \"item4\", \"item5\")\nleadership",
"execution_count": 18
},
{
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": "Installing package into \u2018/gpfs/global_fs01/sym_shared/YPProdSpark/user/s17c-9f3318fc11f06c-d37a4b9405b6/R/libs\u2019\n(as \u2018lib\u2019 is unspecified)\n"
}
],
"cell_type": "code",
"metadata": {},
"source": "#another way:\n#using the rename function in the \"plyr\" package\ninstall.packages(\"plyr\")",
"execution_count": 21
},
{
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": "\nAttaching package: \u2018plyr\u2019\n\nThe following objects are masked from \u2018package:SparkR\u2019:\n\n arrange, count, desc, join, mutate, rename, summarize, take\n\n"
}
],
"cell_type": "code",
"metadata": {},
"source": "#load libraries\nlibrary(plyr)",
"execution_count": 22
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young \n4 4 10/12/08 UK M 39 3 3 4 NA NA Young \n5 5 5/1/09 UK F NA 2 2 1 2 1 NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#use the rename() function\nleadership <- rename(leadership, c(manager = \"managerID\"))\nleadership",
"execution_count": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Missing values"
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{enumerate*}\n\\item FALSE\n\\item FALSE\n\\item FALSE\n\\item TRUE\n\\end{enumerate*}\n",
"text/html": "<ol class=list-inline>\n\t<li>FALSE</li>\n\t<li>FALSE</li>\n\t<li>FALSE</li>\n\t<li>TRUE</li>\n</ol>\n",
"text/plain": "[1] FALSE FALSE FALSE TRUE",
"text/markdown": "1. FALSE\n2. FALSE\n3. FALSE\n4. TRUE\n\n\n"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#identifying missing values using the is.na() function\n#example:\ny <- c(1, 2, 3, NA)\nis.na(y)",
"execution_count": 24
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & FALSE & TRUE & TRUE & FALSE\\\\\n\t FALSE & FALSE & FALSE & FALSE & TRUE & FALSE & FALSE & FALSE & FALSE & FALSE & TRUE\\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td> TRUE</td><td> TRUE</td><td>FALSE</td></tr>\n\t<tr><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td> TRUE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td>FALSE</td><td> TRUE</td></tr>\n</tbody>\n</table>\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5\n[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE\n agecat\n[1,] FALSE \n[2,] FALSE \n[3,] FALSE \n[4,] FALSE \n[5,] TRUE ",
"text/markdown": "1. FALSE\n2. FALSE\n3. FALSE\n4. FALSE\n5. FALSE\n6. FALSE\n7. FALSE\n8. FALSE\n9. FALSE\n10. FALSE\n11. FALSE\n12. FALSE\n13. FALSE\n14. FALSE\n15. FALSE\n16. FALSE\n17. FALSE\n18. FALSE\n19. FALSE\n20. FALSE\n21. FALSE\n22. FALSE\n23. FALSE\n24. FALSE\n25. TRUE\n26. FALSE\n27. FALSE\n28. FALSE\n29. FALSE\n30. FALSE\n31. FALSE\n32. FALSE\n33. FALSE\n34. FALSE\n35. FALSE\n36. FALSE\n37. FALSE\n38. FALSE\n39. FALSE\n40. FALSE\n41. FALSE\n42. FALSE\n43. FALSE\n44. TRUE\n45. FALSE\n46. FALSE\n47. FALSE\n48. FALSE\n49. TRUE\n50. FALSE\n51. FALSE\n52. FALSE\n53. FALSE\n54. FALSE\n55. TRUE\n\n\n"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#another example\nis.na(leadership)",
"execution_count": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Excluding missing values from analyses"
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/plain": "[1] NA"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#sometimes you need to eliminate missing vaues from your analysis. for example:\nx <- c(1, 2, NA, 3)\ny <- x[1] + x[2] + x[3] + x[4] #will result in NA\ny",
"execution_count": 29
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/plain": "[1] NA"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "z <- sum(x) #will result in NA\nz",
"execution_count": 30
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "6",
"text/html": "6",
"text/plain": "[1] 6",
"text/markdown": "6"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#one way to fix this is to use the na.rm = T option, as follows:\nz <- sum(x, na.rm = T)\nz",
"execution_count": 31
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 10/12/08 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 5/1/09 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>10/12/08</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>5/1/09 </td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young \n4 4 10/12/08 UK M 39 3 3 4 NA NA Young \n5 5 5/1/09 UK F NA 2 2 1 2 1 NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#another way to remove missing data is by using the na.omit() function\nleadership #without removing missing data",
"execution_count": 32
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 10/24/08 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 10/28/08 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 10/1/08 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>10/24/08</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>10/28/08</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>10/1/08 </td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td>5 </td><td>2 </td><td>Young </td></tr>\n</tbody>\n</table>\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 10/24/08 US M 32 5 4 5 5 5 Young \n2 2 10/28/08 US F 45 3 5 2 5 5 Young \n3 3 10/1/08 UK F 25 3 5 5 5 2 Young "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "newdata <- na.omit(leadership)\nnewdata",
"execution_count": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Date values"
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n2 2 2008-10-28 US F 45 3 5 2 5 5 Young \n3 3 2008-10-01 UK F 25 3 5 5 5 2 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young \n5 5 2009-05-01 UK F NA 2 2 1 2 1 NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#convert \"testDate\" into a date type by using the as.Date() function\n#default format in R is yyyy-mm-dd\nmyformat <- \"%m/%d/%y\"\nleadership$testDate <- as.Date(leadership$testDate, myformat)\nleadership",
"execution_count": 38
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/plain": "Time difference of 2535 days"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#when R stores dates internally, they're represented as number of days since 1 Jan 1970 with negative values for earlier dates\n#so you can perform arithmetic operations on them\nstartDate <- as.Date(\"2004-02-13\")\nendDate <- as.Date(\"2011-01-22\")\ndays <- endDate - startDate\ndays",
"execution_count": 39
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/plain": "Time difference of 116.4286 weeks"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#you can use the difftime() function to calculate a time interval\n#for example if you want to calculate how old someone born on 29 May 2015 is, you could do it as follows:\ntoday <- Sys.Date()\ndob <- as.Date(\"2015-05-29\")\ndifftime(today, dob, units = \"weeks\")",
"execution_count": 40
},
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#you can convert date variables into character variables using the as.character() function\n#this conversion allows you to apply a range of character functions to the data values\n#(subsetting, replacement, concatenation, etc)",
"execution_count": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Type conversions"
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "TRUE",
"text/html": "TRUE",
"text/plain": "[1] TRUE",
"text/markdown": "TRUE"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#example of converting one data type into another\na <- c(1, 2, 3)\nis.numeric(a)",
"execution_count": 42
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "TRUE",
"text/html": "TRUE",
"text/plain": "[1] TRUE",
"text/markdown": "TRUE"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "is.vector(a)",
"execution_count": 43
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{enumerate*}\n\\item '1'\n\\item '2'\n\\item '3'\n\\end{enumerate*}\n",
"text/html": "<ol class=list-inline>\n\t<li>'1'</li>\n\t<li>'2'</li>\n\t<li>'3'</li>\n</ol>\n",
"text/plain": "[1] \"1\" \"2\" \"3\"",
"text/markdown": "1. '1'\n2. '2'\n3. '3'\n\n\n"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "b <- as.character(a)\nb",
"execution_count": 47
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "FALSE",
"text/html": "FALSE",
"text/plain": "[1] FALSE",
"text/markdown": "FALSE"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "is.numeric(b)",
"execution_count": 48
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "TRUE",
"text/html": "TRUE",
"text/plain": "[1] TRUE",
"text/markdown": "TRUE"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "is.vector(b)",
"execution_count": 49
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "TRUE",
"text/html": "TRUE",
"text/plain": "[1] TRUE",
"text/markdown": "TRUE"
}
}
],
"cell_type": "code",
"metadata": {},
"source": "is.character(b)",
"execution_count": 50
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Sorting data"
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t3 & 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t1 & 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t4 & 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t2 & 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t5 & 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>3</th><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><th scope=row>1</th><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>4</th><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><th scope=row>2</th><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>5</th><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n</tbody>\n</table>\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n3 3 2008-10-01 UK F 25 3 5 5 5 2 Young \n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young \n2 2 2008-10-28 US F 45 3 5 2 5 5 Young \n5 5 2009-05-01 UK F NA 2 2 1 2 1 NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#the order() function lets you sort a data frame in R. The default sorting order is ascending\n#prepend the sorting variable with a minus sign to indicate descending order\n#example:\nnewdata <- leadership[order(leadership$age),] #creates a new dataset sorted from youngest manager to oldest manager\nnewdata",
"execution_count": 51
},
{
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": "The following objects are masked _by_ .GlobalEnv:\n\n age, country, gender\n\n"
},
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t3 & 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t2 & 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t5 & 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\t1 & 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t4 & 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>3</th><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><th scope=row>2</th><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>5</th><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n\t<tr><th scope=row>1</th><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>4</th><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n</tbody>\n</table>\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n3 3 2008-10-01 UK F 25 3 5 5 5 2 Young \n2 2 2008-10-28 US F 45 3 5 2 5 5 Young \n5 5 2009-05-01 UK F NA 2 2 1 2 1 NA \n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#another example\nattach(leadership)\nnewdata <- leadership[order(gender, age),] #creates a new dataset sorting rows into female, followed by male\n #and youngest to oldest within each gender\ndetach(leadership)\nnewdata",
"execution_count": 52
},
{
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": "The following objects are masked _by_ .GlobalEnv:\n\n age, country, gender\n\n"
},
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t5 & 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\t2 & 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t3 & 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\t4 & 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\t1 & 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\\end{tabular}\n",
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>5</th><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td> 2 </td><td> 1 </td><td>NA </td></tr>\n\t<tr><th scope=row>2</th><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>3</th><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td> 5 </td><td> 2 </td><td>Young </td></tr>\n\t<tr><th scope=row>4</th><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n\t<tr><th scope=row>1</th><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n</tbody>\n</table>\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n5 5 2009-05-01 UK F NA 2 2 1 2 1 NA \n2 2 2008-10-28 US F 45 3 5 2 5 5 Young \n3 3 2008-10-01 UK F 25 3 5 5 5 2 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young \n1 1 2008-10-24 US M 32 5 4 5 5 5 Young "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#another example:\nattach(leadership)\nnewdata <- leadership[order(gender, -age),] #creates a new dataset sorting rows into female, followed by male\n #and oldest to youngest within each gender\ndetach(leadership)\nnewdata",
"execution_count": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Merging datasets"
},
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#if your data exists in different locations, you will need to combine it",
"execution_count": 55
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Adding columns to a data frame"
},
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#adding columns to a data frame:\n#to merge two data frames horizontally, you use the merge() function\n#syntax: total <- merge(dataFrameA, dataFrameB, by = \"ID\") #by = the common key variable",
"execution_count": 54
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Horizontal concatenation"
},
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#another way to merge two matrices or data frames where you don't need to specify a common key,\n#is by using the cbind() function. syntax:\n#total <- cbind(a, b)\n#cbind() concatenates objects a and b. each object must have the same number of rows and be sorted in the same order",
"execution_count": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Adding rows to a data frame"
},
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#use the rbind() function to add rows to a data frame. syntax:\n#total <- rbind(a, b)\n#a and b must have the same number of variables but they don't need to be in the same order\n#if a has more variables than b, then either 1) delete the extra variables or 2) create the same variable in b and assign them to NAs",
"execution_count": 57
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Subsetting datasets"
},
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#R has powerful indexing features to select and exclude variables, observations or both",
"execution_count": 58
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Selecting (keeping) variables"
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th></tr></thead>\n<tbody>\n\t<tr><td>5 </td><td>4 </td><td>5 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>2 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>5 </td><td> 5</td><td> 2</td></tr>\n\t<tr><td>3 </td><td>3 </td><td>4 </td><td>NA</td><td>NA</td></tr>\n\t<tr><td>2 </td><td>2 </td><td>1 </td><td> 2</td><td> 1</td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|lllll}\n item1 & item2 & item3 & item4 & item5\\\\\n\\hline\n\t 5 & 4 & 5 & 5 & 5\\\\\n\t 3 & 5 & 2 & 5 & 5\\\\\n\t 3 & 5 & 5 & 5 & 2\\\\\n\t 3 & 3 & 4 & NA & NA\\\\\n\t 2 & 2 & 1 & 2 & 1\\\\\n\\end{tabular}\n",
"text/plain": " item1 item2 item3 item4 item5\n1 5 4 5 5 5 \n2 3 5 2 5 5 \n3 3 5 5 5 2 \n4 3 3 4 NA NA \n5 2 2 1 2 1 "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#selecting using [row indices, column indices]. example:\nnewdata <- leadership[, c(6:10)] #selects all rows and columns 6 through 10 of the leadership data frame\nnewdata",
"execution_count": 60
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th></tr></thead>\n<tbody>\n\t<tr><td>5 </td><td>4 </td><td>5 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>2 </td><td> 5</td><td> 5</td></tr>\n\t<tr><td>3 </td><td>5 </td><td>5 </td><td> 5</td><td> 2</td></tr>\n\t<tr><td>3 </td><td>3 </td><td>4 </td><td>NA</td><td>NA</td></tr>\n\t<tr><td>2 </td><td>2 </td><td>1 </td><td> 2</td><td> 1</td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|lllll}\n item1 & item2 & item3 & item4 & item5\\\\\n\\hline\n\t 5 & 4 & 5 & 5 & 5\\\\\n\t 3 & 5 & 2 & 5 & 5\\\\\n\t 3 & 5 & 5 & 5 & 2\\\\\n\t 3 & 3 & 4 & NA & NA\\\\\n\t 2 & 2 & 1 & 2 & 1\\\\\n\\end{tabular}\n",
"text/plain": " item1 item2 item3 item4 item5\n1 5 4 5 5 5 \n2 3 5 2 5 5 \n3 3 5 5 5 2 \n4 3 3 4 NA NA \n5 2 2 1 2 1 "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#another example\nmyvars <- c(\"item1\", \"item2\", \"item3\", \"item4\", \"item5\")\nnewdata <- leadership[myvars]\nnewdata",
"execution_count": 62
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Excluding (dropping) variables"
},
{
"outputs": [],
"cell_type": "code",
"metadata": {
"collapsed": true
},
"source": "#there are many reasons to exclude variables. for example, if a variable has many missing values,\n#you may want to drop it from further analysis",
"execution_count": 63
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td>NA </td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|lllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & agecat\\\\\n\\hline\n\t 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & Young \\\\\n\t 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & Young \\\\\n\t 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & Young \\\\\n\t 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & Young \\\\\n\t 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & NA \\\\\n\\end{tabular}\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 agecat\n1 1 2008-10-24 US M 32 5 4 5 Young \n2 2 2008-10-28 US F 45 3 5 2 Young \n3 3 2008-10-01 UK F 25 3 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 Young \n5 5 2009-05-01 UK F NA 2 2 1 NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#here is an example: lets say you want to drop item4 and item5 from the leadership data frame.\n#here is one way you can do this:\nmyvars <- names(leadership) %in% c(\"item4\", \"item5\") #names(leadership) produces a character vector with names of all variables\n #names(leadership) %in% c(\"item4\", \"item5\") returns a logical vector with TRUE for each element in\n #names(leadership) that matches either \"item4\" or \"item5\"\nnewdata <- leadership[!myvars] #the ! operator reverses the logical values\nnewdata",
"execution_count": 64
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>Young </td></tr>\n\t<tr><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td>NA </td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|lllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & agecat\\\\\n\\hline\n\t 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & Young \\\\\n\t 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & Young \\\\\n\t 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & Young \\\\\n\t 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & Young \\\\\n\t 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & NA \\\\\n\\end{tabular}\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 agecat\n1 1 2008-10-24 US M 32 5 4 5 Young \n2 2 2008-10-28 US F 45 3 5 2 Young \n3 3 2008-10-01 UK F 25 3 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 Young \n5 5 2009-05-01 UK F NA 2 2 1 NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#another way of dropping variables\n#since item4 is column 9 and item 5 is column 10, you can also do:\nnewdata <- leadership[c(-9, -10)]\nnewdata",
"execution_count": 65
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Selecting observations"
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>2 </td><td>2008-10-28</td><td>US </td><td>F </td><td>45 </td><td>3 </td><td>5 </td><td>2 </td><td>5 </td><td>5 </td><td>Young </td></tr>\n\t<tr><td>3 </td><td>2008-10-01</td><td>UK </td><td>F </td><td>25 </td><td>3 </td><td>5 </td><td>5 </td><td>5 </td><td>2 </td><td>Young </td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|lllllllllll}\n managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t 2 & 2008-10-28 & US & F & 45 & 3 & 5 & 2 & 5 & 5 & Young \\\\\n\t 3 & 2008-10-01 & UK & F & 25 & 3 & 5 & 5 & 5 & 2 & Young \\\\\n\\end{tabular}\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n2 2 2008-10-28 US F 45 3 5 2 5 5 Young \n3 3 2008-10-01 UK F 25 3 5 5 5 2 Young "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#selecting or excluding observations is a key aspect of data preparation\n#here are some examples:\nnewdata <- leadership[1:3,] #selects the first three rows of data\nnewdata",
"execution_count": 66
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>1</th><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>4</th><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t1 & 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t4 & 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\\end{tabular}\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "newdata <- leadership[leadership$gender == \"M\" & #selects all men aged > 30\n leadership$age > 30,]\nnewdata",
"execution_count": 67
},
{
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": "The following objects are masked _by_ .GlobalEnv:\n\n age, country, gender\n\n"
},
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>1</th><td>1 </td><td>2008-10-24</td><td>US </td><td>M </td><td>32 </td><td>5 </td><td>4 </td><td>5 </td><td> 5 </td><td> 5 </td><td>Young </td></tr>\n\t<tr><th scope=row>4</th><td>4 </td><td>2008-10-12</td><td>UK </td><td>M </td><td>39 </td><td>3 </td><td>3 </td><td>4 </td><td>NA </td><td>NA </td><td>Young </td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t1 & 1 & 2008-10-24 & US & M & 32 & 5 & 4 & 5 & 5 & 5 & Young \\\\\n\t4 & 4 & 2008-10-12 & UK & M & 39 & 3 & 3 & 4 & NA & NA & Young \\\\\n\\end{tabular}\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n1 1 2008-10-24 US M 32 5 4 5 5 5 Young \n4 4 2008-10-12 UK M 39 3 3 4 NA NA Young "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#another way of selecting men aged > 30 by attaching the leadership data frame\nattach(leadership)\nnewdata <- leadership[gender == \"M\" & age > 30,]\ndetach(leadership)\nnewdata",
"execution_count": 68
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th></th><th scope=col>managerID</th><th scope=col>testDate</th><th scope=col>country</th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th><th scope=col>agecat</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>5</th><td>5 </td><td>2009-05-01</td><td>UK </td><td>F </td><td>NA </td><td>2 </td><td>2 </td><td>1 </td><td>2 </td><td>1 </td><td>NA </td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|lllllllllll}\n & managerID & testDate & country & gender & age & item1 & item2 & item3 & item4 & item5 & agecat\\\\\n\\hline\n\t5 & 5 & 2009-05-01 & UK & F & NA & 2 & 2 & 1 & 2 & 1 & NA \\\\\n\\end{tabular}\n",
"text/plain": " managerID testDate country gender age item1 item2 item3 item4 item5 agecat\n5 5 2009-05-01 UK F NA 2 2 1 2 1 NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#limit analysis to observations collected between 01/01/2009 and 12/31/2009\nnewdata <- leadership[leadership$testDate >= \"2009-01-01\" & leadership$testDate <= \"2009-12-31\",]\nnewdata",
"execution_count": 69
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### The subset() function"
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th></th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>2</th><td>3 </td><td>5 </td><td>2 </td><td> 5</td></tr>\n\t<tr><th scope=row>4</th><td>3 </td><td>3 </td><td>4 </td><td>NA</td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|llll}\n & item1 & item2 & item3 & item4\\\\\n\\hline\n\t2 & 3 & 5 & 2 & 5\\\\\n\t4 & 3 & 3 & 4 & NA\\\\\n\\end{tabular}\n",
"text/plain": " item1 item2 item3 item4\n2 3 5 2 5 \n4 3 3 4 NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#the subset() function is a really powerful function to select or exclude variables and observations\n#examples:\n#this statement selects all observations where the age is either 35 or over OR under 24 and returns 4 columns (items 1 through 4)\nnewdata <- subset(leadership, age >= 35 | age < 24, select = c(item1, item2, item3, item4))\nnewdata",
"execution_count": 71
},
{
"outputs": [
{
"output_type": "display_data",
"metadata": {},
"data": {
"text/html": "<table>\n<thead><tr><th></th><th scope=col>gender</th><th scope=col>age</th><th scope=col>item1</th><th scope=col>item2</th><th scope=col>item3</th><th scope=col>item4</th><th scope=col>item5</th></tr></thead>\n<tbody>\n\t<tr><th scope=row>1</th><td>M </td><td>32</td><td>5 </td><td>4 </td><td>5 </td><td> 5</td><td> 5</td></tr>\n\t<tr><th scope=row>4</th><td>M </td><td>39</td><td>3 </td><td>3 </td><td>4 </td><td>NA</td><td>NA</td></tr>\n</tbody>\n</table>\n",
"text/latex": "\\begin{tabular}{r|lllllll}\n & gender & age & item1 & item2 & item3 & item4 & item5\\\\\n\\hline\n\t1 & M & 32 & 5 & 4 & 5 & 5 & 5\\\\\n\t4 & M & 39 & 3 & 3 & 4 & NA & NA\\\\\n\\end{tabular}\n",
"text/plain": " gender age item1 item2 item3 item4 item5\n1 M 32 5 4 5 5 5 \n4 M 39 3 3 4 NA NA "
}
}
],
"cell_type": "code",
"metadata": {},
"source": "#another example\n#this statement selects all men > 25 years old and all columns from gender to item 5 (and everything in between)\nnewdata <- subset(leadership, gender == \"M\" & age > 25, select = gender:item5)\nnewdata",
"execution_count": 73
},
{
"cell_type": "markdown",
"metadata": {},
"source": "### Random samples"
},
{
"outputs": [],
"cell_type": "code",
"metadata": {},
"source": "#sampling from larger datasets is a common practice in data mining and machine learning\n#for example, you may want to create a predictive model from one sample and validate its effectiveness on the other\n#the sample() function lets you take a random sample (with or without replacement) from size n of a dataset\n#R has extensive facilities for sampling (see the \"sampling\" package)",
"execution_count": 81
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Using SQL statements to manipulate data frames"
},
{
"outputs": [],
"cell_type": "code",
"metadata": {},
"source": "#using the \"sqldf\" package, you can use SQL statements within R\n#experienced SQl users will find the sqldf package a useful addition to the R data preparation toolkit",
"execution_count": 89
}
],
"metadata": {
"kernelspec": {
"display_name": "R with Spark 2.0",
"name": "r-spark20",
"language": "R"
},
"language_info": {
"mimetype": "text/x-r-source",
"codemirror_mode": "r",
"pygments_lexer": "r",
"version": "3.3.2",
"name": "R",
"file_extension": ".r"
}
},
"nbformat": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment