Skip to content

Instantly share code, notes, and snippets.

@sabineri
Created February 13, 2020 17:51
Show Gist options
  • Save sabineri/8a17ea03b1ed1aa374d7d21b9e8786e9 to your computer and use it in GitHub Desktop.
Save sabineri/8a17ea03b1ed1aa374d7d21b9e8786e9 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"http://cognitiveclass.ai/wp-content/uploads/2017/11/cc-logo-square.png\" width=\"150\">\n",
"\n",
"\n",
"\n",
"\n",
"<h1 align=\"center\">VECTORS and FACTORS in R</h1> \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Welcome!\n",
"\n",
"By the end of this notebook, you will have learned about **vectors and factors**, two very important data types in R."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"<ul>\n",
" <li><a href=\"#About-the-Dataset\">About the Dataset</a></li>\n",
" <li><a href=\"#Vectors\">Vectors</a></li>\n",
" <li><a href=\"#Vector-Operations\">Vector Operations</a></li>\n",
" <li><a href=\"#Subsetting-Vectors\">Subsetting Vectors</a></li>\n",
" <li><a href=\"#Factors\">Factors</a></li>\n",
"</ul>\n",
"<p></p>\n",
"Estimated Time Needed: <strong>25 min</strong>\n",
"\n",
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref0\"></a>\n",
"<h2 align=center>About the Dataset</h2>\n",
"\n",
"You have received many movie recomendations from your friends and compiled all of the recommendations into a table, with information about each movie. \n",
"\n",
"This table has one row for each movie and several columns.\n",
"\n",
"- **name** - The name of the movie\n",
"- **year** - The year the movie was released\n",
"- **length_min** - The lenght of the movie in minutes\n",
"- **genre** - The genre of the movie\n",
"- **average_rating** - Average rating on Imdb\n",
"- **cost_millions** - The movie's production cost in millions\n",
"- **sequences** - The amount of sequences\n",
"- **foreign** - Indicative of whether the movie is foreign (1) or domestic (0)\n",
"- **age_restriction** - The age restriction for the movie\n",
"<br>\n",
"<br>\n",
"\n",
"Here's what the data looks like:\n",
"\n",
"<img src = \"https://ibm.box.com/shared/static/6kr8sg0n6pc40zd1xn6hjhtvy3k7cmeq.png\" width = 90% align=\"left\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 20px\">\n",
"**Remember**: To run the grey code cells in this exercise, click on the code cell, and then press Shift + Enter.\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref1\"></a>\n",
"<center><h2>Vectors</h2></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Vectors** are strings of numbers, characters or logical data (one-dimension array). In other words, a vector is a simple tool to store your grouped data.\n",
"\n",
"In R, you create a vector with the combine function **c()**. You place the vector elements separated by a comma between the brackets. Vectors will be very useful in the future as they allow you to apply operations on a series of data easily.\n",
"\n",
"Note that the items in a vector must be of the same class, for example all should be either number, character, or logical."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Numeric, Character and Logical Vectors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say we have four movie release dates (1985, 1999, 2015, 1964) and we want to assign them to a single variable, `release_year`. This means we'll need to create a vector using **`c()`**.\n",
"\n",
"Using numbers, this becomes a **numeric vector**."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"release_year <- c(1985, 1999, 2015, 1964)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1985</li>\n",
"\t<li>1999</li>\n",
"\t<li>2015</li>\n",
"\t<li>1964</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1985\n",
"\\item 1999\n",
"\\item 2015\n",
"\\item 1964\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1985\n",
"2. 1999\n",
"3. 2015\n",
"4. 1964\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1985 1999 2015 1964"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"release_year"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if we use quotation marks? Then this becomes a **character vector**."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>'Toy Story'</li>\n",
"\t<li>'Akira'</li>\n",
"\t<li>'The Breakfast Club'</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'Toy Story'\n",
"\\item 'Akira'\n",
"\\item 'The Breakfast Club'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'Toy Story'\n",
"2. 'Akira'\n",
"3. 'The Breakfast Club'\n",
"\n",
"\n"
],
"text/plain": [
"[1] \"Toy Story\" \"Akira\" \"The Breakfast Club\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Create genre vector and assign values to it \n",
"titles <- c(\"Toy Story\", \"Akira\", \"The Breakfast Club\")\n",
"titles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are also **logical vectors**, which consist of TRUE's and FALSE's. They're particular important when you want to check its contents"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>FALSE</li>\n",
"\t<li>TRUE</li>\n",
"\t<li>FALSE</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item FALSE\n",
"\\item TRUE\n",
"\\item FALSE\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. FALSE\n",
"2. TRUE\n",
"3. FALSE\n",
"\n",
"\n"
],
"text/plain": [
"[1] FALSE TRUE FALSE"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"titles == \"Akira\" # which item in `titles` is equal to \"Akira\"?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr></hr>\n",
"<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 20px\">\n",
"<h4> [Tip] TRUE and FALSE in R </h4> \n",
"\n",
"Did you know? R only recognizes `TRUE`, `FALSE`, `T` and `F` as special values for true and false. That means all other spellings, including *True* and *true*, are not interpreted by R as logical values.\n",
"\n",
"<p></p>\n",
"</div>\n",
"\n",
"<hr></hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref2\"></a>\n",
"<center><h2>Vector Operations</h2></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adding more elements to a vector"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can add more elements to a vector with the same **`c()`** function you use the create vectors:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1985</li>\n",
"\t<li>1999</li>\n",
"\t<li>2015</li>\n",
"\t<li>1964</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1985\n",
"\\item 1999\n",
"\\item 2015\n",
"\\item 1964\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1985\n",
"2. 1999\n",
"3. 2015\n",
"4. 1964\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1985 1999 2015 1964"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"release_year <- c(1985, 1999, 2015, 1964)\n",
"release_year"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1985</li>\n",
"\t<li>1999</li>\n",
"\t<li>2015</li>\n",
"\t<li>1964</li>\n",
"\t<li>2016</li>\n",
"\t<li>2017</li>\n",
"\t<li>2018</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1985\n",
"\\item 1999\n",
"\\item 2015\n",
"\\item 1964\n",
"\\item 2016\n",
"\\item 2017\n",
"\\item 2018\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1985\n",
"2. 1999\n",
"3. 2015\n",
"4. 1964\n",
"5. 2016\n",
"6. 2017\n",
"7. 2018\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1985 1999 2015 1964 2016 2017 2018"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"release_year <- c(release_year, 2016:2018)\n",
"release_year"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Length of a vector"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do we check how many items there are in a vector? We can use the **length()** function:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1985</li>\n",
"\t<li>1999</li>\n",
"\t<li>2015</li>\n",
"\t<li>1964</li>\n",
"\t<li>2016</li>\n",
"\t<li>2017</li>\n",
"\t<li>2018</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1985\n",
"\\item 1999\n",
"\\item 2015\n",
"\\item 1964\n",
"\\item 2016\n",
"\\item 2017\n",
"\\item 2018\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1985\n",
"2. 1999\n",
"3. 2015\n",
"4. 1964\n",
"5. 2016\n",
"6. 2017\n",
"7. 2018\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1985 1999 2015 1964 2016 2017 2018"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"7"
],
"text/latex": [
"7"
],
"text/markdown": [
"7"
],
"text/plain": [
"[1] 7"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"release_year\n",
"length(release_year)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Head and Tail of a vector"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also retrieve just the **first few items** using the **head()** function:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1985</li>\n",
"\t<li>1999</li>\n",
"\t<li>2015</li>\n",
"\t<li>1964</li>\n",
"\t<li>2016</li>\n",
"\t<li>2017</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1985\n",
"\\item 1999\n",
"\\item 2015\n",
"\\item 1964\n",
"\\item 2016\n",
"\\item 2017\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1985\n",
"2. 1999\n",
"3. 2015\n",
"4. 1964\n",
"5. 2016\n",
"6. 2017\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1985 1999 2015 1964 2016 2017"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"head(release_year) #first six items"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1985</li>\n",
"\t<li>1999</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1985\n",
"\\item 1999\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1985\n",
"2. 1999\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1985 1999"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"head(release_year, n = 2) #first n items"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1985</li>\n",
"\t<li>1999</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1985\n",
"\\item 1999\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1985\n",
"2. 1999\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1985 1999"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"head(release_year, 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also retrieve just the **last few items** using the **tail()** function:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1999</li>\n",
"\t<li>2015</li>\n",
"\t<li>1964</li>\n",
"\t<li>2016</li>\n",
"\t<li>2017</li>\n",
"\t<li>2018</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1999\n",
"\\item 2015\n",
"\\item 1964\n",
"\\item 2016\n",
"\\item 2017\n",
"\\item 2018\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1999\n",
"2. 2015\n",
"3. 1964\n",
"4. 2016\n",
"5. 2017\n",
"6. 2018\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1999 2015 1964 2016 2017 2018"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tail(release_year) #last six items"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>2017</li>\n",
"\t<li>2018</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 2017\n",
"\\item 2018\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 2017\n",
"2. 2018\n",
"\n",
"\n"
],
"text/plain": [
"[1] 2017 2018"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tail(release_year, 2) #last two items"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sorting a vector"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also sort a vector:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1964</li>\n",
"\t<li>1985</li>\n",
"\t<li>1999</li>\n",
"\t<li>2015</li>\n",
"\t<li>2016</li>\n",
"\t<li>2017</li>\n",
"\t<li>2018</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1964\n",
"\\item 1985\n",
"\\item 1999\n",
"\\item 2015\n",
"\\item 2016\n",
"\\item 2017\n",
"\\item 2018\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1964\n",
"2. 1985\n",
"3. 1999\n",
"4. 2015\n",
"5. 2016\n",
"6. 2017\n",
"7. 2018\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1964 1985 1999 2015 2016 2017 2018"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sort(release_year)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also **sort in decreasing order**:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>2018</li>\n",
"\t<li>2017</li>\n",
"\t<li>2016</li>\n",
"\t<li>2015</li>\n",
"\t<li>1999</li>\n",
"\t<li>1985</li>\n",
"\t<li>1964</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 2018\n",
"\\item 2017\n",
"\\item 2016\n",
"\\item 2015\n",
"\\item 1999\n",
"\\item 1985\n",
"\\item 1964\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 2018\n",
"2. 2017\n",
"3. 2016\n",
"4. 2015\n",
"5. 1999\n",
"6. 1985\n",
"7. 1964\n",
"\n",
"\n"
],
"text/plain": [
"[1] 2018 2017 2016 2015 1999 1985 1964"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sort(release_year, decreasing = TRUE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But if you just want the minimum and maximum values of a vector, you can use the **`min()`** and **`max()`** functions"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"1964"
],
"text/latex": [
"1964"
],
"text/markdown": [
"1964"
],
"text/plain": [
"[1] 1964"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"2018"
],
"text/latex": [
"2018"
],
"text/markdown": [
"2018"
],
"text/plain": [
"[1] 2018"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"min(release_year)\n",
"max(release_year)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Average of Numbers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to check the average cost of movies produced in 2014, what would you do? Of course, one way is to add all the numbers together, then divide by the number of movies:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"8.4"
],
"text/latex": [
"8.4"
],
"text/markdown": [
"8.4"
],
"text/plain": [
"[1] 8.4"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cost_2014 <- c(8.6, 8.5, 8.1)\n",
"\n",
"# sum results in the sum of all elements in the vector\n",
"avg_cost_2014 <- sum(cost_2014)/3\n",
"avg_cost_2014"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You also can use the <b>mean</b> function to find the average of the numeric values in a vector:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"8.4"
],
"text/latex": [
"8.4"
],
"text/markdown": [
"8.4"
],
"text/plain": [
"[1] 8.4"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"mean_cost_2014 <- mean(cost_2014)\n",
"mean_cost_2014"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Giving Names to Values in a Vector"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Suppose you want to remember which year corresponds to which movie.\n",
"\n",
"With vectors, you can give names to the elements of a vector using the **names() ** function:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<dl class=dl-horizontal>\n",
"\t<dt>The Breakfast Club</dt>\n",
"\t\t<dd>1985</dd>\n",
"\t<dt>American Beauty</dt>\n",
"\t\t<dd>1999</dd>\n",
"\t<dt>Black Swan</dt>\n",
"\t\t<dd>2010</dd>\n",
"\t<dt>Chicago</dt>\n",
"\t\t<dd>2002</dd>\n",
"</dl>\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[The Breakfast Club] 1985\n",
"\\item[American Beauty] 1999\n",
"\\item[Black Swan] 2010\n",
"\\item[Chicago] 2002\n",
"\\end{description*}\n"
],
"text/markdown": [
"The Breakfast Club\n",
": 1985American Beauty\n",
": 1999Black Swan\n",
": 2010Chicago\n",
": 2002\n",
"\n"
],
"text/plain": [
"The Breakfast Club American Beauty Black Swan Chicago \n",
" 1985 1999 2010 2002 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#Creating a year vector\n",
"release_year <- c(1985, 1999, 2010, 2002)\n",
"\n",
"#Assigning names\n",
"names(release_year) <- c(\"The Breakfast Club\", \"American Beauty\", \"Black Swan\", \"Chicago\")\n",
"\n",
"release_year"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, you can retrieve the values based on the names:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<dl class=dl-horizontal>\n",
"\t<dt>American Beauty</dt>\n",
"\t\t<dd>1999</dd>\n",
"\t<dt>Chicago</dt>\n",
"\t\t<dd>2002</dd>\n",
"</dl>\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[American Beauty] 1999\n",
"\\item[Chicago] 2002\n",
"\\end{description*}\n"
],
"text/markdown": [
"American Beauty\n",
": 1999Chicago\n",
": 2002\n",
"\n"
],
"text/plain": [
"American Beauty Chicago \n",
" 1999 2002 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"release_year[c(\"American Beauty\", \"Chicago\")]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the values of the vector are still the years. We can see this in action by adding a number to the first item:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<strong>The Breakfast Club:</strong> 2085"
],
"text/latex": [
"\\textbf{The Breakfast Club:} 2085"
],
"text/markdown": [
"**The Breakfast Club:** 2085"
],
"text/plain": [
"The Breakfast Club \n",
" 2085 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"release_year[1] + 100 #adding 100 to the first item changes the year"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And you can retrieve the names of the vector using **`names()`**"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>'The Breakfast Club'</li>\n",
"\t<li>'American Beauty'</li>\n",
"\t<li>'Black Swan'</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'The Breakfast Club'\n",
"\\item 'American Beauty'\n",
"\\item 'Black Swan'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'The Breakfast Club'\n",
"2. 'American Beauty'\n",
"3. 'Black Swan'\n",
"\n",
"\n"
],
"text/plain": [
"[1] \"The Breakfast Club\" \"American Beauty\" \"Black Swan\" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"names(release_year)[1:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summarizing Vectors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also use the **\"summary\"** function for simple descriptive statistics: minimum, first quartile, mean, third quartile, maximum:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
" Min. 1st Qu. Median Mean 3rd Qu. Max. \n",
" 8.10 8.30 8.50 8.40 8.55 8.60 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"summary(cost_2014)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using Logical Operations on Vectors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A vector can also be comprised of **`TRUE`** and **`FALSE`**, which are special **logical values** in R. These boolean values are used used to indicate whether a condition is true or false. \n",
"\n",
"Let's check whether a movie year of 1997 is older than (**greater in value than**) 2000."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"FALSE"
],
"text/latex": [
"FALSE"
],
"text/markdown": [
"FALSE"
],
"text/plain": [
"[1] FALSE"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"movie_year <- 1997\n",
"movie_year > 2000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also make a logical comparison across multiple items in a vector. Which movie release years here are \"greater\" than 2014?"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>FALSE</li>\n",
"\t<li>FALSE</li>\n",
"\t<li>TRUE</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item FALSE\n",
"\\item FALSE\n",
"\\item TRUE\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. FALSE\n",
"2. FALSE\n",
"3. TRUE\n",
"\n",
"\n"
],
"text/plain": [
"[1] FALSE FALSE TRUE"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"movies_years <- c(1998, 2010, 2016)\n",
"movies_years > 2014"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also check for **equivalence**, using **`==`**. Let's check which movie year is equal to 2015."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>FALSE</li>\n",
"\t<li>FALSE</li>\n",
"\t<li>FALSE</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item FALSE\n",
"\\item FALSE\n",
"\\item FALSE\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. FALSE\n",
"2. FALSE\n",
"3. FALSE\n",
"\n",
"\n"
],
"text/plain": [
"[1] FALSE FALSE FALSE"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"movies_years == 2015 # is equal to 2015?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to check which ones are **not equal** to 2015, you can use **`!=`**"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>TRUE</li>\n",
"\t<li>TRUE</li>\n",
"\t<li>TRUE</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item TRUE\n",
"\\item TRUE\n",
"\\item TRUE\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. TRUE\n",
"2. TRUE\n",
"3. TRUE\n",
"\n",
"\n"
],
"text/plain": [
"[1] TRUE TRUE TRUE"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"movies_years != 2015"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr></hr>\n",
"<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 20px\">\n",
"<h4> [Tip] Logical Operators in R </h4>\n",
"<p></p>\n",
"You can do a variety of logical operations in R including: \n",
"<li> Checking equivalence: **1 == 2** </li>\n",
"<li> Checking non-equivalence: **TRUE != FALSE** </li>\n",
"<li> Greater than: **100 > 1** </li>\n",
"<li> Greater than or equal to: **100 >= 1** </li>\n",
"<li> Less than: **1 < 2** </li>\n",
"<li> Less than or equal to: **1 <= 2** </li>\n",
"</div>\n",
"<hr></hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref3\"></a>\n",
"<center><h2>Subsetting Vectors</h2><center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if you wanted to retrieve the second year from the following **vector of movie years**?"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1985</li>\n",
"\t<li>1999</li>\n",
"\t<li>2002</li>\n",
"\t<li>2010</li>\n",
"\t<li>2012</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1985\n",
"\\item 1999\n",
"\\item 2002\n",
"\\item 2010\n",
"\\item 2012\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1985\n",
"2. 1999\n",
"3. 2002\n",
"4. 2010\n",
"5. 2012\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1985 1999 2002 2010 2012"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"movie_years <- c(1985, 1999, 2002, 2010, 2012)\n",
"movie_years"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To retrieve the **second year**, you can use square brackets **`[]`**:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"1999"
],
"text/latex": [
"1999"
],
"text/markdown": [
"1999"
],
"text/plain": [
"[1] 1999"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"movie_years[2] #second item"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To retrieve the **third year**, you can use:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"2002"
],
"text/latex": [
"2002"
],
"text/markdown": [
"2002"
],
"text/plain": [
"[1] 2002"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"movie_years[3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And if you want to retrieve **multiple items**, you can pass in a vector:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>1985</li>\n",
"\t<li>2002</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 1985\n",
"\\item 2002\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 1985\n",
"2. 2002\n",
"\n",
"\n"
],
"text/plain": [
"[1] 1985 2002"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"movie_years[c(1,3)] #first and third items"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Retrieving a vector without some of its items**\n",
"\n",
"To retrieve a vector without an item, you can use negative indexing. For example, the following returns a vector slice **without the first item**."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>'Jumanji'</li>\n",
"\t<li>'City of God'</li>\n",
"\t<li>'Toy Story'</li>\n",
"\t<li>'Casino'</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'Jumanji'\n",
"\\item 'City of God'\n",
"\\item 'Toy Story'\n",
"\\item 'Casino'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'Jumanji'\n",
"2. 'City of God'\n",
"3. 'Toy Story'\n",
"4. 'Casino'\n",
"\n",
"\n"
],
"text/plain": [
"[1] \"Jumanji\" \"City of God\" \"Toy Story\" \"Casino\" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"titles <- c(\"Black Swan\", \"Jumanji\", \"City of God\", \"Toy Story\", \"Casino\")\n",
"titles[-1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can save the new vector using a variable:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>'Jumanji'</li>\n",
"\t<li>'City of God'</li>\n",
"\t<li>'Toy Story'</li>\n",
"\t<li>'Casino'</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'Jumanji'\n",
"\\item 'City of God'\n",
"\\item 'Toy Story'\n",
"\\item 'Casino'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'Jumanji'\n",
"2. 'City of God'\n",
"3. 'Toy Story'\n",
"4. 'Casino'\n",
"\n",
"\n"
],
"text/plain": [
"[1] \"Jumanji\" \"City of God\" \"Toy Story\" \"Casino\" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"new_titles <- titles[-1] #removes \"Black Swan\", the first item\n",
"new_titles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Missing Values (NA)**\n",
"\n",
"Sometimes values in a vector are missing and you have to show them using NA, which is a special value in R for \"Not Available\". For example, if you don't know the age restriction for some movies, you can use NA."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>14</li>\n",
"\t<li>12</li>\n",
"\t<li>10</li>\n",
"\t<li>&lt;NA&gt;</li>\n",
"\t<li>18</li>\n",
"\t<li>&lt;NA&gt;</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 14\n",
"\\item 12\n",
"\\item 10\n",
"\\item <NA>\n",
"\\item 18\n",
"\\item <NA>\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 14\n",
"2. 12\n",
"3. 10\n",
"4. &lt;NA&gt;\n",
"5. 18\n",
"6. &lt;NA&gt;\n",
"\n",
"\n"
],
"text/plain": [
"[1] 14 12 10 NA 18 NA"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"age_restric <- c(14, 12, 10, NA, 18, NA)\n",
"age_restric"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 20px\">\n",
"<h4> [Tip] Checking NA in R </h4>\n",
"<p></p>\n",
"You can check if a value is NA by using the **is.na()** function, which returns TRUE or FALSE. \n",
"<li> Check if NA: **is.na(NA)** </li>\n",
"<li> Check if not NA: **!is.na(2)** </li>\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Subsetting vectors based on a logical condition"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if we want to know which movies were created after year 2000? We can simply apply a logical comparison across all the items in a vector:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<dl class=dl-horizontal>\n",
"\t<dt>The Breakfast Club</dt>\n",
"\t\t<dd>FALSE</dd>\n",
"\t<dt>American Beauty</dt>\n",
"\t\t<dd>FALSE</dd>\n",
"\t<dt>Black Swan</dt>\n",
"\t\t<dd>TRUE</dd>\n",
"\t<dt>Chicago</dt>\n",
"\t\t<dd>TRUE</dd>\n",
"</dl>\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[The Breakfast Club] FALSE\n",
"\\item[American Beauty] FALSE\n",
"\\item[Black Swan] TRUE\n",
"\\item[Chicago] TRUE\n",
"\\end{description*}\n"
],
"text/markdown": [
"The Breakfast Club\n",
": FALSEAmerican Beauty\n",
": FALSEBlack Swan\n",
": TRUEChicago\n",
": TRUE\n",
"\n"
],
"text/plain": [
"The Breakfast Club American Beauty Black Swan Chicago \n",
" FALSE FALSE TRUE TRUE "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"release_year > 2000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To retrieve the actual movie years after year 2000, you can simply subset the vector using the logical vector within **square brackets \"[]\"**:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<dl class=dl-horizontal>\n",
"\t<dt>Black Swan</dt>\n",
"\t\t<dd>2010</dd>\n",
"\t<dt>Chicago</dt>\n",
"\t\t<dd>2002</dd>\n",
"\t<dt>3</dt>\n",
"\t\t<dd>&lt;NA&gt;</dd>\n",
"</dl>\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[Black Swan] 2010\n",
"\\item[Chicago] 2002\n",
"\\item[3] <NA>\n",
"\\end{description*}\n"
],
"text/markdown": [
"Black Swan\n",
": 2010Chicago\n",
": 20023\n",
": &lt;NA&gt;\n",
"\n"
],
"text/plain": [
"Black Swan Chicago <NA> \n",
" 2010 2002 NA "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"release_year[movie_years > 2000] #returns a vector for elements that returned TRUE for the condition"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you may notice, subsetting vectors in R works by retrieving items that were TRUE for the provided condition. For example, `year[year > 2000]` can be verbally explained as: _\"From the vector `year`, return only values where the values are TRUE for `year > 2000`\"_.\n",
"\n",
"You can even manually write out TRUE or T for the values you want to subset:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<dl class=dl-horizontal>\n",
"\t<dt>The Breakfast Club</dt>\n",
"\t\t<dd>1985</dd>\n",
"\t<dt>American Beauty</dt>\n",
"\t\t<dd>1999</dd>\n",
"\t<dt>Black Swan</dt>\n",
"\t\t<dd>2010</dd>\n",
"\t<dt>Chicago</dt>\n",
"\t\t<dd>2002</dd>\n",
"</dl>\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[The Breakfast Club] 1985\n",
"\\item[American Beauty] 1999\n",
"\\item[Black Swan] 2010\n",
"\\item[Chicago] 2002\n",
"\\end{description*}\n"
],
"text/markdown": [
"The Breakfast Club\n",
": 1985American Beauty\n",
": 1999Black Swan\n",
": 2010Chicago\n",
": 2002\n",
"\n"
],
"text/plain": [
"The Breakfast Club American Beauty Black Swan Chicago \n",
" 1985 1999 2010 2002 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<strong>The Breakfast Club:</strong> 1985"
],
"text/latex": [
"\\textbf{The Breakfast Club:} 1985"
],
"text/markdown": [
"**The Breakfast Club:** 1985"
],
"text/plain": [
"The Breakfast Club \n",
" 1985 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"release_year\n",
"release_year[c(T, F, F, F)] #returns the values that are TRUE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref4\"></a>\n",
"<center><h2>Factors</h2></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Factors are variables in R which take on a limited number of different values; such variables are often refered to as **categorical variables**. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values. For example, the height of a tree is a continuous variable, but the titles of books would be a categorical variable.\n",
"\n",
"One of the most important uses of factors is in statistical modeling; since categorical variables enter into statistical models differently than continuous variables, storing data as factors insures that the modeling functions will treat such data correctly. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start with a _**vector**_ of genres:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>'Comedy'</li>\n",
"\t<li>'Animation'</li>\n",
"\t<li>'Crime'</li>\n",
"\t<li>'Comedy'</li>\n",
"\t<li>'Animation'</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'Comedy'\n",
"\\item 'Animation'\n",
"\\item 'Crime'\n",
"\\item 'Comedy'\n",
"\\item 'Animation'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'Comedy'\n",
"2. 'Animation'\n",
"3. 'Crime'\n",
"4. 'Comedy'\n",
"5. 'Animation'\n",
"\n",
"\n"
],
"text/plain": [
"[1] \"Comedy\" \"Animation\" \"Crime\" \"Comedy\" \"Animation\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"genre_vector <- c(\"Comedy\", \"Animation\", \"Crime\", \"Comedy\", \"Animation\")\n",
"genre_vector"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you may have noticed, you can theoretically group the items above into three categories of genres: _Animation_, _Comedy_ and _Crime_. In R-terms, we call these categories **\"factor levels\"**.\n",
"\n",
"The function **factor()** converts a vector into a factor, and creates a factor level for each unique element."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>'Animation'</li>\n",
"\t<li>'Comedy'</li>\n",
"\t<li>'Crime'</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'Animation'\n",
"\\item 'Comedy'\n",
"\\item 'Crime'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'Animation'\n",
"2. 'Comedy'\n",
"3. 'Crime'\n",
"\n",
"\n"
],
"text/plain": [
"[1] \"Animation\" \"Comedy\" \"Crime\" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"genre_factor <- as.factor(genre_vector)\n",
"levels(genre_factor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summarizing Factors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you have a large vector, it becomes difficult to identify which levels are most common (e.g., \"How many 'Comedy' movies are there?\").\n",
"\n",
"To answer this, we can use **summary()**, which produces a **frequency table**, as a named vector."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<dl class=dl-horizontal>\n",
"\t<dt>Animation</dt>\n",
"\t\t<dd>2</dd>\n",
"\t<dt>Comedy</dt>\n",
"\t\t<dd>2</dd>\n",
"\t<dt>Crime</dt>\n",
"\t\t<dd>1</dd>\n",
"</dl>\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[Animation] 2\n",
"\\item[Comedy] 2\n",
"\\item[Crime] 1\n",
"\\end{description*}\n"
],
"text/markdown": [
"Animation\n",
": 2Comedy\n",
": 2Crime\n",
": 1\n",
"\n"
],
"text/plain": [
"Animation Comedy Crime \n",
" 2 2 1 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"summary(genre_factor)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And recall that you can sort the values of the table using **sort()**."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<dl class=dl-horizontal>\n",
"\t<dt>Crime</dt>\n",
"\t\t<dd>1</dd>\n",
"\t<dt>Animation</dt>\n",
"\t\t<dd>2</dd>\n",
"\t<dt>Comedy</dt>\n",
"\t\t<dd>2</dd>\n",
"</dl>\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[Crime] 1\n",
"\\item[Animation] 2\n",
"\\item[Comedy] 2\n",
"\\end{description*}\n"
],
"text/markdown": [
"Crime\n",
": 1Animation\n",
": 2Comedy\n",
": 2\n",
"\n"
],
"text/plain": [
" Crime Animation Comedy \n",
" 1 2 2 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sort(summary(genre_factor)) #sorts values by ascending order"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ordered factors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two types of categorical variables: a **nominal categorical variable** and an **ordinal categorical variable**.\n",
"\n",
"A **nominal variable** is a categorical variable for names, without an implied order. This means that it is impossible to say that 'one is better or larger than the other'. For example, consider **movie genre** with the categories _Comedy_, _Animation_, _Crime_, _Comedy_, _Animation_. Here, there is no implicit order of low-to-high or high-to-low between the categories. \n",
"\n",
"In contrast, **ordinal variables** do have a natural ordering. Consider for example, **movie length** with the categories: _Very short_, _Short_ , _Medium_, _Long_, _Very long_. Here it is obvious that _Medium_ stands above _Short_, and _Long_ stands above _Medium_."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>'Very Short'</li>\n",
"\t<li>'Short'</li>\n",
"\t<li>'Medium'</li>\n",
"\t<li>'Short'</li>\n",
"\t<li>'Long'</li>\n",
"\t<li>'Very Short'</li>\n",
"\t<li>'Very Long'</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 'Very Short'\n",
"\\item 'Short'\n",
"\\item 'Medium'\n",
"\\item 'Short'\n",
"\\item 'Long'\n",
"\\item 'Very Short'\n",
"\\item 'Very Long'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 'Very Short'\n",
"2. 'Short'\n",
"3. 'Medium'\n",
"4. 'Short'\n",
"5. 'Long'\n",
"6. 'Very Short'\n",
"7. 'Very Long'\n",
"\n",
"\n"
],
"text/plain": [
"[1] \"Very Short\" \"Short\" \"Medium\" \"Short\" \"Long\" \n",
"[6] \"Very Short\" \"Very Long\" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"movie_length <- c(\"Very Short\", \"Short\", \"Medium\",\"Short\", \"Long\",\n",
" \"Very Short\", \"Very Long\")\n",
"movie_length"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__`movie_length`__ should be converted to an ordinal factor since its categories have a natural ordering. By default, the function <b>factor()</b> transforms `movie_length` into an unordered factor. \n",
"\n",
"To create an **ordered factor**, you have to add two additional arguments: `ordered` and `levels`. \n",
"- `ordered`: When set to `TRUE` in `factor()`, you indicate that the factor is ordered. \n",
"- `levels`: In this argument in `factor()`, you give the values of the factor in the correct order."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>Very Short</li>\n",
"\t<li>Short</li>\n",
"\t<li>Medium</li>\n",
"\t<li>Short</li>\n",
"\t<li>Long</li>\n",
"\t<li>Very Short</li>\n",
"\t<li>Very Long</li>\n",
"</ol>\n",
"\n",
"<details>\n",
"\t<summary style=display:list-item;cursor:pointer>\n",
"\t\t<strong>Levels</strong>:\n",
"\t</summary>\n",
"\t<ol class=list-inline>\n",
"\t\t<li>'Very Short'</li>\n",
"\t\t<li>'Short'</li>\n",
"\t\t<li>'Medium'</li>\n",
"\t\t<li>'Long'</li>\n",
"\t\t<li>'Very Long'</li>\n",
"\t</ol>\n",
"</details>"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item Very Short\n",
"\\item Short\n",
"\\item Medium\n",
"\\item Short\n",
"\\item Long\n",
"\\item Very Short\n",
"\\item Very Long\n",
"\\end{enumerate*}\n",
"\n",
"\\emph{Levels}: \\begin{enumerate*}\n",
"\\item 'Very Short'\n",
"\\item 'Short'\n",
"\\item 'Medium'\n",
"\\item 'Long'\n",
"\\item 'Very Long'\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. Very Short\n",
"2. Short\n",
"3. Medium\n",
"4. Short\n",
"5. Long\n",
"6. Very Short\n",
"7. Very Long\n",
"\n",
"\n",
"\n",
"**Levels**: 1. 'Very Short'\n",
"2. 'Short'\n",
"3. 'Medium'\n",
"4. 'Long'\n",
"5. 'Very Long'\n",
"\n",
"\n"
],
"text/plain": [
"[1] Very Short Short Medium Short Long Very Short Very Long \n",
"Levels: Very Short < Short < Medium < Long < Very Long"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"movie_length_ordered <- factor(movie_length, ordered = TRUE , \n",
" levels = c(\"Very Short\" , \"Short\" , \"Medium\", \n",
" \"Long\",\"Very Long\"))\n",
"movie_length_ordered"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, lets look at the summary of the ordered factor, <b>factor_mvlength_vector</b>:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<dl class=dl-horizontal>\n",
"\t<dt>Very Short</dt>\n",
"\t\t<dd>2</dd>\n",
"\t<dt>Short</dt>\n",
"\t\t<dd>2</dd>\n",
"\t<dt>Medium</dt>\n",
"\t\t<dd>1</dd>\n",
"\t<dt>Long</dt>\n",
"\t\t<dd>1</dd>\n",
"\t<dt>Very Long</dt>\n",
"\t\t<dd>1</dd>\n",
"</dl>\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[Very Short] 2\n",
"\\item[Short] 2\n",
"\\item[Medium] 1\n",
"\\item[Long] 1\n",
"\\item[Very Long] 1\n",
"\\end{description*}\n"
],
"text/markdown": [
"Very Short\n",
": 2Short\n",
": 2Medium\n",
": 1Long\n",
": 1Very Long\n",
": 1\n",
"\n"
],
"text/plain": [
"Very Short Short Medium Long Very Long \n",
" 2 2 1 1 1 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"summary(movie_length_ordered)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Scaling R with big data\n",
"\n",
"As you learn more about R, if you are interested in exploring platforms that can help you run analyses at scale, you might want to sign up for a free account on [IBM Watson Studio](http://cocl.us/dsx_rp0101en), which allows you to run analyses in R with two Spark executors for free."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"### About the Author: \n",
"Hi! It's [Helly Patel](https://ca.linkedin.com/in/helly-patel-90344750), the author of this notebook. I hope you found R easy to learn! There's lots more to learn about R but you're well on your way. Feel free to connect with me if you have any questions.\n",
"\n",
"\n",
"<hr>\n",
"Copyright &copy; [IBM Cognitive Class](https://cognitiveclass.ai). This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "conda-env-r-r"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.5.1"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment