sibyvt/Missing Value

## Missing Value
Selection: 5


  |
  |                                                                      |   0%

| Missing values play an important role in statistics and data analysis. Often,
| missing values must not be ignored, but rather they should be carefully
| studied to see if there's an underlying pattern or cause for their
| missingness.

...


  |
  |====                                                                  |   5%
| In R, NA is used to represent any value that is 'not available' or 'missing'
| (in the statistical sense). In this lesson, we'll explore missing values
| further.

...


  |
  |=======                                                               |  10%
| Any operation involving NA generally yields NA as the result. To illustrate,
| let's create a vector c(44, NA, 5, NA) and assign it to a variable x.

> x <- c(44, NA, 5, NA)

| Keep working like that and you'll get there!


  |
  |===========                                                           |  15%
| Now, let's multiply x by 3.

> y <-x*3

| Not quite! Try again. Or, type info() for more options.

| Try x * 3.

> x*3
[1] 132  NA  15  NA

| Perseverance, that's the answer.


  |
  |==============                                                        |  20%
| Notice that the elements of the resulting vector that correspond with the NA
| values in x are also NA.

...


  |
  |==================                                                    |  25%
| To make things a little more interesting, lets create a vector containing
| 1000 draws from a standard normal distribution with y <- rnorm(1000).

> y <- rnorm(1000)

| You are really on a roll!


  |
  |=====================                                                 |  30%
| Next, let's create a vector containing 1000 NAs with z <- rep(NA, 1000).

> z <- rep(NA, 1000)

| You nailed it! Good job!


  |
  |=========================                                             |  35%
| Finally, let's select 100 elements at random from these 2000 values
| (combining y and z) such that we don't know how many NAs we'll wind up with
| or what positions they'll occupy in our final vector -- my_data <-
| sample(c(y, z), 100).

> my_data <-
+ sample(c(y, z), 100)

| That's the answer I was looking for.


  |
  |============================                                          |  40%
| Let's first ask the question of where our NAs are located in our data. The
| is.na() function tells us whether each element of a vector is NA. Call
| is.na() on my_data and assign the result to my_na.

>
> is.na()
Error in is.na() : 0 arguments passed to 'is.na' which requires 1
> my_na <- is.na()
Error in is.na() : 0 arguments passed to 'is.na' which requires 1
> my_na <- is.na(my_data)

| You are really on a roll!


  |
  |================================                                      |  45%
| Now, print my_na to see what you came up with.

> my_na
  [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
 [13]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
 [25] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
 [49] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
 [61]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
 [73] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
 [85]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [97]  TRUE  TRUE FALSE FALSE

| Nice work!


  |
  |===================================                                   |  50%
| Everywhere you see a TRUE, you know the corresponding element of my_data is
| NA. Likewise, everywhere you see a FALSE, you know the corresponding element
| of my_data is one of our random draws from the standard normal distribution.

...


  |
  |======================================                                |  55%
| In our previous discussion of logical operators, we introduced the `==`
| operator as a method of testing for equality between two objects. So, you
| might think the expression my_data == NA yields the same results as is.na().
| Give it a try.

>
> my_data == NA
  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

| You're the best!


  |
  |==========================================                            |  60%
| The reason you got a vector of all NAs is that NA is not really a value, but
| just a placeholder for a quantity that is not available. Therefore the
| logical expression is incomplete and R has no choice but to return a vector
| of the same length as my_data that contains all NAs.

...sum(my_na)


  |
  |==============================================                        |  65%
| Don't worry if that's a little confusing. The key takeaway is to be cautious
| when using logical expressions anytime NAs might creep in, since a single NA
| value can derail the entire thing.

...


  |
  |=================================================                     |  70%
| So, back to the task at hand. Now that we have a vector, my_na, that has a
| TRUE for every NA and FALSE for every numeric value, we can compute the total
| number of NAs in our data.

...


  |
  |====================================================                  |  75%
| The trick is to recognize that underneath the surface, R represents TRUE as
| the number 1 and FALSE as the number 0. Therefore, if we take the sum of a
| bunch of TRUEs and FALSEs, we get the total number of TRUEs.

...


  |
  |========================================================              |  80%
| Let's give that a try here. Call the sum() function on my_na to count the
| total number of TRUEs in my_na, and thus the total number of NAs in my_data.
| Don't assign the result to a new variable.

> sum(my_na)
[1] 51

| That's a job well done!


  |
  |============================================================          |  85%
| Pretty cool, huh? Finally, let's take a look at the data to convince
| ourselves that everything 'adds up'. Print my_data to the console.

>
> my_data
  [1]          NA          NA  1.09462984  0.59778656 -0.90944656          NA
  [7] -0.35787991 -0.03112132  0.26389767          NA          NA          NA
 [13]          NA          NA          NA          NA  0.21294024          NA
 [19] -1.48071872          NA  1.91079078  1.54674727          NA          NA
 [25] -0.59913377          NA          NA -0.98306919          NA          NA
 [31]          NA -0.27882350 -1.36030614          NA          NA  2.64448799
 [37] -2.12506858  0.25815328 -0.28095853  0.40540060 -0.39859703          NA
 [43]  0.22767670  1.17414183          NA  0.20077198 -0.25732459  1.47075231
 [49]  0.60883017  1.57914885          NA  0.67170119  0.67769134  0.46886597
 [55]  0.01949685          NA          NA -0.59312866  0.17993845 -0.07268996
 [61]          NA          NA          NA          NA          NA -1.19344507
 [67]          NA -0.36172476  0.91197623          NA          NA -1.13324675
 [73] -0.56856448          NA -1.70945322 -1.11652692          NA          NA
 [79]          NA  0.81674190 -1.08081702          NA          NA  0.67119044
 [85]          NA -0.90857825 -0.72647148          NA  1.03472122          NA
 [91]          NA          NA          NA          NA          NA          NA
 [97]          NA          NA  0.17035257  1.05240690

| All that hard work is paying off!


  |
  |===============================================================       |  90%
| Now that we've got NAs down pat, let's look at a second type of missing value
| -- NaN, which stands for 'not a number'. To generate NaN, try dividing (using
| a forward slash) 0 by 0 now.

> 0/0
[1] NaN

| All that hard work is paying off!


  |
  |==================================================================    |  95%
| Let's do one more, just for fun. In R, Inf stands for infinity. What happens
| if you subtract Inf from Inf?

> Inf-Inf
[1] NaN

| That's correct!


  |
  |======================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

1: No
2: Yes
	Selection: 5


	\|
	\| \| 0%

	\| Missing values play an important role in statistics and data analysis. Often,
	\| missing values must not be ignored, but rather they should be carefully
	\| studied to see if there's an underlying pattern or cause for their
	\| missingness.

	...


	\|
	\|==== \| 5%
	\| In R, NA is used to represent any value that is 'not available' or 'missing'
	\| (in the statistical sense). In this lesson, we'll explore missing values
	\| further.

	...


	\|
	\|======= \| 10%
	\| Any operation involving NA generally yields NA as the result. To illustrate,
	\| let's create a vector c(44, NA, 5, NA) and assign it to a variable x.

	> x <- c(44, NA, 5, NA)

	\| Keep working like that and you'll get there!


	\|
	\|=========== \| 15%
	\| Now, let's multiply x by 3.

	> y <-x*3

	\| Not quite! Try again. Or, type info() for more options.

	\| Try x * 3.

	> x*3
	[1] 132 NA 15 NA

	\| Perseverance, that's the answer.


	\|
	\|============== \| 20%
	\| Notice that the elements of the resulting vector that correspond with the NA
	\| values in x are also NA.

	...


	\|
	\|================== \| 25%
	\| To make things a little more interesting, lets create a vector containing
	\| 1000 draws from a standard normal distribution with y <- rnorm(1000).

	> y <- rnorm(1000)

	\| You are really on a roll!


	\|
	\|===================== \| 30%
	\| Next, let's create a vector containing 1000 NAs with z <- rep(NA, 1000).

	> z <- rep(NA, 1000)

	\| You nailed it! Good job!


	\|
	\|========================= \| 35%
	\| Finally, let's select 100 elements at random from these 2000 values
	\| (combining y and z) such that we don't know how many NAs we'll wind up with
	\| or what positions they'll occupy in our final vector -- my_data <-
	\| sample(c(y, z), 100).

	> my_data <-
	+ sample(c(y, z), 100)

	\| That's the answer I was looking for.


	\|
	\|============================ \| 40%
	\| Let's first ask the question of where our NAs are located in our data. The
	\| is.na() function tells us whether each element of a vector is NA. Call
	\| is.na() on my_data and assign the result to my_na.

	>
	> is.na()
	Error in is.na() : 0 arguments passed to 'is.na' which requires 1
	> my_na <- is.na()
	Error in is.na() : 0 arguments passed to 'is.na' which requires 1
	> my_na <- is.na(my_data)

	\| You are really on a roll!


	\|
	\|================================ \| 45%
	\| Now, print my_na to see what you came up with.

	> my_na
	[1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
	[13] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
	[25] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
	[37] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
	[49] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
	[61] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
	[73] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
	[85] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
	[97] TRUE TRUE FALSE FALSE

	\| Nice work!


	\|
	\|=================================== \| 50%
	\| Everywhere you see a TRUE, you know the corresponding element of my_data is
	\| NA. Likewise, everywhere you see a FALSE, you know the corresponding element
	\| of my_data is one of our random draws from the standard normal distribution.

	...


	\|
	\|====================================== \| 55%
	\| In our previous discussion of logical operators, we introduced the `==`
	\| operator as a method of testing for equality between two objects. So, you
	\| might think the expression my_data == NA yields the same results as is.na().
	\| Give it a try.

	>
	> my_data == NA
	[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
	[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
	[51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
	[76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

	\| You're the best!


	\|
	\|========================================== \| 60%
	\| The reason you got a vector of all NAs is that NA is not really a value, but
	\| just a placeholder for a quantity that is not available. Therefore the
	\| logical expression is incomplete and R has no choice but to return a vector
	\| of the same length as my_data that contains all NAs.

	...sum(my_na)


	\|
	\|============================================== \| 65%
	\| Don't worry if that's a little confusing. The key takeaway is to be cautious
	\| when using logical expressions anytime NAs might creep in, since a single NA
	\| value can derail the entire thing.

	...


	\|
	\|================================================= \| 70%
	\| So, back to the task at hand. Now that we have a vector, my_na, that has a
	\| TRUE for every NA and FALSE for every numeric value, we can compute the total
	\| number of NAs in our data.

	...


	\|
	\|==================================================== \| 75%
	\| The trick is to recognize that underneath the surface, R represents TRUE as
	\| the number 1 and FALSE as the number 0. Therefore, if we take the sum of a
	\| bunch of TRUEs and FALSEs, we get the total number of TRUEs.

	...


	\|
	\|======================================================== \| 80%
	\| Let's give that a try here. Call the sum() function on my_na to count the
	\| total number of TRUEs in my_na, and thus the total number of NAs in my_data.
	\| Don't assign the result to a new variable.

	> sum(my_na)
	[1] 51

	\| That's a job well done!


	\|
	\|============================================================ \| 85%
	\| Pretty cool, huh? Finally, let's take a look at the data to convince
	\| ourselves that everything 'adds up'. Print my_data to the console.

	>
	> my_data
	[1] NA NA 1.09462984 0.59778656 -0.90944656 NA
	[7] -0.35787991 -0.03112132 0.26389767 NA NA NA
	[13] NA NA NA NA 0.21294024 NA
	[19] -1.48071872 NA 1.91079078 1.54674727 NA NA
	[25] -0.59913377 NA NA -0.98306919 NA NA
	[31] NA -0.27882350 -1.36030614 NA NA 2.64448799
	[37] -2.12506858 0.25815328 -0.28095853 0.40540060 -0.39859703 NA
	[43] 0.22767670 1.17414183 NA 0.20077198 -0.25732459 1.47075231
	[49] 0.60883017 1.57914885 NA 0.67170119 0.67769134 0.46886597
	[55] 0.01949685 NA NA -0.59312866 0.17993845 -0.07268996
	[61] NA NA NA NA NA -1.19344507
	[67] NA -0.36172476 0.91197623 NA NA -1.13324675
	[73] -0.56856448 NA -1.70945322 -1.11652692 NA NA
	[79] NA 0.81674190 -1.08081702 NA NA 0.67119044
	[85] NA -0.90857825 -0.72647148 NA 1.03472122 NA
	[91] NA NA NA NA NA NA
	[97] NA NA 0.17035257 1.05240690

	\| All that hard work is paying off!


	\|
	\|=============================================================== \| 90%
	\| Now that we've got NAs down pat, let's look at a second type of missing value
	\| -- NaN, which stands for 'not a number'. To generate NaN, try dividing (using
	\| a forward slash) 0 by 0 now.

	> 0/0
	[1] NaN

	\| All that hard work is paying off!


	\|
	\|================================================================== \| 95%
	\| Let's do one more, just for fun. In R, Inf stands for infinity. What happens
	\| if you subtract Inf from Inf?

	> Inf-Inf
	[1] NaN

	\| That's correct!


	\|
	\|======================================================================\| 100%
	\| Would you like to receive credit for completing this course on Coursera.org?

	1: No
	2: Yes