Skip to content

Instantly share code, notes, and snippets.

@sibyvt
Created May 13, 2016 08:18
Show Gist options
  • Save sibyvt/1968af54aa981cff02e09eea3757610a to your computer and use it in GitHub Desktop.
Save sibyvt/1968af54aa981cff02e09eea3757610a to your computer and use it in GitHub Desktop.
Selection: 5
|
| | 0%
| Missing values play an important role in statistics and data analysis. Often,
| missing values must not be ignored, but rather they should be carefully
| studied to see if there's an underlying pattern or cause for their
| missingness.
...
|
|==== | 5%
| In R, NA is used to represent any value that is 'not available' or 'missing'
| (in the statistical sense). In this lesson, we'll explore missing values
| further.
...
|
|======= | 10%
| Any operation involving NA generally yields NA as the result. To illustrate,
| let's create a vector c(44, NA, 5, NA) and assign it to a variable x.
> x <- c(44, NA, 5, NA)
| Keep working like that and you'll get there!
|
|=========== | 15%
| Now, let's multiply x by 3.
> y <-x*3
| Not quite! Try again. Or, type info() for more options.
| Try x * 3.
> x*3
[1] 132 NA 15 NA
| Perseverance, that's the answer.
|
|============== | 20%
| Notice that the elements of the resulting vector that correspond with the NA
| values in x are also NA.
...
|
|================== | 25%
| To make things a little more interesting, lets create a vector containing
| 1000 draws from a standard normal distribution with y <- rnorm(1000).
> y <- rnorm(1000)
| You are really on a roll!
|
|===================== | 30%
| Next, let's create a vector containing 1000 NAs with z <- rep(NA, 1000).
> z <- rep(NA, 1000)
| You nailed it! Good job!
|
|========================= | 35%
| Finally, let's select 100 elements at random from these 2000 values
| (combining y and z) such that we don't know how many NAs we'll wind up with
| or what positions they'll occupy in our final vector -- my_data <-
| sample(c(y, z), 100).
> my_data <-
+ sample(c(y, z), 100)
| That's the answer I was looking for.
|
|============================ | 40%
| Let's first ask the question of where our NAs are located in our data. The
| is.na() function tells us whether each element of a vector is NA. Call
| is.na() on my_data and assign the result to my_na.
>
> is.na()
Error in is.na() : 0 arguments passed to 'is.na' which requires 1
> my_na <- is.na()
Error in is.na() : 0 arguments passed to 'is.na' which requires 1
> my_na <- is.na(my_data)
| You are really on a roll!
|
|================================ | 45%
| Now, print my_na to see what you came up with.
> my_na
[1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
[25] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
[37] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
[49] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[61] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
[73] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
[85] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[97] TRUE TRUE FALSE FALSE
| Nice work!
|
|=================================== | 50%
| Everywhere you see a TRUE, you know the corresponding element of my_data is
| NA. Likewise, everywhere you see a FALSE, you know the corresponding element
| of my_data is one of our random draws from the standard normal distribution.
...
|
|====================================== | 55%
| In our previous discussion of logical operators, we introduced the `==`
| operator as a method of testing for equality between two objects. So, you
| might think the expression my_data == NA yields the same results as is.na().
| Give it a try.
>
> my_data == NA
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
| You're the best!
|
|========================================== | 60%
| The reason you got a vector of all NAs is that NA is not really a value, but
| just a placeholder for a quantity that is not available. Therefore the
| logical expression is incomplete and R has no choice but to return a vector
| of the same length as my_data that contains all NAs.
...sum(my_na)
|
|============================================== | 65%
| Don't worry if that's a little confusing. The key takeaway is to be cautious
| when using logical expressions anytime NAs might creep in, since a single NA
| value can derail the entire thing.
...
|
|================================================= | 70%
| So, back to the task at hand. Now that we have a vector, my_na, that has a
| TRUE for every NA and FALSE for every numeric value, we can compute the total
| number of NAs in our data.
...
|
|==================================================== | 75%
| The trick is to recognize that underneath the surface, R represents TRUE as
| the number 1 and FALSE as the number 0. Therefore, if we take the sum of a
| bunch of TRUEs and FALSEs, we get the total number of TRUEs.
...
|
|======================================================== | 80%
| Let's give that a try here. Call the sum() function on my_na to count the
| total number of TRUEs in my_na, and thus the total number of NAs in my_data.
| Don't assign the result to a new variable.
> sum(my_na)
[1] 51
| That's a job well done!
|
|============================================================ | 85%
| Pretty cool, huh? Finally, let's take a look at the data to convince
| ourselves that everything 'adds up'. Print my_data to the console.
>
> my_data
[1] NA NA 1.09462984 0.59778656 -0.90944656 NA
[7] -0.35787991 -0.03112132 0.26389767 NA NA NA
[13] NA NA NA NA 0.21294024 NA
[19] -1.48071872 NA 1.91079078 1.54674727 NA NA
[25] -0.59913377 NA NA -0.98306919 NA NA
[31] NA -0.27882350 -1.36030614 NA NA 2.64448799
[37] -2.12506858 0.25815328 -0.28095853 0.40540060 -0.39859703 NA
[43] 0.22767670 1.17414183 NA 0.20077198 -0.25732459 1.47075231
[49] 0.60883017 1.57914885 NA 0.67170119 0.67769134 0.46886597
[55] 0.01949685 NA NA -0.59312866 0.17993845 -0.07268996
[61] NA NA NA NA NA -1.19344507
[67] NA -0.36172476 0.91197623 NA NA -1.13324675
[73] -0.56856448 NA -1.70945322 -1.11652692 NA NA
[79] NA 0.81674190 -1.08081702 NA NA 0.67119044
[85] NA -0.90857825 -0.72647148 NA 1.03472122 NA
[91] NA NA NA NA NA NA
[97] NA NA 0.17035257 1.05240690
| All that hard work is paying off!
|
|=============================================================== | 90%
| Now that we've got NAs down pat, let's look at a second type of missing value
| -- NaN, which stands for 'not a number'. To generate NaN, try dividing (using
| a forward slash) 0 by 0 now.
> 0/0
[1] NaN
| All that hard work is paying off!
|
|================================================================== | 95%
| Let's do one more, just for fun. In R, Inf stands for infinity. What happens
| if you subtract Inf from Inf?
> Inf-Inf
[1] NaN
| That's correct!
|
|======================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?
1: No
2: Yes
@dthirunavukkarasu
Copy link

dthirunavukkarasu commented Sep 5, 2021

How to proceed further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment