Skip to content

Instantly share code, notes, and snippets.

@naomispence
Created March 27, 2017 17:08
Show Gist options
  • Save naomispence/f536c01b96f632522b0720cb33649a50 to your computer and use it in GitHub Desktop.
Save naomispence/f536c01b96f632522b0720cb33649a50 to your computer and use it in GitHub Desktop.
Recall from the swirl lesson called "Working with Variables" that if you have missing values R will give you a
value of NA when you ask for specific statistics, such as the mean and sd. Therefore, the first two lines of code
in the chunk below add na.rm=TRUE which tells R that those are missing values that should be ignored to compute the mean and sd.
```{r}
mean(GSS$tvhours, na.rm=TRUE)
sd(GSS$tvhours, na.rm=TRUE)
#note the added code section for geom_vline in the histogram below. Add title and labels.
ggplot_tvhours <-ggplot(GSS, aes(tvhours))
ggplot_tvhours + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
ggtitle("") +
labs(y="Percent", x="") +
geom_vline(xintercept=mean(GSS$tvhours, na.rm=TRUE), color="blue", linetype="dashed", size=1)
```
Describe what the code segment that begins with geom_vline adds to your histogram. Why is the line where it is?
Are you surprised by the placement of the vertical line? Be specific.
Now let's convert (AKA transform) the unit of measurement from hours to minutes and
create a new variable that measures the number of minutes spent watching tv per day.
```{r}
GSS$tvmins <-GSS$tvhours*60
mean(GSS$tvmins, na.rm=TRUE)
sd(GSS$tvmins, na.rm=TRUE)
```
Now let's see what happens if we convert the units to standard deviations.
Note that multiplying by 1/sd is the same as dividing by sd
```{r}
GSS$tvsd <- GSS$tvhours*1/sd(GSS$tvhours, na.rm=TRUE)
mean(GSS$tvsd, na.rm=TRUE)
sd(GSS$tvsd, na.rm=TRUE)
```
Describe how the mean for tvmins and tvsd are related to the mean for tvhours.
What is unique about the standard deviation of tvsd?
Now let's make histograms of the new variables. Notice that we are changing the binwidth.
Think about why we are changing the binwidth to 60 for minutes and 1/sd(tvhrs)
for standard deviations.
If you want, after your run them this way, you can change the binwidth back to 1 and see what happens.
```{r}
ggplot_tvmins <-ggplot(GSS, aes(tvmins))
ggplot_tvmins + geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) +
ggtitle("")+
labs(y="Percent", x="") +
geom_vline(xintercept=mean(GSS$tvmins, na.rm=TRUE), color="green", linetype="dashed", size=1)
```
```{r}
ggplot_tvsd <-ggplot(GSS, aes(tvsd))
ggplot_tvsd +
geom_histogram(binwidth =1/sd(GSS$tvhours, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) +
ggtitle("") +
labs(y="Percent", x="") +
geom_vline(xintercept=mean(GSS$tvsd, na.rm=TRUE), color="purple", linetype="dashed", size=1)
```
Now let's create some new variables where instead of the actual values we have the difference from the value to the mean.
This is sometimes called "recentering." Then we will get the standard deviation and histogram for each new variable.
```{r}
GSS$tvhours0<-GSS$tvhours - mean(GSS$tvhours, na.rm=TRUE)
sd(GSS$tvhours0, na.rm=TRUE)
ggplot_tvhours0 <-ggplot(GSS, aes(tvhours0))
ggplot_tvhours0 + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
ggtitle("") +
labs(y="Percent", x="") +
geom_vline(xintercept=mean(GSS$tvhours0, na.rm=TRUE), color="blue", linetype="dashed", size=.5)
```
```{r}
GSS$tvmins0<-GSS$tvmins - mean(GSS$tvmins, na.rm=TRUE)
sd(GSS$tvmins0, na.rm=TRUE)
ggplot_tvmins0 <-ggplot(GSS, aes(tvmins0))
ggplot_tvmins0 +
geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) +
ggtitle("") + labs(y="Percent", x="") +
geom_vline(xintercept=mean(GSS$tvmins0, na.rm=TRUE), color="green", linetype="dashed", size=.5)
```
```{r}
GSS$tvsd0<-GSS$tvsd - mean(GSS$tvsd, na.rm=TRUE)
sd(GSS$tvsd0, na.rm=TRUE)
ggplot_tvsd0 <-ggplot(GSS, aes(tvsd0))
ggplot_tvsd0 +
geom_histogram(binwidth =1/sd(GSS$tvhours0, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) +
ggtitle("") +
labs(y="Percent", x="") +
geom_vline(xintercept=mean(GSS$tvsd0, na.rm=TRUE), color="purple", linetype="dashed", size=.5)
```
Why is subtracting the mean from each value sometimes called "recentering"?
What does a negative value on these variables mean?
Describe how the standard deviations for these recentered variables compare to the standard deviations
for the previous (comparable) variables.
When we convert the value of an observation into units of "standard deviations above the mean"
or "standard deviations below the mean" those new scores are called Z-SCORES.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment