naomispence/distributions and z-scores

## distributions and z-scores
Recall from the swirl lesson called "Working with Variables" that if you have missing values R will give you a
value of NA when you ask for specific statistics, such as the mean and sd. Therefore, the first two lines of code
in the chunk below add na.rm=TRUE which tells R that those are missing values that should be ignored to compute the mean and sd.
```{r}
mean(GSS$tvhours, na.rm=TRUE)
sd(GSS$tvhours, na.rm=TRUE)

#note the added code section for geom_vline in the histogram below. Add title and labels.
ggplot_tvhours <-ggplot(GSS, aes(tvhours))
ggplot_tvhours + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
    ggtitle("") +
    labs(y="Percent",     x="") +
    geom_vline(xintercept=mean(GSS$tvhours, na.rm=TRUE),  color="blue", linetype="dashed", size=1)
```
Describe what the code segment that begins with geom_vline adds to your histogram. Why is the line where it is?
Are you surprised by the placement of the vertical line? Be specific.

Now let's convert (AKA transform) the unit of measurement from hours to minutes and
create a new variable that measures the number of minutes spent watching tv per day.
```{r}
GSS$tvmins <-GSS$tvhours*60
mean(GSS$tvmins, na.rm=TRUE)
sd(GSS$tvmins, na.rm=TRUE)
```

Now let's see what happens if we convert the units to standard deviations.
Note that multiplying by 1/sd is the same as dividing by sd
```{r}
GSS$tvsd <- GSS$tvhours*1/sd(GSS$tvhours, na.rm=TRUE)
mean(GSS$tvsd, na.rm=TRUE)
sd(GSS$tvsd, na.rm=TRUE)
```
Describe how the mean for tvmins and tvsd are related to the mean for tvhours.

What is unique about the standard deviation of tvsd?


Now let's make histograms of the new variables. Notice that we are changing the binwidth.
Think about why we are changing the binwidth to 60 for minutes and 1/sd(tvhrs)
for standard deviations.
If you want, after your run them this way, you can change the binwidth back to 1 and see what happens.
```{r}
ggplot_tvmins <-ggplot(GSS, aes(tvmins))
ggplot_tvmins + geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) +
    ggtitle("")+
    labs(y="Percent",     x="") +
    geom_vline(xintercept=mean(GSS$tvmins, na.rm=TRUE),  color="green", linetype="dashed", size=1)
```

```{r}
ggplot_tvsd <-ggplot(GSS, aes(tvsd))
ggplot_tvsd +
    geom_histogram(binwidth =1/sd(GSS$tvhours, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) +
    ggtitle("") +
    labs(y="Percent",     x="") +
    geom_vline(xintercept=mean(GSS$tvsd, na.rm=TRUE),  color="purple", linetype="dashed", size=1)
```

Now let's create some new variables where instead of the actual values we have the difference from the value to the mean.
This is sometimes called "recentering." Then we will get the standard deviation and histogram for each new variable.
```{r}
GSS$tvhours0<-GSS$tvhours - mean(GSS$tvhours, na.rm=TRUE)
sd(GSS$tvhours0, na.rm=TRUE)
ggplot_tvhours0 <-ggplot(GSS, aes(tvhours0))
ggplot_tvhours0 + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
    ggtitle("") +
    labs(y="Percent",     x="") +
    geom_vline(xintercept=mean(GSS$tvhours0, na.rm=TRUE),  color="blue", linetype="dashed", size=.5)
```

```{r}
GSS$tvmins0<-GSS$tvmins - mean(GSS$tvmins, na.rm=TRUE)
sd(GSS$tvmins0, na.rm=TRUE)
ggplot_tvmins0 <-ggplot(GSS, aes(tvmins0))
ggplot_tvmins0 +
    geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) +
    ggtitle("") +   labs(y="Percent",     x="") +
    geom_vline(xintercept=mean(GSS$tvmins0, na.rm=TRUE),  color="green", linetype="dashed", size=.5)
```

```{r}
GSS$tvsd0<-GSS$tvsd - mean(GSS$tvsd, na.rm=TRUE)
sd(GSS$tvsd0, na.rm=TRUE)
ggplot_tvsd0 <-ggplot(GSS, aes(tvsd0))
ggplot_tvsd0 +
    geom_histogram(binwidth =1/sd(GSS$tvhours0, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) +
    ggtitle("") +
    labs(y="Percent",     x="") +
    geom_vline(xintercept=mean(GSS$tvsd0, na.rm=TRUE),  color="purple", linetype="dashed", size=.5)
```
Why is subtracting the mean from each value sometimes called "recentering"?

What does a negative value on these variables mean?

Describe how the standard deviations for these recentered variables compare to the standard deviations
for the previous (comparable) variables.

When we convert the value of an observation into units of "standard deviations above the mean"
or "standard deviations below the mean" those new scores are called Z-SCORES.
	Recall from the swirl lesson called "Working with Variables" that if you have missing values R will give you a
	value of NA when you ask for specific statistics, such as the mean and sd. Therefore, the first two lines of code
	in the chunk below add na.rm=TRUE which tells R that those are missing values that should be ignored to compute the mean and sd.
	```{r}
	mean(GSS$tvhours, na.rm=TRUE)
	sd(GSS$tvhours, na.rm=TRUE)

	#note the added code section for geom_vline in the histogram below. Add title and labels.
	ggplot_tvhours <-ggplot(GSS, aes(tvhours))
	ggplot_tvhours + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
	ggtitle("") +
	labs(y="Percent", x="") +
	geom_vline(xintercept=mean(GSS$tvhours, na.rm=TRUE), color="blue", linetype="dashed", size=1)
	```
	Describe what the code segment that begins with geom_vline adds to your histogram. Why is the line where it is?
	Are you surprised by the placement of the vertical line? Be specific.

	Now let's convert (AKA transform) the unit of measurement from hours to minutes and
	create a new variable that measures the number of minutes spent watching tv per day.
	```{r}
	GSS$tvmins <-GSS$tvhours*60
	mean(GSS$tvmins, na.rm=TRUE)
	sd(GSS$tvmins, na.rm=TRUE)
	```

	Now let's see what happens if we convert the units to standard deviations.
	Note that multiplying by 1/sd is the same as dividing by sd
	```{r}
	GSS$tvsd <- GSS$tvhours*1/sd(GSS$tvhours, na.rm=TRUE)
	mean(GSS$tvsd, na.rm=TRUE)
	sd(GSS$tvsd, na.rm=TRUE)
	```
	Describe how the mean for tvmins and tvsd are related to the mean for tvhours.

	What is unique about the standard deviation of tvsd?


	Now let's make histograms of the new variables. Notice that we are changing the binwidth.
	Think about why we are changing the binwidth to 60 for minutes and 1/sd(tvhrs)
	for standard deviations.
	If you want, after your run them this way, you can change the binwidth back to 1 and see what happens.
	```{r}
	ggplot_tvmins <-ggplot(GSS, aes(tvmins))
	ggplot_tvmins + geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) +
	ggtitle("")+
	labs(y="Percent", x="") +
	geom_vline(xintercept=mean(GSS$tvmins, na.rm=TRUE), color="green", linetype="dashed", size=1)
	```

	```{r}
	ggplot_tvsd <-ggplot(GSS, aes(tvsd))
	ggplot_tvsd +
	geom_histogram(binwidth =1/sd(GSS$tvhours, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) +
	ggtitle("") +
	labs(y="Percent", x="") +
	geom_vline(xintercept=mean(GSS$tvsd, na.rm=TRUE), color="purple", linetype="dashed", size=1)
	```

	Now let's create some new variables where instead of the actual values we have the difference from the value to the mean.
	This is sometimes called "recentering." Then we will get the standard deviation and histogram for each new variable.
	```{r}
	GSS$tvhours0<-GSS$tvhours - mean(GSS$tvhours, na.rm=TRUE)
	sd(GSS$tvhours0, na.rm=TRUE)
	ggplot_tvhours0 <-ggplot(GSS, aes(tvhours0))
	ggplot_tvhours0 + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
	ggtitle("") +
	labs(y="Percent", x="") +
	geom_vline(xintercept=mean(GSS$tvhours0, na.rm=TRUE), color="blue", linetype="dashed", size=.5)
	```

	```{r}
	GSS$tvmins0<-GSS$tvmins - mean(GSS$tvmins, na.rm=TRUE)
	sd(GSS$tvmins0, na.rm=TRUE)
	ggplot_tvmins0 <-ggplot(GSS, aes(tvmins0))
	ggplot_tvmins0 +
	geom_histogram(binwidth =60, aes(y=(..count../sum(..count..))*100)) +
	ggtitle("") + labs(y="Percent", x="") +
	geom_vline(xintercept=mean(GSS$tvmins0, na.rm=TRUE), color="green", linetype="dashed", size=.5)
	```

	```{r}
	GSS$tvsd0<-GSS$tvsd - mean(GSS$tvsd, na.rm=TRUE)
	sd(GSS$tvsd0, na.rm=TRUE)
	ggplot_tvsd0 <-ggplot(GSS, aes(tvsd0))
	ggplot_tvsd0 +
	geom_histogram(binwidth =1/sd(GSS$tvhours0, na.rm=TRUE), aes(y=(..count../sum(..count..))*100)) +
	ggtitle("") +
	labs(y="Percent", x="") +
	geom_vline(xintercept=mean(GSS$tvsd0, na.rm=TRUE), color="purple", linetype="dashed", size=.5)
	```
	Why is subtracting the mean from each value sometimes called "recentering"?

	What does a negative value on these variables mean?

	Describe how the standard deviations for these recentered variables compare to the standard deviations
	for the previous (comparable) variables.

	When we convert the value of an observation into units of "standard deviations above the mean"
	or "standard deviations below the mean" those new scores are called Z-SCORES.