naomispence

## CI examples
#add these libraries
library(lsr)
library(dplyr)

#the code below limits the dataset to the 2014 survey data and creates a new dataset named GSS2014
GSS2014<-dplyr::filter(GSS, year==2014)

#get different confidence intervals for the same variable
ciMean(GSS2014$tvhours, na.rm=TRUE, conf =0.90)
ciMean(GSS2014$tvhours, na.rm=TRUE, conf =0.95)

## GSS regression 1
GSS2014$tvhours[GSS2014$tvhours >24]<-NA
#Look at the mean of an interval variable
mean(GSS2014$tvhours, na.rm=TRUE)
# This is a generalized linear model of an interval variable with just an intercept, no independent variable
results<-glm(GSS2014$tvhours~1, data=GSS2014)
summary(results)

#create a dichotomous variable coded 0, 1 (or TRUE, FALSE)
GSS2014$nochild <-as.numeric(GSS2014$childs) <= 0
#Look at the proportion of 1 (TRUE) values of the dichotomous variable coded 0, 1

## ordinal bar graph
#create new temporary dataset that filters out the missing (NA) values so that they don't appear in the graph
polviewsnew<-dplyr::filter(GSS2014, polviews!='NA')
#create a bar graph using the new temporary dataset
ggplot(polviewsnew, aes(x=polviews)) +
geom_bar(stat ="count", color="red", fill="white",    aes(y = ((..count..)/sum(..count..)))) +
ggtitle("Political Views") +
labs(y="Percent",     x="Political Views") + scale_y_continuous(labels=percent)

## Getting measures of central tendency in R, part 1
**CHUNK 1 STARTS BELOW THIS LINE.
```{r}
#YOU WILL ALWAYS NEED THIS FIRST CHUNK. WE WILL ADD TO IT DURING THE SEMESTER.
#THIS CHUNK LOADS THE LIBRARIES AND DATA THAT YOU NEED FOR YOUR WORK.
library(aws.s3)
library('lehmansociology')

s3load('gss.Rda', bucket = 'lehmansociologydata')
```

## graphing, part 1
TO DO A GRAPH IN R STUDIO, YOU FIRST NEED TO ADD A NEW LIBRARY USING THE LINE OF CODE BELOW.
library('ggplot2')
(YOU SHOULD PASTE THIS LINE OF CODE INTO CHUNK 1 ON IT'S OWN LINE NEAR THE OTHER LINES OF CODE THAT START WITH library)
(DON'T FORGET TO RUN CHUNK 1 AFTER PASTING SO THAT IT LOADS THE NEW LIBRARY FOR YOU.)


AFTER YOU'VE ADDED THE LIBRARY ggplot2, YOU CAN BEGIN TO USE THE COMMAND ggplot TO CREATE GRAPHS.
ALWAYS CONSIDER WHICH GRAPH IS APPROPRIATE FOR YOUR VARIABLE BASED ON LEVEL OF MEASUREMENT.
ALSO CONSIDER WHICH GRAPH WILL DISPLAY THE INFORMATION CLEARLY GIVEN THE VARIABLE'S VALUES.
BE SURE TO ALWAYS HAVE AXIS LABELS AND TITLES ON YOUR GRAPH THAT ARE CLEAR, ACCURATE, AND DESCRIBE THE GRAPH.

## Getting measures of variation in R

```{r}
summary(GSS$childs)
IQR(GSS$childs)
var(GSS$childs)
sd(GSS$childs)

summary(GSS$chldidel)
IQR(GSS$chldidel, na.rm=TRUE)
var(GSS$chldidel, na.rm=TRUE)

## distributions and z-scores
Recall from the swirl lesson called "Working with Variables" that if you have missing values R will give you a
value of NA when you ask for specific statistics, such as the mean and sd. Therefore, the first two lines of code
in the chunk below add na.rm=TRUE which tells R that those are missing values that should be ignored to compute the mean and sd.
```{r}
mean(GSS$tvhours, na.rm=TRUE)
sd(GSS$tvhours, na.rm=TRUE)

#note the added code section for geom_vline in the histogram below. Add title and labels.
ggplot_tvhours <-ggplot(GSS, aes(tvhours))
ggplot_tvhours + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +

## confidence intervals, part 1
For this lab, you need to add new libraries. I recommend you put them in Chunk 1 with your other library code.

library(lsr)
library(dplyr)

#get different confidence intervals for the same variable
mean(GSS$tvhours, na.rm=TRUE)
ciMean(GSS$tvhours, na.rm=TRUE, conf =0.90)
ciMean(GSS$tvhours, na.rm=TRUE, conf =0.95)
ciMean(GSS$tvhours, na.rm=TRUE, conf=0.99)

## regression, part 1
```{r}
#Look at the mean of an interval variable
mean(GSS$childs, na.rm=TRUE)
# This is a generalized linear model of an interval variable with just an intercept, no independent variable
results<-glm(GSS$childs~1, data=GSS)
summary(results)
# This is a generalized linear model of an interval variable with  an intercept and an independent variable
results<-glm(GSS$childs~age, data=GSS)
summary(results)
#The line below gives us a scatterplot with a "best fitting line" through it. SEE IF YOU CAN ADD TITLE AND LABELS.

## regression, part 2
#Look at the mean of an interval variable
mean(GSS$childs, na.rm=TRUE)
# This is a generalized linear model of an interval variable with just an intercept, no independent variable
regchilds<-glm(GSS$childs~1, data=GSS)
summary(regchilds)

# This is a generalized linear model of an interval variable with  an intercept and one interval-ratio independent variable
regchilds2<-glm(GSS$childs~age, data=GSS)
summary(regchilds2)
	#add these libraries
	library(lsr)
	library(dplyr)

	#the code below limits the dataset to the 2014 survey data and creates a new dataset named GSS2014
	GSS2014<-dplyr::filter(GSS, year==2014)

	#get different confidence intervals for the same variable
	ciMean(GSS2014$tvhours, na.rm=TRUE, conf =0.90)
	ciMean(GSS2014$tvhours, na.rm=TRUE, conf =0.95)
	GSS2014$tvhours[GSS2014$tvhours >24]<-NA
	#Look at the mean of an interval variable
	mean(GSS2014$tvhours, na.rm=TRUE)
	# This is a generalized linear model of an interval variable with just an intercept, no independent variable
	results<-glm(GSS2014$tvhours~1, data=GSS2014)
	summary(results)

	#create a dichotomous variable coded 0, 1 (or TRUE, FALSE)
	GSS2014$nochild <-as.numeric(GSS2014$childs) <= 0
	#Look at the proportion of 1 (TRUE) values of the dichotomous variable coded 0, 1
	#create new temporary dataset that filters out the missing (NA) values so that they don't appear in the graph
	polviewsnew<-dplyr::filter(GSS2014, polviews!='NA')
	#create a bar graph using the new temporary dataset
	ggplot(polviewsnew, aes(x=polviews)) +
	geom_bar(stat ="count", color="red", fill="white", aes(y = ((..count..)/sum(..count..)))) +
	ggtitle("Political Views") +
	labs(y="Percent", x="Political Views") + scale_y_continuous(labels=percent)
	**CHUNK 1 STARTS BELOW THIS LINE.
	```{r}
	#YOU WILL ALWAYS NEED THIS FIRST CHUNK. WE WILL ADD TO IT DURING THE SEMESTER.
	#THIS CHUNK LOADS THE LIBRARIES AND DATA THAT YOU NEED FOR YOUR WORK.
	library(aws.s3)
	library('lehmansociology')

	s3load('gss.Rda', bucket = 'lehmansociologydata')
	```
	TO DO A GRAPH IN R STUDIO, YOU FIRST NEED TO ADD A NEW LIBRARY USING THE LINE OF CODE BELOW.
	library('ggplot2')
	(YOU SHOULD PASTE THIS LINE OF CODE INTO CHUNK 1 ON IT'S OWN LINE NEAR THE OTHER LINES OF CODE THAT START WITH library)
	(DON'T FORGET TO RUN CHUNK 1 AFTER PASTING SO THAT IT LOADS THE NEW LIBRARY FOR YOU.)


	AFTER YOU'VE ADDED THE LIBRARY ggplot2, YOU CAN BEGIN TO USE THE COMMAND ggplot TO CREATE GRAPHS.
	ALWAYS CONSIDER WHICH GRAPH IS APPROPRIATE FOR YOUR VARIABLE BASED ON LEVEL OF MEASUREMENT.
	ALSO CONSIDER WHICH GRAPH WILL DISPLAY THE INFORMATION CLEARLY GIVEN THE VARIABLE'S VALUES.
	BE SURE TO ALWAYS HAVE AXIS LABELS AND TITLES ON YOUR GRAPH THAT ARE CLEAR, ACCURATE, AND DESCRIBE THE GRAPH.

	```{r}
	summary(GSS$childs)
	IQR(GSS$childs)
	var(GSS$childs)
	sd(GSS$childs)

	summary(GSS$chldidel)
	IQR(GSS$chldidel, na.rm=TRUE)
	var(GSS$chldidel, na.rm=TRUE)
	Recall from the swirl lesson called "Working with Variables" that if you have missing values R will give you a
	value of NA when you ask for specific statistics, such as the mean and sd. Therefore, the first two lines of code
	in the chunk below add na.rm=TRUE which tells R that those are missing values that should be ignored to compute the mean and sd.
	```{r}
	mean(GSS$tvhours, na.rm=TRUE)
	sd(GSS$tvhours, na.rm=TRUE)

	#note the added code section for geom_vline in the histogram below. Add title and labels.
	ggplot_tvhours <-ggplot(GSS, aes(tvhours))
	ggplot_tvhours + geom_histogram(binwidth =1, aes(y=(..count../sum(..count..))*100)) +
	For this lab, you need to add new libraries. I recommend you put them in Chunk 1 with your other library code.

	library(lsr)
	library(dplyr)

	#get different confidence intervals for the same variable
	mean(GSS$tvhours, na.rm=TRUE)
	ciMean(GSS$tvhours, na.rm=TRUE, conf =0.90)
	ciMean(GSS$tvhours, na.rm=TRUE, conf =0.95)
	ciMean(GSS$tvhours, na.rm=TRUE, conf=0.99)
	```{r}
	#Look at the mean of an interval variable
	mean(GSS$childs, na.rm=TRUE)
	# This is a generalized linear model of an interval variable with just an intercept, no independent variable
	results<-glm(GSS$childs~1, data=GSS)
	summary(results)
	# This is a generalized linear model of an interval variable with an intercept and an independent variable
	results<-glm(GSS$childs~age, data=GSS)
	summary(results)
	#The line below gives us a scatterplot with a "best fitting line" through it. SEE IF YOU CAN ADD TITLE AND LABELS.