vanatteveldt/lab2.Rmd

## lab2.Rmd
---
title: 'Lab 2: exploring the US elections (template)'
author: "(Your name)"
output:
  pdf_document:
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
# this chunk is not shown (include=FALSE), but can be used to setup some sane defaults
# note that naming chunks (in this case: setup) is optional, but if you do use names they need to be unique.

# To submit, use the 'Knit' button above to create a PDF document
# Note that setting 'Chunk output in console' in the options (gear wheel icon) is probably a good idea.

# disable warnings and message output:
knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)
# prettier tables:
library(knitr)
```


*Note: The chunks below are not complete, you will need to fill in the blanks and knit to an output document to finish the lab!*

# Exploring the primaries

In this lab we will explore the 2020 US presidentials elections.
The goal is to see whether Biden was more successful in larger states.

## Getting data

First, let's download the data from [MIT election lab](https://electionlab.mit.edu/data)'s [Dataverse page](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX):

```{r data}
# install.packages("dataverse")   # if needed)
library(dataverse)
elections = get_dataframe_by_name(
  filename = "1976-2020-president.tab",
  dataset = "doi:10.7910/DVN/42MVDX",
  server = "dataverse.harvard.edu")
elections
```

Use `select` to select only the `year`, `state`, `state_po`, `party_simplified`, `candidatevotes`, `totalvotes` columns.
Rename `party_simplified` to `party`

```{r clean}
# Insert your code here
elections = ...
```

Now, filter the data for the year 2020 and the democratic party:

```{r filter}
# Insert your code here
```


To check, display (the top rows of) the data set:

```{r check}
head(elections)
```

## Computing values

Now, use `mutate` to compute the percentage of total votes that Biden (the democratic candidate) received:

```{r compute totals}
# Insert your code here
```

## Exploring total results

Using `arrange` and `head` and/or `tail`, show the states where Biden had the largest and smallest vote share (percentage):

```{r exploration}
# Insert your code here
```

[your interpretation]

## Statistical analysis

Now, using `cor.test`, see if there is a correlation between total number of votes and Bidens's vote share. Interpret the result briefly (1 or 2 sentences)

```{r correlation}
# Insert your code here
```

[your interpretation]


## lab3.Rmd
---
title: 'Lab 3: exploring the US elections (template)'
author: "(Your name)"
output:
  word_document:
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
# this chunk is not shown (include=FALSE), but can be used to setup some sane defaults
# note that naming chunks (in this case: setup) is optional, but if you do use names they need to be unique.

# To submit, use the 'Knit' button above to create a Word or PDF document
# Note that setting 'Chunk output in console' in the options (gear wheel icon) is probably a good idea.

# disable warnings and message output:
knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)
# prettier tables:
library(knitr)
```


*Note: The chunks below are not complete, you will need to fill in the blanks and knit to an output document to finish the lab!*

# Exploring the elections

In this lab we will explore the US presidentials elections over time.
The goal is to compute the popular vote for each election since 1976.

## Getting data

First, let's download the same data as for lab 2 from [MIT election lab](https://electionlab.mit.edu/data)'s [Dataverse page](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX):

```{r data}
library(dataverse)
elections = get_dataframe_by_name(
  filename = "1976-2020-president.tab",
  dataset = "doi:10.7910/DVN/42MVDX",
  server = "dataverse.harvard.edu")
elections
```

Compute the total votes per party per year by aggregating over the different states:

```{r clean}
# Insert your code here
```

Compute the share of the popular vote in each year.
You can either compute the share of the total number of votes,
or restrict to the two major parties and compare only with the total votes these parties received.

```{r filter}
# Insert your code here
```

Visualize the popular vote share over time:

```{r check}
# Insert your code here
```

# Bonus challenge 1: add the winners per election

You can download a list of presidents from many places, but e.g. here: https://raw.githubusercontent.com/awhstin/Dataset-List/master/presidents.csv.

Add the winners / serving presidents to the visualization, for example by adding a label or background.

# Bonus challenge: add the electoral college

Add a line for the electoral college share. Note that the electoral votes per state changes over time with each new census.

+ Download the census results from https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment.csv
+ Compute the electoral college as number of representatives plus the number of senate seats (2 for each state).
+ Note: DC has the same number of electors as the least populous state
+ Compute the census year for each election year by dividing by 10, rounding down with `floor`, and multiplying by 10.
+ Join the data to the per-state election outcome data
+ Compute number of electoral college votes per year per candidate, assuming all states assign all electors to the state-wide winner*.
+ Visualize electoral vs popular vote for one of the two major parties.

* Note that this is not 100% true - feel free to split the votes for Maine and Nebraksa in 2008 and 2016 by hand, but the compute it from the election data you would need the data per representative district which is not included here)


## lab4.Rmd
---
title: "Lab 4: Changing times"
author: "Your name(s)"
output: github_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
# Notice include=FALSE, this means this chunk is not included in output
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
# The line above sets chunks to default on echo=TRUE (include code in output),
warning/message=FALSE (ignore warnings and messages)
# You can change that per chunk by ading echo=FALSE etc to the {r, ...} part on top
library(printr)
# This line makes table output a bit nicer
```
The goal of this week's lab is to predict/explain the *change* in votes in the US
Presidential elections.
For this, we get the election outcome data (at the county level) in 2016 and 2020,
combine that with an interesting demographic at the county level,
and correlate (or regress) whether that demographic explains the changing
(democratic or republican) vote share.
# Obtaining the needed data
## Election data
First, we get the election data results from Harvard dataverse
(https://dataverse.harvard.edu/file.xhtml?fileId=4788675&version=9.0)
To make this easy, we use the dataverse package:
```{r, cache=FALSE}
# Notice cache=TRUE - this means the data is only download once
library(dataverse)
# You might have to install.packages("dataverse") - but don't include that in the
Rmd
d = get_dataframe_by_name("countypres_2000-2020.tab",
dataset = "10.7910/DVN/VOQCHQ",
server = "dataverse.harvard.edu")
head(d)
```
Note that this data needs to be cleaned to keep only the main candidate of either
party, only keep 2016 and 2020, drop the different modes except for mode="TOTAL",
and remove unneeded columns.
```{r}
# Data cleaning code...
```
To compute shifts in vote share, we first compute the vote share for one of the
parties, then *pivot* the years into columns so you get the vote share of that
party in 2016 and 2020 as two columns, and then compute the change:
```{r}
# Code to compute shift in vote share
```
Which counties have the largest shift towards or away from Trump?
```{r}
# Code to show highest and/or lowest change
```
## County-level demographics
You can download the county-level facts from 2016:
```{r}
url = "https://raw.githubusercontent.com/houstondatavis/data-jam-august-2016/
master/csv/county_facts.csv"
facts = read_csv(url)
```
Alternatively, you can download per-county COVID cases,
but that requires a bit of extra work to combine with county population
and compute the relative deaths/cases
```{r}
url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-
counties.csv"
covid = read_csv(url)
# Note that this data set contains cumulative cases/deaths per date.
# You should filter this to take the latest figure before election day
# For this, you can
# - filter to exclude later cases
# - arrange by date (descending) so the latest case is on top,
# - group_by(date) and filter(row_number()==1) to select the first case per date
# You also need to combine it with the county population to get relative cases
```
(Of course, you can also find other relevant data on the web,
as long as it's county-level data and relevant for understanding political choices.
)
## Combining the data
```{r}
# ...
```
# Analysis
```{r}
# Code to do correlation/regression analysis of chosen variable
# For bonus points, add a visualization (scatter plot, map plot) of the relevant
variables
```
## Interpretation
From these results, we can see .... This means ...
	---
	title: 'Lab 2: exploring the US elections (template)'
	author: "(Your name)"
	output:
	pdf_document:
	editor_options:
	chunk_output_type: console
	---

	```{r setup, include=FALSE}
	# this chunk is not shown (include=FALSE), but can be used to setup some sane defaults
	# note that naming chunks (in this case: setup) is optional, but if you do use names they need to be unique.

	# To submit, use the 'Knit' button above to create a PDF document
	# Note that setting 'Chunk output in console' in the options (gear wheel icon) is probably a good idea.

	# disable warnings and message output:
	knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)
	# prettier tables:
	library(knitr)
	```


	Note: The chunks below are not complete, you will need to fill in the blanks and knit to an output document to finish the lab!

	# Exploring the primaries

	In this lab we will explore the 2020 US presidentials elections.
	The goal is to see whether Biden was more successful in larger states.

	## Getting data

	First, let's download the data from [MIT election lab](https://electionlab.mit.edu/data)'s [Dataverse page](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX):

	```{r data}
	# install.packages("dataverse") # if needed)
	library(dataverse)
	elections = get_dataframe_by_name(
	filename = "1976-2020-president.tab",
	dataset = "doi:10.7910/DVN/42MVDX",
	server = "dataverse.harvard.edu")
	elections
	```

	Use `select` to select only the `year`, `state`, `state_po`, `party_simplified`, `candidatevotes`, `totalvotes` columns.
	Rename `party_simplified` to `party`

	```{r clean}
	# Insert your code here
	elections = ...
	```

	Now, filter the data for the year 2020 and the democratic party:

	```{r filter}
	# Insert your code here
	```


	To check, display (the top rows of) the data set:

	```{r check}
	head(elections)
	```

	## Computing values

	Now, use `mutate` to compute the percentage of total votes that Biden (the democratic candidate) received:

	```{r compute totals}
	# Insert your code here
	```

	## Exploring total results

	Using `arrange` and `head` and/or `tail`, show the states where Biden had the largest and smallest vote share (percentage):

	```{r exploration}
	# Insert your code here
	```

	[your interpretation]

	## Statistical analysis

	Now, using `cor.test`, see if there is a correlation between total number of votes and Bidens's vote share. Interpret the result briefly (1 or 2 sentences)

	```{r correlation}
	# Insert your code here
	```

	[your interpretation]
	---
	title: 'Lab 3: exploring the US elections (template)'
	author: "(Your name)"
	output:
	word_document:
	editor_options:
	chunk_output_type: console
	---

	```{r setup, include=FALSE}
	# this chunk is not shown (include=FALSE), but can be used to setup some sane defaults
	# note that naming chunks (in this case: setup) is optional, but if you do use names they need to be unique.

	# To submit, use the 'Knit' button above to create a Word or PDF document
	# Note that setting 'Chunk output in console' in the options (gear wheel icon) is probably a good idea.

	# disable warnings and message output:
	knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)
	# prettier tables:
	library(knitr)
	```


	Note: The chunks below are not complete, you will need to fill in the blanks and knit to an output document to finish the lab!

	# Exploring the elections

	In this lab we will explore the US presidentials elections over time.
	The goal is to compute the popular vote for each election since 1976.

	## Getting data

	First, let's download the same data as for lab 2 from [MIT election lab](https://electionlab.mit.edu/data)'s [Dataverse page](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX):

	```{r data}
	library(dataverse)
	elections = get_dataframe_by_name(
	filename = "1976-2020-president.tab",
	dataset = "doi:10.7910/DVN/42MVDX",
	server = "dataverse.harvard.edu")
	elections
	```

	Compute the total votes per party per year by aggregating over the different states:

	```{r clean}
	# Insert your code here
	```

	Compute the share of the popular vote in each year.
	You can either compute the share of the total number of votes,
	or restrict to the two major parties and compare only with the total votes these parties received.

	```{r filter}
	# Insert your code here
	```

	Visualize the popular vote share over time:

	```{r check}
	# Insert your code here
	```

	# Bonus challenge 1: add the winners per election

	You can download a list of presidents from many places, but e.g. here: https://raw.githubusercontent.com/awhstin/Dataset-List/master/presidents.csv.

	Add the winners / serving presidents to the visualization, for example by adding a label or background.

	# Bonus challenge: add the electoral college

	Add a line for the electoral college share. Note that the electoral votes per state changes over time with each new census.

	+ Download the census results from https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment.csv
	+ Compute the electoral college as number of representatives plus the number of senate seats (2 for each state).
	+ Note: DC has the same number of electors as the least populous state
	+ Compute the census year for each election year by dividing by 10, rounding down with `floor`, and multiplying by 10.
	+ Join the data to the per-state election outcome data
	+ Compute number of electoral college votes per year per candidate, assuming all states assign all electors to the state-wide winner*.
	+ Visualize electoral vs popular vote for one of the two major parties.

	* Note that this is not 100% true - feel free to split the votes for Maine and Nebraksa in 2008 and 2016 by hand, but the compute it from the election data you would need the data per representative district which is not included here)
	---
	title: "Lab 4: Changing times"
	author: "Your name(s)"
	output: github_document
	editor_options:
	chunk_output_type: console
	---
	```{r setup, include=FALSE}
	# Notice include=FALSE, this means this chunk is not included in output
	knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
	# The line above sets chunks to default on echo=TRUE (include code in output),
	warning/message=FALSE (ignore warnings and messages)
	# You can change that per chunk by ading echo=FALSE etc to the {r, ...} part on top
	library(printr)
	# This line makes table output a bit nicer
	```
	The goal of this week's lab is to predict/explain the change in votes in the US
	Presidential elections.
	For this, we get the election outcome data (at the county level) in 2016 and 2020,
	combine that with an interesting demographic at the county level,
	and correlate (or regress) whether that demographic explains the changing
	(democratic or republican) vote share.
	# Obtaining the needed data
	## Election data
	First, we get the election data results from Harvard dataverse
	(https://dataverse.harvard.edu/file.xhtml?fileId=4788675&version=9.0)
	To make this easy, we use the dataverse package:
	```{r, cache=FALSE}
	# Notice cache=TRUE - this means the data is only download once
	library(dataverse)
	# You might have to install.packages("dataverse") - but don't include that in the
	Rmd
	d = get_dataframe_by_name("countypres_2000-2020.tab",
	dataset = "10.7910/DVN/VOQCHQ",
	server = "dataverse.harvard.edu")
	head(d)
	```
	Note that this data needs to be cleaned to keep only the main candidate of either
	party, only keep 2016 and 2020, drop the different modes except for mode="TOTAL",
	and remove unneeded columns.
	```{r}
	# Data cleaning code...
	```
	To compute shifts in vote share, we first compute the vote share for one of the
	parties, then pivot the years into columns so you get the vote share of that
	party in 2016 and 2020 as two columns, and then compute the change:
	```{r}
	# Code to compute shift in vote share
	```
	Which counties have the largest shift towards or away from Trump?
	```{r}
	# Code to show highest and/or lowest change
	```
	## County-level demographics
	You can download the county-level facts from 2016:
	```{r}
	url = "https://raw.githubusercontent.com/houstondatavis/data-jam-august-2016/
	master/csv/county_facts.csv"
	facts = read_csv(url)
	```
	Alternatively, you can download per-county COVID cases,
	but that requires a bit of extra work to combine with county population
	and compute the relative deaths/cases
	```{r}
	url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-
	counties.csv"
	covid = read_csv(url)
	# Note that this data set contains cumulative cases/deaths per date.
	# You should filter this to take the latest figure before election day
	# For this, you can
	# - filter to exclude later cases
	# - arrange by date (descending) so the latest case is on top,
	# - group_by(date) and filter(row_number()==1) to select the first case per date
	# You also need to combine it with the county population to get relative cases
	```
	(Of course, you can also find other relevant data on the web,
	as long as it's county-level data and relevant for understanding political choices.
	)
	## Combining the data
	```{r}
	# ...
	```
	# Analysis
	```{r}
	# Code to do correlation/regression analysis of chosen variable
	# For bonus points, add a visualization (scatter plot, map plot) of the relevant
	variables
	```
	## Interpretation
	From these results, we can see .... This means ...