Created
February 9, 2024 02:51
-
-
Save danweitzel/bffe0c26b49e82cc7f9773696aae625c to your computer and use it in GitHub Desktop.
Intro to R for Colorado State University Straayer Center
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Title: Straayer Center Democracy Hackathon | |
## Instructor: Prof. Daniel Weitzel | |
## Email: daniel.weitzel@colostate.edu | |
## Date: 2024-01-29 | |
## Hello and welcome to the Straayer Center Democracy Hackathon! | |
## In today's session you'll learn how to use R to generate insights from data | |
## We'll visualize and transform data with a few lines of code! | |
## Notes for people new to R | |
## The # indicates that something is a comment, the text does not get evaluated by R. | |
## Text starting with a # is for humans. The computer treats it as something that does not exist. | |
## If you want to evaluate code you need to put the cursor somewhere in that line and then | |
## press Control+Enter (Windows and Linux) or Command+Enter (Mac) | |
## Try this below, put your cursor on line 16 and press either Control+Enter of Command + Enter | |
print("Hello world!") | |
# You just executed code. In your script you gave R the command to print "Hello world" and it did print that | |
# text in the console of R Studio | |
# First things first | |
# HOW DOES THIS WORK?!?! | |
# R is a statistical software that allows you to do data manipulation, statistical modeling, | |
# visualization of data and SO. MUCH. MORE. | |
# it basically is a giant calculator with some really fancy tricks! | |
# We can add things together | |
1 + 1 | |
# We can subtract them | |
4-3 | |
# We can multiply and divide | |
2 * 4 | |
9 / 3 | |
# We can do fancy math | |
3**3 | |
3 * pi # hold up, MATH WITH A WORD? Yep, R can do that! | |
# We can make R REMEMBER things with the assignment operator <- (= also works but is bad style outside of pipes) | |
x | |
x <- 5 | |
x | |
y <- 4 | |
y | |
z <- x + y | |
z | |
# Above we made R remember one number in an object | |
# But R can remember MANY numbers in an object | |
# you just need to tell it to combine the numbers | |
# Let's combine the height of five random people | |
height <- c(164, 189, 175, 153, 201) | |
height | |
# We can access specific heights with [] | |
height[1] | |
height_maria <- height[5] | |
height_maria | |
# We can also generate insights from the data by presenting summary statistics | |
mean(height) | |
median(height) | |
sd(height) | |
summary(height) | |
## Libraries | |
## R is very powerful and capable already. However, users around the world | |
## have written extra packages to make R even more powerful. We will install | |
## and then load two of these packages. | |
## You always need to install a package once and then load it. Think of packages like apps on your phone | |
## You don't download and install TikTok of GrubHub everytime you use it. You download and install them once. | |
## After that you activate them by clicking on them on your screen. | |
## In R you don't click on an app (package), you call it with the library() command. See lines 96-100 | |
## This function checks if the requires packages are installed. If not it installs them | |
#if (!require("pacman")) install.packages("pacman") | |
#pacman::p_load(tidyverse) | |
# We will be using the V-Dem (Varieties of Democracy) data set. Particularly, we will use their democracy index | |
# It is called the Polyarchy Index (based on Robert Dahl's ideas about democracy) and is called v2x_polyarchy | |
# If you want to continue to work with this data you can find the code book here: https://v-dem.net/documents/24/codebook_v13.pdf | |
# If you have not installed the V-Dem data set package you can do it here: | |
# First, you need to have the devtools package installed | |
#install.packages("devtools") | |
# now, install the vdemdata package directly from GitHub | |
#devtools::install_github("vdeminstitute/vdemdata") | |
# Libraries | |
# Before we start any coding we always need to load the libraries we are going to use in the project. | |
# Always add the libraries at the top so people know what they need to install to run your scripts | |
library(tidyverse) # a meta package with many libraries to work with data | |
library(vdemdata) # a data set on democracy | |
# Quick note: Let's also set the default plotting scheme to black and white. | |
# this is not necessary but it makes graphs prettier (pretty graphs are SUPER necessary) | |
theme_set(theme_bw()) | |
# Load the V-Dem data | |
# The V-Dem data set is stored on a server. In the vdemdata package is the vdem function. | |
# It calls the server and downloads the data set. R has this functionality for a lot of other data sets as well! | |
# Let's generate a data set on our computer that is called df_vdem (df for data frame) that stores the V-Dem data | |
df_vdem <- vdem | |
# You can inspect the data frame | |
names(df_vdem) # the names command lists all the variables in a data set | |
dim(df_vdem) # dim stands for dimensions. This command gives you the number of observations and the number of variables | |
# RStudio also has a tool to look at the data | |
View(df_vdem) | |
## So, we have a large-ish data set. What's next? | |
## R and other statistical programming languages are useful if you want to extract information from data | |
## If you give a person ALL the democracy scores for all the countries in the world they will be overwhelmed. | |
## They won't know what to do with this! | |
## As an analyst you need to provide key information from the data. | |
## One key information might be the average democracy in the world! We can calculate the mean democracy easily | |
## You use the mean() function and then you add the name of the data set and with a $ the name of the variable that you want to describe | |
## What's the mean level of democracy in the world? | |
mean(df_vdem$v2x_polyarchy) | |
## That didn't work! | |
## The reason: there are missing values in the data set and R wants you to tell it what it should do with missing data | |
mean(df_vdem$v2x_polyarchy, na.rm = TRUE) | |
## We can also calculate the standard deviation. That's a measure of how dispersed the data is. Smaller standard deviation means that data is closer together | |
## Y'all probably want exams with high average points and low point dispersion. | |
sd(df_vdem$v2x_polyarchy, na.rm = TRUE) | |
## We can also use [] go make conditionals. Let's say we want to know the average democracy of the United States of America | |
## We can use the mean command from above to calculate that! | |
mean(df_vdem$v2x_polyarchy[df_vdem$country_name == "United States of America"], na.rm = TRUE) | |
## We can also combine multiple conditionals! Let's say we want to know the mean democracy score for the US since 2000 | |
mean(df_vdem$v2x_polyarchy[df_vdem$country_name == "United States of America" & df_vdem$year >= 2000], na.rm = TRUE) | |
# Let's learn about the tidyverse | |
# The tidyverse is one of three coding paradigms in R (the other ones are base R and data.table) | |
# They all have advantages (base R is more stable, data.table is faster) but key is that the tidyverse is more widely used | |
# and it made writing R code VERY, VERY EASY | |
# Can you form a sentence in English? You can write tidyverse code! | |
# tidyverse code is very verbal and verbose | |
# Let's go through some of the verbs | |
# select: selects variables (columns) from the data frame | |
# filter: filters observations from the data frame | |
# mutate: changes observations in the data frame | |
# ggplot: do a grammar of graphics plot (they are GORGEOUS) | |
# PIPES! | |
# One of the coolest features of the tidyverse (and R) are pipes. | |
# You can pipe lines of code together. | |
# Pipes look either like this %>% or this |> there is only a minimal difference between them | |
# There is a shortcut: Ctrl Shift M or Command Shift M | |
# Think of it like this: | |
# Wake up |> | |
# Turn of alarm |> | |
# Get out of bed |> | |
# Put on clothes |> | |
# Make coffee |> | |
# Talk to roommate | |
# Let's pipe the df_vdem data into a ggplot command and make a histogram of democracy (v2x_polyarchy) | |
df_vdem |> | |
ggplot(aes(x = v2x_polyarchy)) + | |
geom_histogram() | |
# Let's use pipes and filter to make a graph of democracy in the United States of America | |
# We first take the data set that we have and pipe it into the filter function | |
# We filter the country_name variable such that we only have the United States of America | |
# Then we use two variables in the plot. On the x-axis we want the years and on the y axis | |
# we want the democracy score for that specific year | |
df_vdem |> | |
filter(country_name %in% c("United States of America")) |> | |
ggplot(aes(x = year, y = v2x_polyarchy)) + | |
geom_point() + # we use geom_point to make points in the graph | |
geom_line() # and we use geom_line to add a line to the graph | |
# Notice how we are adding layers to the graph with + | |
# Let's add another filter statement that reduces the data set to only the years | |
# after the end of the second world war (1945 but you knew that, right?!) | |
df_vdem |> | |
filter(country_name %in% c("United States of America")) |> | |
filter(year > 1945) |> | |
ggplot(aes(x = year, y = v2x_polyarchy)) + | |
geom_line() | |
# Let's go crazy and compare MULTIPLE countries. | |
# We can do that by adding other countries to the filter statement. | |
# We then need to add a facet_wrap command to tell ggplot to break up the plot | |
# into different facets for each country | |
# Do you want to add countries? You can do that! | |
# Just uncomment the line below and check out all the countries that exist. | |
# Copy the exact name into the filter statement. Make sure to use the exact name, "", and add a , | |
#unique(df_vdem$country_name) | |
df_vdem |> | |
filter(country_name %in% c("United States of America", | |
"Brazil", "Germany", "Mexico", | |
"Canada", "Hungary")) |> | |
ggplot(aes(x = year, y = v2x_polyarchy)) + | |
geom_line() + | |
facet_wrap(~country_name) | |
# Careful scientific calculations came to the conclusion that this would take 37 hours in Excel and | |
# that the graphs would be very, very, very ugly | |
## So, we have now examined democracy visually. | |
## Do you know what the 25 most democratic countries were in 2022? | |
## You could head over to Buzzfeed or you could run the code below. | |
## We take the df_vdem data set, reduce it to the year 2022, select only the country_name and democracy variable | |
## then we arrange it from high to low for democracy and then we slice the head 25 times | |
## Sounds scary but actually just gives us a slice of the data set from the top | |
df_vdem |> | |
filter(year == 2022) |> | |
dplyr::select(country_name, v2x_polyarchy) |> | |
arrange(-v2x_polyarchy) |> | |
slice_head(n = 25) | |
# We can also slice the data set from the tail! | |
# What are the 25 least democratic countries in the world in 2022 | |
df_vdem |> | |
filter(year == 2022) |> | |
dplyr::select(country_name, v2x_polyarchy) |> | |
arrange(-v2x_polyarchy) |> | |
slice_tail(n = 25) | |
# A bit shocking but I'd do a vacation in ten of those... | |
# We now know what the most and least democratic countries are in 2022 | |
# That's interesting but we might want to know how things have changed | |
# People don't go to a high school reunion to remember how people were in 1996 | |
# They want to know how people have changed since 1996 | |
# Let's examine how countries have changed over the last decade! | |
# We'll compare a countries democracy score from 2022 with its score from 2012 | |
# This will require some *fancy* tidyverse work. | |
# Remember, right now the data set has the following structure: country_name, year, democracy score | |
# and we want the difference in the democracy score of each country over a decade | |
# Here's one (of many) ways to do this with the tidyverse | |
# We select the country_name, year, and democracy variable | |
# Then we reduce the data set to the year 2012 and the year 2022 | |
# we use the group_by command, which is AMAZING (it tells R to do calculations based on groups) | |
# If we tell it to group by country it will do all calculations for the countries and not mix up numbers | |
# from different countries | |
# we then do a mutation, meaning we change the data set. We generate a new variable and call it polyarchy_change | |
# this variable is the current polyarchy score minus the previous polyarchy score. | |
# The lag() command is a command in R that tells it to use a previous score | |
# We then ungroup(), that's not necessary but good practive | |
df_vdem_change <- | |
df_vdem |> | |
dplyr::select(country_name, year, v2x_polyarchy) |> | |
filter(year %in% c(2012, 2022)) |> | |
group_by(country_name) |> | |
mutate(polyarchy_change = v2x_polyarchy - lag(v2x_polyarchy)) |> | |
ungroup() | |
# Let's visualize the data | |
# we take the new data we generated and pipe it into a drop_na command. This drops all the missing data that we have | |
# after this we plot the country_name on the x axis and the change in polyarchy on the y axis | |
# we use the reorder command to reorder the country names based on the values of the polyarchy_change variable | |
# then we plot points and add a red line at 0 (indicating no change) | |
df_vdem_change |> | |
drop_na() |> | |
ggplot(aes(x = reorder(country_name, polyarchy_change), y = polyarchy_change)) + | |
geom_point() + | |
geom_hline(yintercept = 0, color = "red") + | |
coord_flip() + | |
labs(title = "Development of democracy from 2012 to 2022", | |
y = "Change in Democracy", | |
x = "Country") | |
# This is somewhat a messy graph. Let's break it up into two graphs | |
# One where polyarchy_change is greater than 0, the countries that became more democratic | |
# as a twist: i am adding color to geom_point() | |
df_vdem_change |> | |
drop_na() |> | |
filter(polyarchy_change > 0) |> | |
ggplot(aes(x = reorder(country_name, polyarchy_change), y = polyarchy_change)) + | |
geom_point(color = "forestgreen") + | |
coord_flip() + | |
labs(title = "Development of democracy from 2012 to 2022", | |
y = "Change in Democracy", | |
x = "Country") | |
# and one graph where polyarchy_change is less than 0, the countries that experienced backsliding | |
# as a twist: i am adding color to geom_point() | |
df_vdem_change |> | |
drop_na() |> | |
filter(polyarchy_change < 0) |> | |
ggplot(aes(x = reorder(country_name, polyarchy_change), y = polyarchy_change)) + | |
geom_point(color = "firebrick") + | |
coord_flip() + | |
labs(title = "Development of democracy from 2012 to 2022", | |
y = "Change in Democracy", | |
x = "Country") | |
# NOTE: If you ever want to save a graph: just click on Export on the right side of R Studio | |
## How has democracy progressed in the entire world? | |
## Let's calculate the AVERAGE democracy for each year from 1789 to 2022 and see if the world has | |
## become more or less democratic | |
## We'll do this with the power of the tidyverse | |
## In this process we use the summarize command, which fundamentally alters the structure of the data set | |
## Normal mutate statements do not change the number of observations you have. If you use summarize you | |
## actually summarize the data. If you have a data set with a democracy score for ten countries and twelve years you have 120 observations | |
## if you summarize the data by year to get the average democracy score by year you will have only 12 observations (one for each year) | |
## This can be very useful but also very confusing | |
## Summarize requires that you tell it what you want to summarize and how you want to summarize | |
## We want to summarize the democracy variable and we want it to be sumamrized as the average (mean) | |
## We also need to group_by year because we want the annual average democracy score | |
df_vdem |> | |
dplyr::select(country_name, year, v2x_polyarchy) |> | |
group_by(year) |> | |
summarize(mean_democracy = mean(v2x_polyarchy, na.rm = TRUE)) |> | |
ggplot(aes(x = year, y = mean_democracy)) + | |
geom_line() + | |
ylim(0,1) | |
## What is the relationship between economic development and democracy? | |
## Let's classify countries into HIGH and LOW economically developed countries. Countries that have a GDP above the global mean | |
## are highly developed and countries below are not highly developed. | |
## Similarly for democracy. Countries above the mean of Polyarchy are highly democratic and those below are not | |
## THIS IS OBVIOUSLY PROBLEMATIC but I want to teach you cool tricks and sometimes we have to simply things for that | |
## We will use something called an ifelse statements. That's a super simple but INCREDIBLY powerful tool | |
## It works like this: if somethings is like that do this else do that | |
## if weather == cloudy bring an umbrella else bring sunglasses | |
## in code they look like this mutate(clothes = ifelse(weather == "cloudy", "Umbrella", "Sunglasses")) | |
## Coders sometimes struggle with ifelse statements because they type ifesle or forget parentheses and commas | |
df_gdp_democracy <- | |
df_vdem |> | |
dplyr::select(country_name, year, v2x_polyarchy, e_gdp) |> | |
group_by(year) |> | |
mutate(global_mean_gdp = mean(e_gdp, na.rm = TRUE), | |
global_mean_democracy = mean(v2x_polyarchy, na.rm = TRUE), | |
democracy_dummy = ifelse(v2x_polyarchy > global_mean_democracy, "High Democracy", "Low Democracy"), | |
gdp_dummy = ifelse(e_gdp > global_mean_gdp, "High GDP", "Low GDP")) | |
table(df_gdp_democracy$democracy_dummy, df_gdp_democracy$gdp_dummy) | |
# Fun fact: we could have been more efficient with our code. We can add the mean calculation into the ifelse statement: | |
# df_gdp_democracy <- | |
# df_vdem |> | |
# dplyr::select(country_name, year, v2x_polyarchy, e_gdp) |> | |
# group_by(year) |> | |
# mutate(democracy_dummy = ifelse(v2x_polyarchy > mean(e_gdp, na.rm = TRUE), "High Democracy", "Low Democracy"), | |
# gdp_dummy = ifelse(e_gdp > mean(v2x_polyarchy, na.rm = TRUE), "High GDP", "Low GDP")) | |
######################################################################################################################### | |
## More advanced | |
#if (!require("pacman")) install.packages("pacman") | |
#pacman::p_load(texreg) | |
library(texreg) # this is a package that makes beautiful tables. Usually I would put it at the top of the script... | |
## Let's say we want to examine the relationship between economic development and democracy further. | |
## We can in theory run a regression model with one variable as the outcome and another variable as the predictor | |
## In this scenario we would then see how much variation in one variable (e.g. differences in GDP) explain variation in another variable (e.g. differences in democracy) | |
## A couple of things: | |
## Regression (usually) does not reveal causal relationships. Just because we find a positive or negative relationship between variables does not mean that | |
## one variables causes another variable. We might run a regression examining the relationship between ice cream sales and violence. We'd likely find a positive relationship | |
## between the two variables! Do ice cream sales cause violence? They (probably) don't. The trick is that another variable is missing from the model and it causes variation in | |
## the outcome and the predictor. This variable is sunny weather. Sunny weather drives up ice cream sales and also drives up violence. | |
## In the code below we will estimate two linear regression models. One has democracy as the outcome and GDP as the explaining variable. The other model reverses that order | |
## We can't say whether higher GDP causes more democracy or higher democracy causes more GDP. We need better causal inference tools and a good theory for that. | |
## GDP is a variable that is highly skewed (some countries have very high GDPs, just like income is also very skewed. You have standard incomes and then you have Jeff Bezos) | |
## A common solution to skewed variables is to log them. We can do this by generating a new variable with the log() function in R | |
df_vdem <- | |
df_vdem |> | |
mutate(e_gdp_log = log(e_gdp)) | |
## Estimating to linear regression models. The first one has democracy as the outcome and uses logged GDP as the predictor | |
lm_1 <- lm(v2x_polyarchy ~ e_gdp_log, data = df_vdem) | |
## The second model flips the order and has logged GDP as the outcome and democracy as the predictor | |
lm_2 <- lm(e_gdp_log ~v2x_polyarchy, data = df_vdem) | |
## Visualize the results | |
screenreg(list(lm_1, lm_2)) | |
plotreg(lm_1) | |
plotreg(lm_2) | |
######################################################################################################################### | |
## Interested in data analysis but intimidated by R? | |
## Don't be! R can be intimidating at first but it's just like learning a language. | |
## The R community is incredibly supportive, helpful, and inclusive. | |
## Check out rstats on Reddit and Twitter and follow @hadleywickham on Twitter. | |
## There are many, many free resources on learning R. | |
## Rule #1 of learning R: Don't ever pay! | |
## R for Data Science book: https://r4ds.had.co.nz/ | |
## Coursera and Udemy also have courses but REMEMBER RULE #1 of R | |
## Exception of the rule: you can send money to people that wrote really useful packages | |
## If you plan to continue working with R: find a project! | |
## Come and talk with me! We can find a project that you can work on and I am happy to help you! | |
## You can also email me: daniel.weitzel@colostate.edu | |
## My office hours are Wednesday 2-5pm and you can book a slot at https://cal.com/weitzel | |
## At the end of the semester the Straayer Center will have a poster session for undergraduate research projects | |
## You can aim for presenting a poster at this session! | |
## Cool things to do: | |
## Analyze your WhatsApp conversations: https://r-posts.com/whatsr-package/ | |
## Visualize organs of mamals: https://github.com/jespermaag/gganatogram |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment