danweitzel/r_intro.R

## r_intro.R
## Title: Straayer Center Democracy Hackathon
## Instructor: Prof. Daniel Weitzel
## Email: daniel.weitzel@colostate.edu
## Date: 2024-01-29

## Hello and welcome to the Straayer Center Democracy Hackathon!
## In today's session you'll learn how to use R to generate insights from data
## We'll visualize and transform data with a few lines of code!

## Notes for people new to R
## The # indicates that something is a comment, the text does not get evaluated by R.
## Text starting with a # is for humans. The computer treats it as something that does not exist.
## If you want to evaluate code you need to put the cursor somewhere in that line and then
## press Control+Enter (Windows and Linux) or Command+Enter (Mac)
## Try this below, put your cursor on line 16 and press either Control+Enter of Command + Enter
print("Hello world!")

# You just executed code. In your script you gave R the command to print "Hello world" and it did print that
# text in the console of R Studio

# First things first
# HOW DOES THIS WORK?!?!
# R is a statistical software that allows you to do data manipulation, statistical modeling,
# visualization of data and SO. MUCH. MORE.
# it basically is a giant calculator with some really fancy tricks!
# We can add things together
1 + 1

# We can subtract them
4-3

# We can multiply and divide
2 * 4
9 / 3

# We can do fancy math
3**3
3 * pi # hold up, MATH WITH A WORD? Yep, R can do that!

# We can make R REMEMBER things with the assignment operator <- (= also works but is bad style outside of pipes)
x
x <- 5
x

y <- 4
y

z <- x + y
z

# Above we made R remember one number in an object
# But R can remember MANY numbers in an object
# you just need to tell it to combine the numbers
# Let's combine the height of five random people
height <- c(164, 189, 175, 153, 201)
height

# We can access specific heights with []
height[1]

height_maria <- height[5]
height_maria

# We can also generate insights from the data by presenting summary statistics
mean(height)
median(height)
sd(height)


summary(height)

## Libraries
## R is very powerful and capable already. However, users around the world
## have written extra packages to make R even more powerful. We will install
## and then load two of these packages.

## You always need to install a package once and then load it. Think of packages like apps on your phone
## You don't download and install TikTok of GrubHub everytime you use it. You download and install them once.
## After that you activate them by clicking on them on your screen.
## In R you don't click on an app (package), you call it with the library() command. See lines 96-100

## This function checks if the requires packages are installed. If not it installs them
#if (!require("pacman")) install.packages("pacman")
#pacman::p_load(tidyverse)

# We will be using the V-Dem (Varieties of Democracy) data set. Particularly, we will use their democracy index
# It is called the Polyarchy Index (based on Robert Dahl's ideas about democracy) and is called v2x_polyarchy
# If you want to continue to work with this data you can find the code book here: https://v-dem.net/documents/24/codebook_v13.pdf

# If you have not installed the V-Dem data set package you can do it here:
# First, you need to have the devtools package installed
#install.packages("devtools")
# now, install the vdemdata package directly from GitHub
#devtools::install_github("vdeminstitute/vdemdata")

# Libraries
# Before we start any coding we always need to load the libraries we are going to use in the project.
# Always add the libraries at the top so people know what they need to install to run your scripts
library(tidyverse)  # a meta package with many libraries to work with data
library(vdemdata)   # a data set on democracy

# Quick note: Let's also set the default plotting scheme to black and white.
# this is not necessary but it makes graphs prettier (pretty graphs are SUPER necessary)
theme_set(theme_bw())

# Load the V-Dem data
# The V-Dem data set is stored on a server. In the vdemdata package is the vdem function.
# It calls the server and downloads the data set. R has this functionality for a lot of other data sets as well!
# Let's generate a data set on our computer that is called df_vdem (df for data frame) that stores the V-Dem data
df_vdem <- vdem

# You can inspect the data frame
names(df_vdem) # the names command lists all the variables in a data set
dim(df_vdem) # dim stands for dimensions. This command gives you the number of observations and the number of variables

# RStudio also has a tool to look at the data
View(df_vdem)

## So, we have a large-ish data set. What's next?
## R and other statistical programming languages are useful if you want to extract information from data
## If you give a person ALL the democracy scores for all the countries in the world they will be overwhelmed.
## They won't know what to do with this!
## As an analyst you need to provide key information from the data.
## One key information might be the average democracy in the world! We can calculate the mean democracy easily
## You use the mean() function and then you add the name of the data set and with a $ the name of the variable that you want to describe

## What's the mean level of democracy in the world?
mean(df_vdem$v2x_polyarchy)

## That didn't work!
## The reason: there are missing values in the data set and R wants you to tell it what it should do with missing data
mean(df_vdem$v2x_polyarchy, na.rm = TRUE)

## We can also calculate the standard deviation. That's a measure of how dispersed the data is. Smaller standard deviation means that data is closer together
## Y'all probably want exams with high average points and low point dispersion.
sd(df_vdem$v2x_polyarchy, na.rm = TRUE)

## We can also use [] go make conditionals. Let's say we want to know the average democracy of the United States of America
## We can use the mean command from above to calculate that!
mean(df_vdem$v2x_polyarchy[df_vdem$country_name == "United States of America"], na.rm = TRUE)

## We can also combine multiple conditionals! Let's say we want to know the mean democracy score for the US since 2000
mean(df_vdem$v2x_polyarchy[df_vdem$country_name == "United States of America" & df_vdem$year >= 2000], na.rm = TRUE)

# Let's learn about the tidyverse
# The tidyverse is one of three coding paradigms in R (the other ones are base R and data.table)
# They all have advantages (base R is more stable, data.table is faster) but key is that the tidyverse is more widely used
# and it made writing R code VERY, VERY EASY
# Can you form a sentence in English? You can write tidyverse code!
# tidyverse code is very verbal and verbose
# Let's go through some of the verbs
# select: selects variables (columns) from the data frame
# filter: filters observations from the data frame
# mutate: changes observations in the data frame
# ggplot: do a grammar of graphics plot (they are GORGEOUS)

# PIPES!
# One of the coolest features of the tidyverse (and R) are pipes.
# You can pipe lines of code together.
# Pipes look either like this %>% or this |> there is only a minimal difference between them
# There is a shortcut: Ctrl Shift M or Command Shift M
# Think of it like this:
# Wake up |>
#   Turn of alarm |>
#   Get out of bed |>
#   Put on clothes |>
#   Make coffee |>
#   Talk to roommate

# Let's pipe the df_vdem data into a ggplot command and make a histogram of democracy (v2x_polyarchy)
df_vdem |>
  ggplot(aes(x = v2x_polyarchy)) +
  geom_histogram()

# Let's use pipes and filter to make a graph of democracy in the United States of America
# We first take the data set that we have and pipe it into the filter function
# We filter the country_name variable such that we only have the United States of America
# Then we use two variables in the plot. On the x-axis we want the years and on the y axis
# we want the democracy score for that specific year
df_vdem |>
  filter(country_name %in% c("United States of America")) |>
  ggplot(aes(x = year, y = v2x_polyarchy)) +
  geom_point() + # we use geom_point to make points in the graph
  geom_line() # and we use geom_line to add a line to the graph

# Notice how we are adding layers to the graph with +

# Let's add another filter statement that reduces the data set to only the years
# after the end of the second world war (1945 but you knew that, right?!)
df_vdem |>
  filter(country_name %in% c("United States of America")) |>
  filter(year > 1945) |>
  ggplot(aes(x = year, y = v2x_polyarchy)) +
  geom_line()

# Let's go crazy and compare MULTIPLE countries.
# We can do that by adding other countries to the filter statement.
# We then need to add a facet_wrap command to tell ggplot to break up the plot
# into different facets for each country
# Do you want to add countries? You can do that!
# Just uncomment the line below and check out all the countries that exist.
# Copy the exact name into the filter statement. Make sure to use the exact name, "", and add a ,
#unique(df_vdem$country_name)

df_vdem |>
  filter(country_name %in% c("United States of America",
                             "Brazil", "Germany", "Mexico",
                             "Canada", "Hungary")) |>
  ggplot(aes(x = year, y = v2x_polyarchy)) +
  geom_line() +
  facet_wrap(~country_name)

# Careful scientific calculations came to the conclusion that this would take 37 hours in Excel and
# that the graphs would be very, very, very ugly

## So, we have now examined democracy visually.
## Do you know what the 25 most democratic countries were in 2022?
## You could head over to Buzzfeed or you could run the code below.
## We take the df_vdem data set, reduce it to the year 2022, select only the country_name and democracy variable
## then we arrange it from high to low for democracy and then we slice the head 25 times
## Sounds scary but actually just gives us a slice of the data set from the top
df_vdem |>
  filter(year == 2022) |>
  dplyr::select(country_name, v2x_polyarchy) |>
  arrange(-v2x_polyarchy) |>
  slice_head(n = 25)

# We can also slice the data set from the tail!
# What are the 25 least democratic countries in the world in 2022
df_vdem |>
  filter(year == 2022) |>
  dplyr::select(country_name, v2x_polyarchy) |>
  arrange(-v2x_polyarchy) |>
  slice_tail(n = 25)

# A bit shocking but I'd do a vacation in ten of those...


# We now know what the most and least democratic countries are in 2022
# That's interesting but we might want to know how things have changed
# People don't go to a high school reunion to remember how people were in 1996
# They want to know how people have changed since 1996
# Let's examine how countries have changed over the last decade!
# We'll compare a countries democracy score from 2022 with its score from 2012

# This will require some *fancy* tidyverse work.
# Remember, right now the data set has the following structure: country_name, year, democracy score
# and we want the difference in the democracy score of each country over a decade
# Here's one (of many) ways to do this with the tidyverse
# We select the country_name, year, and democracy variable
# Then we reduce the data set to the year 2012 and the year 2022
# we use the group_by command, which is AMAZING (it tells R to do calculations based on groups)
# If we tell it to group by country it will do all calculations for the countries and not mix up numbers
# from different countries
# we then do a mutation, meaning we change the data set. We generate a new variable and call it polyarchy_change
# this variable is the current polyarchy score minus the previous polyarchy score.
# The lag() command is a command in R that tells it to use a previous score
# We then ungroup(), that's not necessary but good practive
df_vdem_change <-
  df_vdem |>
  dplyr::select(country_name, year, v2x_polyarchy) |>
  filter(year %in% c(2012, 2022)) |>
  group_by(country_name) |>
  mutate(polyarchy_change = v2x_polyarchy - lag(v2x_polyarchy)) |>
  ungroup()

# Let's visualize the data
# we take the new data we generated and pipe it into a drop_na command. This drops all the missing data that we have
# after this we plot the country_name on the x axis and the change in polyarchy on the y axis
# we use the reorder command to reorder the country names based on the values of the polyarchy_change variable
# then we plot points and add a red line at 0 (indicating no change)

df_vdem_change |>
  drop_na() |>
  ggplot(aes(x = reorder(country_name, polyarchy_change), y = polyarchy_change)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  coord_flip() +
  labs(title = "Development of democracy from 2012 to 2022",
       y = "Change in Democracy",
       x = "Country")

# This is somewhat a messy graph. Let's break it up into two graphs
# One where polyarchy_change is greater than 0, the countries that became more democratic
# as a twist: i am adding color to geom_point()
df_vdem_change |>
  drop_na() |>
  filter(polyarchy_change > 0) |>
  ggplot(aes(x = reorder(country_name, polyarchy_change), y = polyarchy_change)) +
  geom_point(color = "forestgreen") +
  coord_flip() +
  labs(title = "Development of democracy from 2012 to 2022",
       y = "Change in Democracy",
       x = "Country")

# and one graph where polyarchy_change is less than 0, the countries that experienced backsliding
# as a twist: i am adding color to geom_point()
df_vdem_change |>
  drop_na() |>
  filter(polyarchy_change < 0) |>
  ggplot(aes(x = reorder(country_name, polyarchy_change), y = polyarchy_change)) +
  geom_point(color = "firebrick") +
  coord_flip() +
  labs(title = "Development of democracy from 2012 to 2022",
       y = "Change in Democracy",
       x = "Country")

# NOTE: If you ever want to save a graph: just click on Export on the right side of R Studio

## How has democracy progressed in the entire world?
## Let's calculate the AVERAGE democracy for each year from 1789 to 2022 and see if the world has
## become more or less democratic
## We'll do this with the power of the tidyverse
## In this process we use the summarize command, which fundamentally alters the structure of the data set
## Normal mutate statements do not change the number of observations you have. If you use summarize you
## actually summarize the data. If you have a data set with a democracy score for ten countries and twelve years you have 120 observations
## if you summarize the data by year to get the average democracy score by year you will have only 12 observations (one for each year)
## This can be very useful but also very confusing
## Summarize requires that you tell it what you want to summarize and how you want to summarize
## We want to summarize the democracy variable and we want it to be sumamrized as the average (mean)
## We also need to group_by year because we want the annual average democracy score
df_vdem |>
  dplyr::select(country_name, year, v2x_polyarchy) |>
  group_by(year) |>
  summarize(mean_democracy = mean(v2x_polyarchy, na.rm = TRUE)) |>
  ggplot(aes(x = year, y = mean_democracy)) +
  geom_line() +
  ylim(0,1)

## What is the relationship between economic development and democracy?
## Let's classify countries into HIGH and LOW economically developed countries. Countries that have a GDP above the global mean
## are highly developed and countries below are not highly developed.
## Similarly for democracy. Countries above the mean of Polyarchy are highly democratic and those below are not
## THIS IS OBVIOUSLY PROBLEMATIC but I want to teach you cool tricks and sometimes we have to simply things for that
## We will use something called an ifelse statements. That's a super simple but INCREDIBLY powerful tool
## It works like this: if somethings is like that do this else do that
## if weather == cloudy bring an umbrella else bring sunglasses
## in code they look like this mutate(clothes = ifelse(weather == "cloudy", "Umbrella", "Sunglasses"))
## Coders sometimes struggle with ifelse statements because they type ifesle or forget parentheses and commas
df_gdp_democracy <-
  df_vdem |>
  dplyr::select(country_name, year, v2x_polyarchy, e_gdp) |>
  group_by(year) |>
  mutate(global_mean_gdp = mean(e_gdp, na.rm = TRUE),
         global_mean_democracy = mean(v2x_polyarchy, na.rm = TRUE),
         democracy_dummy = ifelse(v2x_polyarchy > global_mean_democracy, "High Democracy", "Low Democracy"),
         gdp_dummy = ifelse(e_gdp > global_mean_gdp, "High GDP", "Low GDP"))


table(df_gdp_democracy$democracy_dummy, df_gdp_democracy$gdp_dummy)

# Fun fact: we could have been more efficient with our code. We can add the mean calculation into the ifelse statement:
# df_gdp_democracy <-
#   df_vdem |>
#   dplyr::select(country_name, year, v2x_polyarchy, e_gdp) |>
#   group_by(year) |>
#   mutate(democracy_dummy = ifelse(v2x_polyarchy > mean(e_gdp, na.rm = TRUE), "High Democracy", "Low Democracy"),
#          gdp_dummy = ifelse(e_gdp > mean(v2x_polyarchy, na.rm = TRUE), "High GDP", "Low GDP"))

#########################################################################################################################
## More advanced
#if (!require("pacman")) install.packages("pacman")
#pacman::p_load(texreg)
library(texreg)  # this is a package that makes beautiful tables. Usually I would put it at the top of the script...

## Let's say we want to examine the relationship between economic development and democracy further.
## We can in theory run a regression model with one variable as the outcome and another variable as the predictor
## In this scenario we would then see how much variation in one variable (e.g. differences in GDP) explain variation in another variable (e.g. differences in democracy)
## A couple of things:
## Regression (usually) does not reveal causal relationships. Just because we find a positive or negative relationship between variables does not mean that
## one variables causes another variable. We might run a regression examining the relationship between ice cream sales and violence. We'd likely find a positive relationship
## between the two variables! Do ice cream sales cause violence? They (probably) don't. The trick is that another variable is missing from the model and it causes variation in
## the outcome and the predictor. This variable is sunny weather. Sunny weather drives up ice cream sales and also drives up violence.
## In the code below we will estimate two linear regression models. One has democracy as the outcome and GDP as the explaining variable. The other model reverses that order
## We can't say whether higher GDP causes more democracy or higher democracy causes more GDP. We need better causal inference tools and a good theory for that.

## GDP is a variable that is highly skewed (some countries have very high GDPs, just like income is also very skewed. You have standard incomes and then you have Jeff Bezos)
## A common solution to skewed variables is to log them. We can do this by generating a new variable with the log() function in R
df_vdem <-
  df_vdem |>
  mutate(e_gdp_log = log(e_gdp))

## Estimating to linear regression models. The first one has democracy as the outcome and uses logged GDP as the predictor
lm_1 <- lm(v2x_polyarchy ~ e_gdp_log, data = df_vdem)
## The second model flips the order and has logged GDP as the outcome and democracy as the predictor
lm_2 <- lm(e_gdp_log ~v2x_polyarchy, data = df_vdem)

## Visualize the results
screenreg(list(lm_1, lm_2))
plotreg(lm_1)
plotreg(lm_2)


#########################################################################################################################
## Interested in data analysis but intimidated by R?
## Don't be! R can be intimidating at first but it's just like learning a language.
## The R community is incredibly supportive, helpful, and inclusive.
## Check out rstats on Reddit and Twitter and follow @hadleywickham on Twitter.

## There are many, many free resources on learning R.
## Rule #1 of learning R: Don't ever pay!
## R for Data Science book: https://r4ds.had.co.nz/
## Coursera and Udemy also have courses but REMEMBER RULE #1 of R
## Exception of the rule: you can send money to people that wrote really useful packages

## If you plan to continue working with R: find a project!
## Come and talk with me! We can find a project that you can work on and I am happy to help you!
## You can also email me: daniel.weitzel@colostate.edu
## My office hours are Wednesday 2-5pm and you can book a slot at https://cal.com/weitzel
## At the end of the semester the Straayer Center will have a poster session for undergraduate research projects
## You can aim for presenting a poster at this session!

## Cool things to do:
## Analyze your WhatsApp conversations: https://r-posts.com/whatsr-package/
## Visualize organs of mamals: https://github.com/jespermaag/gganatogram