marioa/Rsoc.md

## Rsoc.md

      
    Raw
  

              Rsoc.md
            
          
    R  for Social Scientists

Summary of the notes R for Scoial Scientsts.
Contents


Setup
Before we start
Introduction to R
Data frames and tibbles
Data Manipulation using dplyr and tidyr

Setup

Need:

R - from CRAN.
RStudion - from RStudio.

Can check the version installed using sessionInfo(). 

Install tidyverse becase it takes so long install.packages("tidyverse"). 

Can go to the website www.tidyverse.org for more info.
Before we start

R is an interpreted computer langauage. From August 1993.

RStudio is an IDE for R.  Work began in 2010 with the first released in 2016.

R:

Is good for reproducibility, encode workflow in a script.
Have lots ot extensions you can add and use (16066 packages in July 2020).
Can work on large data sets.
Has very good graphics.
A large and welcoming community.
R is free and cross-platform.

Rstudio:

Free under the Affero General Public License (AGPL) v3, but can also be bought under a commercial licence.
Makes programming with R very easy.

Create a new project File->New project->New directory-> New project.

Use ~/data-carpentry
Create project
File->New File -> R script
Save as script.R

Go over the Rstudio layout:

Top Left - Source: your scripts and documents
Bottom Left - Console: what R would look and be like without RStudio
Top Right - Enviornment/History: look here to see what you have done
Bottom Right - Files and more: see the contents of the project/working directory here, like your Script.R file

Suggested layout for directories:

data/ to store your raw data and intermediate datasets.
data_output/ for derived/final data.
documents/ Used for outlines, drafts, and other text.
fig_output/ This folder can store the graphics that are generated by your scripts.
scripts/ A place to keep your R scripts for different analyses or plotting.

Use code sections.
Create some of these (on the console or through the Files interface):
# Create the directories
dir.create("data")
dir.create("data_output")
dir.create("fig_output")

# Download the data
download.file("https://ndownloader.figshare.com/files/11492171",
              "data/SAFI_clean.csv", mode = "wb")
Interacting with R/RStudio:

Console
Script Editor

Cnt+Enter or Cmd+Enter


Navigating between panes: Cnt-1, Cnt-2,Cnt-3,...
Getting out of the + prompt
Installing new packages using Tools->Install packages... or via the console:

install.packages("tidyverse")

Use of up arrow to repeat commands
Use of the history pane
tab completion

Introduction to R

Creating objects

3+5
12/7

# Assign values to objects (note the environment pane)
area_hectares <- 1.0     # Use alt/opt - to get <-
3 -> x                   # Object names are case sensitive and cannot start with 
                         # a number, avoid internal names

# As we have seen"
area_hectares <- 1.0    # doesn't print anything
area_hectares
(area_hectares <- 1.0)  # putting parenthesis around the call prints the value of `area_hectares`

# Can use in artithmetic
2.47 * area_hectares

# Can change the value
area_hectares <- 2.5
2.47 * area_hectares

# Assign to another object
area_acres <- 2.47 * area_hectares

# Reassign
area_hectares <- 50

##########
# Exercise: What do you think is the current content of the object area_acres? 123.5 or 6.175?
##########
Comments

area_hectares <- 1.0		          	# land area in hectares
area_acres <- area_hectares * 2.47	# convert to acres
area_acres			                  	# print land area in acres.

# In R-studio can comment/uncomment regions of code Cntr/Cmd + Shift + C or Code->Comment/Uncomment Lines

##########
# Exercise: Create a width and length variables, assign them values, calculate the area or a rectangle
#           (use a variable area) and print out the result.
#
#           Create height and weight, give it your height in m and weight in kg, calculate your BMI
#           (height/(m*m)). Calculate your BMI.
##########
Functions

Functions are canned bits of code - there are some pre-assigned but you can also generate
your own. Inputs are called arguments.
sqrt(4)
a <- 4
sqrt(a)
b <- sqrt(a)

# Lets look at another function
round(3.14159)

# Can see what arguments round takes
args(round)

# or if you want more information
help(round)
?round

# To round to dp
round(3.14159, digits = 2)

# Can short circuit if you use the same order
round(3.14159, 2)

# But if you cange the order you have to use the labels
round(digits = 2, x = 3.14159)

###########
# Exercise: Do ?round - what funditons exist similar to round? 
#            How do you use the digits parameter in the round function? What does a -ve round digit mean?
###########

Vectors and data types

# Household interviewed
hh_members <- c(3, 7, 10, 6)
hh_members

# Can also have characters of the building material used to construct walls
respondent_wall_type <- c("muddaub", "burntbricks", "sunbricks")
respondent_wall_type

# Can find out how many elements we have
length(hh_members)
length(respondent_wall_type)

# We can find what type of data is stored in the vector
class(hh_members)
class(respondent_wall_type)

# You can use str() to get the structure of an object
str(hh_members)
str(respondent_wall_type)

# You can prefix/postfix elements to a vector
possessions <- c("bicycle", "radio", "television")
possessions <- c(possessions, "mobile_phone") # add to the end of the vector
possessions <- c("car", possessions) # add to the beginning of the vector
possessions

An atomic vector is the simplest R data type and is a linear vector of a single type. These are the basic building blocks that all R objects are built from. The other 4 atomic vector types are:

"logical" for TRUE and FALSE (the boolean data type)
"integer" for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
"complex" to represent complex numbers with real and imaginary parts (e.g., 1 + 4i) and that’s all we’re going to say about them
"raw" for bitstreams that we won’t discuss further
You can check the type of your vector using the typeof() function and inputting your vector as the argument.

Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame), factors (factor) and arrays (array).
What will happen in each of these examples? (hint: use class() to check the data type of your objects):

 num_char <- c(1, 2, 3, "a")
 num_logical <- c(1, 2, 3, TRUE)
 char_logical <- c("a", "b", "c", TRUE)
 tricky <- c(1, 2, 3, "4")
 
 # Automatic type coercion: Logical -> Integers -> Doubles -> Complex-> Character
Subsetting vectors

Extract one or more objects from a vector
# One element - note start counting from 1
respondent_wall_type <- c("muddaub", "burntbricks", "sunbricks")
respondent_wall_type[2]

# Several elements
respondent_wall_type[c(3, 2)]

# Repeated elements
more_respondent_wall_type <- respondent_wall_type[c(1, 2, 3, 2, 1, 3)]
more_respondent_wall_type

Conditional subsetting

Return those values that are TRUE, ignore FALSE values.
hh_members <- c(3, 7, 10, 6)
hh_members[c(TRUE, FALSE, TRUE, TRUE)]

hh_members > 5    # will return logicals with TRUE for the indices that meet the condition

# so we can use this to select only the values above 5
hh_members[hh_members > 5]

# Can combine using & (AND) or | (OR)
c(TRUE,TRUE,FALSE,FALSE) & c(TRUE, FALSE, TRUE,FALSE)
c(TRUE,TRUE,FALSE,FALSE) | c(TRUE, FALSE, TRUE,FALSE)

hh_members[hh_members < 3 | hh_members > 5]

hh_members[hh_members >= 7 & hh_members == 3]

# User of %in%
possessions <- c("car", "bicycle", "radio", "television", "mobile_phone")
possessions[possessions == "car" | possessions == "bicycle"] # returns both car and bicycle

# Can short circuit using the %in% operator
possessions %in% c("car", "bicycle")

# so we can do the same thing using
posession[possessions %in% c("car", "bicycle")]
Missing data

Represented as NA:
rooms <- c(2, 1, 1, NA, 4)

# Can give problems
mean(rooms)
max(rooms)

# You can tell it to ignore NAs
mean(rooms, na.rm = TRUE)
max(rooms, na.rm = TRUE)

# You can check if there are NA values
is.na(rooms)

# Can invert using NOT 
!is.na(rooms)

# so we can get
rooms[!is.na(rooms)]

# There is also something called complete.cases
complete.cases(rooms)
rooms[complete.cases(rooms)]

# Or you can tell it to omit NA cases
na.omit(rooms)

###########
# Exercise
#
# 1. Using this vector of rooms, create a new vector with the NAs removed.
#
# rooms <- c(1, 2, 1, 1, NA, 3, 1, 3, 2, 1, 1, 8, 3, 1, NA, 1)
#
# 2. Use the function median() to calculate the median of the rooms vector.
#
# 3. Use R to figure out how many households in the set use more than 2 rooms for sleeping
####
Data frames and tibbles

A data.frame is a table or a collection of vectors all of the same length - each column must thus
be of the same type (think of an SQL table, a spreadsheet, a panda data frame). A tibble is a
modernised data.frame, a data.frame on steroids.
SAFI (Studying African Farmer-Led Irrigation) is a study looking at farming and irrigation methods in Tanzania and Mozambique. The survey data was collected through interviews conducted between November 2016 and June 2017. For this lesson, we will be using a subset of the available data. For information about the full teaching dataset used in other lessons in this workshop.
# Use read_cvs from the readr package included in tidyverse
library(tidyverse)

# Tell it to interpret "NULL"s as NAs - read as a tibble 
interviews <- read_csv("data/SAFI_clean.csv", na = "NULL")

interviews
View(interviews)
head(interviews)

?read_csv 

class(interviews)

Inspeciting data frames

# examining the object you get a lot of information
interviews

# But you can query objects properties individually
dim(interviews)
nrow(interviews)
ncol(interviews)

head(interviews)
tail(interviews)

names(interviews)

str(interviews)
summary(interviews)
Indexing and subsetting data frames

We are now dealing with a 2 dimensional objecs.
# first element in the first column of the data frame (as a data.frame)
interviews[1, 1]

# first element in the 6th column (as a data.frame)
interviews[1, 6]

# first column of the data frame (as a data.frame)
interviews[1]

# first column of the data frame (as a vector)
interviews[[1]]

# first three elements in the 7th column (as a data.frame)
1:3      # Gives you a vector with three elements
c(1,2,3) # same as
interviews[1:3, 7]

# the 3rd row of the data frame (as a data.frame)
interviews[3, ]

# equivalent to head_interviews <- head(interviews)
head_interviews <- interviews[1:6, ]

# Negative values - everything but that value
interviews[, -1]          # The whole data frame, except the first column

interviews[-c(7:131), ]   # Equivalent to head(interviews)

# Use the column names
interviews["village"]       # Result is a data frame
interviews[, "village"]     # Result is a data frame
interviews[["village"]]     # Result is a vector
interviews$village          # Result is a vector

##########
# Exercise
#
# 1. Create a data frame (interviews_100) containing only the data in row 100 of the interviews dataset.
#
# 2. Notice how nrow() gave you the number of rows in a data frame?
#
#   o Use that number to pull out just that last row in the data frame.
#   o Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.
#   o Pull out that last row using nrow() instead of the row number.
#   o Create a new data frame (interviews_last) from that last row.
# 3. Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object   
#    named interviews_middle.
# 
# 4. Combine nrow() with the - notation above to reproduce the behavior of head(interviews), keeping just the first 
#    through 6th rows of the interviews dataset.
#
#########

Factors

Factors represent categorical data.
respondent_floor_type <- factor(c("earth", "cement", "cement", "earth"))

levels(respondent_floor_type)

# Stores: 2 1 1 2

levels(respondent_floor_type)

respondent_floor_type # current order

# Can change the order by explicitly  telling it 
respondent_floor_type <- factor(respondent_floor_type, levels = c("earth", "cement"))
respondent_floor_type # after re-ordering

# Renaming an element in a factor
levels(respondent_floor_type)

levels(respondent_floor_type)[2] <- "brick"
levels(respondent_floor_type)

respondent_floor_type

# factor is unordered, like a nominal variable as opposed to an ordinal. Can give an order
respondent_floor_type_ordered = factor(respondent_floor_type, ordered=TRUE)
respondent_floor_type_ordered # after setting as ordered factor
Converting factors

# Converting factors from characters
as.character(respondent_floor_type)

# Does not work for numbers 
year_fct <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(year_fct)                     # Wrong! And there is no warning...

# You can do it but it is a bit clunky
as.numeric(as.character(year_fct)) 

as.numeric(levels(year_fct))[year_fct]   # The recommended way.
Renaming factors

Can use plot() with factors. Let's look at the mem_assoc, the number of interview respondents who were or were not members of an irrigation association. This is a column in the data.frame or tibble that you uploaded.
# create a vector from the data frame column "memb_assoc"
memb_assoc <- interviews$memb_assoc

# convert it into a factor
memb_assoc <- as.factor(memb_assoc)

# let's see what it looks like
memb_assoc

# bar plot of the number of interview respondents who were
# members of irrigation association:
plot(memb_assoc)

# The NAs are not show we can relabel these
memb_assoc <- interviews$memb_assoc

memb_assoc[is.na(memb_assoc)] <- "undetermined"

memb_assoc <- as.factor(memb_assoc)

memb_assoc

plot(memb_assoc)

###########
# Exercise: 
#
# Rename the levels of the factor to have the first letter in uppercase: “No”,”Undetermined”, and “Yes”.
#
# Now that we have renamed the factor level to “Undetermined”, can you recreate the barplot such that “Undetermined” is # last (after “Yes”)?
#
############
Formatting dates

Using dates can be a pain. We are going to take the interview_date and split into three columns: year, month, day.
# remind ourselves of the structure
str(interviews)

# Use lubridate, included in tidyverse but not loaded by default
library(lubridate)

# Let's extaract look at the structure 
dates <- interviews$interview_date
str(dates)        # Recognised as a date

# extract the day, month and year
interviews$day <- day(dates)
interviews$month <- month(dates)
interviews$year <- year(dates)
interviews
Data Manipulation using dplyr and tidyr

dplyr provides functions to select, filter, modify data. tidyr allows you to modify the shape of your data.
## load the tidyverse
library(tidyverse)

interviews <- read_csv("data/SAFI_clean.csv", na = "NULL")

## inspect the data
interviews

We’re going to learn some of the most common dplyr functions:

select(): subset columns
filter(): subset rows on conditions
mutate(): create new columns by using information from other columns
group_by() and summarize(): create summary statistics on grouped data
arrange(): sort results
count(): count discrete values

Selecting columns and filtering rows

# select columns
select(interviews, village, no_membrs, years_liv)

# filter rows
filter(interviews, village == "God")

Pipes

# What if you wanted to combine a selection and a filter
interviews2 <- filter(interviews, village == "God")
interviews_god <- select(interviews2, no_membrs, years_liv)

# You could combine
interviews_god <- select(filter(interviews, village == "God"), no_membrs, years_liv)

# Simpler if you use pipes (shortcut Cnt/Cmd - Shift - M
interviews %>%
    filter(village == "God") %>%
    select(no_membrs, years_liv)

# Can put the result in an object
interviews_god <- interviews %>%
    filter(village == "God") %>%
    select(no_membrs, years_liv)

interviews_god

# To me, because the flow goes from left to right
interviews %>%
    filter(village == "God") %>%
    select(no_membrs, years_liv) -> interviews_god
    
##########    
# Exercise
#
# Using pipes, subset the interviews data to include interviews where respondents were members of 
# an irrigation association (memb_assoc) and retain only the columns affect_conflicts, liv_count, 
# and no_meals.
#
###########
Mutatte

Create new columns:
# the ratio of number of household members to rooms used for sleeping (i.e. avg number of people per room):
interviews %>%
    mutate(people_per_room = no_membrs / rooms)
    
# Only want to know those values that are/are not a member of an irrigaton association, remove NAs
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    mutate(people_per_room = no_membrs / rooms)
    
##########
# Exercise 
#
# Create a new data frame from the interviews data that meets the following criteria: contains only the village column 
# and a new column called total_meals containing a value that is equal to the total number of meals served in the
# household per day on average (no_membrs times no_meals). Only the rows where total_meals is greater than 20 should be 
# shown in the final data frame.
#
# Hint: think about how the commands should be ordered to produce this data frame!
#
##########
Split-apply-combine data analysis and the summarize() function

# group_by() to aggregate properties by a categorical variable
# summarise() to collapse by group or altogether into a single row
interviews %>%
    group_by(village) %>%
    summarize(mean_no_membrs = mean(no_membrs))
    
# Group by more than one column
interviews %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs))

# You may wish to release the groups
interviews %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs)) %>%
    ungroup()
    
# we may want to exclude those members we don't know whether they are members of an 
# irrigation association or not (remove NAs)
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs))
    
# You can simmarise by multiple columns
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs),
              min_membrs = min(no_membrs))
              
# You may wish to order the output (ascending order)
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) %>%
    arrange(min_membrs)

# or in descending order
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs),
              min_membrs = min(no_membrs)) %>%
    arrange(desc(min_membrs))
Counting

interviews %>%
    count(village)
    
# You can sort in descending order
interviews %>%
    count(village, sort = TRUE)

##########
# Exercise
#
# How many households in the survey have an average of two meals per day? Three meals per day? 
# Are there any other numbers of meals represented?
#
# Use group_by() and summarize() to find the mean, min, and max number of household members for 
# each village. Also add the number of observations (hint: see ?n).
#
# What was the largest household interviewed in each month?
#
###########

Reshaping with pivot_wider and pivot_longer

A tidy dataset:

Each variable has its own column
Each observation has its own row
Each value must have its own cell
Each type of observational unit forms a table

You want to transform from row values to column and columns to values:
Pivoting wider

You want to transform from row values in a column to have a column for each value:
pivot_wider() takes three principal arguments:

the data
the names_from column variable whose values will become new column names.
the values_from column variable whose values will fill the new column variables.

Further arguments include values_fill which, if set, fills in missing values with the value provided.
# Make the wall_type into columns
interviews %>% select(respondent_wall_type)

interviews_wide <- interviews %>%
    mutate(wall_type_logical = TRUE) %>%   # Auxilary variable (the contents)
    pivot_wider(names_from = respondent_wall_type, 
                values_from = wall_type_logical, 
                values_fill = list(wall_type_logical = FALSE)) # Default fill value
                
Pivoting longer

You want to go the other way where columns become the values in a row.
pivot_longer() takes four principal arguments:

the data
cols are the names of the columns we use to fill the values variable (or to drop).
the names_to column variable we wish to create from column names.
the values_to column variable we wish to create and fill with values associated with the named columns.

We shall go from interviews_wide back to the original data.
interviews_long <- interviews_wide %>%
    pivot_longer(cols = c(burntbricks, cement, muddaub, sunbricks),
                 names_to = "respondent_wall_type", 
                 values_to = "wall_type_logical")
                 
# This creates 4 times as much data (4 different data values - only want a subset of this data
interviews_long <- interviews_wide %>%
    pivot_longer(cols = c(burntbricks, cement, muddaub, sunbricks),
                 names_to = "respondent_wall_type", 
                 values_to = "wall_type_logical") %>%
    filter(wall_type_logical) %>%
    select(-wall_type_logical)
Applying pivot_wider() to clean our data

Split items_ownedso that each item has its own column.
interviews_items_owned <- interviews %>%
    separate_rows(items_owned, sep=";") %>%
    mutate(items_owned_logical = TRUE) %>%
    pivot_wider(names_from = items_owned, 
                values_from = items_owned_logical, 
                values_fill = list(items_owned_logical = FALSE))

nrow(interviews_items_owned)

# Rename the NA column where the owner had no items
interviews_items_owned <- interviews_items_owned %>%
    rename(no_listed_items = `NA`)
    
# You can now do interesting analsyses - number of bicycles by villate
interviews_items_owned %>%
    filter(bicycle) %>%
    group_by(village) %>%
    count(bicycle)
    
# Average number of items owned by each village
interviews_items_owned %>%
    mutate(number_items = rowSums(select(., bicycle:car))) %>%
    group_by(village) %>%
    summarize(mean_items = mean(number_items))

###########
# Exercise
#
# Create a new data frame (named interviews_months_lack_food) that has one column for each month and records 
# TRUE or FALSE for whether each interview respondent was lacking food in that month.
#
# How many months (on average) were respondents without food if they did belong to an irrigation association? 
# What about if they didn’t?
#
###########

Exporting data

# Create new tibble with the values of months_lack_food and items_owned columns expanded
interviews_plotting <- interviews %>%
    # pivot wider by items_owned
    separate_rows(items_owned, sep=";") %>%
    mutate(items_owned_logical = TRUE) %>%
    pivot_wider(names_from = items_owned, 
                values_from = items_owned_logical, 
                values_fill = list(items_owned_logical = FALSE)) %>%
    rename(no_listed_items = `NA`) %>%
    # pivot wider by months_lack_food
    separate_rows(months_lack_food, sep=";") %>%
    mutate(months_lack_food_logical = TRUE) %>%
    pivot_wider(names_from = months_lack_food, 
                values_from = months_lack_food_logical, 
                values_fill = list(months_lack_food_logical = FALSE)) %>%
    # add some summary columns
    mutate(number_months_lack_food = rowSums(select(., Jan:May))) %>%
    mutate(number_items = rowSums(select(., bicycle:car)))

# Now write the data out - makes sure data_output is there.
write_csv(interviews_plotting, path = "data_output/interviews_plotting.csv")