Austin G. Davis-Richardson
This guide will attempt to teach you to:
- load a dataframe into R
- Perform some transformations on said dataframe
- Calculate the F-Test statistic between two groups in the dataset
- Calculate the variance for each column in the dataset
- Visualize the difference in variance between two groups in the dataset
Download the CSV file and start R. Type (TYPE! Do not copy and paste) only the
lines that are not comments (prefixed by #).
# Load the CSV file using the read.csv function.
# The read.csv function creates a data.frame containing
# the data in the CSV file. You can think of it as an
# excel spreadsheet with columns and rows.
dat = read.csv('combined_with_cohort_and_run.csv')
# The csv file is now held in the variable `dat`
# Let's take a look at it
dat
# That should print a bunch of rows from the table
# and then end with the message:
# [ reached getOption("max.print") -- omitted 853 rows ]
# This just means that R stopped printing after reaching
# some limit on the number of lines it is willing to print.
# You can look at only the top few rows of the
# data.frame using the head function
head(dat)
# This prints the top 6 rows.
# You can look at only the column names:
colnames(dat)
# The data contains the number of reads per OTU per sample.
# We want to convert this to the proportion of total reads.
# So we want to divide each cell by the sum of it's row we
# can calculate row sum using the rowSums() function
rowSums(dat)
# Error in rowSums(dat) : 'x' must be numeric
# Ah, we've reached an error because our data.frame
# contains non-numeric information and therefore, we
# cannot calculate the sum of it
# let's move the non-numeric information for now so
# we can do numeric transformations on our data.
# The non-numeric columns are: status, cohort and run
# Let's first copy them to new data.frames
status = dat$status
cohort = dat$cohort
run = dat$run
# We can access a column by its name using the $ operator.
# Try
dat$Yersinia
# Now we need to remove the non-numeric columns from the
# original data.frame
# This is done by assining them to NULL - a special variable
# that means "nothing"
# So we're converting the columns to nothing or, in other words,
# removing them.
dat$status = NULL
dat$cohort = NULL
dat$run = NULL
# now this should work
rowSums(dat)
# We can divide the data by its rowsums:
dat/rowSums(dat)
# Let's save this as a new variable `prop`
prop = dat/rowSums(prop)
# Take a look at prop
head(prop)
# Now that we have the proportion of total reads, let'
# try another transformation: RPKM
# RPKM is defined as log(10^6 * data) [for each cell]
# We can calculate it using this formula and save it
# to the variable `rpkm. by default log calculates the natural log
rpkm = log(10^6 * prop)
# now we need to add back the status column to our
# transformed data.frames so we can check
# the difference in variance between cases and controls
prop$status = status
rpkm$status = status
# Check to see if the column is back in the data.frame
head(prop)
head(rpkm)
# RPKM introduced some -Infinities which will be a problem.
# We can convert them to 0s
rpkm[rpkm == -Inf] = 0
# This line of code reads: "in the rpkm table, select the
# values that are -Inf and set them to zero"
# The F-test is implemented by the var.test function.
# You can view the documentation for any
# function using the ? operator
?var.test
# Not only does the documentation contain instructions
# on using the function but often it also has information
# about how the test works. Scroll all the way down using
# the arrow keys and you will find a 'see also' function
# which may point you in the direction of other functions
# that may be useful.
# Press q to quit the documentation.
# The usage for var.test says to use the function in one
# of two ways: the first is
# var.test(x, y)
# The second is:
# var.test(formula, data, ...)
# A formula determines what two things will be controlled,
# and data is the data.frame that has the data you
# want to compare.
# Let's compare using the proportional data, the difference
# in variance in the genus Akkermansia
var.test(Akkermansia ~ status, prop)
# The formula says "compare Akkermansia across status"
# This will print out a bunch of information but most
# importantly, the p-value
# By default, the F-test uses a two-sided test to see
# if the variances are different. But what if we would like
# to see if the variance in one group is higher or lower? From the
# documentation, we can learn the use of the option `alternative`.
# The 3 values can be two.sided (default), less and greater
var.test(Akkermansia ~ status, prop, alternative = "greater")
var.test(Akkermansia ~ status, prop, alternative = "less")
# Let's do an F-test using the variances from all genera
# between cases and controls For this, we need to separate
# prop into two data frames. One for cases and one for controls
# and use the var.test function in the form:
# var.test(x, y) where x and y are the two sets of data we
# want to compare
# separate prop into cases and controls
prop.cases = prop[prop$status == 'case',]
prop.controls = prop[prop$status == 'controls',]
# Verify that we did indeed separate cases and controls
# by calling `head` on `prop.cases` and `prop.controls`
# Again, we need to ditch the non-numeric columns as they
# will confuse var.test
prop.cases$status = NULL
prop.controls$status = NULL
# And let's do the F-test
var.test(prop.cases, prop.controls)
# YEARGHHHH! An ERROR! WTF?!
# I have no idea what this error means but after some
# trial and googling I found that I needed to convert my
# data.frams to matrices
prop.cases = as.matrix(prop.cases)
prop.controls = as.matrix(prop.controls)
# I hate this about R -- the errors are cryptic or often
# just plain meaningless.
# You can usually get around them by googling the error.
var.test(prop.cases, prop.controls)
# There, we checked to see if the variance is different
# between cases and controls
# Let's see if it's higher
var.test(prop.cases, prop.controls, alternative="greater")
# And lower
var.test(prop.cases, prop.controls, alternatives="lower")
Homework, do the var.test for cases and controls in the RPKM data.
That completes the guide for today. Thank you!