Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Last active March 30, 2020 04:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save benmarwick/5403048 to your computer and use it in GitHub Desktop.
Save benmarwick/5403048 to your computer and use it in GitHub Desktop.
A very short and simple tutorial for the basics of R. Based on http://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf
---
title: "Untitled"
author: "Ben Marwick"
date: "Wednesday, September 24, 2014"
output: html_document
---
## Introduction
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. You should read through this entire document very carefully before making any changes or pressing any buttons.
The first things you need to do (after having read this entire document very carefully) are look up to the first few lines of the document and change the title (currently it says "Untitled", you should call it something like "Week 3 Lab Report"), author (change it to your name) and date. In this introduction section you should include a few short sentences to introduce this report. Something simple like 'This is the report on my analysis of the stone artefact assemblage for the week 3 lab' and tell us a little bit about that assemblage.
In this code chunk below we will read in your stone artefact data from your Excel file into our R session. You wont see the code in your HTML output because of the `echo=FALSE` setting, you can change it to `TRUE` to show the code in the output (or delete the `echo=FALSE` bit altogether).
Read the comments in the code chunk carefully so you can edit it to make it work on your computer. A comment is a line of text that starts with # (hash or pound character). Remember that you can run one line of R code at a time in RStudio by placing your cursor in that line and pressing Control+Enter or clicking the 'Run' button to the upper right of this window pane. Running line-by-line is useful for testing that you're getting things right as you go.
```{r,echo=FALSE, message=FALSE, warning=FALSE}
# load the library that will allow us to read Excel files
library(gdata)
# Read your Excel file into our R session, this assumes that
# you've put your Excel file in a folder on your computer called
# "New folder" and then shared that folder with your VM. If your
# folder is called something different (it probably is) then you
# will need to change "sf_New_folder" in the line below.
# Also note that here the Excel file is called 'my_lab_3_data.xls',
# which is probably different to the name of your Excel file,
# you should change it to match the actual name of your Excel file.
############ !!!!!!!!!! ##########
############ Alert! Please read the above carefully! ##########
############ !!!!!!!!!! ##########
my_data <- read.xls("/media/sf_New_folder/my_lab_3_data.xlsx", stringsAsFactors = FALSE)
```
Usually after we read some data into R we want to make a few quick checks to see that it worked ok and that the data are as we expected. at your R console (where you see the > symbol), type `str(my_data)` and press return, this will tell you the *structure* of your dataset and give you the column names (directly after the $ signs) so you can refer to them later. Sometimes R changes the column names slightly when you import data because it doesn't like spaces in names.
In the output from the `str` function, after the column name is the data type, you might see `int` (for integer, or whole number), `chr` (character, like letters and words) and `num` (numeric, or decimal numbers) in your data. Numeric data types are most useful for statistical analysis, so if you need to convert something like your mass values from integer to numeric, here's what you'd do: `my_data$Mass <- as.numeric(my_data$Mass)` then you'd check to see if that worked with `str(my_data)`. Let's assume you've now got your metric variables as numeric data types (except for things like platform type and cortex location, which can stay as character types). Note that the variable names I use here might be slightly different to yours, so stay alert to those differences so you can edit the code to make it work with your data.
## Methods
Write a few short sentences in here the summarise the methods you used to analyse your assemblage (hint: describe the equipment you used and how you took the measurements, I mean, give a brief definition for each variable)
## Results
Now we can do some basic visualisation and analysis, building on what we know from the swirl tutorials, and our previous exercise.
We might start by visualising the distributions of each of our metric variables. The code chunk below shows how to plot a histogram for the platform width variable. You should explore your other variables by changing `Platform_width` in the chunk below (and the x-axis label) and include in your report one histogram for the variable that seems most interesting to you, along with a detailed comment about why the distribution you have chosen is interesting. Please use the vocabulary you learned in the swirl 'Data analysis' tutorial.
```{r, message=FALSE, warning=FALSE}
library(ggplot2)
ggplot(my_data, aes(Platform_width)) +
geom_histogram() +
xlab("Platform width (mm)")
```
We have discussed studies investigating the value of platform area as a predictor of flake mass. Let's have a quick look into that for this assemblage. The next code chunk will compute the value of platform area for each flake by multiplying platform width by platform thickness and make a plot of platform area and flake mass (for simplicity we'll assume that all the artefacts in your assemblage are complete flakes). You need to look back over your previous rmarkdown exercise to find the code to add a regression line to this plot, include this plot (with the regression line) in your output and write a detailed comment about the relationship you observe in your data.
```{r, message=FALSE, warning=FALSE}
# Let's compute the value for platform area for each flake
my_data$platform_area <- my_data$Platform_width * my_data$Platform_thickness
# Now let's plot that new variable by flake mass
library(ggplot2)
ggplot(my_data, aes(platform_area, Mass)) +
geom_point()
```
Since we have two groups of flakes, we are of course very interested to compare them and see how they are different and how they are similar. This is the basis of analysing assemblages from different sites or different periods or locations within a single site.
We could make many histograms for each group and stack them together, there is a simpler way to compare the groups. One basic technique for visualising differences between groups is the boxplot. The code chunk below will produce a boxplot that shows the distributions of the platform width variable for both groups. You should explore the other metric variables in your dataset by replacing `Platform_width` in the chunk below with other column names, and include in your report one interesting boxplot, and a detailed comment about why the comparison of that variable is interesting. Don't forget to update the axis labels! Please also use the vocabulary you learned in the swirl 'Data analysis' tutorial.
```{r, message=FALSE, warning=FALSE}
# already loaded the ggplot2 library in the previous chunk, so no need to do it again here
ggplot(my_data, aes(factor(Group), Platform_width)) +
geom_boxplot() +
xlab("Group") +
ylab("Platform width (mm)")
```
You might be wondering how we can investigate the categorical variables like platform type. We have a few options, the simplest one is to make a table that summarises the count of each category in each group. We can do this with the code chunk below. To translate this code chunk into English, we'd say "take the data called 'my_data', then divide it into groups according to the 'Group' column and the 'Platform_type' column (since we are interested in that variable for the moment), then give me the counts of each platform type for each group". And so the output should be a table of three columns, one for Group, one for Platform_type and one for the count of each platform type in each group. Go ahead an explore your categorical variables with this code chunk and include in your report a table and a detailed comment about why the data in your table is interesting.
```{r, message=FALSE, warning=FALSE}
library(dplyr)
my_data %>% # you should read this %>% symbol as 'then'
group_by(Group, Platform_type) %>%
summarise(count = n())
```
We spoke a little in class about the EL/M measurement (edge length divided by mass) as a convenient ratio for comparing the usable cutting edge of artefacts of different shapes and sizes. With the next code chunk we'll compute that ratio for these data, and see how the two groups compare. Include in your report a boxplot of the EL/M values for the two groups, and a detailed comment on that plot.
```{r, message=FALSE, warning=FALSE}
# Let's compute the EL/M values and add them to our data set as a new column
# called 'el_m' (slashes have a special meaning in R so we can't use them for
# variable names)
my_data$el_m <- (my_data$Length +
my_data$Width. +
my_data$Maximum_dimension) / my_data$Mass
# Now copy and paste the boxplot code from the chunk above and edit it so
# you get a boxplot for EL/M values for each group. Don't forget to edit the axis
# labels
```
## Conclusion
Write here a few short sentences that recap you key observations of this analysis. You should summarise your main findings from the plots and tables.
Now delete _all_ of my instructions (including this sentence) and leave only your full sentence answers and the code chunks (and the bit at the top between the dashes). Then knit the document by pressing the 'Knit HTML' button in RStudio and upload the resulting HTML document to canvas. You can find the resulting HTML file in the same folder that this Rmd document is saved in (do double-check that you've saved this Rmd file in a sensible location where you can easily find it)
---
title: "Untitled"
author: "Ben Marwick"
date: "Wednesday, September 24, 2014"
output: html_document
---
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. You should read through this entire document very carefully before making any changes or pressing any buttons.
The first things you need to do (after having read this entire document very carefully) are look up to the first few lines of the document and change the title (currently it says "Untitled", you should call it something like "Week 1 Lab: Markdown Exercise"), author (change it to your name) and date.
In this code chunk below we will read in the stone artefact data from an Excel file into our R session. You wont see the code in your HTML output because of the `echo=FALSE` setting, you can change it to `TRUE` to show the code in the output (or delete the `echo=FALSE` bit altogether). Get the Excel file from here: http://core.tdar.org/dataset/375662 (you will need to sign up and log in to access the data). You can read a little about the context of the data at that webpage. The important detail is that it's a spreadsheet of stone artefact measurements.
Read the comments in the code chunk carefully so you can edit it to make it work on your computer. A comment is a line of text that starts with #
```{r,echo=FALSE, message=FALSE, warning=FALSE}
# load the library that will allow us to read Excel files
library(gdata)
# read the Excel file into our R session, this assumes that
# you've put the Excel file in a folder on your computer called
# "New folder" and then shared that folder with your VM. You
# will need to change "New_folder" in the line below if you put
# the Excel file in a folder with a different name.
my_data <- read.xls("/media/sf_New_folder/emap---obsidian-flake-database.xls")
```
It's good science to visualize data to get a general impression of any patterns before doing any statistics, so here's a plot
```{r, message=FALSE, warning=FALSE}
library(ggplot2)
ggplot(my_data, aes(factor(Site), Length)) +
geom_boxplot() +
xlab("turnip") +
ylab("carrot")
```
There are some problems with the x and y axis labels on that plot, go ahead and edit the code in the chunk above to make the labels more sensible.
We can also look at the relationship between length and mass of the flakes. Have a look at the plot that appears below after you knit this document and write a sentence that identifies the site where mistakes were made during data entry for at least two flakes (hint: look to the lower right of the plot!). You may want to increase the point size to see better, study the comments in the code chunk below to learn how to do that.
```{r, message=FALSE, warning=FALSE}
# already loaded the library in the previous chunk, so no need to do it again here
ggplot(my_data, aes(Length, Weight, colour = factor(Site))) +
geom_point(size = 1) # change the point size so you can see the points, try replacing 1 with 4
```
We can add linear regression lines to see if the length-mass relationship is the same at all sites.
```{r, message=FALSE, warning=FALSE}
# already loaded the library in the previous chunk, so no need to do it again here
ggplot(my_data, aes(Length, Weight, colour = factor(Site))) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE) # this is the line that adds the regression lines, see what happens when you change FALSE to TRUE
```
For most of the sites, length increases in proportion to mass. But at one site it seems that length actually decreases as mass increases. Write a sentence to explain simply how this might happen (without suggesting that the data are flawed).
In this next code chunk will compute the average length of flakes at each site.
```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(dplyr) # load the library
mean_length_by_site <- my_data %>% # take the data and...
group_by(Site) %>% # group by site, then...
summarise(mean = round(mean(Length, na.rm = TRUE),2)) # calculate the mean length of all flakes in each site, and round to 2 places.
```
And here we have a table of mean lengths for all the sites.
```{r, results="asis", message=FALSE, warning=FALSE}
library(stargazer) # library for making nice tables
stargazer(mean_length_by_site, type = "html", summary = FALSE, rownames = FALSE)
```
Here's an example of how we can use inline R code, rather than a code chunk. The smallest mean flake length is `r min(mean_length_by_site$mean, na.rm = TRUE)` mm which is found at the `r mean_length_by_site[which(mean_length_by_site$mean == min(mean_length_by_site$mean, na.rm = TRUE)), "Site"]` site.
Now your turn - write a sentence with some inline R code to show the largest mean value in this dataset (hint: copy and paste my sentence with inline code above and replace the `min` function with the `max` function).
# Here's a heading, just to show you how the hash symbol is used to make headings.
You're nearly done with reading through the instructions and soon you can go back to the top and make changes and explore some of the code. You can run the R code in the code chunks as normal, one line at a time from the Rmd document as you edit it in RStudio. Place your cursor in the line of code you want to run and press CTRL+ENTER. If you're not sure what that means see the youtube video.
### Here's a lower level heading, more hashes makes for lower level headings
Now delete _all_ of my instructions (including this sentence) and leave only your full sentence answers and the code chunks (and the bit at the top between the dashes). Then knit the document by pressing the 'Knit HTML' button in RStudio and upload the resulting HTML document to canvas. You can find the resulting HTML file in the same folder that this Rmd document is in.
# 1. tải về và cài đặt R: http://cran.r-project.org/
# 2. tải về và cài đặt RStudio: http://www.rstudio.com/ide/download/
# 3. công việc thông qua hướng dẫn này dưới đây ...
# Nếu bạn gặp khó khăn và cần phải tìm kiếm sự giúp đỡ, tìm kiếm google nhưng
# Include 'r giúp đỡ "trước từ khóa của bạn
## Bắt đầu RStudio, quan sát các cửa sổ khác nhau
# Trong RStudio, click vào 'File' rồi 'New' rồi 'R Script'. Trong cửa sổ trống rỗng
# Xuất hiện, dán trong tất cả những gì bạn thấy ở đây
# Bây giờ
# Đặt con trỏ ở đầu dòng đầu tiên,
# Bắt đầu với '# 1. tải về và cài đặt ...' và nhấn
# Nút 'Run' ở phía trên bên phải của cửa sổ script.
# Điều này sẽ gửi các dòng mã đầu tiên cho giao diện điều khiển và chạy
# Nó, vì vậy bạn sẽ thấy dòng đó xuất hiện trong giao diện điều khiển. Không
# Sẽ tiếp tục xảy ra bởi vì các ký hiệu # chỉ cho R
# Mà các văn bản sau đây là một nhận xét cho con người, không phải là một lệnh cho R.
# Đừng lo lắng, tiếp tục nhấn chạy cho đến khi bạn có được một dòng
# Mà không bắt đầu bằng # và sau đó xem cho giao diện điều khiển
# Đầu ra ...
## R như một máy tính
2 + 2
3 ^ 17
# Bài tập 1: Tính erence di ff giữa năm 2013 và các
# Năm bạn bắt đầu tại trường đại học này và phân chia
# Này bằng các di ff erence giữa năm 2013 và năm
# bạn được sinh ra. Nhân này với 100 để có được
# Tỷ lệ phần trăm của cuộc sống của bạn, bạn đã trải qua tại
# Trường đại học này. Sử dụng dấu ngoặc đơn nếu bạn cần chúng.
## R là một không gian làm việc
a <- 2
b <- 3
a + b
# Bài tập 2: Lặp lại bài tập trước nhưng
# Phá vỡ nó thành nhiều bước, lưu trữ các giá trị
# Như các đối tượng dữ liệu với tên logic (tên
# Phải bắt đầu bằng một chữ cái)
## R có nhiều loại dữ liệu
a # đây là một vô hướng, một số duy nhất
d <- c (2, 4, 7, 9) # này là một vector, một dãy số
# 'C' là một chức năng gọi là 'tiếp nhau'
# BAO GIỜ sử dụng 'c' là một tên cho dữ liệu của bạn
d
d[2] # nhận được số thứ hai trong các vector d
# Yếu tố # cụ thể của một vector có thể được giải quyết
# Sử dụng [i] chỉ mục
d[4] <- 8 # chúng ta có thể cập nhật các vector
d
e <- d + 55 # toàn vectơ có thể được vận hành trên trực tiếp
e
m1 <- matrix(data = c (d, e), ncol = 2) # này là một ma trận, một loại
# Của bảng, cũng được biết đến như là một mảng. ncol xác định
# Số cột chúng ta muốn.
# Ma trận rất tốt, vì chúng rất nhanh để tính toán với
# Những hạn chế chính là họ chỉ có thể giữ số.
m1[1] # có được hàng đầu tiên của ma trận
m1[2] # có được cột thứ hai
m1[3] # gì sai ở đây?
# Mô hình là [hàng, cột] để giải quyết một ma trận
# Bài tập 3: giá trị của hàng thứ ba và thứ hai là gì
# Cột của m1?
f <- c("này", "là", "a", "vector") # vectơ có thể nắm giữ còn ký tự
f
# Chúng ta có thể làm một bảng các số hoặc số và ký tự
# Sử dụng một khung dữ liệu
df1 <- data.frame(number1 = d, number2 = e, character1 = f)
# Ở phía bên trái của "=" là tên cột, bên phải
# Là các giá trị để đưa vào cột đó
df1[2] # có được cột thứ hai của khung dữ liệu
df1$number1 # dấu $ được sử dụng để giải quyết các cột trong Khung
# dữ liệu
df1[df1$number1 < 7,] # các giá treo đồ vuông và ký hiệu đô la
# Có thể được kết hợp để tập hợp các khung dữ liệu. Ở đây chúng ta nói "cung cấp
# Tôi tất cả các hàng trong đó giá trị trong cột 'number1'
# Là ít hơn bảy
df1[df1$number1 == 4,] # tất cả các hàng nơi 'number1' là
# Tương đương với bốn
df1[df1$number1 < 7.5 & df1$number2 > 57] # điều kiện cho
# Subsetting có thể được kết hợp để làm cho các tập con rất cụ thể.
# Ở đây '&' phương tiện 'và'
# Bài tập 4: giá trị của 'character1' cột khi là gì
# Number2 là ít hơn 62 và number1 là lớn hơn 3?
list1 <- list(one = a, two = e, three = df1, four = f)
# Một danh sách là một loại đối tượng dữ liệu là một tập hợp khác
# Đối tượng. Danh sách là đặc biệt bởi vì họ có thể giữ các đối tượng của
# Độ dài khác nhau, không giống như một ma trận hay khung dữ liệu.
list1 # có một cái nhìn
list1$two # truy cập vào danh mục 'hai'
list1[[4]] # acces điểm thứ tư trong danh sách
list1$one + list1$three[,1] # điều này giống như việc thêm hai vectơ
## R có nhiều chức năng
mean(e) # tính toán giá trị trung bình của vector của chúng tôi 'e'
sd(e) # Độ lệch chuẩn #
sum(e) # thêm lên tất cả các phần tử của vector 'e'
summary(e) # chức năng mà là sự kết hợp của các chức năng
seq(1, 10, 0.5) # tạo ra một chuỗi 1-10 0.5
sample(1:10, 5) # lấy mẫu ngẫu nhiên 5 số từ
# Chuỗi các 1-10
rnorm(50) # lấy 50 số ngẫu nhiên từ một bình thường
# Phân phối
# Để có được thông tin về làm thế nào để sử dụng một chức năng, sử dụng '?'
?mean # mở trang trợ giúp
example(mean) # những ví dụ thường là mang tính
# Bài tập 5: Làm khung hình dữ liệu của riêng bạn mà có bốn cột
# Và 200 hàng sử dụng 'seq', 'mẫu' và 'rnorm ít nhất
# Mỗi một lần. Sử dụng chức năng 'colMeans để tính
# Nghĩa của mỗi cột (kiểm tra các tập tin trợ giúp để tìm hiểu
# Cú pháp cho colMeans)
## R là tốt cho visualization
plot(d, e) # phân tán cơ bản
with(df1, plot(number1, number2)) # cùng
hist(rnorm (1000)) # histogram cơ bản
boxplot(df1 [, c (1: 2)]) # hộp và râu ria của hai cols đầu tiên
# Các chức năng cốt truyện được xây dựng trong cơ bản là tốt cho một cách nhanh chóng
# Nhìn vào dữ liệu, nhưng mảnh đất tốt hơn là có thể dễ dàng
library(ggplot2) # gói vẽ yêu thích của tôi
# Nếu bạn nhận được một thông báo lỗi vào thời điểm này, bạn có lẽ chỉ đơn giản
# Cần phải cài đặt các gói ggplot2, chạy dòng này
# install.packages("ggplot2")
# (Không có ký hiệu #) nên sửa chữa nó
# Rất nhiều trợ giúp trực tuyến tốt, đặc biệt. http://docs.ggplot2.org/current/
# Đây là một fancy box-ria kiểu cốt truyện
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_violin(scale = "count", adjust = 0.5) +
geom_jitter(height = 0)
# Đây là một biplot với dòng phù hợp nhất cho từng yếu tố
ggplot(mtcars, aes(qsec, wt)) +
stat_smooth(method=lm, aes(fill = factor(cyl))) +
geom_point()
# Một ví dụ bồn rửa nhà bếp ...
library(scales)
ggplot(mtcars, aes(qsec, wt)) + # make the basic plot object
xlab(paste("1/4 mile time (s) (r = ", with(mtcars, round(cor(qsec,wt),4)),", p-value = ",
round(anova(with(mtcars, lm(wt ~ qsec)))$'Pr(>F)'[1],4),")",sep="")) +
# paste in the r and p values with the axis label
ylab("Weight (lb/1000)") +
geom_text(label = row.names(mtcars)) + # specify that the data points are the sample names
geom_rug(size=0.1,) + # specify that we want the rug plot
geom_smooth(method="lm", fill=alpha("grey80", 0.05), colour="grey80") +
# specify a line of best fit calculated by a linear model
# fill is the colour of the standard error area
# with fill set it to "white" it wont show at all
# colour is the colour of the line
theme(panel.background = element_blank(), # suppress default background
panel.grid.major = element_blank(), # suppress default major gridlines
panel.grid.minor = element_blank(), # suppress default minor gridlines
axis.ticks = element_blank(), # suppress tick marks
axis.title.x=element_text(size=17), # increase axis title size slightly
axis.title.y=element_text(size=17, angle=90), # increase axis title size slightly and rotate
axis.text.x=element_text(size=12), # increase size of numbers on x-axis
axis.text.y=element_text(size=12), # increase size of numbers on y-axis
aspect.ratio=1 ) # make the plot square-ish
# Bài tập 6: sử dụng ggplot lô khung dữ liệu 4x200 của bạn
# Như là một phân tán và phù hợp với một dòng hoàng thổ của phù hợp nhất với một
# Khoảng 0,3
# Next: việc mặc dù đẹp hướng dẫn trực tuyến tương tác này: http://tryr.codeschool.com/
# Gửi tới: xem một chục vậy của các đoạn video ngắn điên mà bao gồm rất nhiều công việc hữu ích với R: http://www.twotorials.com/
# More tiếp theo: một khá vững chắc cơ bản và dễ dàng để đọc intro: http://www.computerworld.com/s/article/9239625/Beginner_s_guide_to_R_Introduction
# Thậm chí nhiều hơn tiếp theo: Một nơi tốt để tìm câu trả lời cho câu hỏi (và hỏi và tự trả lời): http://stackoverflow.com/questions/tagged/r
# 1. download and install R: http://cran.r-project.org/
# 2. download and install RStudio: http://www.rstudio.com/ide/download/
# 3. there are lots of great intros to R on the web, here's one
# to help you find your way around RStudio: http://www.youtube.com/watch?v=lVKMsaWju8w
# 3. work through this tutorial below...
# if you get stuck and need to search for help, search google but
# include 'r help' before your keywords
## Start RStudio, observe the different windows
# in RStudio, click on File then New then R Script. In the empty window
# that appears, paste in all of what you see here Now
# place the cursor at the start of the first line, which
# begins with '# 1. download and install...' and press
# the button 'Run' in the upper right of the script window.
# This will send the first line of code to the console and run
# it, so you'll see that line appear in the console. Nothing
# further will happen because the # symbol indicates to R
# that the following text is a comment for humans, not a command for R.
# Don't worry, keep pressing run until you get to a line
# that does not start with # and then watch the console for
# the output...
## R as a calculator
2 + 2
3^17
# Exercise 1: Compute the difference between 2019 and the
# year you started at this university and divide
# this by the difference between 2019 and the year
# you were born. Multiply this with 100 to get
# the percentage of your life you have spent at
# this university. Use brackets if you need them.
## R as a workspace
a <- 2
b <- 3
a + b
# Exercise 2: Repeat the previous exercise but
# break it into several steps, storing values
# as data objects with logical names (names
# must start with a letter)
## R has many types of data
a # this is a scalar, a single number
d <- c(2, 4, 7, 9) # this is a vector, a row of numbers
# 'c' is a function called 'concatenate'
# NEVER use 'c' as a name for your data
d
d[2] # get the second number in the vector d
# specific elements of a vector can be addressed
# using [i] indexing
d[4] <- 8 # we can update the vector
d
e <- d + 55 # whole vectors can be operated on directly
e
m1 <- matrix(data = c(d, e), ncol = 2) # this is a matrix, a kind
# of table, also known as an array. ncol specifies the
# number of columns we want.
# matrices are good because they are very fast to calculate with
# the main limitation is that they can only hold numbers.
m1[1, ] # get the first row of the matrix
m1[, 2] # get the second column
m1[, 3] # what's wrong here?
# the pattern is [row, column] for addressing a matrix
# Exercise 3: What is the value of the third row and second
# column of m1?
f <- c("this", "is", "a", "vector") # vectors can also hold characters
f
# we can make a table of numbers or numbers and characters
# using a data frame
df1 <- data.frame(number1 = d, number2 = e, character1 = f)
# on the left side of the "=" is the column name, on the right
# is the values to put in that column
df1[, 2] # get the second column of the data frame
df1$number1 # the dollar sign is used to address columns in
# data frames
df1[df1$number1 < 7, ] # the square brakets and dollar sign
# can be combined to subset the data frame. Here we say "give
# me all the rows where the value in the 'number1' column
# is less than seven
df1[df1$number1 == 4, ] # all rows where 'number1' is
# equivalent to four
df1[df1$number1 < 7.5 & df1$number2 > 57, ] # conditions for
# subsetting can be combined to make very specific subsets.
# Here the '&' means 'and'
# Exercise 4: What is the value of the 'character1' column when
# number2 is less than 62 and number1 is greater than 3?
list1 <- list(one = a, two = e, three = df1, four = f)
# a list is a type of data object that is a collection of other
# objects. Lists are special because they can hold objects of
# different lengths, unlike a matrix or data frame.
list1 # have a look
list1$two # access the list item 'two'
list1[[4]] # acces the fourth item in the list
list1$one + list1$three[,1] # this is like adding two vectors
## R has many functions
mean(e) # calculate the mean value of our vector 'e'
sd(e) # standard deviation
sum(e) # add up all the elements of vector 'e'
summary(e) # functions that are combinations of functions
seq(1, 10, 0.5) # generate a sequence from 1 to 10 by 0.5
sample(1:10, 5) # take a random sample of 5 numbers from the
# sequence of 1 to 10
rnorm(50) # take 50 numbers randomly from a normal
# distribution
# to get information about how to use a function, use '?'
?mean # opens the help page
example(mean) # the examples are often instructive
# Exercise 5: Make your own data frame that has four columns
# and 50 rows using 'seq', 'sample' and 'rnorm at least
# once each. Use the 'colMeans' function to calculate the
# mean of each column (check the help file to learn the
# syntax for colMeans)
## R is good for visualisation
plot(d,e) # basic scatterplot
with(df1, plot(number1, number2)) # the same
hist(rnorm(1000)) # basic histogram
boxplot(df1[,c(1:2)]) # box and whisker of first two cols
# the basic built-in plot functions are good for a quick
# look at data, but better plots are easily possible
library(ggplot2) # my favourite plotting package
# if you get an error message at this point, you probably simply
# need to install the ggplot2 package, running this line
# install.packages("ggplot2")
# (without the # symbol) should fix it
# lots of good help online, esp. http://docs.ggplot2.org/current/
# here's a nice box-whisker type plot
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_violin(scale = "count", adjust = 0.5) +
geom_jitter(height = 0)
# here's a biplot with lines of best fit for each factor
ggplot(mtcars, aes(qsec, wt)) +
stat_smooth(method=lm, aes(fill = factor(cyl))) +
geom_point()
# An elaborate example...
library(scales)
ggplot(mtcars,
aes(qsec, wt)) + # make the basic plot object
xlab(paste("1/4 mile time (s) (r = ", with(mtcars, round(cor(qsec,wt),4)),", p-value = ",
round(anova(with(mtcars, lm(wt ~ qsec)))$'Pr(>F)'[1],4),")",sep="")) +
# paste in the r and p values with the axis label
ylab("Weight (lb/1000)") +
geom_text(label = row.names(mtcars)) + # specify that the data points are the sample names
geom_rug(size=0.1,) + # specify that we want the rug plot
geom_smooth(method="lm", fill=alpha("grey80", 0.05), colour="grey80") +
# specify a line of best fit calculated by a linear model
# fill is the colour of the standard error area
# with fill set it to "white" it wont show at all
# colour is the colour of the line
theme(panel.background = element_blank(), # suppress default background
panel.grid.major = element_blank(), # suppress default major gridlines
panel.grid.minor = element_blank(), # suppress default minor gridlines
axis.ticks = element_blank(), # suppress tick marks
axis.title.x=element_text(size=17), # increase axis title size slightly
axis.title.y=element_text(size=17, angle=90), # increase axis title size slightly and rotate
axis.text.x=element_text(size=12), # increase size of numbers on x-axis
axis.text.y=element_text(size=12), # increase size of numbers on y-axis
aspect.ratio=1 ) # make the plot square-ish
# Exercise 6: use ggplot to plot your 4x50 data frame
# as a scatterplot and fit a loess line of best fit with a
# span of 0.3
# A good place to find answers to questions (and ask and answer yourself): http://stackoverflow.com/questions/tagged/r
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment