Skip to content

Instantly share code, notes, and snippets.

@wang-zhijun
Last active August 17, 2017 02:41
Show Gist options
  • Save wang-zhijun/9fd0b6d970b398b12eeed3e99da1e1a4 to your computer and use it in GitHub Desktop.
Save wang-zhijun/9fd0b6d970b398b12eeed3e99da1e1a4 to your computer and use it in GitHub Desktop.
r

Comparing two variances is useful in several cases, including:

When you want to perform a two samples t-test to check the equality of the variances of the two samples

When you want to compare the variability of a new measurement method to an old one. Does the new method reduce the variability of the measure?

> data("ToothGrowth")
> head(ToothGrowth)
   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5

> my_data <- ToothGrowth
> res.ftest <- var.test(len ~ supp, data = my_data)
> res.ftest

	F test to compare two variances

data:  len by supp
F = 0.6386, num df = 29, denom df = 29, p-value = 0.2331
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.3039488 1.3416857
sample estimates:
ratio of variances 
         0.6385951 

> res.ftest$estimate
ratio of variances 
         0.6385951 
> res.ftest$p.value
[1] 0.2331433
> 

> a <- rnorm(10000, mean =0 , sd = 1)
> plot(a)

The first function we look at it is dnorm. Given a set of values it returns the height of the probability distribution at each point. If you only give the points it assumes you want to use a mean of zero and standard deviation of one. There are options to use different values for the mean and standard deviation, though:

> dnorm(4,mean=4)
[1] 0.3989423
> dnorm(0)
[1] 0.3989423

The second function we examine is pnorm. Given a number or a list it computes the probability that a normally distributed random number will be less than that number. This function also goes by the rather ominous title of the “Cumulative Distribution Function.” It accepts the same options as dnorm:


There are two kinds of the corpus data type, the permanent corpus, PCorpus, and the volatile corpus, VCorpus. In essence, the difference between the two has to do with how the collection of documents is stored in your computer. In this course, we will use the volatile corpus, which is held in your computer's RAM rather than saved to disk, just to be more memory efficient.

To make a volatile corpus, R needs to interpret each element in our vector of text, coffee_tweets, as a document. And the tm package provides what are called Source functions to do just that! In this exercise, we'll use a Source function called VectorSource() because our text data is contained in a vector. The output of this function is called a Source object. Give it a shot!

library(qdap)
# Load tm
library(tm)

# Make a vector source: coffee_source
coffee_source <- VectorSource(coffee_tweets)

行数を測る

tweets <- read.csv("coffee.csv", stringsAsFactors = FALSE)

# Print out the number of rows in tweets
nrow(tweets)

the qdap package offers a better alternative. You can easily find the top 4 most frequent terms (including ties) in text by calling the freq_terms function and specifying 4.

> library(qdap)

> text
[1] "Text mining usually involves the process of structuring the input text. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods."

> frequent_terms <- freq_terms(text, 4)
> freq_terms(text, 4)
   WORD        FREQ
1  text           3
2  the            3
3  of             2
4  analysis       1
5  analytical     1
6  and            1

> plot(frequent_terms)
> str(frequent_terms)
Classes 'freq_terms', 'all_words' and 'data.frame':	28 obs. of  2 variables:
 $ WORD: chr  "text" "the" "of" "analysis" ...
 $ FREQ: num  3 3 2 1 1 1 1 1 1 1 ...

The opposite of separate() is unite(), which takes multiple columns and pastes them together. By default, the contents of the columns will be separated by underscores in the new column, but this behavior can be altered via the sep argument.

bmi_cc <- unite(bmi_cc_clean, Country_ISO, Country, ISO, sep = "-")


The separate() function allows you to separate one column into multiple columns.


> treatments
  patient treatment year_mo response
1       X         A 2010-10        1
2       Y         A 2010-10        4
3       X         B 2012-08        2
4       Y         B 2012-08        5
5       X         C 2014-12        3
6       Y         C 2014-12        6


> separate(treatments, year_mo, c("year", "month"))
  patient treatment year month response
1       X         A 2010    10        1
2       Y         A 2010    10        4
3       X         B 2012    08        2
4       Y         B 2012    08        5
5       X         C 2014    12        3
6       Y         C 2014    12        6

> head(bmi_cc)
             Country_ISO  year  bmi_val
1         Afghanistan/AF Y1980 21.48678
2             Albania/AL Y1980 25.22533
3             Algeria/DZ Y1980 22.25703
4             Andorra/AD Y1980 25.66652
5              Angola/AO Y1980 20.94876
6 Antigua and Barbuda/AG Y1980 23.31424

bmi_cc_clean <- separate(bmi_cc, col = Country_ISO, into = c("Country", "ISO"), sep = "/")

# Print the head of the result
bmi_cc_clean

lm(), predict

Using linear model, predict the number of views for the next three days (days 22, 23 and 24). Use predict() and the predefined future_days data frame. Assign the result to linkedin_pred.

> linkedin
 [1]  5  7  4  9 11 10 14 17 13 11 18 17 21 21 24 23 28 35 21 27 23
> days <- 1:21
> linkedin_lm <- lm(linkedin ~ days)
> linkedin_lm

Call:
lm(formula = linkedin ~ days)

Coefficients:
(Intercept)         days  
      3.967        1.194
> future_days <- data.frame(days = 22:24)
> linkedin_pred <- predict(linkedin_lm, future_days)
> plot(linkedin ~ days, xlim = c(1, 24))
> points(22:24, linkedin_pred, col = "green")

Try to experiment with this code to increase or decrease POSIXct objects:

> now <- Sys.time()
> now
[1] "2017-06-14 06:10:53 UTC"
> now + 3600
[1] "2017-06-14 07:10:53 UTC"
> now - 3600*24
[1] "2017-06-13 06:10:53 UTC"
> str(now)
 POSIXct[1:1], format: "2017-06-14 06:10:53"
> birth <- as.POSIXct("1879-03-14 14:37:23")

> birth
[1] "1879-03-14 14:37:23 UTC"
> str(birth)
 POSIXct[1:1], format: "1879-03-14 14:37:23"

> death <- as.POSIXct("1955-04-18 03:47:12")

> einstein <- death - birth
> einstein
Time difference of 27792.55 days

-- as.Date

Use as.Date() to convert the astro vector to a vector containing Date objects. You will need the %d, %b and %Y symbols to specify the format. Store the resulting vector as astro_dates.

> meteo
           spring            summer              fall            winter 
    "March 1, 15"      "June 1, 15" "September 1, 15"  "December 1, 15"
> as.Date(meteo, format="%b %d, %y")
[1] "2015-03-01" "2015-06-01" "2015-09-01" "2015-12-01"
> as.Date(meteo, format="%B %d, %y")
[1] "2015-03-01" "2015-06-01" "2015-09-01" "2015-12-01"


> astro
       spring        summer          fall        winter 
"20-Mar-2015" "25-Jun-2015" "23-Sep-2015" "22-Dec-2015"
> as.Date(astro, format="%d-%b-%Y")
[1] "2015-03-20" "2015-06-25" "2015-09-23" "2015-12-22"



> astro_dates
[1] "2015-03-20" "2015-06-25" "2015-09-23" "2015-12-22"
> meteo_dates
[1] "2015-03-01" "2015-06-01" "2015-09-01" "2015-12-01"
> max(abs(meteo_dates - astro_dates))
Time difference of 24 days

> dat <- data.frame(
   time = factor(c("Lunch","Dinner"), levels=c("Lunch","Dinner")),
   total_bill = c(14.89, 17.23)
)
> ggplot(data = dat, mapping= aes(x=time, y=total_bill)) + geom_bar(stat = "identity") 

> ggplot(data = dat, mapping= aes(x=time, y=total_bill)) + geom_bar(stat = "identity", aes(fill=time))

> ggplot(data = dat, mapping= aes(x=time, y=total_bill, fill = time)) + geom_bar(stat = "identity") + guides(fill=FALSE)

FemaleとMaleを横並べて描画する

> ggplot(data=dat1, aes(x=time, y=total_bill, fill=sex)) + geom_bar(stat="identity")
> ggplot(data=dat1, aes(x=time, y=total_bill, fill=sex)) + geom_bar(stat="identity", position = position_dodge())

黒いのは縁、 自分で定義した色を使う

> ggplot(data=dat1, aes(x=time, y=total_bill, fill=sex)) + geom_bar(stat="identity", position = position_dodge(), colour="black") + scale_fill_manual(values=c("#999999", "#E69F00"))

size=0.3では 黒い縁をもっと細くする

ggplot(data=dat1, aes(x=time, y=total_bill, fill=sex)) + geom_bar(stat="identity", position = position_dodge(), colour="black") + scale_fill_manual(values=c("#999999", "#E69F00"))

各barに色つける

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity))


On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.

?geom_bar shows that the default value for stat is count

縦軸は count

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

これも同じ

ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))

You can use the same idea to specify different data for each layer. Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars. The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()


下の二つは同じ

Imagine if you wanted to change the y-axis to display cty instead of hwy. You’d need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to ggplot().

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

3種類のdrvがあるから、4, f, r, だから三つの線を描いている

> ggplot(data=mpg) + geom_smooth(mapping = aes(x=displ, y=hwy, linetype=drv))

To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~.

> n_distinct(mpg$drv)
[1] 3
> n_distinct(mpg$cyl)
[1] 4
ggplot(data=mpg) + geom_point(mapping=aes(x=displ, y=hwy)) + facet_grid(drv ~ cyl)

drv種類は3個、 cyl種類は4個、 なので全部12個の図が描画される


> daily <- group_by(flights, year, month, day)
> daily %>%  summarise(flights = n())
Source: local data frame [365 x 4]
Groups: year, month [?]

# A tibble: 365 x 4
    year month   day flights
   <int> <int> <int>   <int>
 1  2013     1     1     842
 2  2013     1     2     943
 3  2013     1     3     914
 4  2013     1     4     915
 5  2013     1     5     720
 6  2013     1     6     832
 7  2013     1     7     933
 8  2013     1     8     899
 9  2013     1     9     902
10  2013     1    10     932
# ... with 355 more rows


> daily %>% ungroup() %>% summarise(flights = n()) # # no longer grouped by date
# A tibble: 1 x 1
  flights
    <int>
1  336776

To count the number of distinct (unique) values, use n_distinct(x).

n_distinct(flights$carrier)
[1] 16

Pipe Line

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))
#> Source: local data frame [365 x 4]
#> Groups: year, month [?]
#> 
#>    year month   day  mean
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1 11.44
#> 2  2013     1     2 13.68
#> 3  2013     1     3 10.91
#> 4  2013     1     4  8.97
#> 5  2013     1     5  5.73
#> 6  2013     1     6  7.15
#> # ... with 359 more rows

Visualise the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram) display the count with bars; frequency polygons (geom_freqpoly), display the counts with lines. Frequency polygons are more suitable when you want to compare the distribution across a the levels of a categorical variable.

delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay)
  )

ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)  ### 

目的地でgroup_by, 各目的地への個数、平均距離、平均delay遅延

> by_dest <- group_by(flights, dest)

> delay <- summarise(by_dest, count=n(), dist = mean(distance, na.rm=TRUE), delay=mean(arr_delay, na.rm=TRUE))


summarise(). It collapses a data frame to a single row:

> by_day <- group_by(flights, year, month, day)


> summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   delay
#>   <dbl>
#> 1  12.6

> summarise(by_day, deplay=mean(dep_delay, na.rm=TRUE))
Source: local data frame [365 x 4]
Groups: year, month [?]

# A tibble: 365 x 4
    year month   day    deplay
   <int> <int> <int>     <dbl>
 1  2013     1     1 11.548926
 2  2013     1     2 13.858824
 3  2013     1     3 10.987832
 4  2013     1     4  8.951595
 5  2013     1     5  5.732218
 6  2013     1     6  7.148014
 7  2013     1     7  5.417204
 8  2013     1     8  2.553073
 9  2013     1     9  2.276477
10  2013     1    10  2.844995
# ... with 355 more rows

> View(summarise(by_day, deplay=mean(dep_delay, na.rm=TRUE)))


> (x <- 1:4) 
[1] 1 2 3 4

> cumsum(x) # 一個のsum、2個のsum 3個のsum、4個のsum
[1]  1  3  6 10


> cummean(x) # 一個の平均、2個の平均 3個の平均、4個の平均
[1] 1.0 1.5 2.0 2.5

If you only want to keep the new variables, use transmute():

> transmute(flights, gain=arr_delay- dep_delay, hours = air_time/ 60, gain_per_hour = gain/hours)

select() can be used to rename variables, but it’s rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that aren’t explicitly mentioned:

rename(flights, tail_num = tailnum)

tailnumtail_numへ rename


# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 × 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))

Another option is to use select() in conjunction with the everything() helper. This is useful if you have a handful of variables you’d like to move to the start of the data frame.

select(flights, time_hour, air_time, everything())

time_hourair_timeを先頭に移動

timeが含まれるコラムを出力

> select(flights, contains("TIME"))


Note: the group aesthetic will tell ggplot() to draw a single linear model through all the points.

ggplot(mtcars, aes(x = wt, y = mpg, col = cyl)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_smooth(aes(group = 2), method = "lm", se = FALSE, linetype = 2)

differencing allows you to remove the longer-term time trend For time series exhibiting seasonal trends, seasonal differencing can be applied to remove these periodic patterns. For example, monthly data may exhibit a strong twelve month pattern. In such situations, changes in behavior from year to year may be of more interest than changes from month to month, which may largely follow the overall seasonal pattern.

diff(..., lag = 4)

A,B,C,D,E,F

(A+B+C+D) (B+C+D+E) (B+C+D+E)- (A+B+C+D)


Differencing a time series can remove a time trend. The function diff() will calculate the first difference or change series. A difference series lets you examine the increments or changes in a given time series. It always has one fewer observations than the original series.

# Generate the first difference of z
dz <- diff(z)
  
# Plot dz

ts.plot(dz)
# View the length of z and dz, respectively

length(z)
length(dz)

The logarithmic function log() is a data transformation that can be applied to positively valued time series data. It slightly shrinks observations that are greater than one towards zero, while greatly shrinking very large observations. This property can stabilize variability when a series exhibits increasing variability over time. It may also be used to linearize a rapid growth pattern over time.

# Log rapid_growth
linear_growth <- log(rapid_growth)
  
# Plot linear_growth using ts.plot()
 
ts.plot(linear_growth)

time_series <- ts(data_vector, start=2004, frequency = 4)

sort(social_vec,decreasing=TRUE)

> round(c(-3.6, 4.7))
[1] -4  5

basics <- function(x) {
  c(min = min(x), mean = mean(x), max = max(x))
}

# Apply basics() over temp using vapply()
vapply(temp, basics, numeric(3))

sapply()

apply function over list or vector try to simplify list to array


runif(n, min = 0, max = 1) n : number of observations. If length(n) > 1, the length is taken to be the number required.

runif(10)

x, yまったく一緒であるかどうか

identical(x, y)

Recall the bizarre pattern that you saw in the scatterplot between brain weight and body weight among mammals in a previous exercise. Can we use transformations to clarify this relationship?

ggplot2 provides several different mechanisms for viewing transformed relationships. The coord_trans() function transforms the coordinates of the plot. Alternatively, the scale_x_log10() and scale_y_log10() functions perform a base-10 log transformation of each axis. Note the differences in the appearance of the axes.

# Scatterplot with coord_trans()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() + 
  coord_trans(x = "log10", y = "log10")

# Scatterplot with scale_x_log10() and scale_y_log10()
ggplot(data = mammals, aes(x = BodyWt, y = BrainWt)) +
  geom_point() +
  scale_x_log10() + scale_y_log10()

If it is helpful, you can think of boxplots as scatterplots for which the variable on the x-axis has been discretized.

The cut() function takes two arguments: the continuous variable you want to discretize and the number of breaks that you want to make in that continuous variable in order to discretize it.

# Boxplot of weight vs. weeks
ggplot(data = ncbirths, 
       aes(x = cut(weeks, breaks = 5), y = weight)) + 
  geom_boxplot()

# Check out the currently attached packages again
search()

Take a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively. These are generic functions with methods for other R classes.

You can add a column or multiple columns to a matrix with the cbind() function, which merges matrices and/or vectors together by column. For example:

big_matrix <- cbind(matrix1, matrix2, vector1 ...)

# Construct star_wars_matrix
box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
                           dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"), 
                                           c("US", "non-US")))

# The worldwide box office figures
worldwide_vector <- rowSums(star_wars_matrix)

# Bind the new variable worldwide_vector as a column to star_wars_matrix
all_wars_matrix <- cbind(star_wars_matrix, worldwide_vector)

you can combine the results of the rbind() function with the colSums() function!


To conveniently add elements to lists you can use the c() function, that you also used to build vectors:

ext_list <- c(my_list , my_val)

# shining_list, the list containing movie name, actors and reviews, is pre-loaded in the workspace

# We forgot something; add the year to shining_list
shining_list_full <- c(shining_list, year=1980)

# Have a look at shining_list_full
str(shining_list_full)

Use for my_vector the name vec, for my_matrix the name mat and for my_df the name df.

# Vector with numerics from 1 up to 10
my_vector <- 1:10 

# Matrix with numerics from 1 up to 9
my_matrix <- matrix(1:9, ncol = 3)

# First 10 elements of the built-in data frame mtcars
my_df <- mtcars[1:10,]

# Adapt list() call to give the components names
my_list <- list(my_vector, my_matrix, my_df)
names(my_list) <- c("vec", "mat", "df")
# Print out my_list
my_list


The most important function in tidyr is gather(). It should be used when you have columns that are not variables and you want to collapse them into key-value pairs.

The easiest way to visualize the effect of gather() is that it makes wide datasets long. As you saw in the video, running the following command on wide_df will make it long:

gather(wide_df, my_key, my_val, -col)

gather(mtcars, key, value, -gear)


paste Concatenate vectors after converting to character.

# Initialize the speed variable
speed <- 64

# Extend/adapt the while loop
while (speed > 30) {
  print(paste("Your speed is",speed))
  if (speed > 48 ) {
    print("Slow down big time!")
    speed = speed - 11
  } else {
    print("Slow down!")
    speed = speed - 6
  }
}

By default, arrange() arranges the rows from smallest to largest. Arrange dtc so that flights by the same carrier appear next to each other. Within each carrier, flights that have smaller departure delays appear before flights that have higher departure delays. Do this in a one-liner.

# Arrange dtc according to carrier and departure delays
arrange(dtc, UniqueCarrier, DepDelay)

# Arrange the flights in hflights by their total delay (the sum of DepDelay and ArrDelay). Try to do this directly inside arrange().
# Arrange flights by total delay (normal order).
arrange(hflights, DepDelay+ArrDelay)
arrange(flights, desc(arr_delay))

Missing values are always sorted at the end:

> df <- tibble(x=c(5,2,NA))
> arrange(df, x)
# A tibble: 3 x 1
      x
  <dbl>
1     2
2     5
3    NA
> str(df)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	3 obs. of  1 variable:
 $ x: num  5 2 NA
> arrange(df, desc(x))
# A tibble: 3 x 1
      x
  <dbl>
1     5
2     2
3    NA

g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)


# Add the new variable GroundTime to g1. Save the result as g2.
g2 <- mutate(g1, GroundTime=TaxiIn + TaxiOut)

# Add the new variable AverageSpeed to g2. Save the result as g3.

g3 <- mutate(g2, AverageSpeed=Distance/AirTime * 60)
# Print out g3
g3

# Add a second variable loss_ratio to the dataset: m1
m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_ratio = loss/DepDelay)

# Add the three variables as described in the third instruction: m2
m2 <- mutate(hflights, TotalTaxi = TaxiIn+TaxiOut, ActualGroundTime=ActualElapsedTime-AirTime, Diff=TotalTaxi- ActualGroundTime )

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", 
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)


rings_vector <- planets_df$rings
  
# Print out rings_vector
rings_vector

# Adapt the code so that instead of only the name column, all columns for planets that have rings are selected.
planets_df[rings_vector, ]


# Use subset() on planets_df to select planets that have a diameter smaller than Earth. Because the diameter variable is a relative measure of the planet's diameter w.r.t that of planet Earth, your condition is diameter < 1.
subset(planets_df, subset = diameter < 1)


# Use order() to create positions
positions <-  order(planets_df$diameter)

# Use positions to sort planets_df
planets_df[positions, ]

On the right we've included a generic version of the select functions that you've coded earlier: select_el(). It takes a vector as its first argument, and an index as its second argument. It returns the vector's element at the specified index.

# Definition of split_low
pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
split <- strsplit(pioneers, split = ":")
split_low <- lapply(split, tolower)

# Generic select function
select_el <- function(x, index) {
  x[index]
}

# Use lapply() twice on split_low: names and year
names <- lapply(split_low, select_el, 1)
years <- lapply(split_low, select_el, 2)

Use anonymous function inside lapply() lapply(list(1,2,3), function(x) { 3 * x })

# Definition of split_low
pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
split <- strsplit(pioneers, split = ":")
split_low <- lapply(split, tolower)

# Transform: use anonymous function inside lapply

names <- lapply(split_low, function(x) {x[1]})

years <- lapply(split_low, function(x) {x[2]})

In R, the function rowSums() conveniently calculates the totals for each row of a matrix. This function creates a new vector:

rowSums(my_matrix)

# Construct star_wars_matrix
box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
                           dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"), 
                                           c("US", "non-US")))

# Calculate worldwide box office figures
worldwide_vector <- rowSums(star_wars_matrix)

rownames, colnames

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

# Construct matrix
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)

# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

# Name the columns with region
colnames(star_wars_matrix) <- region
rownames(star_wars_matrix) <- titles


# Name the rows with titles
star_wars_matrix


The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE. The third argument nrow indicates that the matrix should have three rows.

matrix(1:9, byrow=TRUE, nrow=3)

The cor(x, y) function will compute the Pearson product-moment correlation between variables, x and y. Since this quantity is symmetric with respect to x and y, it doesn't matter in which order you put the variables.

At the same time, the cor() function is very conservative when it encounters missing data (e.g. NAs). The use argument allows you to override the default behavior of returning NA whenever any of the values encountered is NA. Setting the use argument to "pairwise.complete.obs" allows cor() to compute the correlation coefficient for those observations where the values of x and y are both not missing.

# Compute correlation
ncbirths %>%
  summarize(N = n(), r = cor(weight, mage))

# Compute correlation for all non-missing pairs
ncbirths %>%
  summarize(N = n(), r = cor(weight, weeks, use = "pairwise.complete.obs"))

# Create factor_speed_vector
speed_vector <- c("fast", "slow", "slow", "fast", "insane")
factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "fast", "insane"))

# Factor value for second data analyst
da2 <- factor_speed_vector[2]

# Factor value for fifth data analyst
da5 <- factor_speed_vector[5]

# Is data analyst 2 faster than data analyst 5?
da2 > da5

By setting the argument ordered to TRUE in the function factor(), you indicate that the factor is ordered. With the argument levels you give the values of the factor in the correct order.

speed_vector <- c("fast", "slow", "slow", "fast", "insane")

# Convert speed_vector to ordered factor vector
factor_speed_vector <- factor(speed_vector, order=TRUE, levels=c("slow", "fast", "insane"))

# Print factor_speed_vector
factor_speed_vector
summary(factor_speed_vector)

check how R constructs and prints nominal and ordinal variables.

# Temperature
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector

昔点を描画する方法

> plot(iris$Sepal.Length, iris$Sepal.Width)
> points(iris$Petal.Length, iris$Petal.Width, col="red")
> 

> plot(mtcars$wt, mtcars$mpg, col=mtcars$cyl)

Limitations

  • Plot doesn't get redrawn
  • Plot is drawn as an image
  • Need to manually add legend
  • No unified framework for plotting
> plot(mtcars$wt, mtcars$mpg, col=mtcars$cyl)


tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using tbl_df().

hflights <- tbl_df(hflights)

library(hflights)

dim(hflights) ### observationと 変数の個数
> output <- vector("double", ncol(df))
> output
numeric(0)

> df <- data.frame(
+     a = rnorm(5),
+     b = rnorm(5),
+     c = rnorm(5),
+     d = rnorm(5)
+ )
> df
           a           b           c            d
1 -0.5166697 -0.62743621  0.37563561 -0.496738863
2 -1.7507334  0.01831663  0.31026217  0.011395161
3  0.8801042  0.70524346  0.00500695  0.009859946
4  1.3700104 -0.64701901 -0.03763026  0.678271423
5 -1.6873268  0.86818087  0.72397606  1.029563029
> for(i in 1:ncol(df)) {
+     print(median(df[[i]]))
+ }
[1] -0.5166697
[1] 0.01831663
[1] 0.3102622
[1] 0.01139516
> 

二番目の要素を削除して、残りを表示

> v = c(2,4,6,8)
> v[-2]
[1] 2 6 8

Out-of-Range Index

> v[5]
[1] NA

既存のvectorから新しいいvectorを抽出

> s = c("aa", "bb", "cc", "dd", "ee") 
> s[c(2,4)]
[1] "bb" "dd"
> n = s[c(2,4)]
> n
[1] "bb" "dd"

> n = s[2:4]
> n
[1] "bb" "cc" "dd"

Logical Index vector

> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(FALSE, TRUE, FALSE, TRUE, FALSE) 
> s[b]
[1] "bb" "dd"
> s[!b]
[1] "aa" "cc" "ee"

ggplot(data = <DATA>) +   <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot(data = mpg) +  geom_point(mapping = aes(x = displ, y = hwy))

With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph, but it’s not very interesting so I’m not going to show it here.

You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, which creates a scatterplot.

aes() specify which variables to map to the x and y axes.

違うcolorでclassを区別 違うsizeでclassを区別 違うalpha透明度でclassを区別 違うshapeでclassを区別,最大6種類

> ggplot(data = mpg) +  geom_point(mapping = aes(x = displ, y = hwy,color=class))
> ggplot(data = mpg) +  geom_point(mapping = aes(x = displ, y = hwy,size=class))
 警告メッセージ: 
Using size for a discrete variable is not advised. 
> ggplot(data = mpg) +  geom_point(mapping = aes(x = displ, y = hwy,alpha=class))
> ggplot(data = mpg) +  geom_point(mapping = aes(x = displ, y = hwy,shape=class))


> ggplot(data = mpg) +  geom_point(mapping = aes(x = displ, y = hwy,alpha=class), color="blue")

split your plot into facets, subplots that each display one subset of the data.

To facet your plot by a single variable, use facet_wrap().

One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.

> n_distinct(mpg$class) # 7個の図が書かれる class: 2seater compact midsize minivan pickup subcompact suv
[1] 7

> ggplot(data = mpg) +  geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow=3)
> ggplot(data = mpg) +  geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow=4)

A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.

> n = c(2, 3, 5) 
> s = c("aa", "bb", "cc") 
> 
> b = c(TRUE, FALSE, TRUE) 
> df = data.frame(n, s, b)
> df
  n  s     b
1 2 aa  TRUE
2 3 bb FALSE
3 5 cc  TRUE

data frame

> data <- read.table(header=T, text='
+     subject sex size
+     1       M   7
+     2       F   6
+     3       F   9
+     4       M   11
+ ')

# Get the element at row 1, column 3
> data[1,3]
[1] 7
> data[3,2]
[1] F
Levels: F M
> data[4, "size"]
[1] 11


> data[c(2,3)]
  sex size
1   M    7
2   F    6
3   F    9
4   M   11
> data[c(2,3),]
  subject sex size
2       2   F    6
3       3   F    9


# 1から4行までの2列目
> data[1:4, 2]
[1] M F F M
> (y <- seq(1,10, by=5))
[1] 1 6

PCは無限に数字を保存できないから、nearを使って、一緒なのかを判定

> library(tidyverse)
> near(sqrt(2)^2 ,2)
[1] TRUE
> sqrt(2)^2 == 2
[1] FALSE

The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.

> filter(nycflights13::flights, month == 1, day == 1)

so if you want to save the result, you’ll need to use the assignment operator, <-:

jan1 <- filter(nycflights13::flights, month == 1, day == 1)

x %in% y. This will select every row where x is one of the values in y.

nov_dec <- filter(nycflights13::flights, month %in% c(11, 12))

NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

> NA > 5
[1] NA
> NA == NA
[1] NA


> x = NA
> is.na(x)
[1] TRUE

Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).

tibble() is a nice way to create data frames. It encapsulates best practices for data frames:

tibble(x = letters)
#> # A tibble: 26 x 1
#>       x
#>   <chr>
#> 1     a
#> 2     b
#> 3     c
#> 4     d
#> # ... with 22 more rows



> tibble(x = 1:3, y = list(1:5, 1:10, 1:20))
# A tibble: 3 x 2
      x          y
  <int>     <list>
1     1  <int [5]>
2     2 <int [10]>
3     3 <int [20]>

> tibble(x = 1:3, y = list(1:5, 1:10))
 エラー: Column `y` must be length 1 or 3, not 2
> 

> y = list(1:5)
> y
[[1]]
[1] 1 2 3 4 5

> y = c(1:5)
> y
[1] 1 2 3 4 5


> y = list(1:5, 1:8)
> y
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] 1 2 3 4 5 6 7 8



> x = c(1, NA, 3)
> x
[1]  1 NA  3
> tibble(x = c(1, NA, 3))
# A tibble: 3 x 1
      x
  <dbl>
1     1
2    NA
3     3

点を描画する

> cars <- c(1,3,6,4,9)
> plot(cars)



# Graph cars using blue points overlayed by a line 
plot(cars, type="o", col="blue")

# Create a title with a red, bold/italic font
title(main="Autos", col.main="red", font.main=4)

> iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE)

## headerがFALSEに設定されているから、Instead of the attribute names, you might see strange column names such as “V1” or “V2”
> head(iris)
   V1  V2  V3  V4          V5
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa


## To simplify the working with the data set, 
## it is a good idea to make the column names yourself: you can do this through the function names()
> names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
> iris
    Sepal.Length Sepal.Width Petal.Length Petal.Width         Species
1            5.1         3.5          1.4         0.2     Iris-setosa
2            4.9         3.0          1.4         0.2     Iris-setosa
3            4.7         3.2          1.3         0.2     Iris-setosa
4            4.6         3.1          1.5         0.2     Iris-setosa
5            5.0         3.6          1.4         0.2     Iris-setosa
6            5.4         3.9          1.7         0.4     Iris-setosa
7            4.6         3.4          1.4         0.3     Iris-setosa

The goal of ggvis is to make it easy to build interactive graphics for exploratory data analysis. ggvis has a similar underlying theory to ggplot2 (the grammar of graphics), but it’s expressed a little differently

mtcars %>% ggvis(x = ~wt, y = ~mpg) %>% layer_points()
mtcars %>% ggvis(~wt,  ~mpg) %>% layer_points()

> mtcars %>% ggvis(x = ~wt, y = ~mpg, fill = ~vs) %>% layer_points()
> mtcars %>% ggvis(x = ~wt, y = ~mpg, stroke = ~vs) %>% layer_points()
> mtcars %>% ggvis(x = ~wt, y = ~mpg, size = ~vs) %>% layer_points()

The $ allows you extract elements by name from a named list. For example

> x <- list(a=1, b=2, c=3)
> x$b
[1] 2

You can find the names of a list using names()

> names(x)
[1] "a" "b" "c"

Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with as_tibble():

> as_tibble(iris)
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
          <dbl>       <dbl>        <dbl>       <dbl>      <fctr>
 1          5.1         3.5          1.4         0.2 Iris-setosa
 2          4.9         3.0          1.4         0.2 Iris-setosa
 3          4.7         3.2          1.3         0.2 Iris-setosa
 4          4.6         3.1          1.5         0.2 Iris-setosa
 5          5.0         3.6          1.4         0.2 Iris-setosa
 6          5.4         3.9          1.7         0.4 Iris-setosa
 7          4.6         3.4          1.4         0.3 Iris-setosa
 8          5.0         3.4          1.5         0.2 Iris-setosa
 9          4.4         2.9          1.4         0.2 Iris-setosa
10          4.9         3.1          1.5         0.1 Iris-setosa
# ... with 140 more rows

Another way to create a tibble is with tribble(), short for transposed tibble. tribble() is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.

+     ~x, ~y, ~z,
+     #--|--|----
+     "a", 2, 3.6,
+     "b", 1, 8.5
+ )
# A tibble: 2 x 3
      x     y     z
  <chr> <dbl> <dbl>
1     a     2   3.6
2     b     1   8.5

myString <- "Hello World"
print(myString)  # [1] "Hello World"

v <- TRUE
print(class(v)) # [1] "logical"

x <- 1
print(class(x)) # [1] "numeric"

y <- 2L
print(class( y)) # [1] "integer"

z <- 2+5i
print(class( z )) # [1] "complex"

m <- "TRUE"
print(class( m )) # [1] "character"

n <- charToRaw("Hello")
print(class( n )) # [1] "raw"  "Hello" is stored as 48 65 6c 6c 6f

Rscript test.R


> a <- array(c('green', 'yellow'), dim = c(3,3,2))
> a
, , 1

     [,1]     [,2]     [,3]    
[1,] "green"  "yellow" "green" 
[2,] "yellow" "green"  "yellow"
[3,] "green"  "yellow" "green" 

, , 2

     [,1]     [,2]     [,3]    
[1,] "yellow" "green"  "yellow"
[2,] "green"  "yellow" "green" 
[3,] "yellow" "green"  "yellow"

> apple_colors <- c('green','green','yellow','red','red','red','green', 'blue')
> 
> factor_apple <- factor(apple_colors)
> 
> print(nlevels(factor_apple))
[1] 4
> print(factor_apple)
[1] green  green  yellow red    red    red    green  blue  
Levels: blue green red yellow

Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data.

BMI <- 	data.frame(
   gender = c("Male", "Male","Female"), 
   height = c(152, 171.5, 165), 
   weight = c(81,93, 78),
   Age = c(42,38,26)
)
print(BMI)


  gender height weight Age
1   male  152.0     81  42
2   Male  171.5     93  38
3 Femail  165.0     78  26


while loop

> v <- c("Hello","while loop")
> cnt <- 2
> while (cnt < 7) {
+     print(v)
+     cnt = cnt + 1
+ }
[1] "Hello"      "while loop"
[1] "Hello"      "while loop"
[1] "Hello"      "while loop"
[1] "Hello"      "while loop"
[1] "Hello"      "while loop"

for netx

> v <- LETTERS[1:6]
> for ( i in v) {
+    
+    if (i == "D") {
+       next
+    }
+    print(i)
+ }
[1] "A"
[1] "B"
[1] "C"
[1] "E"
[1] "F"

>  new.function <- function() {
+    for(i in 1:5) {
+       print(i^2)
+    }
+ }	
> new.function()
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25

new.function <- function(a = 3, b = 6) {
   result <- a * b
   print(result)
}

# Call the function without giving any argument.
new.function() # [1] 18

# Call the function with giving new values of the argument.
new.function(9,5) # [1] 45

paste

> a <- "Hello"
> b <- 'How'
> c <- "are you? "
> 
> print(paste(a,b,c))
[1] "Hello How are you? "
> 
> print(paste(a,b,c, sep = "-"))
[1] "Hello-How-are you? "

> linkedin <- c(16, 9, 13, 5, 2, 17, 14)
> linkedin > 10
[1]  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
> facebook <- c(17, 7, 5, 16, 8, 13, 14)
> linkedin < facebook # それぞれの要素を比べる
[1]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE

> linkedin <- c(16, 9, 13, 5, 2, 17, 14)
> facebook <- c(17, 7, 5, 16, 8, 13, 14)
> views <- matrix(c(linkedin, facebook), nrow = 2, byrow = TRUE)
> views
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]   16    9   13    5    2   17   14
[2,]   17    7    5   16    8   13   14

> linkedin <- c(16, 9, 13, 5, 2, 17, 14)
> last <- tail(linkedin, 1)
> last
[1] 14

library(ggplot2)

# Change the command below so that cyl is treated as factor
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_point()

Grammar of Graphics

2 principles

  • Graphics = distinct layers of grammatical elements
  • Meaningful plots through aesthetic mapping

ggplot(diamonds, aes(x = carat, y = price)) + geom_smooth(method="gam")
ggplot(diamonds, aes(x = carat, y = price, col=clarity)) + geom_point(alpha=0.4)
> dia_plot <- ggplot(diamonds, aes(x = carat, y = price))+ geom_point(aes(alpha=0.2))
> 
> dia_plot
> dia_plot+geom_smooth(se=FALSE)
`geom_smooth()` using method = 'gam'
> dia_plot+geom_smooth(se=TRUE) #  You don't want any error shading, which can be achieved by setting the se argument in geom_smooth() to FALSE. 
`geom_smooth()` using method = 'gam'
mydata <- read_excel("exercise1.xlsx")

# Create a ts object called myts
myts <- ts(mydata[, 2:4], start = c(1981, 1), frequency = 4) # col 2,3,4を取り出す The "frequency" is the number of observations per "cycle" (normally a year, but sometimes a week, a day, an hour, etc). 

Use which.max() to spot the outlier in the gold series

> goldoutlier <- which.max(gold)

> goldoutlier
[1] 770

autoplot(myts, facets = TRUE)
library(fpp2)

# Create plots of the a10 data

autoplot(a10)

# Produce a polar coordinate season plot for the a10 data
ggseasonplot(a10, polar = TRUE)

# Restrict the ausbeer data to start in 1992
beer <- window(ausbeer, start=1992)
ggsubseriesplot(beer) # 毎年の同じシーズンの量を描画

# Path to the hotdogs.txt file: path
path <- file.path("data", "hotdogs.txt") # dataディレクトリ 

# Import the hotdogs.txt file: hotdogs
hotdogs <- read.table(path, 
                      sep = "", 
                      col.names = c("type", "calories", "sodium")) # type, calories sodiumカラムを取り込む

# Call head() on hotdogs
head(hotdogs)

setting the colClasses argument to a vector of strings representing classes, typeはfactor, sodiumは numeric

hotdogs2 <- read.delim("hotdogs.txt", header = FALSE, 
                       col.names = c("type", "calories", "sodium"),
                       colClasses = c("factor", "NULL", "numeric"))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment