Skip to content

Instantly share code, notes, and snippets.

@graebnerc
Created March 24, 2022 17:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save graebnerc/6d268af56e00909efffc372c1a5da1c3 to your computer and use it in GitHub Desktop.
Save graebnerc/6d268af56e00909efffc372c1a5da1c3 to your computer and use it in GitHub Desktop.
The original code for the desaster markdown file, as well as a corrected version
---
title: "What a desaster!"
author: "Claudius"
date: '2022-04-06'
output: pdf_document
---
# Packages used
```{r}
library(tidyverse)
library(DataScienceExercises)
library(knitr)
```
# Exploring flight data
In this short text we explore the following data set on flights departing
from New York.
```{r}
base_data <- DataScienceExercises::nycflights21_small[1:200, ]
data.frame(head(DataScienceExercises::nycflights21_small, 50))
```
To have a first look on the relationship of the variables, consider the
following scatter plots:
```{r, out.width='40%', out.height='40%', fig.pos="center"}
arrival_dep <- ggplot(data = base_data) +
geom_point(mapping = aes(x=arr_delay, y=dep_delay),
alpha=0.5, color="#00395B") +
ggplot2::theme_bw() +
labs(x="Arrival delay", y="Departure delay") +
theme(
legend.position = "bottom",
legend.title = ggplot2::element_blank(),
panel.border = ggplot2::element_blank(),
axis.line = ggplot2::element_line(colour = "grey"),
axis.ticks = ggplot2::element_line(colour = "grey")
)
arrival_dist <- ggplot(data = base_data) +
geom_point(mapping = aes(x=arr_delay, y=distance),
alpha=0.5, color="#00395B") +
ggplot2::theme_bw() +
labs(x="Arrival delay", y="Departure delay") +
theme(
legend.position = "bottom",
legend.title = ggplot2::element_blank(),
panel.border = ggplot2::element_blank(),
axis.line = ggplot2::element_line(colour = "grey"),
axis.ticks = ggplot2::element_line(colour = "grey")
)
arrival_month <- ggplot(data = base_data) +
geom_point(mapping = aes(y=arr_delay, x=month),
alpha=0.5, color="#00395B") +
ggplot2::theme_bw() +
labs(x="Arrival delay", y="Departure delay") +
theme(
legend.position = "bottom",
legend.title = ggplot2::element_blank(),
panel.border = ggplot2::element_blank(),
axis.line = ggplot2::element_line(colour = "grey"),
axis.ticks = ggplot2::element_line(colour = "grey")
)
arrival_carrier <- ggplot(data = base_data) +
geom_point(mapping = aes(y=arr_delay, x=carrier),
alpha=0.5, color="#00395B") +
ggplot2::theme_bw() +
labs(x="Arrival delay", y="Departure delay") +
theme(
legend.position = "bottom",
legend.title = ggplot2::element_blank(),
panel.border = ggplot2::element_blank(),
axis.line = ggplot2::element_line(colour = "grey"),
axis.ticks = ggplot2::element_line(colour = "grey")
)
ggpubr::ggarrange(
arrival_dep, arrival_dist,
arrival_month, arrival_carrier,
ncol = 2, nrow = 2)
```
This suggests that there is a strong correlation between departure and arrival
delay. To compute the correlation we might use the following R code:
```{r, echo=FALSE}
cor(base_data$arr_delay, base_data$dep_delay)
```
There is indeed a very strong correlation. But is it significant? Lets check
it using the Pearson correlation test:
```{r}
cor.test(base_data$arr_delay, base_data$dep_delay, method = "pearson")
```
Of course, these are just preliminary results, from a methodological point of
view there is still much to do...
---
title: "What a beauty!"
author: "Claudius"
date: '2022-04-06'
output: pdf_document
header-includes:
- \usepackage{setspace}
- \onehalfspacing
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
warning = FALSE, message = FALSE)
```
# Packages used
```{r}
library(tidyverse)
library(DataScienceExercises)
library(knitr)
```
# Exploring flight data
In this short text we explore the following data set on flights departing
from New York.
```{r, echo=FALSE}
base_data <- DataScienceExercises::nycflights21_small[1:200, ]
knitr::kable(head(DataScienceExercises::nycflights21_small, 5))
```
To have a first look on the relationship of the variables, consider the
following scatter plots:
```{r, fig.align='center', out.width='70%', out.height='70%', echo=FALSE}
arrival_dep <- ggplot(data = base_data) +
geom_point(mapping = aes(x=arr_delay, y=dep_delay),
alpha=0.5, color="#00395B") +
ggplot2::theme_bw() +
labs(x="Arrival delay", y="Departure delay") +
theme(
legend.position = "bottom",
legend.title = ggplot2::element_blank(),
panel.border = ggplot2::element_blank(),
axis.line = ggplot2::element_line(colour = "grey"),
axis.ticks = ggplot2::element_line(colour = "grey")
)
arrival_dist <- ggplot(data = base_data) +
geom_point(mapping = aes(x=arr_delay, y=distance),
alpha=0.5, color="#00395B") +
ggplot2::theme_bw() +
labs(x="Arrival delay", y="Departure delay") +
theme(
legend.position = "bottom",
legend.title = ggplot2::element_blank(),
panel.border = ggplot2::element_blank(),
axis.line = ggplot2::element_line(colour = "grey"),
axis.ticks = ggplot2::element_line(colour = "grey")
)
arrival_month <- ggplot(data = base_data) +
geom_point(mapping = aes(y=arr_delay, x=month),
alpha=0.5, color="#00395B") +
ggplot2::theme_bw() +
labs(x="Arrival delay", y="Departure delay") +
theme(
legend.position = "bottom",
legend.title = ggplot2::element_blank(),
panel.border = ggplot2::element_blank(),
axis.line = ggplot2::element_line(colour = "grey"),
axis.ticks = ggplot2::element_line(colour = "grey")
)
arrival_carrier <- ggplot(data = base_data) +
geom_point(mapping = aes(y=arr_delay, x=carrier),
alpha=0.5, color="#00395B") +
ggplot2::theme_bw() +
labs(x="Arrival delay", y="Departure delay") +
theme(
legend.position = "bottom",
legend.title = ggplot2::element_blank(),
panel.border = ggplot2::element_blank(),
axis.line = ggplot2::element_line(colour = "grey"),
axis.ticks = ggplot2::element_line(colour = "grey")
)
ggpubr::ggarrange(
arrival_dep, arrival_dist,
arrival_month, arrival_carrier,
ncol = 2, nrow = 2)
```
These plots suggests that there is a strong correlation between departure and
arrival delay. To compute the correlation we might use the following R code:
```{r, echo=TRUE, results='hide'}
cor_coef <- cor(base_data$arr_delay, base_data$dep_delay)
```
This produces a correlation coefficient of `r round(cor_coef, 3)`, suggesting
that there is indeed a very strong correlation.
But is it significant? Lets check it using the Pearson correlation test:
```{r}
c_test <- cor.test(
x = base_data$arr_delay,
y = base_data$dep_delay,
method = "pearson")
```
The most relevant statistics are:
```{r, echo=FALSE}
knitr::kable(tibble(
"t-stat"=c_test$statistic,
"df"=c_test$parameter,
"p-val"=c_test$p.value,
"95% conf interval"=paste0(
"[",
paste(round(c_test$conf.int, 3), collapse = "; "),
"]")
), align = "c")
```
Of course, these are just preliminary results, from a methodological point of
view there is still much to do...
# The corrections we did
To make this document look *much* nicer immediately, the following changes
were made:
* Supress warnings and messages by default
* Set line spacing to one and a half (just looked it up in the internet)
* Do not show the whole table in the beginning but only the first lines;
* do not show the R code in this context since it is not meaningful;
* use `knitr::kable()` to print tables
* Do not show the code for preparing the plot, it is not necessary to understand
the message
* Adjust `out.width` and `out.height` options in the plot chunk such that the
plot is easier to read
* Show the code use to compute the correlation coefficient, but in a readable
way; but summarize the output concisely, focusing on what is relevant
* Report the result of the Pearson correlation test in a more concise way
Of course, the last sentence above is true: to analyze this data in a
meaningful way, we must invest a bit more thinking into the correct analysis
method!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment