jmclawson/special_imports.rmd

## special_imports.rmd
---
title: "Special cases of data importing in R"
---

The typical methods of importing data make it straightforward to import a single CSV file. But data will often be prepared in some other format. Some common scenarios include a folder of smaller CSV files or data prepared for use with SAS.

## Importing many CSV files

It is common to read in multiple CSV files and combine their data frames. Beyond a certain number, the process should be automated.

### Prepare sample data

For the purposes of this section, the first step is to prepare a sample data set to make this document replicable anywhere. This code divides data from the `nycflights13` package into a folder of 16 smaller CSV files to be read.

```{r}
library(dplyr)
library(readr)

if(!dir.exists("carriers")){dir.create("carriers")}

nycflights13::flights %>%
  group_by(carrier) %>%
  group_walk(~ write_csv(.x, file = paste0("carriers/", .y$carrier, ".csv")))
```

Each of the CSV files in the "carriers" directory now has a subset of the full data, minus its carrier name. These files can be imported all at once to recreate the original data set.

### Identify the files for importing

To read a list of files, it's necessary first to have a list. The `list.files()` function makes it easy to find all the files in a folder that match any given pattern, specified with regular expressions.

```{r}
all_filenames <- list.files(path = "carriers", pattern = "*.csv", full.names = TRUE)
```

### Import all the files

The `map()` function from `purrr` makes it easy to use `read_csv()` on every item in the list of file names. Especially when reading a large number of files, the `read_csv()` function can be noisy, but it can be quietened by setting `show_col_types` to `FALSE`. The resulting object is a list of data frames:

```{r}
library(purrr)
library(readr)

imported_data <- map(all_filenames, read_csv, show_col_types = FALSE)

class(imported_data)
```

Use `bind_rows()` to combine all of these data frames into a single large data frame.

```{r}
combined_data <- imported_data |>
  bind_rows()

class(combined_data)
dim(combined_data)
```

Because the column was encoded into the file names of the CSVs, the table lacks the distinguishing "carrier" information. It can be added by setting names with `set_names()` before appropriating these names for a named id column in `bind_rows()`.

```{r}
library(stringr)

carrier_names <- all_filenames |>
  str_remove_all("carriers/") |>
  str_remove_all("[.]csv")

combined_data <- imported_data |>
  set_names(carrier_names) |>
  bind_rows(.id = "carrier")

dim(combined_data)
```

Relocating the "carrier" column and rearranging the rows are all that's necessary to get back to something like the original.

```{r}
combined_data <-
  combined_data |>
  relocate(carrier, .before = flight) |>
  arrange(year, month, day, dep_time)
```

If the process is used regularly, it might be wise to create a function.

```{r}
combine_csvs <- function(dir, id = ".id"){

  full_paths <- list.files(dir,
                           pattern = "*.csv",
                           full.names = TRUE)

  simple_names <- full_paths |>
    stringr::str_remove_all(paste0(dir,"[/]")) |>
    stringr::str_remove_all("[.]csv")

  full_paths |>
    purrr::map(readr::read_csv, show_col_types = FALSE) |>
    purrr::set_names(simple_names) |>
    dplyr::bind_rows(.id = {{id}})
}
```

## Importing SAS data with `haven`

Reading other data types such as data from SAS is easily done with the `haven` package.

### Prepare sample data

For the purposes of this section, the first step is to download a sample data set once so that it can later be imported.

```{r}
if(!file.exists("medical.sas7bdat")) {
    download.file("http://www.principlesofeconometrics.com/sas/medical.sas7bdat",
                  destfile = "medical.sas7bdat")
  }
```

### Load the package and read the data with read_sas()

The `read_sas()` function is straightforward.

```{r}
library(haven)
medical <- read_sas("medical.sas7bdat")
class(medical)
attributes(medical)$label
```

### Access column labels

Data imported from SAS might have labels attached to columns. These are shown in RStudio's data viewer, but they can also be accessed with code using the `attr()` function to get the "label" attribute.

```{r}
attr(medical$ID, "label")
```

To see the attributes for multiple columns at once, either use the `purrr` package's `map()` function, or write a simple loop.

```{r}
for(i in 1:ncol(medical)){
  this_column <- paste(colnames(medical)[i],
                       attr(medical[[i]], "label"))
  message(this_column)
}
```
	---
	title: "Special cases of data importing in R"
	---

	The typical methods of importing data make it straightforward to import a single CSV file. But data will often be prepared in some other format. Some common scenarios include a folder of smaller CSV files or data prepared for use with SAS.

	## Importing many CSV files

	It is common to read in multiple CSV files and combine their data frames. Beyond a certain number, the process should be automated.

	### Prepare sample data

	For the purposes of this section, the first step is to prepare a sample data set to make this document replicable anywhere. This code divides data from the `nycflights13` package into a folder of 16 smaller CSV files to be read.

	```{r}
	library(dplyr)
	library(readr)

	if(!dir.exists("carriers")){dir.create("carriers")}

	nycflights13::flights %>%
	group_by(carrier) %>%
	group_walk(~ write_csv(.x, file = paste0("carriers/", .y$carrier, ".csv")))
	```

	Each of the CSV files in the "carriers" directory now has a subset of the full data, minus its carrier name. These files can be imported all at once to recreate the original data set.

	### Identify the files for importing

	To read a list of files, it's necessary first to have a list. The `list.files()` function makes it easy to find all the files in a folder that match any given pattern, specified with regular expressions.

	```{r}
	all_filenames <- list.files(path = "carriers", pattern = "*.csv", full.names = TRUE)
	```

	### Import all the files

	The `map()` function from `purrr` makes it easy to use `read_csv()` on every item in the list of file names. Especially when reading a large number of files, the `read_csv()` function can be noisy, but it can be quietened by setting `show_col_types` to `FALSE`. The resulting object is a list of data frames:

	```{r}
	library(purrr)
	library(readr)

	imported_data <- map(all_filenames, read_csv, show_col_types = FALSE)

	class(imported_data)
	```

	Use `bind_rows()` to combine all of these data frames into a single large data frame.

	```{r}
	combined_data <- imported_data \|>
	bind_rows()

	class(combined_data)
	dim(combined_data)
	```

	Because the column was encoded into the file names of the CSVs, the table lacks the distinguishing "carrier" information. It can be added by setting names with `set_names()` before appropriating these names for a named id column in `bind_rows()`.

	```{r}
	library(stringr)

	carrier_names <- all_filenames \|>
	str_remove_all("carriers/") \|>
	str_remove_all("[.]csv")

	combined_data <- imported_data \|>
	set_names(carrier_names) \|>
	bind_rows(.id = "carrier")

	dim(combined_data)
	```

	Relocating the "carrier" column and rearranging the rows are all that's necessary to get back to something like the original.

	```{r}
	combined_data <-
	combined_data \|>
	relocate(carrier, .before = flight) \|>
	arrange(year, month, day, dep_time)
	```

	If the process is used regularly, it might be wise to create a function.

	```{r}
	combine_csvs <- function(dir, id = ".id"){

	full_paths <- list.files(dir,
	pattern = "*.csv",
	full.names = TRUE)

	simple_names <- full_paths \|>
	stringr::str_remove_all(paste0(dir,"[/]")) \|>
	stringr::str_remove_all("[.]csv")

	full_paths \|>
	purrr::map(readr::read_csv, show_col_types = FALSE) \|>
	purrr::set_names(simple_names) \|>
	dplyr::bind_rows(.id = {{id}})
	}
	```

	## Importing SAS data with `haven`

	Reading other data types such as data from SAS is easily done with the `haven` package.

	### Prepare sample data

	For the purposes of this section, the first step is to download a sample data set once so that it can later be imported.

	```{r}
	if(!file.exists("medical.sas7bdat")) {
	download.file("http://www.principlesofeconometrics.com/sas/medical.sas7bdat",
	destfile = "medical.sas7bdat")
	}
	```

	### Load the package and read the data with read_sas()

	The `read_sas()` function is straightforward.

	```{r}
	library(haven)
	medical <- read_sas("medical.sas7bdat")
	class(medical)
	attributes(medical)$label
	```

	### Access column labels

	Data imported from SAS might have labels attached to columns. These are shown in RStudio's data viewer, but they can also be accessed with code using the `attr()` function to get the "label" attribute.

	```{r}
	attr(medical$ID, "label")
	```

	To see the attributes for multiple columns at once, either use the `purrr` package's `map()` function, or write a simple loop.

	```{r}
	for(i in 1:ncol(medical)){
	this_column <- paste(colnames(medical)[i],
	attr(medical[[i]], "label"))
	message(this_column)
	}
	```