For medium (not big) data with RAM issues, for native R solutions, I often use {fst} + {data.table}. Keys are
- load only needed data in RAM
- during data wrangling, whenever possible, do NOT shallow/deep copy objects. Use 'reference semantics' e.g. {data.table} to modify in-place
Quote from Lightning fast serialization of data frames using the fst package
For a few years now, solid state disks (SSD’s) have been getting larger in capacity, faster and much cheaper. It’s not uncommon to find a high-performance SSD with speeds of up to multiple GB/s in a medium-end laptop. At the same time, the number of cores per CPU keeps growing. The combination of these two trends are opening up the way for data science to work on large data sets using a very modest computer setup.
Use {fst} for fast serialization and random access (read only what you need into memory)
Resouces
- {fstpackage/fst}
- Lightning fast serialization of data frames using the fst package
- Multi-threaded LZ4 and ZSTD compression from R
suppressMessages({ library(fst) ; library(data.table) })
# Only {Rcpp} as dependency. {data.table} has no dependency
sapply(c('fst', 'data.table'),
function(x) tools::package_dependencies(x, recursive = TRUE))
# $fst.fst
# [1] "Rcpp" "methods" "utils"
# $data.table.data.table
# [1] "methods"
n_rows <- 2e8
dt <- data.table(
logical = sample(c(TRUE, FALSE, NA), n_rows, replace = TRUE, c(.6, .2, .2)),
integer = sample(seq_len(100L), n_rows, replace = TRUE),
real = sample( sample(seq_len(1e4), 20) / 100, n_rows, replace = TRUE ),
factor = as.factor(sample(labels(UScitiesD), n_rows, replace = TRUE))
)
format(object.size(dt), 'auto')
# [1] "3.7 Gb"
write_fst(dt, 'data.fst', compress = 50)
utils:::format.object_size(file.size('data.fst'), 'auto')
# [1] "1.1 Gb"
library(fst)
threads_fst(8) # allow fst to use 8 threads
dt_orig <- read_fst('data.fst', as.data.table = TRUE)
# Random access: read only what you need into memory
dt_small <- read_fst('data.fst',
columns = c('logical', 'integer'),
from = 1, to = nrow(dt_orig) / 2, as.data.table = TRUE)
sapply(list(dt_orig, dt_small),
function(x) format(object.size(x), 'auto'))
# [1] "3.7 Gb" "762.9 Mb"
Then use {data.table} reference semantics to modify objects in-place (that is, without making shallow/deep copy for reducing memory and also, in turn, improving speed)
Use {data.table} for fast & memory efficient data wrangling operations, specifically, its feature of reference semantics to modify objects in-place (that is, without making shallow/deep copy for reducing memory and also, in turn, improving speed)
Resources
- Get started with {data.table}
- Reference semantics
- Keys and fast binary search based subset
- Secondary indices and auto indexing
- Elegantly assigning multiple columns in data.table with lapply()
- Add a row by reference at the end of a data.table object
Some of my favorite techniques I used based on links above for efficiently apply same functions to many columns
dt <- data.table( matrix(runif(10000), nrow = 100) )
# A few variants
for (col in paste0('V', 20:100))
set(dt, j = col, value = sqrt(dt[[col]])
for (col in paste0('V', 20:100))
dt[, (col) := sqrt(dt[[col]])]
lapply(paste0('V', 20:100), function(col) dt[, (col) := sqrt(get(col))]
# I prefer `purrr::map` to `for`
library(purrr)
map(paste0('V', 20:100), ~ dt[, (.) := sqrt(get(.))])