Skip to content

Instantly share code, notes, and snippets.

@kar9222
Last active December 4, 2020 01:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kar9222/7babcf6fe80a06cd8febbf091eed28de to your computer and use it in GitHub Desktop.
Save kar9222/7babcf6fe80a06cd8febbf091eed28de to your computer and use it in GitHub Desktop.
{fst} + {data.table}

{data.table} + {fst}

For medium (not big) data with RAM issues, for native R solutions, I often use {fst} + {data.table}. Keys are

  • load only needed data in RAM
  • during data wrangling, whenever possible, do NOT shallow/deep copy objects. Use 'reference semantics' e.g. {data.table} to modify in-place

Quote from Lightning fast serialization of data frames using the fst package

For a few years now, solid state disks (SSD’s) have been getting larger in capacity, faster and much cheaper. It’s not uncommon to find a high-performance SSD with speeds of up to multiple GB/s in a medium-end laptop. At the same time, the number of cores per CPU keeps growing. The combination of these two trends are opening up the way for data science to work on large data sets using a very modest computer setup.

{fst}

Use {fst} for fast serialization and random access (read only what you need into memory)

Resouces

suppressMessages({ library(fst) ; library(data.table) })
# Only {Rcpp} as dependency. {data.table} has no dependency
sapply(c('fst', 'data.table'), 
       function(x) tools::package_dependencies(x, recursive = TRUE))
# $fst.fst
# [1] "Rcpp"    "methods" "utils"

# $data.table.data.table
# [1] "methods"

n_rows <- 2e8
dt <- data.table(
  logical = sample(c(TRUE, FALSE, NA), n_rows, replace = TRUE, c(.6, .2, .2)),
  integer = sample(seq_len(100L), n_rows, replace = TRUE),
  real    = sample( sample(seq_len(1e4), 20) / 100, n_rows, replace = TRUE ),
  factor  = as.factor(sample(labels(UScitiesD), n_rows, replace = TRUE))
)
format(object.size(dt), 'auto')
# [1] "3.7 Gb"

write_fst(dt, 'data.fst', compress = 50)
utils:::format.object_size(file.size('data.fst'), 'auto')
# [1] "1.1 Gb"

library(fst)
threads_fst(8)  # allow fst to use 8 threads

dt_orig <- read_fst('data.fst', as.data.table = TRUE)
# Random access: read only what you need into memory 
dt_small <- read_fst('data.fst', 
  columns = c('logical', 'integer'),
  from = 1, to = nrow(dt_orig) / 2, as.data.table = TRUE)

sapply(list(dt_orig, dt_small), 
       function(x) format(object.size(x), 'auto'))
# [1] "3.7 Gb"   "762.9 Mb"

Then use {data.table} reference semantics to modify objects in-place (that is, without making shallow/deep copy for reducing memory and also, in turn, improving speed)

{data.table}

Use {data.table} for fast & memory efficient data wrangling operations, specifically, its feature of reference semantics to modify objects in-place (that is, without making shallow/deep copy for reducing memory and also, in turn, improving speed)

Resources

Some of my favorite techniques I used based on links above for efficiently apply same functions to many columns

dt <- data.table( matrix(runif(10000), nrow = 100) )

# A few variants

for (col in paste0('V', 20:100))
  set(dt, j = col, value = sqrt(dt[[col]])

for (col in paste0('V', 20:100))
  dt[, (col) := sqrt(dt[[col]])]

lapply(paste0('V', 20:100), function(col) dt[, (col) := sqrt(get(col))]
# I prefer `purrr::map` to `for`
library(purrr)
map(paste0('V', 20:100), ~ dt[, (.) := sqrt(get(.))])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment