kar9222/fst_and_data.table.md

## fst_and_data.table.md

      
    Raw
  

              fst_and_data.table.md
            
          
    {data.table} + {fst}

For medium (not big) data with RAM issues, for native R solutions, I often use {fst} + {data.table}. Keys are

load only needed data in RAM
during data wrangling, whenever possible, do NOT shallow/deep copy objects. Use 'reference semantics' e.g. {data.table} to modify in-place

Quote from Lightning fast serialization of data frames using the fst package

For a few years now, solid state disks (SSD’s) have been getting larger in capacity, faster and much cheaper. It’s not uncommon to find a high-performance SSD with speeds of up to multiple GB/s in a medium-end laptop. At the same time, the number of cores per CPU keeps growing. The combination of these two trends are opening up the way for data science to work on large data sets using a very modest computer setup.

{fst}

Use {fst} for fast serialization and random access (read only what you need into memory)
Resouces

{fstpackage/fst}
Lightning fast serialization of data frames using the fst package
Multi-threaded LZ4 and ZSTD compression from R

suppressMessages({ library(fst) ; library(data.table) })
# Only {Rcpp} as dependency. {data.table} has no dependency
sapply(c('fst', 'data.table'), 
       function(x) tools::package_dependencies(x, recursive = TRUE))
# $fst.fst
# [1] "Rcpp"    "methods" "utils"

# $data.table.data.table
# [1] "methods"

n_rows <- 2e8
dt <- data.table(
  logical = sample(c(TRUE, FALSE, NA), n_rows, replace = TRUE, c(.6, .2, .2)),
  integer = sample(seq_len(100L), n_rows, replace = TRUE),
  real    = sample( sample(seq_len(1e4), 20) / 100, n_rows, replace = TRUE ),
  factor  = as.factor(sample(labels(UScitiesD), n_rows, replace = TRUE))
)
format(object.size(dt), 'auto')
# [1] "3.7 Gb"

write_fst(dt, 'data.fst', compress = 50)
utils:::format.object_size(file.size('data.fst'), 'auto')
# [1] "1.1 Gb"

library(fst)
threads_fst(8)  # allow fst to use 8 threads

dt_orig <- read_fst('data.fst', as.data.table = TRUE)
# Random access: read only what you need into memory 
dt_small <- read_fst('data.fst', 
  columns = c('logical', 'integer'),
  from = 1, to = nrow(dt_orig) / 2, as.data.table = TRUE)

sapply(list(dt_orig, dt_small), 
       function(x) format(object.size(x), 'auto'))
# [1] "3.7 Gb"   "762.9 Mb"
Then use {data.table} reference semantics to modify objects in-place (that is, without making shallow/deep copy for reducing memory and also, in turn, improving speed)
{data.table}

Use {data.table} for fast & memory efficient data wrangling operations, specifically, its feature of reference semantics to modify objects in-place (that is, without making shallow/deep copy for reducing memory and also, in turn, improving speed)
Resources

Get started with {data.table}
Reference semantics
Keys and fast binary search based subset
Secondary indices and auto indexing
Elegantly assigning multiple columns in data.table with lapply()
Add a row by reference at the end of a data.table object

Some of my favorite techniques I used based on links above for efficiently apply same functions to many columns
dt <- data.table( matrix(runif(10000), nrow = 100) )

# A few variants

for (col in paste0('V', 20:100))
  set(dt, j = col, value = sqrt(dt[[col]])

for (col in paste0('V', 20:100))
  dt[, (col) := sqrt(dt[[col]])]

lapply(paste0('V', 20:100), function(col) dt[, (col) := sqrt(get(col))]
# I prefer `purrr::map` to `for`
library(purrr)
map(paste0('V', 20:100), ~ dt[, (.) := sqrt(get(.))])