Skip to content

Instantly share code, notes, and snippets.

@colearendt
Last active December 5, 2018 03:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save colearendt/3ea7d9db4c03dad71567f820ab95ccf9 to your computer and use it in GitHub Desktop.
Save colearendt/3ea7d9db4c03dad71567f820ab95ccf9 to your computer and use it in GitHub Desktop.
async processes to speed up database logging

Is async in R always the right answer?

I have a database logging operation that takes about 1/3s - 1/2s. I do lots of these logging operations. I do enough of them that I am worried it could be a bottleneck in whether my app has a snappy feel to it.

bench::mark(log_event(con, schema=schema, prefix = prefix, event = "benchmark", session = "session"))
# # A tibble: 1 x 14
#   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr total_time result memory time 
#   <chr>      <bch> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list> <list> <lis>
# 1 "log_even… 376ms 379ms  379ms 382ms      2.64    44.5KB     0     2      758ms <data… <Rpro… <bch…
# # ... with 1 more variable: gc <list>

I wanted to explore if throwing that logging operation into another thread would speed things up. After all, I only need:

  • the connection object
  • the query

Two approaches here:

  • callr (except it does not support serializing a DBI::connection object)
  • future / promises (which are supported well in Shiny)

callr

Faster, but no dice

explore2 <- function(query) {
  callr::r_bg(function() {Sys.sleep(1); print(query)})
}

# results of bench::mark
# # A tibble: 1 x 14
#   expression    min   mean median    max `itr/sec` mem_alloc  n_gc n_itr total_time result memory
#   <chr>      <bch:> <bch:> <bch:> <bch:>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list> <list>
# 1 {...       27.3ms 45.7ms 43.7ms 70.7ms      21.9    94.1KB     0    12      548ms <S3: … <Rpro…
# # ... with 2 more variables: time <list>, gc <list>

I suspect this is largely because there is no object copying / etc. from the parent session, just a naked R session startup, offloading things.

Of course, this doesn't work because I cannot serialize the connection. I guess I could pass the connection parameters, but that will get annoying... (still... running in 1/10th of the time)

There are some thoughts about using callr::r_session$new, but it has some limitations as well.

future

Faster... but not worth the effort. The magnitude of the time taken is about the same, so little benefit, and definitely not enough to warrant the added complexity.

explore <- function(query, con) {
  future({Sys.sleep(1); DBI::dbGetQuery(con, query)})
}

# results of bench::mark
# # A tibble: 1 x 14
#   expression   min  mean median   max `itr/sec` mem_alloc  n_gc n_itr total_time result memory time 
#   <chr>      <bch> <bch> <bch:> <bch>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list> <list> <lis>
# 1 {...       223ms 229ms  224ms 240ms      4.37    55.6KB     0     3      687ms <S3: … <Rpro… <bch…
# # ... with 1 more variable: gc <list>

Now what?

I would love to either have:

  • r_bg be able to serialize a connection object
  • Be able to run a persistent background process (that can be restarted / etc. if it fails) that I can throw my async processes to (in a sort of queue, if you will). I made such a request, and we shall see what comes of it!

For now, I will probably just stay synchronous. Easier to reason about, and not a whole lot slower than the only functional alternative I have come up with. 1/3s-1/2s does not seem to be slow enough to gain much benefit from asynchronous code in R today. Either that or I will codify my connection so that the async process can start a connection itself. That's a lot of overhead for each background process (creating and dropping a db connection), though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment