Skip to content

Instantly share code, notes, and snippets.

@iangow
Created June 9, 2024 21:32
Show Gist options
  • Save iangow/78159b1594a95dcdd092f037e15f583b to your computer and use it in GitHub Desktop.
Save iangow/78159b1594a95dcdd092f037e15f583b to your computer and use it in GitHub Desktop.
Another version of `port_sorts.qmd`.

Fast Portfolio Sorts Revisited

library(tidyverse)
library(dtplyr)
library(data.table)
library(bench)
library(DBI)
tidy_finance <- dbConnect(
  RSQLite::SQLite(),
  "data/tidy_finance_r.sqlite",
  extended_types = TRUE
)

crsp_monthly_dplyr <- tbl(tidy_finance, "crsp_monthly") |>
  select(permno, month, ret_excess, mktcap_lag) |>
  collect()

crsp_monthly_dt <- copy(as.data.table(crsp_monthly_dplyr))
crsp_monthly_dtplyr <- lazy_dt(copy(as.data.table(crsp_monthly_dplyr)))

The following version of assign_portfolio() operates on a single vector. Perhaps this is a better set-up for data.table … it certainly seems more “tidy” than the “whole data frame” approach in the code here.1

assign_portfolio <- function(x, n_portfolios) {
  probs <- seq(0, 1, length.out = n_portfolios + 1)
  breakpoints <- quantile(x, probs, na.rm = TRUE, names = FALSE)
  findInterval(x, breakpoints, all.inside = TRUE)
}

Now, I apply assign_portfolio() in three different ways. First, using a regular dplyr data frame.

sort_dplyr <- function() {
  crsp_monthly_dplyr |> 
    group_by(month) |> 
    mutate(portfolio = assign_portfolio(mktcap_lag, n_portfolios = 10)) |> 
    group_by(portfolio) |> 
    summarize(ret = mean(ret_excess)) |> 
    as.data.frame()
}

Second, using data.table and dtplyr.

sort_dt <- function() {
  crsp_monthly_dt |> 
    group_by(month) |> 
    mutate(portfolio = assign_portfolio(mktcap_lag, n_portfolios = 10)) |> 
    group_by(portfolio) |> 
    summarize(ret = mean(ret_excess)) |> 
    as.data.frame()
}

Third, using data.table and dtplyr, but applied to the version returned by lazy_dt().

sort_dtplyr <- function() {
  crsp_monthly_dtplyr |> 
    group_by(month) |> 
    mutate(portfolio = assign_portfolio(mktcap_lag, n_portfolios = 10)) |> 
    group_by(portfolio) |> 
    summarize(ret = mean(ret_excess)) |> 
    as.data.frame()
}

Now, run these three.

iterations <- 100

results <- bench::mark(
  sort_dplyr(), sort_dt(), sort_dtplyr(), 
  iterations = iterations
)
autoplot(results, type = "violin") +
  labs(y = NULL, x = NULL, 
       title = "Execution time of porfolio sorts using dplyr and dtplyr",
       subtitle = "'sort_dt' uses `dtplyr` and 'sort_dtplyr' simply adds `lazy_dt()`")

Footnotes

  1. Of course, the pipe (|>) operates on a whole data frame, but I think this makes it seem even odder to have a function that is applied using the pipe that takes and returns a data frame.

@iangow
Copy link
Author

iangow commented Jun 9, 2024

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment