Skip to content

Instantly share code, notes, and snippets.

@jthomasmock
Created October 13, 2023 19:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jthomasmock/0215f9d2787aa3ef6149ef48fd4d038b to your computer and use it in GitHub Desktop.
Save jthomasmock/0215f9d2787aa3ef6149ef48fd4d038b to your computer and use it in GitHub Desktop.
``` r
library(dplyr)
# vector operation
chunk_vec <- function(chunk_size){
ceiling(seq_along(row_number())/chunk_size)
}
# operate on dataframe
chunk_df <- function(.data, chunk_size) {
mutate(.data, chunk = chunk_vec(chunk_size)) %>%
group_by(chunk) %>%
group_split()
}
mtcars |>
chunk_df(12)
#> <list_of<
#> tbl_df<
#> mpg : double
#> cyl : double
#> disp : double
#> hp : double
#> drat : double
#> wt : double
#> qsec : double
#> vs : double
#> am : double
#> gear : double
#> carb : double
#> chunk: double
#> >
#> >[3]>
#> [[1]]
#> # A tibble: 12 × 12
#> mpg cyl disp hp drat wt qsec vs am gear carb chunk
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 1
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 1
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 1
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 1
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 1
#> 11 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4 1
#> 12 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3 1
#>
#> [[2]]
#> # A tibble: 12 × 12
#> mpg cyl disp hp drat wt qsec vs am gear carb chunk
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3 2
#> 2 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 2
#> 3 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4 2
#> 4 10.4 8 460 215 3 5.42 17.8 0 0 3 4 2
#> 5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 2
#> 6 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 2
#> 7 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 2
#> 8 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 2
#> 9 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 2
#> 10 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 2
#> 11 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 2
#> 12 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4 2
#>
#> [[3]]
#> # A tibble: 8 × 12
#> mpg cyl disp hp drat wt qsec vs am gear carb chunk
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2 3
#> 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 3
#> 3 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 3
#> 4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 3
#> 5 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 3
#> 6 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 3
#> 7 15 8 301 335 3.54 3.57 14.6 0 1 5 8 3
#> 8 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 3
```
<sup>Created on 2023-10-13 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
@JosiahParry
Copy link

An alternate take on the problem is to find the indices of each chunk outside of the context of dplyr.
This returns a list with start and end which contains the start and end indices of each chunk.

#' Get chunk indices
#' 
#' For a given number of items and a chunk size, determine the start and end
#' positions of each chunk.
#' 
#' @param n the number of rows
#' @param m the chunk size
chunk_indices <- function(n, m) {
  n_chunks <- ceiling(n/m) 
  chunk_starts <- seq(1, n, by = m) 
  chunk_ends <- seq_len(n_chunks) * m
  chunk_ends[n_chunks] <- n
  list(start = chunk_starts, end = chunk_ends)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment