Skip to content

Instantly share code, notes, and snippets.

@totten
Last active May 2, 2023 22:42
Show Gist options
  • Save totten/e8d81d56121cb06247ebe1d7bb0a683d to your computer and use it in GitHub Desktop.
Save totten/e8d81d56121cb06247ebe1d7bb0a683d to your computer and use it in GitHub Desktop.

Example algorithm for choosing a singular batch-size. This example does a trial-run with a few records and tries to extrapolate the $batch_size that we can get away with.

  • Take the first 10 records. Import them as a batch. Record performance characteristics of the trial-run:
    • The $base_request_memory is the memory_get_usage() at the start of the page-load (before executing any imports).
    • The $avg_rec_size is based on the averge size of an import record in this batch.
    • The $avg_rec_duration is derived from microtime(1) (before+after importing the trial batch).
    • The $avg_rec_memory is derived from memory_get_usage() or memory_get_peak_usage() (before+after importing the trial batch).
  • Pick general constraints
    • The $max_rec_size is based on the largest record in the entire import.
    • The $max_request_duration is min(ini_get(max_execution_time), Civi::settings()->get('import_max_duration'))
    • The $max_request_memory is min(ini_get(memory_limit), Civi::settings()->get('import_max_memory')) - $base_request_memory.
    • The $scale_factor is $max_rec_size/$avg_rec_size.
  • Calculate batch size based on different resource-constraints.
    • The $batch_size_by_memory is $max_request_memory / ($avg_rec_memory * $scale_factor).
    • The $batch_size_by_duration is $max_request_duration ($avg_rec_duration * $scale_factor).
    • The $batch_size is min($batch_size_by_memory, $batch_size_by_duration)

This formula is not at all tested and could be totally wrong. It makes some significant assumptions, e.g.

  • It assumes that the first 10 records are representative of the entire data-set.
  • It assumes that runtime scales linearly with qty+size of records. Well, that feels true...
  • It assumes that memory usage scales linearly with qty+size of records. This feels kinda wrong/misleading. It depends on how batches are loaded, what kind of memory-leaks there are, and what other data-structures are loaded. I would guess that caches+code have more impact on the overall memory footprint. But who knows...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment