Example algorithm for choosing a singular batch-size. This example does a trial-run with a few records and tries
to extrapolate the $batch_size
that we can get away with.
- Take the first 10 records. Import them as a batch. Record performance characteristics of the trial-run:
- The
$base_request_memory
is thememory_get_usage()
at the start of the page-load (before executing any imports). - The
$avg_rec_size
is based on the averge size of an import record in this batch. - The
$avg_rec_duration
is derived frommicrotime(1)
(before+after importing the trial batch). - The
$avg_rec_memory
is derived frommemory_get_usage()
ormemory_get_peak_usage()
(before+after importing the trial batch).
- The
- Pick general constraints
- The
$max_rec_size
is based on the largest record in the entire import. - The
$max_request_duration
ismin(ini_get(max_execution_time), Civi::settings()->get('import_max_duration'))
- The
$max_request_memory
ismin(ini_get(memory_limit), Civi::settings()->get('import_max_memory')) - $base_request_memory
. - The
$scale_factor
is$max_rec_size/$avg_rec_size
.
- The
- Calculate batch size based on different resource-constraints.
- The
$batch_size_by_memory
is$max_request_memory / ($avg_rec_memory * $scale_factor)
. - The
$batch_size_by_duration
is$max_request_duration ($avg_rec_duration * $scale_factor)
. - The
$batch_size
ismin($batch_size_by_memory, $batch_size_by_duration)
- The
This formula is not at all tested and could be totally wrong. It makes some significant assumptions, e.g.
- It assumes that the first 10 records are representative of the entire data-set.
- It assumes that runtime scales linearly with qty+size of records. Well, that feels true...
- It assumes that memory usage scales linearly with qty+size of records. This feels kinda wrong/misleading. It depends on how batches are loaded, what kind of memory-leaks there are, and what other data-structures are loaded. I would guess that caches+code have more impact on the overall memory footprint. But who knows...