totten/adaptive-batch.md Secret

## adaptive-batch.md

      
    Raw
  

              adaptive-batch.md
            
          
    Example algorithm for choosing a singular batch-size. This example does a trial-run with a few records and tries
to extrapolate the $batch_size that we can get away with.

Take the first 10 records. Import them as a batch. Record performance characteristics of the trial-run:

The $base_request_memory is the memory_get_usage() at the start of the page-load (before executing any imports).
The $avg_rec_size is based on the averge size of an import record in this batch.
The $avg_rec_duration is derived from microtime(1) (before+after importing the trial batch).
The $avg_rec_memory is derived from memory_get_usage() or memory_get_peak_usage() (before+after importing the trial batch).


Pick general constraints

The $max_rec_size is based on the largest record in the entire import.
The $max_request_duration is min(ini_get(max_execution_time), Civi::settings()->get('import_max_duration'))
The $max_request_memory is min(ini_get(memory_limit), Civi::settings()->get('import_max_memory')) - $base_request_memory.
The $scale_factor is $max_rec_size/$avg_rec_size.


Calculate batch size based on different resource-constraints.

The $batch_size_by_memory is $max_request_memory / ($avg_rec_memory * $scale_factor).
The $batch_size_by_duration is $max_request_duration ($avg_rec_duration * $scale_factor).
The $batch_size is min($batch_size_by_memory, $batch_size_by_duration)


This formula is not at all tested and could be totally wrong. It makes some significant assumptions, e.g.

It assumes that the first 10 records are representative of the entire data-set.
It assumes that runtime scales linearly with qty+size of records. Well, that feels true...
It assumes that memory usage scales linearly with qty+size of records. This feels kinda wrong/misleading. It depends on how batches are loaded, what kind of memory-leaks there are, and what other data-structures are loaded. I would guess that caches+code have more impact on the overall memory footprint. But who knows...