Patch Magento AMQP framework to prevent RabbitMQ blowing up with consumer jobs that require long processing
The default AMQP settings have unlimited prefetch count - this means that the server will continue sending messages to the consumer as quickly as it can regardless of whether the client is actually reading the messages.
When consumer worker is processing a long running job it's not reading the socket, while RMQ is still sending - this causes the TCP buffer to be overflown and the connection dropped before the original job even had the chance of finishing, this has major stability implications:
- The original job will not be reported as finished because the connection has been interrupted - it will be retried many times - possibly ad infinitum.
- The jobs that the server sent and were not read by the consumer will be in UNACKed state which means RMQ will always keep them in-memory possibly driving memory usage skyrocketing in case of big amount fo jobs.
All of this may cause a snowball effect driving the whole node unstable and jobs being processed slowly or even crash. When RMQ goes over high memory watermark it will stop accepting connections temporarily causing application errors and slowing everything down even more.
This is a quick fix/hack setting this at fixed value of 1
. I would argue that this settings will fit 99% of most
common Magento Queue workloads - the main one being the bulk api async processing. This because in that case the
messaging is used for job processing, each job takes a significant amount of time to be processed and really in this
case high prefetch rate does not improve performance at all. Ideally this setting should be configurable with a default
set at a sane value (let's say 10
?).
The solution to this problem is open for discussion.
And I spent hours trying to debug the whole TCP stack 😅