Skip to content

Instantly share code, notes, and snippets.

@danilobatistaqueiroz
Last active July 24, 2023 18:12
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save danilobatistaqueiroz/eb253bf1b59aa0721fd24b4f0212b626 to your computer and use it in GitHub Desktop.
Save danilobatistaqueiroz/eb253bf1b59aa0721fd24b4f0212b626 to your computer and use it in GitHub Desktop.
RabbitMQ and CloudAMQP

To get optimal performance, make sure your queues stay as short as possible all the time.
Longer queues impose more processing overhead.
We recommend that queues should always stay around 0 for optimal performance.

Em cenários com muitos consumers e publishers, o ideal é centralizar a criação de exchanges e queues, removendo essa permissão deles e fazendo essa gestão por uma equipe de admin.

Para garantir a entrega das mensagens o correto é usar consumer ack, broker ack e durable queue com persistent messages.
Pode-se utilizar quorum queues dependendo do cenário.

ZeroMQ is a brokerless library, asynchronous, it provides a message queue.
ZeroMQ API provides sockets that represents a many-to-many connection between endpoints.
ZeroMQ patterns:

  • Request-reply (Asynchronous and Synchronous Request/Response)
  • Publish-subscribe
  • Push-pull (pipeline)
  • Exclusive pair

Request-reply -> this is a remote procedure call and task distribution pattern.
Publish-subscribe -> connects a set of publishers to a set of subscribers.
Push-pull -> fan-out / fan-in pattern that can have multiple steps, and loops. This is a parallel task distribution and collection pattern.
Exclusive pair -> connects two sockets in an exclusive pair.

If you set up a brokerless messaging network, three things that you might need are:
discovery, availability and management.

discovery is the problem of maintaining peers that can send messages and can join.
availability is the problem of dealing with peers disappearing from time to time, keep a copy of messages until some peers reapper, etc.
management is know who is connected to who, and who can be connected to who.

Models like JGroups usually make management simple.

A broker is a end-point leader, it is also an intermediary.
A commonly held misconception about brokers is that they are ‘centralized’.

Peer to peer models are not inherently more or less simple than brokered models.
If you do not need discovery, availability, management, or intermediation then it may be simpler to not use them.

From a RabbitMQ point of view ZeroMQ is a ‘smart client’ that can use its buffers like a queue.

Brokers

São um tipo de middleware.
Esse tipo de middleware fornece aos desenvolvedores um meio padronizado de lidar com o fluxo de dados entre os componentes de um aplicativo para que eles possam se concentrar na lógica central. Ele pode atuar como uma camada de comunicações distribuída que permite que aplicativos de diversas plataformas se comuniquem internamente.

Os message brokers podem validar, armazenar, rotear e entregar mensagens aos destinos apropriados. Eles atuam como intermediários entre outros aplicativos, permitindo que os remetentes emitam mensagens sem saber onde estão os destinatários, se eles estão ativos ou não ou quantos deles existem. Isso facilita o desacoplamento de processos e serviços dentro de sistemas.

Para fornecer armazenamento confiável de mensagens e entrega garantida, os message brokers geralmente contam com uma subestrutura ou componente chamado fila de mensagens, que armazena e ordena as mensagens até que os aplicativos de consumo possam processá-las. Em uma fila de mensagens, as mensagens são armazenadas na ordem exata em que foram transmitidas e permanecem na fila até que o recebimento seja confirmado.

Capacidade por Default

12 mensagens por segundo são 1.000.000 por dia.
RabbitMQ aguenta de 4.000 a 10.000 mensagens por segundo, para um hardware moderno comum.
Menos de 1.000 mensagens por segundo é algo simples para o RabbitMQ.
Você só vai enfrentar gargalos ao usar milhares de queues e clientes,
caso contrário não terá problemas de performance,
e não haverá motivos para criar um cluster a não ser por motivos de resiliência.

A partir de 1.000 mensagens por segundo (86 milhões por dia, 2.5 bilhões por mês),
você pode começar a analisar o tamanho do cluster para dar mais vazão.

Dimensione o cluster pensando no pior

Sempre dimensionar pensando quando ocorrer um erro e houver diversas mensagens de erro.
Quando cair um broker, como ficará o restante do cluster.
Se alguns consumidores perderem a conexão, como ficará o cluster.

Uso de memória

Classic queues e classic mirrored queues mantém as mensagens na memória, mas começam a evitar de deixá-las na memória caso a memória fique com pouco espaço.
Não é possível ficar determinando quando esse comportamento vai ocorrer por ser dinâmico.
O correto é usar lazy queues caso precise sempre ter memória livre.
Quorum queues deixam todas as mensagens na memória por defaul, mesmo em circunstancias de baixa memória livre.
Quando as queues crescerem muito, haverá alarmes de memória, e os publishers podem acabar bloqueados.
Esse comportamento pode ser alterado com propriedades como x-max-in-memory-length, mas isso tem um preço, pois, mas mensagens vão começar a serem recuperadas do disco.

Redundância e Escalabilidade

3 brokers para redundancia
9 brokers para escalabilidade
9 brokers com quorum queues com replicator factor de 3

Message Size

mensagens grandes com mais de 1Mb podem causar tráfego na rede e saturar a memória.

Adverse Conditions

Um broker pode desligar por diversos motivos:

  • instalação de um patch do SO
  • falha do disco
  • perda da rede
  • falha na memória

Um cenário de teste para avaliar um pico de 30.000 msg por segundo durante alguns minutos.
Uma campanha de marketing muito viral pode levar a um tráfego muito maior que o usual.
Identificar um ataque, ou falha no sistema que possa causar um pico de mensagens.

Testes de Carga (Benchmarks)

Uma ferramenta seria o PerfTest (load generator)
Porém com Perftest é um teste sintético, para simular o ambiente o ideal é usar as aplicações em si.

Peer Discovery

To form a cluster, new ("blank") nodes need to be able to discover their peers.
This can be done using a variety of mechanisms (backends).
Some mechanisms assume all cluster members are known ahead of time (for example, listed in the config file),
others are dynamic (nodes can come and go).

Available Discovery Mechanisms

The following mechanisms are built into the core and always available:

  • Config file
  • Pre-configured DNS A/AAAA records Additional peer discovery mechanisms are available via plugins.
    The following peer discovery plugins ship with supported RabbitMQ versions:
  • AWS (EC2)
  • Kubernetes
  • Consul
  • etcd

Depending on the backend (mechanism) used, the process of peer discovery may involve contacting external services,
for example, an AWS API endpoint, a Consul node or performing a DNS query.

If peer discovery isn't configured, or it repeatedly fails,
or no peers are reachable,
a node that wasn't a cluster member in the past will initialise from scratch and proceed as a standalone node.
If a node previously was a cluster member,
it will try to contact and rejoin its "last seen" peer for a period of time.
In this case, no peer discovery will be performed. This is true for all backends.

As a rule of thumb, a cluster that has only been partially formed — that is,
only a subset of nodes has joined it — must be considered fully available by clients.

Existing cluster members will not perform peer discovery.
Instead they will try to contact their previously known peers.

Configuring Peer Discovery

Config File Peer Discovery Backend: cluster_formation.peer_discovery_backend = classic_config

DNS Peer Discovery Backend:

cluster_formation.peer_discovery_backend = dns
cluster_formation.dns.hostname = discovery.eng.example.local

Peer Discovery on AWS (EC2):
The plugin provides two ways for a node to discover its peers:

  • Using EC2 instance tags
  • Using AWS autoscaling group membership
cluster_formation.peer_discovery_backend = aws
cluster_formation.aws.region = us-east-1
cluster_formation.aws.access_key_id = ANIDEXAMPLE
cluster_formation.aws.secret_key = WjalrxuTnFEMI/K7MDENG+bPxRfiCYEXAMPLEKEY

Kubernetes for peer discovery:

cluster_formation.peer_discovery_backend = k8s
cluster_formation.k8s.host = kubernetes.default.example.local

Consul Peer Discovery:

cluster_formation.peer_discovery_backend = consul
cluster_formation.consul.host = consul.eng.example.local
cluster_formation.consul.acl_token = acl-token-value

Etcd Peer Discovery:

cluster_formation.peer_discovery_backend = etcd
cluster_formation.etcd.endpoints.1 = one.etcd.eng.example.local:2379
cluster_formation.etcd.endpoints.2 = two.etcd.eng.example.local:2479
cluster_formation.etcd.endpoints.3 = three.etcd.eng.example.local:2579
cluster_formation.etcd.username = rabbitmq
cluster_formation.etcd.password = s3kR37

Some backends (Consul, etcd) support node health checks or TTL. These checks should not to be confused with monitoring health checks.

Erlang Cookie

If the file does not exist, Erlang VM will try to create one with a randomly generated value when the RabbitMQ server starts up. Using such generated cookie files are appropriate in development environments only. Since each node will generate its own value independently, this strategy is not really viable in a clustered environment.
Erlang cookie generation should be done at cluster deployment stage, ideally using automation and orchestration tools.
In distributed deployment.

What is Replicated?

All data/state required for the operation of a RabbitMQ broker is replicated across all nodes. An exception to this are message queues, which by default reside on one node, though they are visible and reachable from all nodes. To replicate queues across nodes in a cluster, use a queue type that supports replication.

Clustering and Observability

Client connections, channels and queues will be distributed across cluster nodes. Operators need to be able to inspect and monitor such resources across all cluster nodes.
RabbitMQ CLI tools such as rabbitmq-diagnostics and rabbitmqctl provide commands that inspect resources and cluster-wide state. Some commands focus on the state of a single node (e.g. rabbitmq-diagnostics environment and rabbitmq-diagnostics status), others inspect cluster-wide state. Some examples of the latter include rabbitmqctl list_connections, rabbitmqctl list_mqtt_connections, rabbitmqctl list_stomp_connections, rabbitmqctl list_users, rabbitmqctl list_vhosts and so on.

Management UI works similarly: a node that has to respond to an HTTP API request will fan out to other cluster members and aggregate their responses. In a cluster with multiple nodes that have management plugin enabled, the operator can use any node to access management UI. The same goes for monitoring tools that use the HTTP API to collect data about the state of the cluster. There is no need to issue a request to every cluster node in turn.

Node Failure - Classic Queues

Non-replicated classic queues can also be used in clusters. Non-mirrored queue behaviour in case of node failure depends on queue durability.

Clustering - WAN - LAN

Clustering is meant to be used across LAN. It is not recommended to run clusters that span WAN. The Shovel or Federation plugins are better solutions for connecting brokers across a WAN.

Metrics and Statistics

Every node stores and aggregates its own metrics and stats, and provides an API for other nodes to access it. Some stats are cluster-wide, others are specific to individual nodes.

Criando um Cluster

Para se criar um cluster primeiro sobe as máquinas com o RabbitMQ.
Depois é necessário dar um reset no RabbitMQ.
O reset é remover todos os recursos e dados do broker.
Ou seja, um broker não pode se tornar membro de um cluster e manter os dados existentes ao mesmo tempo.
Quando você deseja manter esses dados, então use Blue/Green deployment ou use backup/restore.
rabbtimqctl stop_app
rabbtimqctl reset
rabbtimqctl join_cluster rabbit@rabbit1
rabbtimqctl start_app
rabbtimqctl cluster_status
Para adicionar outro broker, basta fazer o join com um dos brokers do cluster.

Schema Syncing - Restarting

Um broker ao parar, ele seleciona um par para após reiniciar, tentar sincronizar.
Por exemplo, o broker rabbit@broker1 ao reiniciar, seleciona o rabbit@broker2
Após reiniciar, o broker1 vai tentar sincronizar com o broker2 para entrar novamente no cluster.
O total de tentativas é de 10 vezes por default, com 30 segundos de timeout.

Ao baixar um cluster inteiro, o último broker a desligar não terá nenhum par ao sair.
Ao reiniciar, os brokers tentam contactar os pares por 5 minutos (default).
Os brokers podem ser reiniciados em qualquer ordem.

Durante upgrades, algumas vezes o último broker a parar será o primeiro a subir após o upgrade.
Esse broker será designado para fazer a migração do schema do cluster para os outros brokers.

A persistent message to be confirmed must be written to disk or ack’d on all the queues it was delivered to.
With regard to confirms, persistent messages delivered to non-durable queues behave like transient messages.
Queue deletion, queue purge and basic.reject{requeue=false} simulate a consumer acknowledgement.
With respect to per-queue ttl, message expiry simulates a consumer acknowledgement.

If more than one of these conditions are met, only the first causes a confirm to be sent.

The broker may always set the multiple bit in the basic.acks.
A basic.ack with multiple set means that all messages up-to-and-including delivery-tag are acknowledged.

Heterogeneous Cluster

Let’s say our cluster of workers doesn’t run in exactly the same hardware.
Some machines have some hardware features that give them an advantage over the others in the cluster based on the type of task we are running.
For example some machines have SSDs and our tasks require a lot of I/O;
or perhaps the tasks need faster CPUs to perform calculations;
or more RAM in order to cache results for future computations.
In any case it would be interesting that if we have two consumers ready to get more messages,
and one is in a better machine.
RabbitMQ should pick the consumer in the better machine and deliver the message to it.

Data Locality

Another use for consumer priorities is to benefit from data locality.
In RabbitMQ queue contents live in the node where the queue was originally declared,
and in case of mirrored queues there will be a master node that will coordinate the queue.
We can use a consumer priority to tell RabbitMQ to first deliver messages to consumers connected to the master node.
To do that the consumer that connects to the master node, will set a higher priority for itself

Um barramento de serviço corporativo (ESB) é um padrão arquitetônico utilizado algumas vezes em arquiteturas orientadas por serviços (SOA) implementadas entre as empresas.
Em um ESB, uma plataforma de software centralizada combina protocolos de comunicação e formatos de dados em uma "linguagem comum" que todos os serviços e aplicativos na arquitetura possam compartilhar.
Ele pode, por exemplo, converter as solicitações que recebe de um protocolo (como XML) para outro (como JSON).
Os ESBs transformam suas cargas úteis de mensagens usando um processo automatizado.
A plataforma centralizada de software também trata de outra lógica de orquestração, como conectividade,
roteamento e processamento de solicitações.

As infraestruturas ESB são complexas e podem causar problemas durante a integração, além de serem caras.
É difícil solucionar problemas em ambientes de produção,
eles não são fáceis para ajustar a escala e a atualização é tediosa.

Os message brokers são uma alternativa "leve" aos ESBs,
pois fornecem uma funcionalidade semelhante:
um mecanismo para comunicações entre serviços, só que mais simples e barato.
Eles são adequados para uso nas arquiteturas de microsserviços que se tornaram mais comuns à medida que os ESBs caíram em desuso.

Se um consumer estiver sobrecarregado é possível colocar mais consumidores na queue, porém, eles competirão no consumo,
a distribuição das mensagens será via round-robin,
um load balance entre eles, a ordem das mensagens é perdida caso tenha sequencia de eventos atrelados a algum id ou algo assim.

Se um broker estiver sobrecarregado é possível usar consistent exchange hashing para distribuir as mensagens da queue entre brokers, assim garantindo até a ordem das mensagens, será uma única queue espalhada em brokers, porém, com a capacidade de manter a ordenação das mensagens, fora que a distribuição não será por replicas, os dados não serão replicados.

Para elevado uso de mensagens, o RabbitMQ possui o recurso de streams que escala mais que queues comuns.

Kafka suporta até 1.000.000 de mensagens por segundo
RabbitMQ suporta entre 4.000 e 10.000 de mensagens por segundo

É possível usar RAM nodes, no qual armazena o internal database tables na memória RAM apenas.
Isso não inclui mensagens, índices de mensagens, índices de queues, etc.
Isso é útil para casos em que há muita criação e remoção de queues, exchanges, bindings.
Um cluster precisa ter ao menos um broker do tipo Disk Node.
Há diversos riscos ao usar RAM nodes.

Without the feature flags subsystem,
it would be impossible to have a RabbitMQ 3.8.0 node inside a cluster where other nodes are running RabbitMQ 3.7.x.
Indeed, the 3.7.x nodes would be unable to understand the data structure or the database schema from 3.8.0 node.

That’s why RabbitMQ today prevents this from happening by comparing versions and by denying clustering when
versions are considered incompatible (the policy considers different minor/major versions to be incompatible).

RabbitMQ 3.8.0 agora tem o subsistem feature flags.
Se um cluster contendo brokers com a versão 3.7.x e um broker é atualizado para a versão 3.8.0,
ele não irá usar recursos da versão 3.8.0 por conta de que o subsistema feature flags irá impedir,
para que ele comece a usá-los, todos os outros brokers do cluster precisam estar atualizados.

Uma vez uma feature flag habilitada, é impossível adicionar um broker no cluster que esteja rodando uma versão antiga do RabbitMQ.

Para habilitar uma feature flag é necessário usar o rabbitmqctl, management UI ou outra ferramenta.

High Availability and Disaster Recovery

HA se refere a algum tipo de modo automático de fail-over para trocar uma instância de um software que tem falha por um outro.
Podendo ser falha no server, disco, rede, etc.

Disaster Recovery normalmente se refere a resposta para um incidente mais crítico (um desastre) tal como perda total de um data center.
Disaster Recovery tenta anular, evitar, uma falha permanente parcial ou total.
Tenta evitar perda de um sistema, o que geralmente envolve construindo um sistema redundante que vai estar separado geograficamente.

HA e Disaster Recovery entram no âmbito de Continuidade de Negócios (Business Continuity).

Business Continuity Planning

Em última análise, queremos ser capazes de recuperar rápido de um incidente maior (recuperar de um desastre),
e continuar com disponibilidade durante incidentes menores (alta dispoibilidade).

Implantar um sistema que possa se recuperar de falhas e desastres pode ser muito caro, tanto do ponto de vista financeiro bem como quanto a performance.
A implementação deve colocar na balança o custo da implantação vs o custo da perda de dados e da interrupção do serviço.

Deve-se levar em consideração os tipos de dados:

  • Transiente
  • Persistente
  • Secundário
  • Fonte da verdade

Para garantir dados num eventual desastre, há algumas abordagens:
Backups e Replicação

Replicação pode ser assíncrona, e síncrona.
Assíncrona, o cliente não espera as réplicas receberem o dado para confirmar.
Síncrona, o cliente espera.

Com replicação assíncrona há baixa latência, porém, risco de perda de dados.

Replicação Assíncrona e Backups geralmente é a solução para Disaster Recovery.

CAP Theorem

CAP classifica sistemas como:

  • AP - Availability in a partitioned network
  • CP - Consistency in a partitioned network

Sistemas CP perdem a disponibilidade caso não tenha um nível de redundância que satisfaça.
CP replica dados de forma síncrona e apenas confirma para o cliente quando a operação já foi replicada para os brokers.
Isso elimina perda de dados, mas com o custo de alta latência.
CP é quando a consistência é mais importante.

AP continua disponível apesar de não conseguir redundância.
AP replica de forma assíncrona, pode haver perda de dados, porém tem baixa latência.
AP é quando disponibilidade ou latência é mais importante.

Replicação síncrona oferece segurança de dados ao custo de disponibilidade e latência extra, normalmente não é viável em multiplos data centers.
Replicação assíncrona oferece alta disponibilidade, baixa latência e funciona bem em multiplos data centers.

Backups podem ser lentos para recuperar os dados, mas têm o benefício de serem capazes de percorrer os dados no tempo.

Quorum queues são CP e toleram a perda de brokers contanto que um quorum (maioria) continue funcionando.
Quorum queues usam um método próprio para detectar particionamento de rede, não é o mesmo usado pelo RabbitMQ quando se tratando do cluster.
Quorum queues dão ẽnfase na consistência ao invés de disponibilidade no caso de particionamento.

Classic Mirrored Queues - usam o sistema de recuperação de particionamento adotado pelo RabbitMQ.
Classic Mirrored Queues pode ser configurada para disponibilidade (AP) ou para consistência (CP).
Usando auto-heal ou ignore irá permitir todos os brokers continuarem mesmo se um ou mais brokers sairem da rede. Porém, ao recuperar do particionamento, pode haver perda de dados.
Usando pause_minoritya consistência será prioridade, garantindo os dados.

RabbitMQ Suporta

  • replicação síncrona com uso de quorum queues
  • replicação assíncrona de cluster para cluster usando federação e shovels
  • backup support
  • rack awareness support

High Availability com RabbitMQ

Uma boa recomendação é usar cluster dentro de um data center e usar quorum queues.

Availability Zones

Availability zones são data centers que estão conectados por links ultra confiável e de baixa latência, mas não estão separados geograficamente.

Schema Replication

Há duas maneiras de replicar os schema.
Export/Import das definições, para JSON.
Tanzu RabbitMQ plugin que sincroniza os schemas para um cluster secundário.

Federation and Shovel

Federation/Shovel reencaminham mensagens de um cluster para outro, não fazem replicação da situação das mensagens. Então um cluster primário pode ter consumido as mensagens e o secundário pode acabar tendo essas mensagens.
Se o cluster primário sair do ar e o secundário assumir, aplicações que já consumiram as mensagens no cluster primário, vão acabar consumindo segunda vez no cluster secundário, a única solução para eliminar esse problema, é com idempotência, ou algum mecanismo para deduplicar as mensagens (gravando um id num Redis).

Backups

RabbitMQ suporta backups mas o suporte é limitado, exigindo que o cluster precise ser desligado, para poder ser feito o backup do diretório de dados, o que torna essa política de backups impraticável em muitos cenários.

Management Plugin

The RabbitMQ management plugin provides an HTTP-based API for management and monitoring of RabbitMQ nodes and clusters, along with a browser-based UI and a command line tool, rabbitmqadmin.

Cowboy, the embedded Web server used by the management plugin, provides a number of options that can be used to customize the behavior of the server.

rabbitmqadmin is a Python command line tool that interacts with the HTTP API.

The API is intended to be used for basic observability tasks.
Prometheus and Grafana are recommended for long term metric storage, alerting, anomaly detection, and so on.

Authenticating with OAuth 2

RabbitMQ can be configured to use JWT-encoded OAuth 2.0 access tokens to authenticate client applications and management UI users.
When doing so, the management UI does not automatically redirect users to authenticate against the OAuth 2 server, this must be configured separately.
Currently, only UAA is supported authorization server.

Mirrored Queues

All operations for a given queue are first applied on the queue's leader node and then propagated to mirrors

Consumers are connected to the leader regardless of which node they connect to

Queue mirroring therefore enhances availability, but does not distribute load across nodes

If the node that hosts queue leader fails, the oldest mirror will be promoted to the new leader as long as it's synchronised. Unsynchronised mirrors can be promoted, too, depending on queue mirroring parameters.

Adequate monitoring of the system is critically important as it is the only way to spot problematic trends (e.g. channel leaks, growing file descriptor usage from poor connection management) early.

The way applications are designed and use RabbitMQ client libraries is a major contributor to the overall system resilience. Applications that use resources inefficiently or leak them will eventually affect the rest of the system. For example, an app that continuously opens connections but never closes them will exhaust cluster nodes out of file descriptors so no new connections will be accepted. This and similar problems can manifest themselves in more complex scenarios, e.g those collectively known as the thundering herd problem.

Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet loss). Disrupted TCP connections take a moderately long time (about 11 minutes with default configuration on Linux, for example) to be detected by the operating system. AMQP 0-9-1 offers a heartbeat feature to ensure that the application layer promptly finds out about disrupted connections (and also completely unresponsive peers). Heartbeats also defend against certain network equipment which may terminate "idle" TCP connections when there is no activity on them for a certain period of time.

TCP keepalives is a TCP stack feature that serves a similar purpose and can be very useful (possibly in combination with heartbeats) but requires kernel tuning in order to be practical with most operating systems and distributions.

To configure the heartbeat timeout in the Java client, set it with ConnectionFactory#setRequestedHeartbeat before creating a connection:

ConnectionFactory cf = new ConnectionFactory();

// set the heartbeat timeout to 60 seconds
cf.setRequestedHeartbeat(60);

Prometheus in combination with Grafana or the ELK stack have a number of benefits compared to other monitoring options:

  • Decoupling of the monitoring system from the system being monitored
  • More powerful and customizable user interface

Legacy Intrusive Health Check

Earlier versions of RabbitMQ provided a single opinionated and intrusive health check command (and its respective HTTP API endpoint):
// DO NOT USE: this health check is very intrusive, resource-intensive, prone to false positives // and as such, deprecated rabbitmq-diagnostics node_health_check

The above check forced every connection, queue leader replica, and channel in the system to emit certain metrics. With a large number of concurrent connections and queues, this can be very resource-intensive and too likely to produce false positives.

Monitoring Tools

  • AppDynamics
  • AWS CloudWatch
  • collectd
  • DataDog
  • Dynatrace
  • Nagios
  • New Relic
  • Prometheus
  • Zabbix

Health Checks as Readiness Probes

Alguns health checks retornam um erro se o node estiver esperando por peers sincroniar as tabelas de schema:
rabbitmq-diagnostics check_running
Outros como o abaixo basta o node estar rodando:
rabbitmq-diagnostics ping

In some environments, node restarts are controlled with a designated health check. The checks verify that one node has started and the deployment process can proceed to the next one. If the check does not pass, the deployment of the node is considered to be incomplete and the deployment process will typically wait and retry for a period of time. One popular example of such environment is Kubernetes where an operator-defined readiness probe can prevent a deployment from proceeding when the OrderedReady pod management policy is used.

Existem health checks mais específicos e outros mais amplos, alguns avaliam se o cluster está saudável outros se o broker iniciou corretamente.

rabbitmq-diagnostics -q listeners
verifica as portas ouvindo, 25672, 5672, 5671, 15672, 15671, são portas para inter-node, e cli tool communication, amqp, amqp/ssl, http, https

rabbitmq-diagnostics check_port_connectivity

rabbitmq-diagnostics -q memory_breakdown --unit "MB"
avalia o uso de memória para cada componente do broker

rabbitmq-diagnostics check_running
verifica se o broker está rodando

rabbitmq-diagnostics -q check_running && rabbitmq-diagnostics -q check_local_alarms

rabbitmq-diagnostics -q status

Stage 5
Includes all checks in stage 4 plus checks that there are no failed virtual hosts.
rabbitmq-diagnostics check_virtual_hosts is a command checks whether any virtual host dependencies may have failed. This is done for all virtual hosts.
rabbitmq-diagnostics -q check_virtual_hosts
// if the check succeeded, exit code will be 0

Formas de Monitorar

  • Prometheus: É possível monitorar com Prometheus, a ferramenta recomendada, por ser desacoplada do próprio broker e monitorar apenas o broker e não incluindo ela mesma, por ter baixo overhead, long term metric storage, usando Grafana teremos um dashboard e gráficos de métricas avançados e de fácil configuração, etc.
  • Management Plugin: coleta métricas e exibe-as na UI, é conveniente para ambiente de desenvolvimento.
    Management Plugin é entrelaçado com o próprio broker poluindo o monitoramento.
    Tem muito overhead, principalmente de memória RAM, apenas guarda dados recentes, tem uma UI bem básica.
  • Kubernetes Operator: vem habilitado por default quando o RabbitMQ é implantado via Kubernetes com Operator.
  • Interactive Command Line-Based Observer Tool: rabbitmq-diagnostics observer command line similar ao top, htop, vmstat
    Oferece acesso a muitas métricas: Runtime version info, CPU, schedule stats, Mem allocation, Top CPU, Network link stats, TCP socket stats...

Clustering pode ser usado para atingir objetivos distintos, vai depender da configuração:

  • Aumento da segurança dos dados através de replicação.
  • Aumentar a disponibilidade para os clientes.
  • Maior capacidade de fluxo total de dados.

Falha na conexão da rede entre os brokers tem um efeito na consistência dos dados e na disponibilidade.

Dependendo do sistema há diferente necessidade quanto a consistência e quanto a tolerância a indisponibilidade.
Por conta disso existem estratégias diferentes para tratar o particionamento do cluster.

Uma partição de rede é detectada se um broker não conseguir se contactar com outro por 60 segundos (default).

Se não for configurada uma estratégia de recuperação de particionamento de rede, após o cluster se restaurar,
a divisão vai continuar mesmo após a rede estar funcionando entre todos os brokers.

Recuperando de um Split-Brain manualmente

1o escolha uma partição que você confie mais.
Quaisquer alterações nas outras partições serão perdidas.
Pare todos os brokers das outras partições, e então inicie-os novamente.
Após isso, reinicie os brokers da partição principal que você escolheu.
Dependendo do caso, se for simples, é possível reiniciar todos os brokers de uma vez,
contanto que você garanta que o 1o node a iniciar seja da partição que você confia.

Recuperando o particionamento de forma automática

RabbitMQ also offers two ways to deal with network partitions automatically: pause-minority mode or pause-if-all-down mode.
There are two aditional arguments: autoheal. And the default behaviour ignore.

In pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority.
This configuration prevents split-brain and is therefore able to automatically recover from network partitions without inconsistencies.

In pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes.
This is close to the pause-minority mode, however, it allows an administrator to decide which nodes to prefer, instead of relying on the context.
In pause-if-all-down mode, if the administrator listed the two nodes in rack A, only nodes in rack B will pause.
Note that it is possible the listed nodes get split across both sides of a partition:
in this situation, no node will pause.
That is why there is an additional ignore/autoheal argument to indicate how to recover from the partition.

In autoheal mode RabbitMQ will automatically decide on a winning partition if a partition is deemed to have occurred
and will restart all nodes that are not in the winning partition
The winning partition is the one which has the most clients connected (or if this produces a draw, the one with the most nodes; and if that still produces a draw then one of the partitions is chosen in an unspecified way).

ignore: use when network reliability is the highest practically possible and node availability is of topmost importance. For example, all cluster nodes can be in the same rack or equivalent, connected with a switch, and that switch is also the route to the outside world.

Note that pause_minority mode will do nothing to defend against partitions caused by cluster nodes being suspended. This is because the suspended node will never see the rest of the cluster vanish, so will have no trigger to disconnect itself from the cluster.

The persistence layer has two components: the queue index and the message store.
The queue index is responsible for maintaining knowledge about where a given message is in a queue, along with whether it has been delivered and acknowledged.

The message store is a key-value store for messages, shared among all queues in the server.

By default, RabbitMQ will not accept any new messages when it detects that it's using more than 40% of the available memory (as reported by the OS): vm_memory_high_watermark.relative = 0.4. This is a safe default and care should be taken when modifying this value, even when the host is a dedicated RabbitMQ node.

The current 50MB disk_free_limit default works very well for development and tutorials. Production deployments require a much greater safety margin. Insufficient disk space will lead to node failures and may result in data loss as all disk writes will fail.

Operating systems limit maximum number of concurrently open file handles, which includes network sockets.

RabbitMQ nodes authenticate to each other using a shared secret stored in a file. On Linux and other UNIX-like systems, it is necessary to restrict cookie file access only to the OS users that will run RabbitMQ and CLI tools.

It is important that the value is generated in a reasonably secure way (e.g. not computed from an easy to guess value). This is usually done using deployment automation tools at the time of initial deployment. Those tools can use default or placeholder values: don't rely on them. Allowing the runtime to generate a cookie file on one node and copying it to all other nodes is also a poor practice: it makes the generated value more predictable since the generation algorithm is known.

We recommend using TLS connections when possible, at least to encrypt traffic. Peer verification (authentication) is also recommended.

Single node clusters can be sufficient when simplicity is preferred over everything else: development, integration testing and certain QA environments.

Three node clusters are the next step up. They can tolerate a single node failure (or unavailability) and still maintain quorum. Simplicity is traded off for availability, resiliency and, in certain cases, throughput.

It is important to pick a partition handling strategy before going into production. When in doubt, use the pause_minority strategy with an odd number of nodes (3, 5, 7, and so on).

Data locality will be best when producers (publishers) connect to RabbitMQ nodes where queue leaders are running. Such topology is difficult to achieve in practice.

All metadata (definitions: virtual hosts, users, queues, exchanges, bindings, etc.) is replicated across all nodes in the cluster, and most metadata changes are synchronous in nature.

The cost of propagating such changes goes up with the number of cluster nodes, both during operations and node restarts. Users who find themselves in need of clusters with node counts in double digits should consider using independent clusters for separate parts of the system where possible.

Quorum queues

para garantir as mensagens no caso de queda de um broker, pode-se usar as quorum queues para distribuir(replicar) as mensagens da queue entre os brokers.
quorum queues são duráveis, FIFO, e possui um algoritmo de consenso para caso de particionamento da rede.
quorum queues são usadas em cenários em que fault tolerance e data safety são mais importantes do que, a menor latência possível e configurações avançadas nas queues.
3 brokers (nós) é o mínimo praticável para uma quorum queue ser replicada.
a quorum queue pode ser configurada para ter 3 réplicas porém, o cluster tem 10 brokers.
o ideal é a quorum queue sempre ter número ímpar de nós para caso houver particionamento ter um lado maior.
Every quorum queue has a primary replica. That replica is called queue leader. All queue operations go through the leader first and then are replicated to followers (mirrors). This is necessary to guarantee FIFO ordering of messages.
Um broker só poderá conter réplicas de uma quorum queue se ele for declarado (incluído) na lista de réplicas.
O correto é distribuir os leaders entre os brokers para não sobrecarregar um único broker com leaders.
Ao entrar ou sair um broker do cluster é possível rebalancear o quorum.
rabbitmq-queues rebalance quorum
A quorum queue should be able to tolerate a minority of queue members becoming unavailable with no or little effect on availability.
A mensagem é considerada entregue após a maioria dos nós confirmarem o recebimento.
Se um quorum de 2 brokers ficarem indisponíveis e no total são 3, então a quorum queue ficará indisponível.
Followers queues que estão disconectadas ou avaliando um leader não receberão mensagens.
Quorum queues keep track of the number of unsuccessful delivery attempts and expose it in the "x-delivery-count" header that is included with any redelivered message.
It is possible to set a delivery limit for a queue using a policy argument, delivery-limit.
Pode acontecer de uma mensagem ser reenviada diversas vezes e o consumer nunca conseguir retornar um ack, para evitar um possível ciclo infinito, é possível criar uma policy delivery-limit e assim evitar esse ciclo, ao chegar no limite, a mensagem é enviada para uma Dead Letter Exchange ou excluída. A contagem é feita sendo adicionado cada retorno no header da mensagem, chamado x-delivery-count
Repeated Requeues Internally quorum queues are implemented using a log where all operations including messages are persisted. To avoid this log growing too large it needs to be truncated regularly. To be able to truncate a section of the log all messages in that section needs to be acknowledged. Usage patterns that continuously reject or nack the same message with the requeue flag set to true could cause the log to grow in an unbounded fashion and eventually fill up the disks.
Since RabbitMQ 3.10 messages that are rejected or nacked back to a quorum queue will be returned to the back of the queue if no delivery-limit is set. This avoids the above scenario where repeated re-queues causes the Raft log to grow in an unbounded manner. If a delivery-limit is set it will use the original behaviour of returning the message near the head of the queue.
O log das quorum queues pode crescer muito caso tenha muita mensagem rejeitada e que são reenviadas (requeue), a partir da versão 3.10, mensagens rejeitadas voltam para o fim da fila, isso se elas não tiverem o flag delivery-limit
With classic queues, all deliveries are performed from the leader replica. Quorum queues can deliver messages from queue replicas as well, so as long as consumers connect to a node where a quorum queue replica is hosted, messages delivered to those consumers will be performed from the local node.
Assuming all cluster members are available, a client can connect to any node and perform any operation. Nodes will route operations to the quorum queue leader or queue leader replica transparently to clients.

Rabbit
Streams through publish/subscribe, and services with a direct reply-to feature.
Load balancing can be achieved with a Work Queue.
Applications must correlate requests with replies over multiple topics for a service (request-reply) pattern.

Pulsar
Streams through publish/subscribe.
Multiple competing consumer patterns support load balancing.
Application code must correlate requests with replies over multiple topics for a service (request-reply) pattern.

Kafka
Streams through publish/subscribe.
Load balancing can be achieved with consumer groups.
Application code must correlate requests with replies over multiple topics for a service (request-reply) pattern.

NATS
Streams and Services through built-in publish/subscribe, request-reply, and load-balanced queue subscriber patterns.
Dynamic request permissioning and request subject obfuscation is supported.

gRPC
One service, which may have streaming semantics, per channel.
Load Balancing for a service can be done either client-side or by using a proxy.

Delivery Guarantees

NATS
At most once, at least once, and exactly once is available in JetStream.
gRPC
At most once.
Kafka
At least once, exactly once.
Pulsar
At most once, at least once, and exactly once.
Rabbit
At most once, at least once.

Multi-tenancy and Sharing

NATS
NATS supports true multi-tenancy and decentralized security through accounts and defining shared streams and services.
gRPC
N/A
Kafka
Multi-tenancy is not supported.
Pulsar
Multi-tenancy is implemented through tenants; built-in data sharing across tenants is not supported.
Each tenant can have its own authentication and authorization scheme.
Rabbit
Multi-tenancy is supported with vhosts; data sharing is not supported.

Authentication

NATS
NATS supports TLS, NATS credentials, NKEYS (NATS ED25519 keys), username and password, or simple token.
gRPC
TLS, ALT, Token, channel and call credentials, and a plug-in mechanism.
Kafka
Supports Kerberos and TLS.
Supports JAAS and an out-of-box authorizer implementation that uses ZooKeeper to store connection and subject.
Pulsar
TLS Authentication, Athenz, Kerberos, JSON Web Token Authentication.
Rabbit
TLS, SASL, username and password, and pluggable authorization.

NATS’s ideal use case is for architectures with a need for highly-dynamic RPC and non-persistent PubSub messaging. NATS is optimized for performance and ease of use. This comes with some limitations — specifically, smaller messages, fewer delivery guarantees, limited persistence. NATS has become more popular lately as an alternative to Service Mesh implementations for internal service communication. NATS is also easier to deploy and scale than RabbitMQ. Try adding a node to RabbitMQ in production - it’s not as easy as deploying a new NATS server instance.

RabbitMQ is great for worker queues and less-dynamic PubSub. Sure you can do RPC with RabbitMQ, but topology configuration is more complex and deliberate. RabbitMQ is great at making sure messages aren’t dropped (whether by RMQ itself or a subscriber that fails to process a message). These transactional guarantees make RMQ slower than NATS but provide safety in situations where you don’t want to make consumers responsible for their own reliability.

vhost serve para separar ambientes, um ambiente rabbit para dev, outro prod, etc.
Também pode servir para separar projetos no mesmo servidor.

vhost separa aplicações usando o mesmo RabbitMQ instance.
Different users can have different access privileges to different vhosts and queues,
and exchanges can be created so that they only exist in one vhost.

Acknowledgments and Confirms podem ser emitidos tanto pelos consumidores para o servidor, como vice-versa, do servidor para o producer.

4 tipos de exchanges: Direct, Topic, Fanout, Headers

Direct envia as mensagens diretamente para as queues baseado numa routing key.
Topic filtra as mensagens conforme alguma expressão
Fanout envia para todas as Queues
Headers aplica rotas conforme informações contindas no Header

Durable Exchanges sobrevivem após restarts do servidor, e duram até serem deletadas.
Temporary Exchanges existem até a instância RabbitMQ sofrer shut down.
Auto-deleted Exchanges são removidas assim que o último objeto é desassociado da exchange.

Think of the routing key as an "address" that the exchange uses to decide on how to route the message.
A message goes to the queue(s) that have or has the exact match in binding key to the routing key of the message.

Note: If the message routing key does not match any binding key, the message is discarded.

Default exchange
A default exchange é uma direct exchange pré-declarada, sem nome.
Quando usando a exchange default, a mensagem é entregue para a queue com um nome igual a routing key da mensagem.
Toda queue é automaticamente associada a uma exchange default com uma routing key que corresponda ao nome da queue.
Ou seja, será criada uma routing key com o nome igual ao nome da queue

TOPIC EXCHANGE
Topic exchanges roteiam as mensagens para uma queue baseado num filtro com wildcard
Para rotear precisa uma correspondência entre routing key e routing pattern que é especificado pelo queue binding
Mensagens podem ser roteadas para uma ou mais queues.
The routing key must be a list of words, delimited by a period (.).
The routing patterns may contain an asterisk ("*") to match a word in a specific position of the routing key
A pound symbol ("#") indicates a match on zero or more words
(e.g., a routing pattern of agreements.*.*.b.* only match routing keys where the first word is agreements and the fourth word is "b").
(e.g., a routing pattern of agreements.eu.berlin.# matches any routing keys beginning with agreements.eu.berlin).

O consumidor indica qual tópico está interessado.
O consumidor que cria uma queue.
O consumidor cria a fila e configura um binding com um routing pattern para a exchange.
The consumers indicate which topics they are interested in (like subscribing to a feed of an individual tag).
The consumer creates a queue and sets up a binding with a given routing pattern to the exchange.
All messages with a routing key that match the routing pattern are routed to the queue and stay there until the consumer handles the message.

routing pattern é o filtro, que terá os wildcards, e fica entre a queue e a exchange.
routing key é a informação na mensagem.

Headers Exchange
A special argument named "x-match", added in the binding between exchange and queue, specifies if all headers must match or just one.
The "x-match" property can have two different values: "any" or "all", where "all" is the default value.
A value of "all" means all header pairs (key, value) must match, while value of "any" means at least one of the header pairs must match.

It's worth noting that in a header exchange, the actual order of the key-value pairs in the message is irrelevant.

Dead Letter Exchange
A message is considered "dead" when it has reached the end of it's time-to-live or,
the queue exceeds the max length (messages or bytes) configured for the queue or,
the message has been rejected by the queue or marked by the consumer for some reason and
is not marked for re-queueing.

A dead-lettered message can be republished to an exchange called dead letter exchange.
O roteamento pode ser usando uma routing key especificada para a dead letter exchange ou com a mesma routing key original.

A client may accidentally or maliciously route messages using non-existent routing keys.
To avoid complications from lost information, collecting unroutable messages in a RabbitMQ alternate exchange is an easy, safe backup.
RabbitMQ handles unroutable messages in two ways based on the mandatory flag setting within the message header.
The server either returns the message when the flag is set to "true" or silently drops the message when set to "false".
O servidor retorna a mensagem quando o sinalizador é definido como "verdadeiro" ou elimina silenciosamente a mensagem quando definido como "falso".

Alternate Exchange
RabbitMQ let you define an alternate exchange to apply logic to unroutable messages.

Confirmation Mode
A channel in confirmation mode requires each published message to be ‘acked’ or ‘nacked’ by the server,
thereby indicating that it has been handled.

UI Interface
Unacked are the number of messages for which the server is waiting for acknowledgment.

Features show the parameters for the exchange (e.g D stands for durable, and AD for auto-delete)

Message TTL - The time a message published to a queue can live before being discarded.
o que ocorre com mensagem descartada? se perde?

Auto-expire - The time a queue can be unused before it is automatically deleted.

Max length - How many ready messages a queue can hold before it starts to drop them.
se passar do limite as mensagens serao perdidas?

Max length bytes - The total size of ready messages a queue can hold before it starts to drop them.

Shovel?

Enable lazy queues

Lazy queues are able to support long queues (millions of messages).
Lazy queues, added in RabbitMQ 3.6, write messages to disk immediately.

Messages are only loaded into memory when needed, thereby minimizing the RAM usage, but increasing the throughput time.
É um recurso que evita o servidor ficar sobrecarregado se vier muitas mensagens de uma vez,
com o custo de toda mensagem ser gravada em disco, minimiza o uso de RAM.

We recommend using lazy queues when you know that you will have large queues from time to time

Sending many messages at once (e.g., processing batch jobs) or
if there is a risk consumers will not keep up with the speed of the publishers consistently,
lazy queues are recommended.
Disable lazy queues if high performance is required,
if queues are always short, or if a max-length policy exists.

Queues

Queues são single-thread.
Queues suportam até 50.000 mensagens.

RabbitMQ queues are bound to the node where they are first declared.
All messages routed to a specific queue will end up on the node where that queue resides.
Manually splitting queues evenly between nodes is possible,
but the downside is to remember where each queue is located manually.

Queues são ligadas a um node ao serem criadas, todas as mensagens para essa queue serao roteadas para esse node.
É possível dividir as queues manualmente, mas terá que lembrar onde cada queue está localizada.

Multiple Cores

Two plugins that help if there are multiple nodes or a single node cluster with multiple
cores are the consistent hash exchange plugin and RabbitMQ Sharding.

Plugin Constisten Hash Exchange
O plug-in consistent hash exchange permite o uso de uma exchange para fazer load-balance de mensagens entre as filas.
As mensagens enviadas para essa exchange são distribuídas de forma consistente
e igual em muitas filas com base na chave de roteamento da mensagem.
O plugin cria um hash da chave de roteamento e espalha as mensagens entre as filas que têm uma binding a essa exchange.
Fazer isso manualmente pode se tornar problemático rapidamente
sem adicionar muitas informações sobre o número de filas e suas bindings no publisher.

RabbitMQ sharding

https://github.com/rabbitmq/rabbitmq-sharding.

A queue name starting with amq. is reserved for internal use by the broker.

TTL para fila (time-to-live) policy on the queue.
A TTL policy of 28 days will delete queues that have not had messages consumed from them in the last 28 days.

auto-delete queue é deletada quando o último consumer é cancelado ou quando o channel/connection é fechada.

exclusive queue é usada apenas pela connection que a declarou, e será deletada quando a connection for fechada.

Sending multiple small messages might be a bad alternative.
The better idea could be to bundle the small messages into one larger message and let the consumer split it up.
However, if bundling multiple messages remember that this might affect the processing time.

Se uma das mensagens agrupadas falhar, todas elas precisarão ser reprocessadas?
A largura de banda e a arquitetura ditarão a melhor maneira de configurar as mensagens para considerar a carga útil.

Each connection uses about 100 KB of RAM (and even more, if TLS is used).

Connections should be long-lived so that channels can be opened and closed more frequently, if required.
Even channels should be long-lived if possible.
Do not open a channel every time a message is published.

Custos:

• AMQP connections: 7 TCP packages
• AMQP channel: 2 TCP packages
• AMQP publish: 1 TCP package (more for larger messages)
• AMQP close channel: 2 TCP packages
• AMQP close connection: 2 TCP packages

The client can either ack(knowledge) the message when it receives it, or when the client has completely processed the message.

Se as mensagens forem coisa que precisa ser processado até o fim, então o ack deve ser emitido pelo consumer só após o processamento.

Messages, exchanges, and queues that are not durable and persistent are lost during a broker restart.
Mensagens, exchanges, e queues não são duráveis por default, e são perdidas num restart.

Durable Queues and Persistent Messages
Queues should be declared as "durable" and messages should be sent with delivery mode "persistent".

Use persistent messages and durable queues
If you cannot afford to lose any messages,
make sure that your queue is declared as "durable",
and your messages are sent with delivery mode "persistent"
(delivery_mode=2)

Encriptação
Connecting to RabbitMQ over AMQPS is the AMQP protocol wrapped in TLS.
AMPQS tem problema de performance, é possível VPC no lugar.

PRE-FETCH
The RabbitMQ default pre-fetch setting gives clients an unlimited buffer,
meaning that RabbitMQ by default sends as many messages as it can to any consumer that looks ready to accept them.

Messages that are sent are cached by the RabbitMQ client library in the consumer until processed.
Pre-fetch limits how many messages the client can receive before acknowledging a message.

A large pre-fetch count, can deliver many messages to one single consumer,
and keep other consumers in an idling state.

Creating a CloudAMQP instance with one node. Um RabbitMQ só num node é muito rápido,
visto que as mensagens não são sincronizadas entre nodes.
Creating a CloudAMQP instance with two nodes results in half the performance.

With three nodes created in a CloudAMQP instance. Um Rabbit com 3 nodes será um quarto da performance.

The nodes are located in different availability zones. Quando se cria cluster, cada node precisa ficar em diferentes zonas.

Pause minority is a partition handling strategy in a three node cluster that protects data from being inconsistent due to net-splits.
Pause minority???

If the system has many consumers, and/or a long processing time,
the best practice is to set the pre-fetch count to one (1) to evenly distribute the messages among all resources.
Please note that if a client is set up to auto-ack messages, the pre-fetch value has no effect.

HIPE
Habilitando HIPE, RabbitMQ terá maior throughput, porém, o tempo de startup será maior.
Enabling HiPE means that RabbitMQ is compiled at startup.

REMEMBER TO ENABLE HA ON NEW VHOSTS
Ao criar os clusters, é necessário ter a política HA, para que os nodes sejam sincronizados.
A common mistake on CloudAMQP clusters is that users create a new vhost but forget
to enable an HA-policy for it.
Messages will therefore not be synced between nodes.

Direct Exchange are the fastest

Dead Letter
A queue that is declared with the x-dead-letter-exchange property sends messages
which are either rejected, nacked (negative acknowledged) or expired (with TTL),
to the specified dead-letter-exchange.
Specifying x-dead-letter-routing-key in the routing
key of the message will change it when dead lettered.

Message TTL
Declaring a queue with the x-message-ttl property means that messages will be
discarded from the queue if they haven’t been consumed within the time specified.

Set max-length if needed
Drop or dead-letter messages from the front of the queue (i.e. the oldest messages in the queue).

Split queues over different cores
Better performance is achieved by splitting queues into different cores,
and into different nodes if possible, as queue performance is limited to one CPU core.

Enable lazy queues
Lazy queues write messages to disk immediately,
which spreads the work out over time instead of risking a performance hit somewhere down the line.
The result of a lazy queue is a more predictable, smooth performance curve without sudden drops.
Lazy queues gravam no disco.
O que espalha a carga ao longo do tempo,
ao invés de correr o risco de ter quedas bruscas de performance em um momento no tempo.
O resultado de uma lazy queue é uma curva mais suave e previsível de performance sem quedas repentinas.

RabbitMQ HA – Two (2) nodes
Dois (2) nós são ideais para alta disponibilidade
Quando se trata de mensageria não precisa mais que isso, não há o problema de split brain.

CloudAMQP locates each node in a cluster in different availability zones (AWS).
Additionally, queues are automatically mirrored and replicated (HA) between availability zones.

Optional federation use between clouds
Um Broker de mensageria, um Load Balancer não serão gargalos, não serão pontos de fragilidade do sistema, pois não processam, não possuem regras, então quebrar um broker ou um load balancer é quase impossível, é disponibilidade quase 100%.

Clustering is not recommended between clouds or regions,
and therefore there is no plan at CloudAMQP to spread nodes across regions or datacenters.
If one cloud region goes down, the CloudAMQP cluster also goes down, but this scenario would be extremely rare.
Instead, cluster nodes are spread across availability zones within the same region.

Fazer cluster de mensageria em cloud, ou em regiões muito distante vai ter throughput ou latência, o que torna inviável.
Pode-se utilizar federation para alta disponibilidade, que é um mecanismo diferente do que clusterizar com réplicas.

Management statistics rate mode
Setting the rate mode of RabbitMQ management statistics to ‘detailed’ has a serious impact on performance.
This setting is not recommended in production.

Limited use of priority queues
Each priority level uses an internal queue on the Erlang VM, which consumes resources.
In most uses it is sufficient to have no more than five (5) priority levels.

Policy
A policy is a pattern against which queue names are matched.
Policy is a regular expression used to match queue (or exchange) names.
A policy can apply to an upstream set or a single upstream of exchanges and/or queues.

Protocols
One protocol can be used when publishing while another can be used to consume.
It has no impact on the message itself.
The MQTT protocol, with its minimal design, is perfect for built-in systems, mobile phones, and other memory and bandwidth sensitive applications.

MQTT
MQ Telemetry Transport is a publish-subscribe, pattern-based "lightweight" messaging protocol.
The protocol is often used in the IoT (Internet of Things) world of connected devices.

MQTT lacks authorization and error notifications from the server to clients, which are significant limitations in some scenarios.

STOMP
STOMP does not deal with queues and topics; instead, it uses a SEND semantic with a destination string.

HTTP
RabbitMQ can transmit messages over HTTP.
The RabbitMQ-management plugin provides an HTTP-based API for management and monitoring of the RabbitMQ server.
CloudAMQP HTTP assigned port number is 443.

RabbitMQ

O grande diferencial do RabbitMQ são suas capacidades para roteamento das mensagens.
O grande diferencial do Kafka é a escalabilidade, porém, limitado pela capacidade de disco do broker.
Pulsar tem como diferencial a escalabilidade infinita.
Redis é simples, pub/sub fire and forget.
NATS tem muitos recursos.

Kafka

data processing and analytics pipelines which is where Kafka clearly shines.

Exchanges

Exchanges route messages to queues and other exchanges
A common misconception of exchanges in RabbitMQ are that they are "things" that you send messages to. In fact that are routing rules.
Exchanges são regras, não é um repositório de mensagens, não é um receptor de mensagens, é apenas regra de roteamento para uma queue, na verdade a mensagem ao chegar no Broker, fica na memória, e é roteada para uma queue.

LDAP

RabbitMQ can use LDAP to perform authentication and authorisation by deferring to external LDAP servers.
This functionality is provided by a built-in plugin.
The plugin primarily targets OpenLDAP and Microsoft Active Directory.

Federation vs Shovel

The federation plugin doesn't move messages, it only allows remote consumers to dequeue messages.
Shovel plugin really drain messages from a broker, it is used to move messages.

Transactions

The Tx class allows publish and ack operations to be batched into atomic units of work.
The intention is that all publish and ack requests issued within a transaction will complete successfully or none of them will.
Transactions that cover multiple queues may be non-atomic,
given that queues can be created and destroyed asynchronously,
and such events do not form part of any transaction.

RabbitMQ compatibility

The use of any RabbitMQ-specific extensions makes it harder to swap RabbitMQ for a different AMQP broker - sender-selected distribution is no exception.

JMS

RabbitMQ is not a JMS provider but includes a plugin needed to support the JMS Queue and Topic messaging models.
Thus allowing new and existing JMS applications to connect to RabbitMQ.

To use JMS with RabbitMQ, you need 2 components:

  • JMS client library
  • RabbitMQ JMS topic selector plugin

RabbitMQ JMS topic selector plugin support message selectors for JMS topics, message selectors allow a JMS application to filter messages using an expression based on SQL syntax.

Some JMS 1.1 features are unsupported in the RabbitMQ JMS Client:

  • The JMS Client does not support server sessions.
  • XA transaction support interfaces are not implemented.
  • Queue selectors are not yet implemented.

BCC CC

The routing logic in AMQP 0-9-1 does not offer a way for message publishers to select intended recipients unless they bind their queues to the target destination (an exchange).
The RabbitMQ broker treats the "CC" and "BCC" message headers in a special way to overcome this limitation. This is the equivalent of entering multiple recipients in the "CC" or "BCC" field of an email.
The values associated with the "CC" and "BCC" header keys will be added to the routing key if they are present.
The "BCC" key and value will be removed from the message prior to delivery, offering some confidentiality among consumers.
Um publisher não consegue escolher um destinatário a não ser que faça bind da queue para o destinatário.
RabbitMQ usa CC e BCC headers para superar essa limitação.

Sharding

Message distribution between shards (partitioning) is achieved with a custom exchange type that distributes messages by applying a hashing function to the routing key.
The plugin provides a new exchange type, "x-modulus-hash", that will use a hashing function to partition messages routed to a logical queue across a number of regular queues (shards).
The "x-modulus-hash" exchange will hash the routing key used to publish the message and then it will apply a Hash mod N to pick the queue where to route the message, where N is the number of queues bound to the exchange.
This exchange will completely ignore the binding key used to bind the queue to the exchange.
There are other exchanges with similar behaviour: the Consistent Hash Exchange or the Random Exchange. Those were designed with regular queues in mind, not this plugin, so "x-modulus-hash" is highly recommended.
If message partitioning is the only feature necessary and the automatic scaling of the number of shards (covered below) is not needed or desired, consider using Consistent Hash Exchange instead of this plugin.
et's say you declared the exchange images to be a sharded exchange. Then RabbitMQ creates several "shard" queues behind the scenes.
TL;DR: if you have a shard called images, then you can directly consume from a queue called images.

Policies

RabbitMQ policies allow for flexible, centralised attribute configuration of queues and exchanges.

Alternate Exchanges

Alternate Exchanges route messages that were otherwise unroutable.

Queue Message TTL

Per-Queue Message TTL determines how long an unconsumed message can live in a queue before it is automatically deleted.

Queue TTL

Queue TTL determines how long an unused queue can live before it is automatically deleted.

Dead Letter Exchanges

Dead Letter Exchanges ensure messages get re-routed when they are rejected or expire.

Wireshark

Capturing Traffic with Wireshark.

Quorum Queue

Quorum Queues is a replicated queue that provide high availability and data safety.

Quorum Queue Purposes

Their intended use is for topologies where queues exist for a long time and are critical to certain aspects of system operation, therefore fault tolerance and data safety is more important than, say, lowest possible latency and advanced queue features.
This is a replicated queue to provide high availability and data safety.
Each quorum queue is a replicated queue; it has a leader and multiple followers.

Ack

Acknowledging Multiple Deliveries at Once.
basic.nack is used for negative acknowledgements (note: this is a RabbitMQ extension to AMQP 0-9-1)
Delivery Identifiers: Delivery Tags
Negative acknowledge a delivery:
Such deliveries can be
1- discarded or
2- dead-lettered or
3- requeued by the broker.
This behaviour is controlled by the requeue field.
When the field is set to true, the broker will requeue the delivery with the specified delivery tag.
Alternatively, when this field is set to false, the message will be routed to a Dead Letter Exchange if it is configured, otherwise it will be discarded.
When a message is requeued, it will be placed to its original position in its queue, if possible. If not (due to concurrent deliveries and acknowledgements from other consumers when multiple consumers share a queue), the message will be requeued to a position closer to queue head.

basic.reject vs basic.nack

It is possible to reject or requeue multiple messages at once using the basic.nack method.
This is what differentiates it from basic.reject. It accepts an additional parameter, multiple.

Asynchronous ack

Because messages are sent (pushed) to clients asynchronously, there is usually more than one message "in flight" on a channel at any given moment. In addition, manual acknowledgements from clients are also inherently asynchronous in nature.

Confirm

For routable messages, the basic.ack is sent when a message has been accepted by all the queues. For persistent messages routed to durable queues, this means persisting to disk. For quorum queues, this means that a quorum replicas have accepted and confirmed the message to the elected leader.

Exclusive Queue

used by only one connection and the queue will be deleted when that connection closes.

Auto-delete queue

queue that has had at least one consumer is deleted when last consumer unsubscribes.

Ordenação de Ack

No Rabbit aplicações não podem depender da ordem dos Ack.
Ack pode ser feito em lote, e assincrono.

Consistent hash exchange plugin

The consistent hash exchange plugin allows you to use an exchange to load-balance messages between queues.
Messages sent to the exchange are consistently and equally distributed across many queues, based on the routing key of the message.
The plugin creates a hash of the routing key and spreads the messages out between queues that have a binding to that exchange.
It could quickly become problematically to do this manually, without adding too much information about numbers of queues and their bindings into the publisher.
Basicamente ele distribui uniformemente as mensagens enviadas para uma exchange entre diversas queues.
Quando a quantidade de mensagens enviadas para uma queue são demasiadas para o broker aguentar, se você colocar 2 consumers na queue, haverá concorrência, e a ordem dos eventos são perdidos, por exemplo a ordem dos eventos do pedido 123, ele foi feito, atualizado, e depois cancelado, tendo dois consumidores, é possível um pegar o primeiro evento, e outro pegar o segundo, perdendo assim a consistência.
Com o consistent hash exchange é possível enviar todos os eventos de um pedido (por exemplo o 123) para a queue de pedidos, um consumidor vai pegar todas as mensagens daquele pedido, já outro consumidor pegará eventos de outro pedido, pois, a queue lógica é a mesma, porém, há uma queue física para cada consumidor.

Consistent Hash Exchange vs n Exchanges + n Queues

We can bind 5 queues to 5 exchanges directly, instead of using consistent hash exchange.
But each of those queues would receive every message (not a subset).

RabbitMQ sharding

The RabbitMQ sharding plugin does the partitioning of queues automatically; i.e., once you define an exchange as sharded, then the supporting queues are automatically created on every cluster node and messages are sharded accordingly.
RabbitMQ sharding shows one queue to the consumer, but in reality, it consists of many queues running in the background.
The RabbitMQ sharding plugin gives you a centralized place where you can send your messages, plus load balancing across many nodes, by adding queues to the other nodes in the cluster.
Esse plugin cria diversas queues, uma em cada broker, e essas queues são vistas como uma só pelo consumer
Serve para paralelizar o consumo da queue sem sobrecarregar um broker só
O problema é que se tiver 10 brokers e apenas 7 consumers, então 3 queues ficarão ociosas e talvez sem ter as mensagens sendo consumidas
Outro problema é se tiver 12 consumers, 2 queues terão 2 consumers, ou 1 queue terá 3 consumers, o que pode ser errado dependendo da estratégia e o administrador do cluster terá que verificar isso, o plugin não tem nenhuma opção para controlar isso
O consumer groups do Kafka é um recurso similar a esse plugin que resolve todos esses problemas

Exchange vs Topic JMS

a fanout exchange mimicks topics that you get with other queue based pub-sub systems.

Message Priority - Consumer Priority

Message TTL - Queue TTL

Dead Letter Exchanges

Messages from a queue can be "dead-lettered"; that is, republished to an exchange when any of the following events occur:
1- The message is negatively acknowledged by a consumer using basic.reject or basic.nack with requeue parameter set to false.
2- The message expires due to per-message TTL; or
3- The message is dropped because its queue exceeded a length limit
Note that expiration of a queue will not dead letter the messages in it.
Dead letter exchanges (DLXs) are normal exchanges.
For any given queue, a DLX can be defined by clients using the queue's arguments, or in the server using policies.

Nodes are Equal Peers

Some distributed systems have leader and follower nodes.
This is generally not true for RabbitMQ.
All nodes in a RabbitMQ cluster are equal peers: there are no special nodes in RabbitMQ core.

Communication between nodes

RabbitMQ nodes and CLI tools (e.g. rabbitmqctl) use a cookie to determine whether they are allowed to communicate with each other. For two nodes to be able to communicate they must have the same shared secret called the Erlang cookie.

Simple Authentication and Security Layer (SASL)

It is a framework for authentication and data security in Internet protocols.
It decouples authentication mechanisms from application protocols.
A SASL mechanism implements a series of challenges and responses:

EXTERNAL

where authentication is implicit in the context (e.g., for protocols already using IPsec or TLS)

ANONYMOUS

for unauthenticated guest access

PLAIN

a simple cleartext password mechanism, defined in RFC 4616

OTP

a one-time password mechanism. Obsoletes the SKEY mechanism.

CRAM-MD5

a simple challenge-response scheme based on HMAC-MD5.

DIGEST-MD5

partially HTTP Digest compatible challenge-response scheme based upon MD5.
DIGEST-MD5 offered a data security layer.

OAUTHBEARER

OAuth 2.0 bearer tokens (RFC 6750), communicated through TLS

OAUTH10A

OAuth 1.0a message-authentication-code tokens (RFC 5849)

As of 2012 protocols currently supporting SASL include:

  • Advanced Message Queuing Protocol (AMQP)
  • Lightweight Directory Access Protocol (LDAP)
  • memcached
  • Post Office Protocol (POP)
  • Remote framebuffer protocol used by VNC
  • Simple Mail Transfer Protocol (SMTP)

Scheduling Messages with RabbitMQ

RabbitMQ Delayed Message Plugin

rabbitmq-plugins enable rabbitmq_delayed_message_exchange

Using the Exchange

To use the Delayed Message Exchange:

Map<String, Object> args = new HashMap<String, Object>();
args.put("x-delayed-type", "direct");
channel.exchangeDeclare("my-exchange", "x-delayed-message", true, false, args);

Delaying Messages

To delay a message:

byte[] messageBodyBytes = "delayed payload".getBytes();
AMQP.BasicProperties.Builder props = new AMQP.BasicProperties.Builder();
headers = new HashMap<String, Object>();
headers.put("x-delay", 5000);
props.headers(headers);
channel.basicPublish("my-exchange", "", props.build(), messageBodyBytes);

Stream

A stream remains an AMQP 0.9.1 queue, so it can be bound to any exchange after its creation, just as any other RabbitMQ queue.
As streams never delete any messages, any consumer can start reading/consuming from any point in the log. This is controlled by the x-stream-offset consumer argument.
a stream with three replicas can tolerate one node failure without losing availability. A stream with five replicas can tolerate two, and so on.
Messages em queues são lidas e após isso destruídas, já em streams, é possível ler o log em qualquer ponto desejado e quantas vezes quiser.
Declaring a queue with an x-queue-type argument set to stream will create a stream with a replica on each configured RabbitMQ node. Streams are quorum systems so uneven cluster sizes is strongly recommended.
Queues são desenhadas para zerar as mensagens e são otimizadas nessa direção.
Já Streams podem performar super bem mesmo com milhões de mensagens.
Every stream has a primary writer (the leader) and zero or more replicas.
A stream requires a quorum of the declared nodes to be available to function.
Streams replicate data across multiple nodes and publisher confirms are only issued once the data has been replicated to a quorum of stream replicas.

Stream e Multiplos Consumers

Para publicar a mesma mensagem para multiplos consumers, usando queues, acaba sendo necessário criar uma queue para cada subscriber.
Isso é péssimo principalmente quando se trata de queue persistentes e de réplicas para resiliência.
Streams resolve esse problema.
Multiplos consumers sem denegrir a performance.

Offsets

Uma aplicação guarda um offset periodicamente (por exemplo, a cada 10.000 mensagens) no servidor.

Inter-node and CLI Traffic Compression

For the data to be compressed, the following conditions MUST be met:

Both RabbitMQ nodes must support inter-node traffic compression. In other words, both nodes must run VMware Tanzu RabbitMQ. The open source edition does not support this feature.

Both nodes must share at least one compression algorithm in common.

Inter-node traffic compression and TLS can't be used at the same time: they are mutually exclusive at the moment.

Rolling upgrades between certain versions are not supported.
Full Stop Upgrades covers the process for those cases.

The Blue/Green deployment strategy offers the benefit of making the upgrade process safer at the cost of temporary increasing infrastructure footprint. The safety aspect comes from the fact that the operator can abort an upgrade by switching applications back to the existing cluster.

Priority queue on disk data currently cannot be migrated in place between 3.6 and 3.7 (a later series).
If an upgrade is performed in place, such queues would start empty (without any messages) after node restart.
To migrate an environment with priority queues and preserve their content (messages), a blue-green upgrade strategy should be used.

With some distributions (e.g. the generic binary UNIX) you can install a newer version of RabbitMQ without removing or replacing the old one, which can make upgrade faster. You should make sure the new version uses the same data directory.

RabbitMQ does not support downgrades; it's strongly advised to back node's data directory up before upgrading.

Depending on what versions are involved in an upgrade, RabbitMQ cluster may provide an opportunity to perform upgrades without cluster downtime using a procedure known as rolling upgrade. A rolling upgrade is when nodes are stopped, upgraded and restarted one-by-one, with the rest of the cluster still running while each node is being upgraded.

If rolling upgrades are not possible, the entire cluster should be stopped, then restarted. This is referred to as a full stop upgrade.

During a rolling upgrade, client connection recovery will make sure that connections are rebalanced.

Starting with RabbitMQ 3.8.8, nodes can be put into maintenance mode to prepare them for shutdown during rolling upgrades.

Full-Stop Upgrades are when an entire cluster is stopped for upgrade, the order in which nodes are stopped and started is important.
During an upgrade, the last disc node to go down must be the first node to be brought online.

Quorum queues dependem de um quorum de nodes online, num contexto de rolling upgrades isso significa que um quorum de nodes precisa estar presente o tempo todo durante o upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment