Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save tomliversidge/1248146f586d061ea5a7fdddc842c092 to your computer and use it in GitHub Desktop.
Save tomliversidge/1248146f586d061ea5a7fdddc842c092 to your computer and use it in GitHub Desktop.
A few years ago, I was building a SaaS application for podcasting. Early on in the life of the system, I ran in to a major failure. The front-end crashed. Hard. The site was down and my error reporting system was filling up my inbox with issues related to connectivity and message queue system. My house seemed to be crumbling all around me, and I was in full panic mode.
To have it go down during business hours (U.S. time) would be less than ideal - or, really, a disaster. But in spite of what turned out to be 2 weeks of intermittent crashes and problems, I didn’t lose any important data or have any service outage in serving podcast episodes. Yeah, the front end website was down a lot. But the primary service - sending RSS and podcast episodes to subscribers - ran without a hitch... all thanks to a robust architecture centered around RabbitMQ!
A Series Of Unfortunate Events
It started with a network problem between my server and RabbitMQ instance service. I spent almost an entire working day babysitting the system, restarting it every time it crashed. Sometimes the network between the two services would glitch every 10 minutes. Sometimes it would go for a couple of hours just fine. Eventually, the network issues were resolved and everything became stable again.
But that’s only where the problems started. A few days later, I started getting more crash reports with a different issue… this time it was database related. I forgot about the free database plan that I was using, and that plan met it’s limits. Any attempt to write data to the database failed - except for the media service that delivered podcast episodes to people. It was still humming along and serving episodes, in spite of the main website being down.
Worse still, in the middle of trying to fix this, I managed to delete my production database by accident... with no backup solution in place! I was in terrified, stomach sickness panic at this point. I thought I was going to have to shut down my business, having lost all customer data. Fortunately, I had made a copy of the data for testing purpose a few hours prior. A quick copy from that testing version restored the service to where it had been, with no significant loss.
A few hours later, the database problems were solved, I had a proper backup solution in place and things seemed to be stable once again.
Except “Stable” Was More Like “Frozen In Time”
A few days later, one of the podcasters that used the service contacted me to let me know that the reports were not showing any new data for the last few days… the same number of days that it had been since the database issue started. After a bit of digging around, I figured out that there was a series of cascading failures caused by the database being down for a bit.
I had backup code in place for the times when the queue services is interrupted. This code saved all messages to my database before sending them through the queue. Once the data is processed on the other end of the queue, it’s marked as such in the database and I know I can remove the data.
When the database was having issues, none of the backup records were being written. The queueing service did it’s job, though, and held on to the messages – near 60,000 of them. But my code was failing when it tired to find the corresponding database record for the message queue message that it was processing and things were crashing because of this.
Catching Up As If It Were Never Down
A few minutes later, I had the queue processing code fixed up to handle that scenario and the 60,000 messages stuck in the queue were processing through. It took about 3 hours for to finish processing the backlog of messages.
The good news is that once the data was finished processing, the service looked like it never had any issues. All of the analytics information from several days of down time showed up. The previous hiccups in service on the front end were long forgotten. My customers had the reports they needed, and they were all up to date again.
Once again, RabbitMQ had saved my SaaS from critical failure by holding all of my messages until I was able to fix the code and process them correctly.
Do You See The Trend?
During the message queue outage, the media services still worked. They sent episodes to the listeners, even though they couldn’t send tracking events to the message queue. The code to send a message to the queue happened asynchronously, after the request for the episode was fulfilled. Serve the episode first, and then try to send the message queue message.
During the database outage, the media services still worked… I could still read data from the database, but I couldn’t write data. The media services don’t rely on being able to write data. I explicitly made a decision quite some time ago, to allow the database write to fail and still serve the podcast episode. As long as the database can be read, the files can be served.
By keeping the critical part of my service as simple as possible – I just need read access to the database, to serve episodes – I was able to keep the most important parts of the system up and running while the rest of it crashed and burned around me.
The additional features that I want in my system happen outside the critical path by sending messages across RabbitMQ and handling it behind the scenes. Things ran smoothly because of the messaging system and patterns that I had put in place.
Learn The Patterns That Will Make It Easier
As with any tool in software development, there are patterns and anti-patterns of use. There's also a blurry line between pattern and anti-pattern in many cases. A good use for one method in one scenario might be a terrible use in another scenario. So how do you know which is which? What patterns should you look at, to begin with? And how do these patterns apply to RabbitMQ?
Starting tomorrow, you'll get the answers you need. You'll learn the core patterns that saved my SaaS and allowed podcast listeners to continue getting episodes while the code around the fringes burned and crumbled.
So stay tuned for the emails that are headed to your inbox! Your application, your architecture, and your sanity will thank you.
When I first started working with RabbitMQ, I missed a fundamental part of how I should have been working with it. I was using the "send to queue" feature of the driver that I had found. What I didn't know, back then, was that I bypassed more than half of the benefits that RabbitMQ provides by stuffing messages directly in to a queue instead of letting them be routed through the message broker.
Broker? Queue? What?
The thing that tripped me up was not understanding the idea of a broker vs a queue. I had come from a background where we put messages directly in to queues, previously. It was what we had available, because we didn't want to spend a ton of extra money for the larger message broker in the system that we were using. But RabbitMQ includes a message broker and I didn't realize this, nor did I realize the value that this provides.
So, what is the "broker" part of RabbitMQ? In a nutshell, the broker is the feature set that accepts a message, examines it and determines where that message should be delivered. It's the exchanges and the route bindings to the queues, as well as the code that moves the message to the right queue. And this is where some of the most significant value can be found in RabbitMQ - the intelligence that you can build in to the routing and bindings.
What Kind Of Intelligence?
Routing in RabbitMQ is done through bindings, and bindings give you a lot of options to create a very intelligence RabbitMQ setup.
Think about this: you have a birthday party invitation you wanted to mail. This invitation needs to go out to 25 people. Each of these people has a postal address and mail box. So, you create 25 copies and send a copy to each person's postal address. With e-mail, however, you only have to write the message once and the email system will make a copy of it for all 25 people on the recipient list.
RabbitMQ's bindings have the ability to work more like email - they can send multiple copies of a single message to multiple queues. This allows your code to be more efficient because it only has to send 1 message and a virtually unlimited number of queues can receive the message.
Send To Queue Is An Anti-Pattern
It took a while for me to realize this, but I now see the "send to queue" feature of RabbitMQ as an anti-pattern.
Sure, it's built in to the library and protocol. And it's convenient, right? But that doesn't mean you should use it. It's one of those features that exists to make demos simple and to handle some specific scenarios. But generally speaking, "send to queue" is an anti-pattern.
When you're a message producer, you only care about sending the message to the right exchange with the right routing key. When you're a message consumer, you care about the message destination - the queue to which you are subscribed. A message may be sent to the same exchange, with the same routing key, every day, thousands of times per day. But, that doesn't mean it will arrive in the same queue every time.
As message consumers come online and go offline, they can create new queues and bindings and remove old queues and bindings. This perspective of message producers and consumers informs the nature of queues: postal boxes that can change when they need to.
An Ever-Changing Landscape
The dynamic nature of queues and bindings means that you are likely to cause problems by having a message producer assume a specific queue is always there. Sending a message directly to a queue means you might be sending it in to the void, with no queue for it to arrive at. When you do this, you lose messages or get errors thrown. Either way, it's not good.
As I said before, there is a place where "send to queue" being valuable in a system design. It is there for a reason, and you will learn at least one place where this is valuable in the upcoming lessons. But in the vast majority of cases, you should avoid "send to queue" in favor of publishing a message to an exchange.
What Kind Of Data?
Now that you're set for publishing messages, what kind of data should you be putting in the message body? Stay tuned for tomorrow's RMQPatterns email and you'll hear about two very common and useful patterns of data for message bodies.
I was talking with Rob Conery about RabbitMQ vs (insert a bunch of other great options, here) recently. In the conversation, he mentioned a pattern that he was seeing with using Redis as a pub/sub server:
One thing I really like is that they suggest you don’t push your data into the queue - rather you should use IDs and pull the data out when you need to. So for order processing you would save the order in a raw form, then queue a job to run payment on the order, send email, etc, with each job saving and updating the DB as needed.
Have to say, that model seems really clean to me.
- Rob Conery
Rob is suggesting that instead of sending a complete document through the message queue - with all the details of the record or whatever it is you're dealing with - you just send an ID from a database table. On the other side of the message queue, where code is processing the message, you would load the record from the database using the ID.
One Of Many Good Patterns
I often use this pattern - send just an ID across the queue, and then load the message from the database. This is a great way to go when you have a database available on both ends (the message producer and message consumer). You can reduce the size of the message, which helps to improve performance of the queueing system and network traffic in some cases. You can also use this to protect yourself against bad messages - messages with "fake" or malicious data. By having an ID, you are forced to look up the data instead of relying on the data in the message body.
However, this is just one of many options that you can use and choose from in any given situation. Even with the usefulness of this pattern, it is important to understand that a database is not an integration layer. Some of your message consumer processes won't, and/or shouldn't, have access to the same database as the producer. Sometimes you need this level of protection, and sometimes you need to trust the document source and the data in the document.
There is no "one size fits all" for messaging patterns.
Sending A Complete Doc vs An ID
There's nothing wrong with sending just an ID. But there's also nothing wrong with sending a complete document that has all the data associated with it, across a message broker and queue.
For example, if I wanted to just send an ID to process analytics requests for SignalLeaf, I would have to make a database call first. This call would write a record so that I would have an ID. But I don't want to do this. It slows things down to much. Instead, I'm going to send a single message across RabbitMQ that has all the data I need. This will keep everything snappy and fast!
Conversely, when I receive a webhook call from Stripe, SignalLeaf will write the webhook event to a database table. Then it will send a message across RabbitMQ with the event ID. On the other end, I'll load the record from the database and verify the event with the Stripe API. This protects me from malicious data and forgery for analytics, and keeps the message small with just an ID.
Both of these patterns are very useful and chances are, you'll use both patterns in the same system.
Just Documents?
Data patterns are useful in the way the imply behavioral patterns, as you can see. Having a limited data set vs a large data set can change whether or not you need access to a given database. But is there anything beyond the simple data patterns for documents, when dealing with messages? Are these just "documents" that have no context or meaning, other than the document contents?
Stick around for the next RMQPatterns email. There's more to messages than just the body / document.
Events tell you that something happened and give you an opportunity to respond accordingly. Sometimes an event includes data about what changed. Sometimes an event is nothing more than the statement that something changed or happened - it includes no additional information other than the event name. Either way, events are important as they allow other parts of a system to react.
But what's the best way to send an event, through a message? What are the concerns that need to be addressed, and how can RabbitMQ facilitate events effectively?
Event Names
Like JavaScript UI frameworks, it is common for event message types to be named with a hierarchy: "student.enrolled", "program.completed", "connection.lost", etc. These names reference a thing ("student", "program" and "connection") and tell you what happened to that thing ("enrolled", "completed" and "lost").
Note that the event is usually named with past-tense language. This isn't a request for something to be done, it is a statement that something has changed or has happened in the past. An event name like "enroll.student" would be backward and would not imply an event. Rather, the previously stated "student.enrolled" would be more appropriate as it tells us what happened or what changed with the student.
In RabbitMQ, the event name can be a part of the message wrapper or meta-data (aka the message "envelope"), but it is more often a part of the exchange through which the message flows.
Publish/Subscribe For Event Messages
Event messages are typically sent through publish/subscribe ("pub/sub") channels - exchanges with queue bindings that allow a single message to be published to all current subscribers and missed by any subscriber not currently there.
Think of it like a "click" event handler in an application, or a "change" event on a data model. You can have as many subscribers as you want for that click event or change event, but you must have the event subscriber in place if you want it to receive the event. If you trigger a change even on the model, and then later add a subscriber to that event, the subscriber will not receive the previous change event. It will wait until the next change event occurs. The same holds true for the click event or any other event you want to publish in a UI.
Events in messaging work the same way. They are temporal; handled by any active subscriber and missed by anything that connects later. With that in mind pub/sub is usually implemented with a fanout exchange and exclusive, auto-delete queues in RabbitMQ.
A fanout exchange type ensures every bound queue gets a copy of every message by ignoring any routing key that you specify in the binding or message. To prevent messages from piling up in queues endlessly, an exclusive and auto-deleted queue should be used for each subscriber. This allows a subscriber to receive messages when it is connected, and have the queue deleted when the subscriber disconnects.
Exchange Per Event Type
It's common to have exchanges grouped by object type. There may be an exchange called "Account" where messages related to accounts are published and brokered. This exchange would likely use routing keys for queue binding, though. And that would imply a direct or topic exchange. Since your pub/sub setup needs to be a fanout that ignores routing keys, you will want to create a separate exchange. But what do you name the exchange? How many types of events can be pushed through a single fanout?
It may be tempting to create a fanout exchange with a name like "Account.Events" or "Account.PubSub". This would allow any event message to be published through the exchange. It would place undue burden on your subscriber code, though. Each subscriber would have to know which type of event it should listen for and return events of the wrong type back to the queue. Worse yet, all of this would need to happen in code, instead of letting RabbitMQ do the work for you. This will cause network and message thrashing as a process picks up a message and returns it over and over and over, hoping for the right subscriber to get the message eventually.
A better solution for event messages and pub/sub is to have an exchange per event type.
Rather than sending an "Account.Closed" or "Account.Updated" event through a single exchange, create an "Account.Closed" exchange and a separate "Account.Updated" exchange. Any subscriber that knows how to handle the closed event will create a non-durable, exclusive queue with a binding to the "Account.Closed" exchange. The same applies to the updated event and exchange, or any other event / exchange that you need. When a message is published to this exchange and is routed to the queue, you know the code can handle it because the subscriber would not be there if it couldn't.
Missed Messages and Time To Live
Messages are missed entirely if a subscriber is not present when the event is sent. However, subscribers that are present may find themselves overwhelmed with messages and a backlog may build up. If this happens in a situation where old data can be ignored with no repercussions, you can use RabbitMQ's Time To Live ("TTL") feature to drop old messages.
Say, for example, you have a system that produces system diagnostics messages as described in Chapter 8 of the RabbitMQ Layout ebook. This system could produce a new diagnostics message 50 times a second, with every 10th message restarting the message. If your subscriber code only process 10 messages per second, it would only take 30 seconds to build a backlog of 1,200 messages. In this mass of messages with the sequence repeating every 10 messages, you would have 120 copies of data in the queue!
To avoid this monstrous and growing backlog, set a TTL on the subscriber's queue. Having a TTL in place will cause messages to be dropped after the specified time period has elapsed. In this example, a TTL of 1 second would prevent the backlog from growing too large while still giving you 5 instances of each message in the sequence. Once your subscriber disconnects and the auto-delete queue is removed, no more backlog will be built up because there will not be a queue in which they can build up.
From Past To Present
Event messages are an important part of systems that need to coordinate behaviors. But the "toss it over the fence" nature of pub/sub has its share of limitations. For example, if you need to guarantee that one and only one copy of a message is processed, so that the associated functionality is only executed once, pub/sub is a bad idea.
So, how do you get a single message to be processed only once? Start by exiting the past-tense of events, and then stay tuned for tomorrow's email where you'll learn more about the handling of command messages!
Knowing what happened in the past is important and event messages give us that information. But when you need to ensure one copy of a message is handled, and no more than one, look to command messages. With a command, you can tell another system to do work as if you were calling an HTTP or RPC based API, but with all the benefit of asynchronous messaging.
Command Names
Counter to events, command names are typically written in present tense. Think of a "command" in the language sense and you'll get the idea of how commands should be named in messages. While language can have an implied subject, as in the command "Run!", a command message should have the subject explicitly stated. This will prevent confusion and simplify the handling of commands.
Command names should represent a high level concepts, not specific details. Rather than saying "turn the steering wheel in a counter-clockwise motion while depressing the pedal to flood the engine with additional fuel", the command may be "vehicle.turn". This command doesn't specify which direction to turn, though. The specifics of the turn should be included in the message document, allowing the command to be generalized. A
single message handler can then deal with turning left or right, by varying degrees, rather than having separate handlers for different directions.
Once again, command names in RabbitMQ can be a part of the message envelope, but doesn't need to be. Unlike events where the name is typically part of the exchange, however, the command name is often part of the routing key with the command subject being the exchange name.
Point-to-Point For Commands
If you asked a coworker to get you a coffee, you probably don't want all 30 of your coworkers that heard you to bring you coffee. By the same token, using a fanout exchange for a command is a bad idea. You don't want every bound queue and subscriber to handle the command. You want a point-to-point channel - an exchange that uses routing keys to send the message to a single queue so that a single worker process can handle it.
Think of it like an API on an object. If you have a "vehicle" object and you want to turn it, you would call "vehicle.turn(/*some parameters*/)". If the "vehicle" variable is null or undefined, you will get an error and the code won't run. You have to ensure the object is there before you make that call, if you want the code to execute.
In RabbitMQ, command messages are usually handled by direct exchanges. This is similar to the object's method call in that a single routing key is matched exactly to the binding key of an exchange and queue. If the routing key is "vehicle.turn", then the binding key between the exchange and the queue must also be "vehicle.turn". Unlike the object example, however, no error will be thrown if a binding does not exist for that key. The message will be sent in to the void, instead.
Exchange Per Object
Unlike event messages where a single exchange should be dedicated to a single event, command messages can easily be grouped into a single exchange. A single "vehicle" exchange can handle a multitude of commands, using routing keys to ensure the command message is delivered to the correct queue.
You should note, however, that it is possible for a single message to hit multiple destination queues by having more than one binding on the exchange, using the same binding key. Take care not to allow this to happen without very explicit intention. Having multiple handlers for a single command message can result in duplicated work and results.
Once, and Only Once!
The danger of having a single command message handled multiple times may be minimal in some situations. In others, it may be detrimental. Imagine a command to process a credit card transaction, and having it handled by several instances of the processor! Your customers would be charged multiple times and you may end up with claims of fraud against you.
Because of this very real danger in executing a command twice, it is often important to introduce idempotence in to message handling. This allows the same message to be processed multiple times, without affecting the result multiple times. In other words, a message to charge a credit card may be handled three or four times, but only the first successful handling would actually charge the card.
Idempotence can be implemented through various means. One of the most common is to use a unique identifier on a message, and only process that ID once. This typically requires a database or other data storage mechanism to hold on to the ID and state of the message processing. There are other methods of handling idemopotence, but this is a subject deserving it's own research for your specific scenario and needs.
Sometimes A Command Needs A Response
Telling someone to go do something is one thing... but what if you need them to give you a response to that work being done? Sending a command through a message will only get you half the story in that case. You need a way to reply to the message, and have the reply end up in the right place in your code. But... how?
Stay tuned for tomorrow's email where you'll see how a ID and a return address can make your replies easy to handle.
Often when work is requested the result of that work needs to be known - even if it's only an acknowledgement that the work was done. With the command pattern for messaging, you can easily request work to be done. But this pattern is a great example of "fire and forget" programming where you don't care about the status or results as long as you know the request was made.
To correct for the lack of information that a command message may introduce, a response can be produced. In order to get the response back to the right place, however, the original message will need a few more bits of meta-data along with it.
Request Names and Channels
Requests are similar to commands in many ways, including the use of direct changes. Having the same semantics of point-to-point communication with a single handler for the request helps to keep the code from causing problems with multiple handlers.
Request names are often similar to command names, as well. However, some requests will have a name that directly implies a response value, such as "vehicle.list" or "student.courses". These names imply a "get" behavior, to return a list of data to the requestor. This isn't just an arbitrary distinction in naming, however. Keeping commands separated from requests is critical to system architecture and preventing side effect. Before getting into that, however, it's important to understand how to facilitate a request/response scenario.
Correlation ID and Reply-To Queues
There are two pieces of information that must be included in a request, if a response is expected. The first is a correlation ID. This is a unique ID that the requestor generates for that one request. It is included in the meta-data of the message and must be included in the response message. Having the correlation ID in the response allows the requestor to know which chunk of code should handle the response. Without the correlation ID, it would be impossible to have multiple requests going out at the same time.
Having a correlation ID is only half the solution, however. It is necessary for the requesting code to handle the response... but how does the response get back to the requesting code in the first place? This is where the reply-to queue comes in. In RabbitMQ, the response is sent through an exclusive (private) queue that the requestor sets up just for responses. The location or address of this private queue is sent with the request message in the same manner as the correlation ID. Once the request handler code has produced the desired response, it sends the response message to the reply-to queue, setting the correlation ID from the original request.
A Do-It-Yourself Pattern, And sendToQueue
RabbitMQ provides both a "reply-to" and "correlationId" field in a message's properties. This lets you facilitate the request/reply functionality, easily. However, it is important to note that the request/reply behaviors are not directly baked in to RabbitMQ. Your code is responsible for using the reply-to and correlation ID fields appropriately.
This may sound like a bad idea, but it does open up new opportunities and create more flexibility in using RabbitMQ. For example, commands can still have a response even though they are not explicitly required to have one like request/response is.
It's also important to note that this is one place where the "sendToQueue" method of RabbitMQ will be used. In the use of a "reply-to" queue, you don't want to go through an exchange that will dynamically determine where to send your message. Rather, you want to place the response message directly in to the reply-to queue. To do this, the "sendToQueue" method will be pulled out of your toolbox. But be aware that this is one of the few places where this method is to be used.
Commands w/ Responses, and Separating Requests
In yesterday's "vehicle.turn" command, the system could turn the wheels by a specified amount and send a response stating that it completed. This is a similar to an API on an object that alters the state of an object and provides a response indicating success or failure of the state change. In this situation, you are still executing a command but you are also getting a response from the command.
When the message implies data to be retrieved with no state being changed, however, you're most likely looking at a request / response scenario. A "vehicle.list" request may return a list of vehicles, but would not alter the list of vehicles in any way. Altering the vehicle list while making a request for information would break the semantics of a request. You did not ask to alter the list. You only asked for the list.
A command should be used to alter the state or data of a system. A request, on the other hand, should never alter the state of the system. This separation of a command vs a query (request) is an important concept in architecture, as discussed in my interview with Anders Ljusberg on the subject of CQRS and messaging (part of my RabbitMQ For Developers bundle). Without the separation of commands and queries, your code will have unpredictable results and state changes. This quickly leads to bad data, race conditions and other problems in your system.
Handling Request Timeouts
With request/response scenarios, there is are a few important things to keep in mind:
the request is idempotent (repeatable without modifying any system state, as mentioned above)
the request handler respond within a reasonable time
the response can be thrown away if the requestor is no longer around
The last item on this list is easily handled with exclusive, auto-delete queues for the reply-to queue. The first one is a matter of code quality that must be enforced at a design level. Handling a response in timely manner, however, implies a timeout and action to be taken if the specified time period elapses.
In addition to the Time To Live (TTL) used on event messages, the timeout required for a request/response scenario also needs to happen in code.
The typical use case for a request/response scenario is to retrieve data that a user needs to see, from some external system. When the request is made, a timeout can be set to a reasonable time from (a few milliseconds... or maybe even a few seconds, depending on the exact circumstances). When the timeout elapses, the requesting code should provide notice that the response is missing and move on. If the response comes back later, the correlation ID should no longer match any request that is waiting. When the correlation ID fails to match, the response will be discarded.
From a user perspective, expecting to see some data on the screen, several things could happen. The system could provide a default response, could show a "data not available" message of some type, or it could show an error message to the user saying that the information is not available.
What About Exceptionally Long Running Processes?
There are many possible outcomes of the response timing out, including logging the missed response for system analysis and statistics. But whatever the result is, the request should not be allowed to live for extended periods of time, or indefinitely. If you find yourself looking at a long running process that takes minutes or even hours to execute, you should throw request/response out the window entirely.
So, how do you handle an exceptionally long running process?
Stay tuned for tomorrow's email, as I discuss an alternative to request/response to handle just that scenario.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment