Skip to content

Instantly share code, notes, and snippets.

@HarshaNalluru
Last active March 4, 2021 00:34
Show Gist options
  • Save HarshaNalluru/0736842ac5d4024a889fdd82be015dc1 to your computer and use it in GitHub Desktop.
Save HarshaNalluru/0736842ac5d4024a889fdd82be015dc1 to your computer and use it in GitHub Desktop.
#2048-Solver

Problem

In some scenarios of receiving in "peekLock" mode, when the incoming delivery buffer is full(=2048), happens when there are 2048 outstanding deliveries, new messages are not received.

The drain request triggered by the timeout hangs forever, users would have to force-exit their application in this scenario.

Interestingly, the above behaviour is seen in the case of receiving from unpartitioned queues. Drain requests seem to work as expected when receiving from partitioned queues and result in returning zero messages.

Note: This isn't a problem as long as users are settling the messages that are being received.

Background

We have given a repro to the service team, and they couldn't figure out the difference between the partitioned vs unpartitioned that could cause this.

Going towards a solution

No matter what the service does, it might be a better idea to solve this problem pre-emptively. Meaning.. never let the circular buffer fill up entirely.

If the circular buffer is full, return the messages for batching receiver, notify the user through streaming receiver through the processError.

RHEA says - "If autoaccept is disabled on a receiver, app should ensure that it accepts/releases/rejects the messages received."

Following are the options we can go with.. to address the problem pre-emptively.

Option 1

When autoaccept =  false
Keep a count of all the messages, whenever the settlement is done for a message, decrease the count.
If reaches 2048
	Batching -> return with already collected messages
	New Batching -> return 0
	Streaming -> raise an error on processError

Option 2

Enhanced Option - 1
	○ Allow configuring the circular buffer size at rhea

Option 3

Rhea should provide us with a new buffer_overflow event
Upon buffer_overflow
	Batching -> return with already collected messages
	New Batching -> return 0
	Streaming -> raise an error on processError

Option 4

Tweaking Option 1
Rhea should allow accessing the buffer size at the SB SDK level
If the size reaches capacity 
	Batching -> return with already collected messages
	New Batching -> return 0
	Streaming -> raise an error on processError

Option 5

Delivery Manager Map <delivery.id>
	○ Gets populated only if autoaccept is false
	○ OnMessageSettled trigger would remove the delivery.id from the Map
If reaches 2048
	Batching -> return with already collected messages
	New Batching -> return 0
	Streaming -> raise an error on processError

Option 6

Enhanced Option 5
	○ Allow configuring the circular buffer size at rhea

Coming to the pros/cons..

Option 2 and Option 6 talk about allowing to configure the circular buffer size, no matter which way we pick, this change should happen independently. Option 1 and Option 5 are equivalent(maintaining all the ids vs just the count).

So, we just have to pick between Options 3, 4 and 5

  • Option 3 - Rhea provides us with a new buffer_overflow event
  • Option 4 - Rhea allows accessing the buffer size at the SB SDK level
  • Option 5 - Tracking the count at SB SDK

I personally like Option 4 as it would not require us to maintain a copy/buffer-state when compared to Option 5. And might need less convincing to do at rhea when compared to Option 3.

Option 7

Only ever set a max of 2047 credits(in peekLock) on the link instead of maxMessageCount or maxConcurrentCalls, 
this way the buffer would never be full.

With the solution of checking the buffer size for every message.. 
(from the investigations of testing Option 4), though I stop receiving after 2047 messages.. 
Since there were credits on the link, messages were on the line though we stopped receiving.

I could see this by receiving the messages again - the delivery count was 1 for 2050 messages as opposed to 2047 in my hand. 
If the credit initialization itself keeps the cap, the problem would be avoided entirely(as expected).
@richardpark-msft
Copy link

I can't tell if you're saying that the rhea team (well, one person) has told us that option #6 is not an option but it's clearly the "best" one (ie, it should be fine to let the customer determine what their appropriate limit is when it's purely a client-side limitation).

RHEA says - "If autoaccept is disabled on a receiver, app should ensure that it accepts/releases/rejects the messages received."

@HarshaNalluru
Copy link
Author

Option 6 is still on the table.
"rhea" has a TODO comment in its src code to allow configuring the buffer size.

Even if we go with Option 6, users still have to settle the messages at some point to receive more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment