HarshaNalluru/2048-solver.md Secret

## 2048-solver.md

      
    Raw
  

              2048-solver.md
            
          
    Problem

ISSUE Azure/azure-sdk-for-js#11633

In some scenarios of receiving in "peekLock" mode, when the incoming delivery buffer is full(=2048), happens when there are 2048 outstanding deliveries, new messages are not received.
The drain request triggered by the timeout hangs forever, users would have to force-exit their application in this scenario.
Interestingly, the above behaviour is seen in the case of receiving from unpartitioned queues.
Drain requests seem to work as expected when receiving from partitioned queues and result in returning zero messages.
Note: This isn't a problem as long as users are settling the messages that are being received.
Background

We have given a repro to the service team, and they couldn't figure out the difference between the partitioned vs unpartitioned that could cause this.
Going towards a solution

No matter what the service does, it might be a better idea to solve this problem pre-emptively. Meaning.. never let the circular buffer fill up entirely.
If the circular buffer is full, return the messages for batching receiver, notify the user through streaming receiver through the processError.
RHEA says - "If autoaccept is disabled on a receiver, app should ensure that it accepts/releases/rejects the messages received."
Following are the options we can go with.. to address the problem pre-emptively.
Option 1

When autoaccept =  false
Keep a count of all the messages, whenever the settlement is done for a message, decrease the count.
If reaches 2048
	Batching -> return with already collected messages
	New Batching -> return 0
	Streaming -> raise an error on processError

Option 2

Enhanced Option - 1
	○ Allow configuring the circular buffer size at rhea

Option 3

Rhea should provide us with a new buffer_overflow event
Upon buffer_overflow
	Batching -> return with already collected messages
	New Batching -> return 0
	Streaming -> raise an error on processError

Option 4

Tweaking Option 1
Rhea should allow accessing the buffer size at the SB SDK level
If the size reaches capacity 
	Batching -> return with already collected messages
	New Batching -> return 0
	Streaming -> raise an error on processError

Option 5

Delivery Manager Map <delivery.id>
	○ Gets populated only if autoaccept is false
	○ OnMessageSettled trigger would remove the delivery.id from the Map
If reaches 2048
	Batching -> return with already collected messages
	New Batching -> return 0
	Streaming -> raise an error on processError

Option 6

Enhanced Option 5
	○ Allow configuring the circular buffer size at rhea

Coming to the pros/cons..

Option 2 and Option 6 talk about allowing to configure the circular buffer size, no matter which way we pick, this change should happen independently.
Option 1 and Option 5 are equivalent(maintaining all the ids vs just the count).
So, we just have to pick between Options 3, 4 and 5

Option 3 - Rhea provides us with a new buffer_overflow event
Option 4 - Rhea allows accessing the buffer size at the SB SDK level
Option 5 - Tracking the count at SB SDK

I personally like Option 4 as it would not require us to maintain a copy/buffer-state when compared to Option 5.
And might need less convincing to do at rhea when compared to Option 3.
Option 7

Only ever set a max of 2047 credits(in peekLock) on the link instead of maxMessageCount or maxConcurrentCalls, 
this way the buffer would never be full.

With the solution of checking the buffer size for every message.. 
(from the investigations of testing Option 4), though I stop receiving after 2047 messages.. 
Since there were credits on the link, messages were on the line though we stopped receiving.

I could see this by receiving the messages again - the delivery count was 1 for 2050 messages as opposed to 2047 in my hand. 
If the credit initialization itself keeps the cap, the problem would be avoided entirely(as expected).