Skip to content

Instantly share code, notes, and snippets.

@helephant
Last active November 23, 2022 10:30
Show Gist options
  • Save helephant/d78df737947f60427a6455f750698f94 to your computer and use it in GitHub Desktop.
Save helephant/d78df737947f60427a6455f750698f94 to your computer and use it in GitHub Desktop.
Do step functions wait for an available lambda if they hit a concurrency limit or do they fail?

We have a process that spawns lots of lambdas that we were worried might hit a concurrency limit. We wanted to know what a step function would do in that situation because the AWS documentation says that poll-based services have a built-in retry mechanism if lambdas are throttled and we were hoping that the subsequent steps would simply wait until there was enough capacity.

So I built a step function with three steps that had a sleep in them, and then set the reserve concurrency to 1 for the middle step. Then I kicked off three step functions at the same time.

The result was the second intance of the step function failed on the throttled lamda with the error message: "Lambda.TooManyRequestsException"

There are retry options in step functions, but it is interesting to know that they don't wait for a lambda slot to be available.

exports.handler = async (event) => {
// evil code just for debugging purposes
function sleep (time) {
return new Promise((resolve) => setTimeout(resolve, time));
}
// Usage!
await sleep(30000);
const response = {
statusCode: 200,
body: JSON.stringify("<number> function finished"),
};
return response;
};
{
"StartAt": "FirstStepFunction",
"States": {
"FirstStepFunction": {
"Type": "Task",
"Resource": "<arn to lambda 1>",
"Next": "SecondStepFunction"
},
"SecondStepFunction" : {
"Type": "Task",
"Resource": "<arn to lambda 2>",
"Next": "ThirdStepFunction"
},
"ThirdStepFunction" : {
"Type": "Task",
"Resource": "<arn to lambda 3>",
"End": true
}
}
}
@kbradl16
Copy link

interesting, any new insight since you made this?

@helephant
Copy link
Author

I can say that this applies to SQS and SNS as well, which is why AWS recommend you always have multiple re-tries configured.

The documentation really isn't very clear on the topic of what's event based and what's polled based and how the re-tries work.

Are you interested in anything in specific? We didn't end up using a step function for our process, but I have spent lots of time working with lambdas, API GWs, queues and topics over the last 12 months.

@kbradl16
Copy link

Gotcha. I’m using step functions and starting to look into scaling so was just curious if you found any best practices for lots of concurrent step function executions.

@renedacosta
Copy link

Hi I'm currently looking into using step functions to better orchestrate my serverless application and wanted to understand how it deals with lambda throttling, particularly when invocation fails all retries due to the lambda being throttled, would this be captured by the stepfunction or will i still need a DLQ ?

My current architecture is a system of SQS and lambdas that perform various validations on csv files before attempting to ingest the data into a database, for each lambda depending on its pass/fail it will send the event to a subsequent SQS/DLQ that are then polled by different lambdas.
I'd like to migrate all this logic into a state machine so I can better trace the files but am worried about the possibility of loosing data due to throttling.

Any advise or suggestion on this would be great

@harnekmanj
Copy link

Currently I am investigating to use Step Functions so that I can orchestrate the workflow and I had the same question about how to make sure the compute resources are available in this case the Lambda Function and if they are throttled then the Step Function could retry.

Based on the recommendation on this page https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html under the section "Configuring a queue for use with Lambda" it mentions to make the visibility time out to be at least 6 times the timeout that's configured for the Lambda Function.

Based on that as I understand it could be followed for the Step Function also when configuring the Retry. So as per this page https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html in the section "Retrying after an Error" the "IntervalSeconds" could be similar to the 6 time value of the lambda timeout or if you think not to start with 6 times, but with less time, then the retry also allows to define the "BackoffRate"

@mhvelplund
Copy link

Like everyone here knows by now; no they don't.
Step functions allow you to configure a retry scheme with incrementing backoff, triggered by lambda throttling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment