New workflow to get rid of LambdaManageASG #457

mello7tre · 2021-04-18T13:45:35Z

mello7tre
Apr 18, 2021

Introduction

During the developement of event based branch, there is been the problem of the concurrency execution of main Lambda and in particular the increase of the ASG Max size and the Autoscaling Suspension while the ondemand instance is replaced with the spot one.
To solve this i introduced another lambda LambdaManageASG with concurrency execution set to 1 invoked by the main one.

Now i think to have found a better approach.
I want to first share it as an idea, this way maybe you can find some drawbacks or problems that i do not have taken in account.
Fortunately the changes are not relevant but the advantage are great.
Let me explain them.

Idea

When main AutoSpotting [AS] detect a new spot instance launch (event) or find a spot instance not attached (cron), instead of processing it, simply insert a message in a FIFO SQS Queue.
The message can be the same event that triggered the lambda but when we send it we specify as group Id the AutoScalingGroup name [ASG].
(https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/using-messagegroupid-property.html)
The SQS FIFO Queue trigger again the lambda, but this time we execute the "standard code" to replace the ondemand with the spot and once successfully replaced we delete the message from the queue.
(we can optionally bypass instanceid asg discovery as we already know the ASG name.)

Fifo SQS queue and group ID have this advantage:
When messages that belong to a particular message group ID are invisible, no other consumer can process messages with the same message group ID.

Summary

So at the end we have:

Spot request (onDemand launch event) are handled in parallel
onDemand replacement (Spot launch event) are handled:
- in parallel if they belong to different ASG
- in sequential if they belong to the same ASG

We need to properly configure the visibility timeout.
There are two possibilities:

use the same of main lambda max execution time (15min):
simple but if lambda get stuck or crash we are unable to replace an onDemand belonging to the same ASG for 15min.
create a second lambda identically to the main one (same code permission and so on..) but with a max execution time of 60 sec (they should be sufficient to replace a single ondemand with a spot one). This second lambda is the SQS queue target.
(CloudFormation template must be changed and became bigger).

Usually AS is pretty stable and do not crash, so i think that we can go for the first, simpler, approach.

Code Changes

We need to handle SQS event in EventHandler [core.go] and once discovered relative ASG name send message with group Id equals to ASG name.
We can add a parameter (sqs=True/False) to handleNewSpotInstanceLaunch [core.go]
2.1. sqs==True we execute a new func that send sqs message after having discovered ASG name and end execution.
2.2. sqs==False we execute standard code to replace relative onDemand and once successfully replaced execute a new func to delete sqs message.

As you can see changes are not big, but we need to dismantle/change the part of the code relative to the execution of LambdaManageASG.

Let me know what you think of this approach.
If you agree with me that it can lead to a better workflow i can begin to write down some of the necessary code and create a working lambda to test.

Best regards, Alberto

cristim · 2021-04-24T19:54:37Z

cristim
Apr 24, 2021
Maintainer

I love the idea, we're essentially reducing the complexity: we delete that Python Lambda together with its IAM configuration and replace the calling code with code working with an SQS queue.

Regarding the visibility timeout, I think we can figure it out as we use this, let's keep it simple for now and just use something relatively sensible.

We can also use the same queue later for processing the regional events instead of calling the main Lambda function synchronously

1 reply

mello7tre Apr 25, 2021
Author

#458

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New workflow to get rid of LambdaManageASG #457

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

New workflow to get rid of LambdaManageASG #457

mello7tre Apr 18, 2021

Introduction

Idea

Summary

Code Changes

Replies: 1 comment · 1 reply

cristim Apr 24, 2021 Maintainer

mello7tre Apr 25, 2021 Author

mello7tre
Apr 18, 2021

Replies: 1 comment 1 reply

cristim
Apr 24, 2021
Maintainer

mello7tre Apr 25, 2021
Author