Replies: 1 comment 1 reply
-
I love the idea, we're essentially reducing the complexity: we delete that Python Lambda together with its IAM configuration and replace the calling code with code working with an SQS queue. Regarding the visibility timeout, I think we can figure it out as we use this, let's keep it simple for now and just use something relatively sensible. We can also use the same queue later for processing the regional events instead of calling the main Lambda function synchronously |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Introduction
During the developement of event based branch, there is been the problem of the concurrency execution of main Lambda and in particular the increase of the ASG Max size and the Autoscaling Suspension while the ondemand instance is replaced with the spot one.
To solve this i introduced another lambda LambdaManageASG with concurrency execution set to 1 invoked by the main one.
Now i think to have found a better approach.
I want to first share it as an idea, this way maybe you can find some drawbacks or problems that i do not have taken in account.
Fortunately the changes are not relevant but the advantage are great.
Let me explain them.
Idea
When main AutoSpotting [AS] detect a new spot instance launch (event) or find a spot instance not attached (cron), instead of processing it, simply insert a message in a FIFO SQS Queue.
The message can be the same event that triggered the lambda but when we send it we specify as group Id the AutoScalingGroup name [ASG].
(https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/using-messagegroupid-property.html)
The SQS FIFO Queue trigger again the lambda, but this time we execute the "standard code" to replace the ondemand with the spot and once successfully replaced we delete the message from the queue.
(we can optionally bypass instanceid asg discovery as we already know the ASG name.)
Fifo SQS queue and group ID have this advantage:
When messages that belong to a particular message group ID are invisible, no other consumer can process messages with the same message group ID.
Summary
So at the end we have:
We need to properly configure the visibility timeout.
There are two possibilities:
simple but if lambda get stuck or crash we are unable to replace an onDemand belonging to the same ASG for 15min.
(CloudFormation template must be changed and became bigger).
Usually AS is pretty stable and do not crash, so i think that we can go for the first, simpler, approach.
Code Changes
2.1. sqs==True we execute a new func that send sqs message after having discovered ASG name and end execution.
2.2. sqs==False we execute standard code to replace relative onDemand and once successfully replaced execute a new func to delete sqs message.
As you can see changes are not big, but we need to dismantle/change the part of the code relative to the execution of LambdaManageASG.
Let me know what you think of this approach.
If you agree with me that it can lead to a better workflow i can begin to write down some of the necessary code and create a working lambda to test.
Best regards, Alberto
Beta Was this translation helpful? Give feedback.
All reactions