Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goroutines may be prioritized incorrectly #54

Open
eytan-avisror opened this issue Apr 10, 2020 · 2 comments
Open

Goroutines may be prioritized incorrectly #54

eytan-avisror opened this issue Apr 10, 2020 · 2 comments

Comments

@eytan-avisror
Copy link
Collaborator

It was noticed when a huge spike of terminating instances happen (150 instances + another 150 instances after few minutes), we may not have enough goroutines or do not prioritize some goroutines efficiently.

The result was that and event that was received started processing 5-6 minutes later and by that time the heartbeat had already expired.

It's possible that newWorker goroutine is trying to spawn but there are so many other goroutines that it cannot.

We should load test and evaulate:

  • Giving more priority to newWorker and sendHeartbeat goroutines.
  • Overall reduce number of goroutines we spin up

The goal is to make sure we can handle large spikes of scale-downs without starvation

@eytan-avisror
Copy link
Collaborator Author

One improvement that may improve this is refactoring server.go
This block should not be called if message already exist, we should check

// process messags from channel
for message := range mgr.eventStream {
go mgr.newWorker(message)
}

We should move these checks up

event, err := readMessage(message)
if err != nil {
err = errors.Wrap(err, "failed to read message")
mgr.RejectEvent(err, event)
return
}
event.SetQueueURL(url)
if !event.IsValid() {
err = errors.Wrap(err, "received invalid event")
mgr.RejectEvent(err, event)
return
}
if event.IsAlreadyExist(mgr.workQueue) {
err := errors.New("event already exists")
mgr.RejectEvent(err, event)
return
}

@eytan-avisror
Copy link
Collaborator Author

Message validation has been refactored to avoid spinning up worker goroutines for invalid/duplicate messages, this might help a bit.
But it seems the best improvement we can make here is around the deregistration waiters which spin up N * Cluster TargetGroups * Terminating Instances goroutines, which make up 99% of the goroutines we spin up.
We should come up with some approach to do this differently if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant