Circuit breaker implementation in scriptorium #22730
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR is to add circuit breaker functionality for scriptorium lambda. It is to handle the exceptions where service restart is not helpful and instead, we want to wait and retry again. For example, when mongo db is unavailable/down, and scriptorium is not able to write ops to the db, restarting the service doesnt help, instead we would wait and retry after some time. Circuit Breaker pattern helps in such cases by maintaining open/closed/halfOpen state.
So in scriptorium, all the calls to db are wrapped by the circuit breaker, and in case of such errors, the circuit will open and pause the lambda (i.e. pause the incoming messages). After some time, the circuit will go to halfOpen state and call a healthCheck function - if it succeeds, the circuit will close and resume the incoming messages, else it will stay open and paused.
We can configure various options, like error threshold, reset timeout, the errors for which we want to engage the circuit breaker, etc. Also if the circuit is not able to close or resume for some time (configurable), we will fallback to restarting the service to avoid being in an endless state of waiting.
This PR is for scriptorium, and once we validate and roll this out in production, we will add the same pattern for document lambdas too.
Summary of changes made in this PR:
Testing
We will roll this out slowly by testing in each ring.