Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circuit breaker implementation in scriptorium #22730

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

shubhi1092
Copy link
Contributor

@shubhi1092 shubhi1092 commented Oct 4, 2024

Description

This PR is to add circuit breaker functionality for scriptorium lambda. It is to handle the exceptions where service restart is not helpful and instead, we want to wait and retry again. For example, when mongo db is unavailable/down, and scriptorium is not able to write ops to the db, restarting the service doesnt help, instead we would wait and retry after some time. Circuit Breaker pattern helps in such cases by maintaining open/closed/halfOpen state.

So in scriptorium, all the calls to db are wrapped by the circuit breaker, and in case of such errors, the circuit will open and pause the lambda (i.e. pause the incoming messages). After some time, the circuit will go to halfOpen state and call a healthCheck function - if it succeeds, the circuit will close and resume the incoming messages, else it will stay open and paused.

We can configure various options, like error threshold, reset timeout, the errors for which we want to engage the circuit breaker, etc. Also if the circuit is not able to close or resume for some time (configurable), we will fallback to restarting the service to avoid being in an endless state of waiting.

This PR is for scriptorium, and once we validate and roll this out in production, we will add the same pattern for document lambdas too.

Summary of changes made in this PR:

  • Circuit Breaker Implementation: Adds a circuit breaker pattern to scriptorium->db calls, with various configuration options for error thresholds, reset timeouts, and error filters.
  • Pause and Resume Methods: Adds pause and resume methods for lambdas, context, documentContext, partition, partitionManager, kafkaRunner, rdKafkaConsumer, and lambda to manage message flow during circuit breaker states.
  • Health Check for MongoDB: Adds a health check method to the MongoDB class and exposes a healthCheck property from the MongoManager class.

Testing

  • Added unit tests for circuit breaker.
  • Tested the scriptorium end to end functionality locally by forcing the db to be unavailable in the local setup.
  • Testing in dev cluster (in progress)

We will roll this out slowly by testing in each ring.

@github-actions github-actions bot added area: server Server related issues (routerlicious) base: main PRs targeted against main branch changeset-present and removed changeset-present labels Oct 4, 2024
@shubhi1092 shubhi1092 force-pushed the shuagarwal/scriptorium-circuit-breaker branch from da2459b to 798511b Compare October 4, 2024 22:03
@shubhi1092 shubhi1092 marked this pull request as ready for review October 4, 2024 22:05
@shubhi1092 shubhi1092 requested a review from a team as a code owner October 4, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: server Server related issues (routerlicious) base: main PRs targeted against main branch changeset-present
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant