Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: improve worker lifetime #9439

Merged
merged 4 commits into from
Nov 18, 2024
Merged

Conversation

bougue-pe
Copy link
Contributor

🔍 review the commits separately, ignore whitespace on first commit.

fixes #9388

Done:

  • fix rabbit's connection closing
  • stop worker when unable to process requests after infra load

Details:
The worker is able to process requests if infra load:

  • ... went fine, happy path (just process request)
  • ... failed with a perennial error (expected behavior is reject requests on the same version of the same infra)

If infra load failed with a temporary error (and after possible retries), better just stop and let orchestrator decide to retry or not.

No retry implemented so far (easier to test the behavior... and easier to code 😇).
Hand-tested on the cases I could think of and reproduce, more test is welcome.

@bougue-pe bougue-pe added area:core Work on Core Service kind:architecture Software architecture work labels Oct 23, 2024
@bougue-pe bougue-pe requested a review from a team as a code owner October 23, 2024 16:26
@github-actions github-actions bot removed the area:core Work on Core Service label Oct 23, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 37.83%. Comparing base (cb99790) to head (401947e).
Report is 33 commits behind head on dev.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #9439      +/-   ##
==========================================
- Coverage   37.84%   37.83%   -0.01%     
==========================================
  Files         990      990              
  Lines       90915    90920       +5     
  Branches     1176     1176              
==========================================
- Hits        34404    34401       -3     
- Misses      56057    56065       +8     
  Partials      454      454              
Flag Coverage Δ
editoast 73.23% <ø> (-0.02%) ⬇️
front 20.11% <ø> (-0.01%) ⬇️
gateway 2.18% <ø> (ø)
osrdyne 3.28% <ø> (ø)
railjson_generator 87.49% <ø> (ø)
tests 86.74% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bougue-pe bougue-pe force-pushed the peb/core/improve_worker_lifetime branch from 58671c6 to 3bc9bad Compare October 23, 2024 16:47
Copy link
Contributor

@woshilapin woshilapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will definitely improve the situation, thank you.

Copy link
Contributor

@eckter eckter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (not tested though)

Though I feel a little uncomfortable about the isCacheable flag in ErrorType. Apparently it was there but not set for any error, now we set it for InfraSoftLoadingError. But then most of the logic is made by checking for InfraSoftLoadingError specifically.

Maybe we could just remove the flag? Or always use it instead of checking for this specific error.

@bougue-pe
Copy link
Contributor Author

@eckter

Though I feel a little uncomfortable about the isCacheable flag in ErrorType. Apparently it was there but not set for any error, now we set it for InfraSoftLoadingError. But then most of the logic is made by checking for InfraSoftLoadingError specifically.

Maybe we could just remove the flag? Or always use it instead of checking for this specific error.

I'm OK to remove the flag (I won't have the time to do it before 4/11/2024).
There's a bit of work to have things contiguous/ensure consistency (between handler, path_finding and simulation):

  • I'd probably rename the flag to "unrecoverable"
  • then also add a unique "filter_recoverable" method to be used in all 3 places

@bougue-pe bougue-pe force-pushed the peb/core/improve_worker_lifetime branch 3 times, most recently from a243a35 to da81c3d Compare November 7, 2024 17:33
@bougue-pe
Copy link
Contributor Author

@eckter : introduced a 'simple' isRecoverable in da81c3d

This led to cases where the worker would hang forever when throwing
exception.

Signed-off-by: Pierre-Etienne Bougué <bougue.pe@proton.me>
The worker is able to process requests if infra load:
* ... went fine, happy path (just process request)
* ... failed with a perennial error (expected behavior is reject requests
  on the same version of the same infra)

If infra load failed with a temporary error (and after possible retries),
better just stop and let orchestrator decide to retry or not.

No retry implemented so far.

Signed-off-by: Pierre-Etienne Bougué <bougue.pe@proton.me>
So far any recoverable error is "cacheable/serializable" for simulation
or pathfinding.

Signed-off-by: Pierre-Etienne Bougué <bougue.pe@proton.me>
make sure handle() is called only when not recoverable

Signed-off-by: Pierre-Etienne Bougué <bougue.pe@proton.me>
@bougue-pe bougue-pe force-pushed the peb/core/improve_worker_lifetime branch from 954a328 to 401947e Compare November 15, 2024 20:20
@bougue-pe bougue-pe added this pull request to the merge queue Nov 18, 2024
Merged via the queue into dev with commit 67a6cc9 Nov 18, 2024
27 checks passed
@bougue-pe bougue-pe deleted the peb/core/improve_worker_lifetime branch November 18, 2024 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:architecture Software architecture work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

core: health check should be KO if invalid infra or non-existing (404 from editoast)
6 participants