Reduce arbitrator retry attempts to keep the operation at 1hz #2415
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Details
Description
This PR aims to fix an issue where arbitrator is getting frozen due to too many repeated failed calls.
With 500ms service call timeout, the arbitrator can only afford maximum 2 retry attempts to satisfy 1hz operation (it will be slightly later than 1s period due to other successful calls returning, but that is okay).
Arbitrator should at least retry 2nd time again because sometimes ROS service call can fail due to ROS error.
Without retrying, the next planning will be after 1 second, which is too late for some planning such as checking red light in lci_strategic_plugin.
This may not be full fix (we should consider threaded calls, so that each 500ms wait is not sequential), but this will significantly reduce any issues we encounter for carma-platform to run out of the box.
Related GitHub Issue
#2385
Related Jira Key
https://usdot-carma.atlassian.net/browse/CAR-6039
Motivation and Context
Demos after 4.5.0 frequently encounter this issue by default because some nodes in ROS2 are not integration tested well but converted. This makes some of the nodes fail to activate or just due to host machine's performance, and repeated calls to such failed nodes freeze the arbitrator for up to 5 sec with 10 retry attempts.
How Has This Been Tested?
integration tested locally
Types of changes
Checklist: