-
Notifications
You must be signed in to change notification settings - Fork 400
2024 06 25 Eclipse iceoryx developer meetup
Mathias Kraus edited this page Jun 25, 2024
·
4 revisions
Date: 2024/06/25
Time: 17:00 CET
https://giphy.com/gifs/hail-hypnotoad-rou0CTAp6Z8VW/fullscreen
- https://github.com/eclipse-iceoryx/iceoryx/issues/2193
- https://github.com/eclipse-iceoryx/iceoryx/issues/325
- Mathias Kraus, ekxide IO GmbH
- Graham
- Niclas
- Hrudhansh
- Discuss the root cause of #2193 and #325
- Possible solutions
- Word distribution
1.1 The keep alive thread does not wake up after the system time changes
- the thread is waiting in a semaphore
timed_wait
call -
timed_wait
requires theCLOCK_REALTIME
which is affected by changes in the local time - additionally the heartbeat is send as timestamp
- the timestamp uses the monotonic clock
- jumps into the future can still happen with the monotonic clock
- when RouDi checks for the last heartbeat, it might have a timestamp after a jump which is too far into the future compared to the heartbeat timestamp
1.2 There is a mutex to guard adding and removing subscriber queues to the publisher
- the mutex also needs to be locked when the publisher accesses the subscriber queues for publishing
- this is quite fast since there is no contention unless subscriber are added or removed from the publisher
- nevertheless, there is a small window in which the application could terminate abnormally and leave a locked mutex behind
- this usually only happens on either:
- a crash -> can be prevented on the application side by e.g. not running threads in parallel when publishing smaples
- not handling signals which leads to not running the destructors -> can be prevented by handling signals
- sending
SIGKILL
-> unfortunately this is also done by RouDi when monitoring in turned on and RouDi does not receive a heartbeat for some time; RouDi assumes the application is unresponsive and sends aSIGKILL
in order to safely reclaim the resources
2.1 Possible solution for the monitoring issue
- create a timer_create abstraction
- use that abstraction in combination with a blocking semaphore wait instead of the semaphore timed_wait
- use a counter for the heartbeat instead of the timestamp
- add a mechanism for RouDi to check when the counter does not change for some time to detect unresponsive applications
2.2 Possible solution for locking issue
- revisit https://github.com/eclipse-iceoryx/iceoryx/pull/748
- create a lock-free ring buffer for history
- a workaround might be to skip the resource cleanup if the mutex is still locked and just leak the resource
3.1 ekxide might be contracted to fix the monitoring issue
3.2 Graham is looking into implementing the workaround
- Mathias checks if the workaround is feasible and gives some hints on how to proceed