-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JBPM-10209] Switching to CMT #2362
Conversation
a5dcd5f
to
34617bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments and questions (mainly about tracing and catching exceptions made in a different way than in the previous implementation).
Reproducer works like a charm with this patch.
Superb work @fjtirado !
...ejb/jbpm-services-ejb-timer/src/main/java/org/jbpm/services/ejb/timer/EJBTimerScheduler.java
Show resolved
Hide resolved
...ejb/jbpm-services-ejb-timer/src/main/java/org/jbpm/services/ejb/timer/EJBTimerScheduler.java
Show resolved
Hide resolved
...ejb/jbpm-services-ejb-timer/src/main/java/org/jbpm/services/ejb/timer/EJBTimerScheduler.java
Show resolved
Hide resolved
...ejb/jbpm-services-ejb-timer/src/main/java/org/jbpm/services/ejb/timer/EJBTimerScheduler.java
Show resolved
Hide resolved
...b/jbpm-services-ejb-timer/src/main/java/org/jbpm/services/ejb/timer/EjbSchedulerService.java
Show resolved
Hide resolved
...ejb/jbpm-services-ejb-timer/src/main/java/org/jbpm/services/ejb/timer/EJBTimerScheduler.java
Show resolved
Hide resolved
ok. some notes in there. changing to BMT won't allow to control ejb timer retry mechanism and this will cause an escalation of timers and logs in the user. (infinite numbers of retries when you are using interval triggers) BMT does not suspend any transaction in this case as this is a timer timeout call (there was no previous tx in flight) Require NEW is the same as Required in this case when you call an ejb timeout method because there was not any tx in flight CMT or BMT are just demarcation of the tx and the timeout should only happen in one place so it is unlikely to be the root cause of a deadlock as this is the first demarcation in the call. (tx.begin cannot cause the deadlock) To cause a a deadlock you need two different resources and try to hold them in one tx. Let's try to simulate an scenario: 1 tx -> BMT tx demarcation holding resource A then resource B or any different calling tx being it is not going to fix the situation in a deadlock. You need to see which tables are involved in the deadlock first to see where the problem lies. (deadlock analisys in the database. This depends on the database but usually it comes with a tool) |
SonarCloud Quality Gate failed. 0 Bugs 29.2% Coverage Catch issues before they fail your Quality Gate with our IDE extension SonarLint |
@elguardian This fix was tested with two reproducers and in customer environmment and it prevents the deadlock (it was more concurrency than deadlock, but Oracle sent the same exception for both). The tx that is suspended is the one that is invoking method |
@fjtirado are the reproducers you mentioned available somewhere? |
@elguardian @martinweiler I agree that we might have an issue now the we have switched to CMT with interval triggers. I would really appreciate your insight with that, because Im not completely sure I have all the details. In particular, I did not fully understood why having a commit before calling recoverTimerJobInstance is an issue. Can you please elaborate on that?. In any case the solution with CMT might be easy after all. I believe the timeout method, when using CMT creates a transaction (it is implicitly required), so if invokeTransaction method comitting is an issue, we can remove it so everything is executed within the transaction created when executeTimerJob is invoked by the EJB container and execute the call and the retry that deals with any exception within the same transaction. |
|
@martinweiler here is the automated test ( Commenting the Dockerfile jar replacement lines, you can get the deadlock scenario |
@gmunozfe @fjtirado I ran the reproducer locally, and confirm that there are no more deadlocks with your fix. I also ran the reproducer from JBPM-8688 (as this is where BMT was introduced) and it still works. I also tested the reproducer from RHPAM-3826 / RHPAM-3574, as these are related to the removeUnrecoverableTimer method which has been removed, but did not find an issue with this test either. Great work @fjtirado My main overall concern is that this is quite a significant change for a micro release, and given the history of issues we had to deal with in regards to timers, that we will introduce further issues down the road, and/or introduce a regression (which might not be caused by the tests). |
* [JBPM-10209] Switching to CMT * [JBPM-10209] A bit of refactor * [JBPM-10209] Gonzalos comments * [JBPM-10209] Disposable handling
* [JBPM-10209] Switching to CMT * [JBPM-10209] A bit of refactor * [JBPM-10209] Gonzalos comments * [JBPM-10209] Disposable handling
@elguardian @fjtirado regarding the discussion on CMT vs BMT, my knowledge in that area is limited, however I'd like to share the following. On 7.13.4/8.0.4 (unpatched), the timer creation happens before the transaction of the caller is ended:
With the changed code, the timer creation happens at the end of the caller transaction (after the sleep task):
This prevents the timer execution to clash with the active process execution, so I'd say the new behavior is correct. What concerns me is that this change is quite significant and happening in a very sensitive area, and it seems this PR got merged in a rush. cc @krisv @porcelli |
I share similar concern of @martinweiler in regard the rush to merge. I felt that we were still discussing it.... and all of sudden it got merged. |
...ejb/jbpm-services-ejb-timer/src/main/java/org/jbpm/services/ejb/timer/EJBTimerScheduler.java
Show resolved
Hide resolved
@porcelli @martinweiler There are still issues with timers, not related with this PR, that we are addressing with #2365. These issues were introduced by https://github.com/kiegroup/jbpm/pull/2143/files#diff-155546a1876a109b453b736d70a280e1721992e71ab093adfe42949b064aa852R104-R107 |
@fjtirado i understand that these changes aimed to fix a problem. but concerns were raised and i felt that we didn't have enough time to properly discuss. Sorry to disagree on terms addressing critical customer issue... this is the open source project, and there are better mechanisms to address downstream needs to provide quick fixes to customers. |
@porcelli The discussion was not about the content of the code changes, but about this PR being part of a micro release, which is not related with open source either, as far as I know. Quoting Martin "My main overall concern is that this is quite a significant change for a micro release, and given the history of issues we had to deal with in regards to timers, that we will introduce further issues down the road, and/or introduce a regression (which might not be caused by the tests).". Since to officially release the patch to the customer. it MUST be merged into main, I think the discussion should not be about this PR being merged in a rush o not (it has to be merged) but about which further work we need to do about timers and about the content of such micro release. |
I understand all the shared concerns. It is clear this is a very sentive area in the code. We are confident about the quality of the changes despite the fact that this area is inherently sensitive and still not fully covered neither by the previous implementation nor the new one. The changes merged provide not only the same coverage as before but it also includes now manual test cases reproducers that were not working with the previous implementation and are working now. So This PR does provide even better coverage than before. Regarding the suitability to merge this PR into main, there was no rush to merge in my opinion, since the team (@gmunozfe @fjtirado) has been more than 2 months working on reproducers, testing, and finally bringing a reasonable fix. Besides, we continue improving the code quality with some other edge cases that we found from the past implementation, so our aim is to still bring some more improvements and extra coverage going forward. @martinweiler ^ is there anything we could improve in the testing of those changes to raise the quality even more? |
EJBTimerScheduler bean was using BMT.
A BTM bean will always suspend the incoming transaction, therefore any operation within the bean, for example, scheduling a timer, will be performed in a different transaction.
This PR switch back to CMT, removing the risk of a short delay timer colliding with the parent transaction.
Since CMT was changed to BMT to be able to control delays and recovery when the timer is executed, as an alternative, we use CMT with requires_new to force a new transaction (achieving the same result than by handling UserTransaction). Also, the recoverTimerJobInstance was refactor a bit to make it clear that essentially, whatever the scenario (sessionNotFound, intervalTrigger or failed times) the recovery is executed within another transaction (as the regular call)