Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JBPM-10209] Switching to CMT #2362

Merged
merged 4 commits into from
Nov 16, 2023
Merged

[JBPM-10209] Switching to CMT #2362

merged 4 commits into from
Nov 16, 2023

Conversation

fjtirado
Copy link
Contributor

@fjtirado fjtirado commented Nov 14, 2023

EJBTimerScheduler bean was using BMT.
A BTM bean will always suspend the incoming transaction, therefore any operation within the bean, for example, scheduling a timer, will be performed in a different transaction.
This PR switch back to CMT, removing the risk of a short delay timer colliding with the parent transaction.
Since CMT was changed to BMT to be able to control delays and recovery when the timer is executed, as an alternative, we use CMT with requires_new to force a new transaction (achieving the same result than by handling UserTransaction). Also, the recoverTimerJobInstance was refactor a bit to make it clear that essentially, whatever the scenario (sessionNotFound, intervalTrigger or failed times) the recovery is executed within another transaction (as the regular call)

@fjtirado fjtirado marked this pull request as draft November 14, 2023 13:36
@fjtirado fjtirado added backport-7.67.x Generate backport PR for 7.67.x branch backport-7.67.x-blue Generate backport PR for 7.67.x-blue branch labels Nov 14, 2023
@fjtirado fjtirado marked this pull request as ready for review November 14, 2023 15:39
@fjtirado fjtirado force-pushed the JBPM-10209 branch 5 times, most recently from a5dcd5f to 34617bc Compare November 14, 2023 16:07
Copy link
Member

@gmunozfe gmunozfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments and questions (mainly about tracing and catching exceptions made in a different way than in the previous implementation).
Reproducer works like a charm with this patch.
Superb work @fjtirado !

@elguardian
Copy link
Member

ok. some notes in there. changing to BMT won't allow to control ejb timer retry mechanism and this will cause an escalation of timers and logs in the user. (infinite numbers of retries when you are using interval triggers)

BMT does not suspend any transaction in this case as this is a timer timeout call (there was no previous tx in flight)
There is no difference regarding tx or not if you use BMT or CMT in this case regarding any deadlock.

Require NEW is the same as Required in this case when you call an ejb timeout method because there was not any tx in flight

CMT or BMT are just demarcation of the tx and the timeout should only happen in one place so it is unlikely to be the root cause of a deadlock as this is the first demarcation in the call. (tx.begin cannot cause the deadlock)

To cause a a deadlock you need two different resources and try to hold them in one tx. Let's try to simulate an scenario:

1 tx -> BMT tx demarcation holding resource A then resource B
2 tx -> BMT tx demarcation hodling resource B and the resource A

or any different
1 tx -> CMT tx demarcation holding resource A then resource B
2 tx -> CMT tx demarcation holding resource B then resource A

calling tx being it is not going to fix the situation in a deadlock. You need to see which tables are involved in the deadlock first to see where the problem lies. (deadlock analisys in the database. This depends on the database but usually it comes with a tool)

@martinweiler

Copy link

sonarcloud bot commented Nov 15, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

29.2% 29.2% Coverage
0.0% 0.0% Duplication

idea Catch issues before they fail your Quality Gate with our IDE extension sonarlint SonarLint

@fjtirado
Copy link
Contributor Author

fjtirado commented Nov 15, 2023

@elguardian
BMT suspend the transaction (if it exist), see attached picture
image(5)

This fix was tested with two reproducers and in customer environmment and it prevents the deadlock (it was more concurrency than deadlock, but Oracle sent the same exception for both). The tx that is suspended is the one that is invoking method internalSchedule, since the timer is commited by the transaction newly created by the BMT bean, it is visible to the container before the suspended transaction is commited and then there is concurrency over the same process instance (between the timer that expires before the transaction that scheduled it committed and the transaction that schedules the timer). Note this only ocurs when the timer expiration timer is shorter than the time the transaction takes to be commited.

@martinweiler
Copy link
Contributor

@fjtirado are the reproducers you mentioned available somewhere?

@fjtirado
Copy link
Contributor Author

fjtirado commented Nov 15, 2023

@elguardian @martinweiler I agree that we might have an issue now the we have switched to CMT with interval triggers. I would really appreciate your insight with that, because Im not completely sure I have all the details. In particular, I did not fully understood why having a commit before calling recoverTimerJobInstance is an issue. Can you please elaborate on that?. In any case the solution with CMT might be easy after all. I believe the timeout method, when using CMT creates a transaction (it is implicitly required), so if invokeTransaction method comitting is an issue, we can remove it so everything is executed within the transaction created when executeTimerJob is invoked by the EJB container and execute the call and the retry that deals with any exception within the same transaction.

@fjtirado
Copy link
Contributor Author

fjtirado commented Nov 15, 2023

@fjtirado are the reproducers you mentioned available somewhere?

@martinweiler Gonzalo attached the reprrodcuer BPMN to the JIRA

@gmunozfe
Copy link
Member

gmunozfe commented Nov 15, 2023

@fjtirado are the reproducers you mentioned available somewhere?

@martinweiler Gonzalo attached the reprrodcuer BPMN to the JIRA

@martinweiler here is the automated test (timerIsNotTriggerUntilTaskCommitted) with the reproducer (we're using 2 kie-servers and PostgreSQL containers)
kiegroup/jbpm-playground@master...gmunozfe:jbpm-playground:timer_test

Commenting the Dockerfile jar replacement lines, you can get the deadlock scenario

@martinweiler
Copy link
Contributor

@gmunozfe @fjtirado I ran the reproducer locally, and confirm that there are no more deadlocks with your fix. I also ran the reproducer from JBPM-8688 (as this is where BMT was introduced) and it still works. I also tested the reproducer from RHPAM-3826 / RHPAM-3574, as these are related to the removeUnrecoverableTimer method which has been removed, but did not find an issue with this test either. Great work @fjtirado

My main overall concern is that this is quite a significant change for a micro release, and given the history of issues we had to deal with in regards to timers, that we will introduce further issues down the road, and/or introduce a regression (which might not be caused by the tests).

@fjtirado fjtirado merged commit ababa5f into kiegroup:main Nov 16, 2023
4 of 5 checks passed
github-actions bot pushed a commit that referenced this pull request Nov 16, 2023
* [JBPM-10209] Switching to CMT

* [JBPM-10209] A bit of refactor

* [JBPM-10209] Gonzalos comments

* [JBPM-10209] Disposable handling
github-actions bot pushed a commit that referenced this pull request Nov 16, 2023
* [JBPM-10209] Switching to CMT

* [JBPM-10209] A bit of refactor

* [JBPM-10209] Gonzalos comments

* [JBPM-10209] Disposable handling
@martinweiler
Copy link
Contributor

@elguardian @fjtirado regarding the discussion on CMT vs BMT, my knowledge in that area is limited, however I'd like to share the following. On 7.13.4/8.0.4 (unpatched), the timer creation happens before the transaction of the caller is ended:

2023-11-16 10:28:06,645 DEBUG [org.jboss.as.ejb3.timer] (default task-1) Scheduling a single action timer [id=76d18752-b1a6-47bd-9955-6a01a7855e97 timedObjectId=kie-server.kie-server.EJBTimerScheduler auto-timer?:false persistent?:true timerService=org.jboss.as.ejb3.timerservice.TimerServiceImpl@19c26ae1 previousRun=null initialExpiration=Thu Nov 16 10:28:07 MST 2023 intervalDuration(in milli sec)=0 nextExpiration=Thu Nov 16 10:28:07 MST 2023 timerState=ACTIVE info=EjbTimerJob [timerJobInstance=GlobalJpaTimerJobInstance [timerServiceId=deadlock-test:1.0.0-SNAPSHOT-timerServiceId, getJobHandle()=EjbGlobalJobHandle [uuid=462-430-timer-1]]]] starting at 972 milliseconds from now
2023-11-16 10:28:10,651 INFO  [stdout] (default task-1) === end sleep Exception Before Task3 ===

With the changed code, the timer creation happens at the end of the caller transaction (after the sleep task):

2023-11-16 10:15:58,676 INFO  [stdout] (default task-1) === end sleep Exception Before Task3 ===
2023-11-16 10:15:58,739 DEBUG [org.jboss.as.ejb3.timer] (default task-1) commit timer creation: [id=6d500923-80be-438a-a0c3-0c292fe6bb5e timedObjectId=kie-server.kie-server.EJBTimerScheduler auto-timer?:false persistent?:true timerS
ervice=org.jboss.as.ejb3.timerservice.TimerServiceImpl@7fac3af7 previousRun=null initialExpiration=Thu Nov 16 10:15:55 MST 2023 intervalDuration(in milli sec)=0 nextExpiration=Thu Nov 16 10:15:55 MST 2023 timerState=CREATED info=Ejb
TimerJob [timerJobInstance=GlobalJpaTimerJobInstance [timerServiceId=deadlock-test:1.0.0-SNAPSHOT-timerServiceId, getJobHandle()=EjbGlobalJobHandle [uuid=458-429-timer-1]]]]
2023-11-16 10:15:58,740 DEBUG [org.jboss.as.ejb3.timer] (default task-1) Scheduling a single action timer [id=6d500923-80be-438a-a0c3-0c292fe6bb5e timedObjectId=kie-server.kie-server.EJBTimerScheduler auto-timer?:false persistent?:t
rue timerService=org.jboss.as.ejb3.timerservice.TimerServiceImpl@7fac3af7 previousRun=null initialExpiration=Thu Nov 16 10:15:55 MST 2023 intervalDuration(in milli sec)=0 nextExpiration=Thu Nov 16 10:15:55 MST 2023 timerState=ACTIVE
 info=EjbTimerJob [timerJobInstance=GlobalJpaTimerJobInstance [timerServiceId=deadlock-test:1.0.0-SNAPSHOT-timerServiceId, getJobHandle()=EjbGlobalJobHandle [uuid=458-429-timer-1]]]] starting at -3,093 milliseconds from now

This prevents the timer execution to clash with the active process execution, so I'd say the new behavior is correct.

What concerns me is that this change is quite significant and happening in a very sensitive area, and it seems this PR got merged in a rush. cc @krisv @porcelli

@porcelli
Copy link
Member

I share similar concern of @martinweiler in regard the rush to merge. I felt that we were still discussing it.... and all of sudden it got merged.

fjtirado added a commit that referenced this pull request Nov 17, 2023
* [JBPM-10209] Switching to CMT

* [JBPM-10209] A bit of refactor

* [JBPM-10209] Gonzalos comments

* [JBPM-10209] Disposable handling

Co-authored-by: Francisco Javier Tirado Sarti <65240126+fjtirado@users.noreply.github.com>
@fjtirado
Copy link
Contributor Author

I share similar concern of @martinweiler in regard the rush to merge. I felt that we were still discussing it.... and all of sudden it got merged.

@porcelli @martinweiler
There was no rush, but quick resolution because this PR fixes a critical customer issue that has being around for months. The issues was caused by this change https://github.com/kiegroup/jbpm/pull/1520/files#diff-5669039d6beade81bd11322820ff3c35dca83d5ada6986d268e1ef9879f06a65R53

There are still issues with timers, not related with this PR, that we are addressing with #2365. These issues were introduced by https://github.com/kiegroup/jbpm/pull/2143/files#diff-155546a1876a109b453b736d70a280e1721992e71ab093adfe42949b064aa852R104-R107

@porcelli
Copy link
Member

@fjtirado i understand that these changes aimed to fix a problem. but concerns were raised and i felt that we didn't have enough time to properly discuss.

Sorry to disagree on terms addressing critical customer issue... this is the open source project, and there are better mechanisms to address downstream needs to provide quick fixes to customers.

@fjtirado
Copy link
Contributor Author

fjtirado commented Nov 21, 2023

Sorry to disagree on terms addressing critical customer issue... this is the open source project, and there are better mechanisms to address downstream needs to provide quick fixes to customers.

@porcelli The discussion was not about the content of the code changes, but about this PR being part of a micro release, which is not related with open source either, as far as I know. Quoting Martin "My main overall concern is that this is quite a significant change for a micro release, and given the history of issues we had to deal with in regards to timers, that we will introduce further issues down the road, and/or introduce a regression (which might not be caused by the tests).". Since to officially release the patch to the customer. it MUST be merged into main, I think the discussion should not be about this PR being merged in a rush o not (it has to be merged) but about which further work we need to do about timers and about the content of such micro release.

@dgutierr
Copy link
Member

dgutierr commented Nov 21, 2023

I understand all the shared concerns. It is clear this is a very sentive area in the code. We are confident about the quality of the changes despite the fact that this area is inherently sensitive and still not fully covered neither by the previous implementation nor the new one. The changes merged provide not only the same coverage as before but it also includes now manual test cases reproducers that were not working with the previous implementation and are working now. So This PR does provide even better coverage than before.

Regarding the suitability to merge this PR into main, there was no rush to merge in my opinion, since the team (@gmunozfe @fjtirado) has been more than 2 months working on reproducers, testing, and finally bringing a reasonable fix. Besides, we continue improving the code quality with some other edge cases that we found from the past implementation, so our aim is to still bring some more improvements and extra coverage going forward.

@martinweiler ^ is there anything we could improve in the testing of those changes to raise the quality even more?

fjtirado added a commit that referenced this pull request Nov 21, 2023
* [JBPM-10209] Switching to CMT

* [JBPM-10209] A bit of refactor

* [JBPM-10209] Gonzalos comments

* [JBPM-10209] Disposable handling

Co-authored-by: Francisco Javier Tirado Sarti <65240126+fjtirado@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.67.x Generate backport PR for 7.67.x branch backport-7.67.x-blue Generate backport PR for 7.67.x-blue branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants