Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JBPM-10197] Improving performance of getTimerByName #2369

Closed
wants to merge 1 commit into from

Conversation

fjtirado
Copy link
Contributor

@fjtirado fjtirado commented Nov 27, 2023

JIRA:

link

A huge hack, because it made certain assumptions that should not be made if TimerService API was functional for the usage we are giving it (which is even a huger hack in my opinion ;)), but maybe worthy for users having performance issues with getTimerByName().

If Timer implementation is wildfly and it is using DB as persistence mechanism and if DB is oracle or postgresql, we can actually try to reduce the number of timers to be checked in getTimerByName using a bit of reflection and a couple of native queries. The hacky part is isolated into a single class.

The Wildfly hacky assumptions are:

  1. TimerService implementation wildffy class name
  2. The field containing the persistence instance is called persistence
  3. The class name of the instance assigned to field persistence
  4. The method used to deserialize info column is called deSerialize
  5. That there is a table called ejb_timer with a string column called info.

If assumptions 1), 2) and 3) are changed by a future wildfly release, linear search will be done (so we are covered on that regard). 4) and 5) are trickier, since the exception will get propagated.

DISCLAIMER
I did not want to do this, but I consider this approach better than potentially creating duplicates (see closed PRs over the same JIRA) or performing a linear search over thousands of timers once we know how Wildfly is working. If Application Server is not widlfly, or the persistence is not on DB or widfly change the class name, we will revert to linear search, so it is just a hack that boost performance of a certain wildfly setup and a couple of DBs, it does not imply losing any existing functionality if the setup is not matched .

PS
If we are ok with the possibility of assuming Wildfly, my idea, before moving into "ready for review" and once this has been tested with an existing problematic setup, is to switch to ServiceLoader mechanism and try to add as much DBs as possible, so the classes related with Wildfly are not in the EJB timer module, but in a separate one that might be added only for wildfly setup (this will allow including the Wildfly dependencues, the usage of instanceof and even calling the deSerialize method through direct call). Before that I want to verify that this approach really boost performance.

@elguardian
Copy link
Member

This is not really a good approach IMO you are going to cause more problem that anything and it is completely unnecesary. As performance problems was fixed storing the ejb timer info in the table. It is standard and and supported by Java EE environments.Also it supports clustering properly.

Copy link
Member

@elguardian elguardian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above.

@fjtirado
Copy link
Contributor Author

fjtirado commented Nov 29, 2023

This is not really a good approach IMO you are going to cause more problem that anything and it is completely unnecesary. As performance problems was fixed storing the ejb timer info in the table. It is standard and and supported by Java EE environments.Also it supports clustering properly.

I agree except for the necesity (if it was not necessary, I wont be coding this ;)). The reason we are proposing this hack is because the EJBTimerMapping solution is far from bulletproof (there are users seeing lot of EJBTimerMapping entryies pointing to a null external timer) and, in that situation, the system collapses performing the serial search when timers are registered at startup (since there is an exception processing the EJBTimerMapping, the system performs the linear seach for all the id null entries). Any other solution that avoid that search (see other closed PRs for this JIRA) will potentially lead to duplicate timers, so breaking encapsulation (which we can hide anyway by using service loader) seems one possibiilty (the other option being manually rebuilding EJBTimerMappingInfo cache offline)

Anyway, Im not in favour of commiting this, it is just a branch in case the same situation with EJBTimerMappingInfo arise again and we can do a specific patch to address it

@elguardian
Copy link
Member

We are aware of the problem but the solution is not following up with a hack but fix properly the null (just an effect of another problem). If that happens means there is something wrong with it (root cause not clear in here). Walking this path it is not a solution for other app servers or even a real thing as there configuration in the eap could make this not work (like using a different DB or other persistent solutions) within the ejb timer subsystem. So this solution will work in a very specific cases.
Necessity is not defined by if this piece of code can fix (more likely just mitigate the issue) but fix the solution in place is correct and make it behave correctly.
If you have a solution that is compliant with the spec and should work in any app server, we should strive to make the solution work properly and not circumvent with another piece of code that will only work in specific cases as you are not accessing the ejb db timer solution.
I am seeing lately very vague analysis in very complex portioins of code that could lead to potential bit problems. We should analyse the problem carefully and make a proper diagnose of the problem (root cause) and no the effeects (null).

@martinweiler @porcelli ^^ be aware of this as this solution should not be merged.

@martinweiler
Copy link
Contributor

I agree with @elguardian that introducing a solution specific to only a subset of the supported databases is not meeting the supportability requirements of the downstream product, and is thus problematic to be merged even upstream.
@fjtirado your comment highlights issues with the TimerMappingInfo logic. Are you able to pinpoint those so that these can be fixed?

@albfan
Copy link
Contributor

albfan commented Dec 4, 2023

Solution is already provided: https://issues.redhat.com/browse/EAPSUP-1336

Keep an index of timers that can be lost is an smell coding. That shouldn't be a problem if traverse all timers is not that slow, but it is: For each N active timer, due to column info is not cached, JBPM needs to do N select queries to get that info (really slow) just to read into info column the index JBPM knows.

That issue propose to be able to search for that know index (the usual concept of an externalid) as EJB timers is just a service to provide timers to external apps, it should manage the concept of real index the external app uses for timers.

This hack is just the evidence that TImerMappingInfo should be removed, and that externaId should be added to EJB timers spec

The goal here is not to avoid getTimersByName (basically what TimerMappingInfo does) but point out that we cannot have a method that slown down JBPM because we have millions of timers.

I tried to convince engineers of EJB timers about this obvious change on spec (the hack just prove it) but there's no progress on it

@fjtirado
Copy link
Contributor Author

fjtirado commented Dec 4, 2023

@elguardian @martinweiler @porcelli We are not planning to merge this. As I already mention (twice) it is a hack that can be used as specific patch for specific users, nothing more than that.
@albfan Analysing why TimerMappingInfo was corrupted was not feasible since we do not have all the required information to do that (we do not even know if the table is corrupted because of manual interaction, because of a migration or because a bug). We just know that somehow it becomes corrupted.
@elguardian I really appreciate your constructive criticism, but the word lately related with timer changes is not accurate, since the change (which causes concurrency issues) to move from BMT to CMT (based of the false assumption that transactions are inherited by BMT, see #2362 (comment) and https://github.com/kiegroup/jbpm/pull/1520/files#diff-5669039d6beade81bd11322820ff3c35dca83d5ada6986d268e1ef9879f06a65R53) was done a few years ago and the change to add TimerMappingInfo (which can also be considered a hack, because it is persistence cache and introduced another set of issues, see one example) was done a year ago.

[JBPM-10197] Martin's comments
Revert "[JBPM-10197] Martin's comments"

This reverts commit 88396a0.

Revert "[JBPM-10197] Handle TimerMappingInfo in error scenarios"

This reverts commit 53b1e10.

[JBPM-10197] Alternative approach
Update WildflyEJBTimerRetriever.java
[JBPM-10197] Sonar warnings
Copy link

sonarcloud bot commented Dec 4, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

43.0% 43.0% Coverage
0.0% 0.0% Duplication

idea Catch issues before they fail your Quality Gate with our IDE extension sonarlint SonarLint

@fjtirado
Copy link
Contributor Author

fjtirado commented Dec 5, 2023

@martinweiler @elguardian
A user reported that, for several rows in TimerMappingInfo, both externalTimerId and info columns are null.
The only way I found this can happen, considering the following code https://github.com/kiegroup/jbpm/blob/main/jbpm-runtime-manager/src/main/java/org/jbpm/runtime/manager/impl/AbstractRuntimeManager.java#L211-L263 is if GlobalJpaTimerJobInstance has externalTimeId and getTimerInfo returns null.
For EJB timers, externalTimerId and info might not be set if hastNextFireTime is null here https://github.com/kiegroup/jbpm/blob/main/jbpm-services/jbpm-services-ejb/jbpm-services-ejb-timer/src/main/java/org/jbpm/services/ejb/timer/EJBTimerScheduler.java#L205
Also, I have doubts about this logic https://github.com/kiegroup/jbpm/blob/main/jbpm-services/jbpm-services-ejb/jbpm-services-ejb-timer/src/main/java/org/jbpm/services/ejb/timer/EjbSchedulerService.java#L71-L79
If ctx.isNew is false, what happens here https://github.com/kiegroup/jbpm/blob/main/jbpm-flow/src/main/java/org/jbpm/process/instance/timer/TimerManager.java#L278, then, if the timer is not found (I do not know if that situation is possible), a new one will be created and the performance will be linear in any case, because it has to check all existing timers before triggering a new one.

@elguardian
Copy link
Member

@fjtirado my concern about changing from BMT to CMT was related to transaction demarcation. There is nothing in my observations related to inheriting anything (I am talking about tx demarcation). I did never said that. I would suggest to stay on the topic instead answering to something that noone was putting in question.

The main problem of CMT is that you will lose control of retry mechanism and wildly does have a strategy in place to deal with problems (two policies actually) one coming from the spec (immediate automatic retry) and the other coming from the app server itself (several retries after some time). Actually that functionality is lost or broken at this point.

Please read again my observation again (#2362 (comment)) which is the first sentence.

Related to TimerMappingInfo is a solution compliant with the ejb spec and respected the design of ejb. It is a way to introduce a way to exchange information among cluster members so it is not a cache

@elguardian elguardian closed this Dec 11, 2023
@fjtirado
Copy link
Contributor Author

@fjtirado my concern about changing from BMT to CMT was related to transaction demarcation. There is nothing in my observations related to inheriting anything (I am talking about tx demarcation). I did never said that. I would suggest to stay on the topic instead answering to something that noone was putting in question.

The main problem of CMT is that you will lose control of retry mechanism and wildly does have a strategy in place to deal with problems (two policies actually) one coming from the spec (immediate automatic retry) and the other coming from the app server itself (several retries after some time). Actually that functionality is lost or broken at this point.

Please read again my observation again (#2362 (comment)) which is the first sentence.

Related to TimerMappingInfo is a solution compliant with the ejb spec and respected the design of ejb. It is a way to introduce a way to exchange information among cluster members so it is not a cache

@elguardian Switching to BMT suspend the transaction that schedules the timer. That causes the timer to be scheduled before the transaction that schedule it is commited (if the timer delay is short enough). The broken Wildfly retry mechanism can be avoided (using CMT or BMT) by not propagating any exception back to the container (so the container thinks the timer execution has been completed successfully)

@elguardian
Copy link
Member

elguardian commented Dec 11, 2023

@fjtirado The tx related to the timer is the one in flight when the timer times out/consumed is operated, so not very certain what is related to suspend tx you are mentioned. You can look into the code yourself

Related to the @timeout method invocation.
https://github.com/wildfly/wildfly/blob/main/ejb3/src/main/java/org/jboss/as/ejb3/timerservice/TimedObjectInvokerImpl.java#L72-L85

The callout will invoke interceptors including tx if there is any. If the bean is not BMT it won't invoke any tx so BMT won't suspend any transaction in a timeout call. if that is what you are referring.

In any case the timer service can span it's own tx
https://github.com/wildfly/wildfly/blob/main/ejb3/src/main/java/org/jboss/as/ejb3/timerservice/TimerServiceImpl.java#L417-L455

If the timeout fails and you don't propagate the error, you will lose the timer for sure.

@fjtirado
Copy link
Contributor Author

fjtirado commented Dec 11, 2023

@elguardian Im not referring to the timeout method. Im referring to this one https://github.com/kiegroup/jbpm/blob/main/jbpm-services/jbpm-services-ejb/jbpm-services-ejb-timer/src/main/java/org/jbpm/services/ejb/timer/EJBTimerScheduler.java#L202
That one is invoked by the JBPM engine when a timer needs to be scheduled. If we use BMT, the container transaction in which the engine is running is supended, TimerService creates a new one, so a timer is scheduled before the engine has actually comitted the transaction that creates it. It is all described in the PR that switch back to CMT and there is a reproducer for that issue.

@fjtirado fjtirado changed the title [JBPM-10097] Improving performance of getTimerByName [JBPM-10197] Improving performance of getTimerByName Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants