-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ReentrantLock
instead of synchronized
in DeserializerCache
to avoid deadlock on pinning
#4430
Conversation
The detailed thread stacktrace was produced by The following unit test illustrates the underlying problem. If you set both boolean flag to
|
The InfoWorld article makes this sound like a Java bug that Oracle are seeking to fix. While Jackson will probably replace synchronized blocks with explicit locks, this is probably not going to happen in a patch release. |
@cowtowncoder From my testing, Reentrant locks are slower than synchronized blocks. This change could affect performance for the majority of users. One option is to make the behaviour configurable. One code path uses synchronized blocks and the other ReentrantLocks. |
@pjfanning I agree that the "proper" fix would be to fix the JVMs problematic behavior observed when synchronized threads park (more specifically "threads holding an object monitor") and this indeed being worked on by the JVM team, but I'm a bit worried that this will take months (if not years) to complete. For info there is currently 18 usages of |
@oddbjornkvalsund You are free to fork jackson-databind and make whatever changes suit you. Unfortunately, I need to worry about all other jackson-databind users. I don't feel like hacking away at jackson-databind because of JDK bugs is something that should be rushed into. |
@pjfanning Sure, I have already forked and submitted this change for the one specific part of the code that caused problems for me. I'll be happy to dig into the remaining, hang on... |
@pjfanning There, I've now changed the remaining usages of I realize that there is quite possibly a ton of As to the more fundamental question of
...I have no strong opinions, but I do feel that if you choose to wait then you should give a big warning to users that Jackson cannot be used with JDK 21 or newer until this issue is fixed. The deadlock issue I observed is very real and hits very hard. |
is discouraged on virtual threads* |
Not every Java 21/22 user is using virtual threads. Let's also be straight - this a major bug in Java. It is not a Jackson bug. @oddbjornkvalsund if you want to cast stones - cast them in the right direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 from me - but I guess it's up to @cowtowncoder
My preference is make the locking configurable so that users can choose to use synchronized blocks or ReentrantLocks. If we can agree on where to put the configuration (maybe JsonFactory) and what to call it, I can do a POC that uses the config and refactor the DeserializerCache to support both lock types. If the POC looks good, we can roll it out elsewhere in jackson-databind and beyond that into other Jackson modules.
@mariofusco Making you aware of this issue since you are the biggest driver of the existing Jackson changes to avoid ThreadLocal caching. |
@@ -69,6 +70,8 @@ protected abstract static class DateBasedDeserializer<T> | |||
*/ | |||
protected final DateFormat _customFormat; | |||
|
|||
private final ReentrantLock _customFormatLock = new ReentrantLock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is not 100% correct - If you have 2 deserializer instances with the same _customFormat instance then synchronizing it will prevent concurrent use of that _customFormat instance - your lock would not work properly in that case.
It is sort of annoying that Java never made DateFormat thread-safe. We could look into finding thread-safe alternatives - or adding a thread-safe class of our own that wraps DataFormat but that uses a locking mechanism to control access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't researched this but I suspect we could find some java.time
API to do date/time formatting that is thread safe - and therefore not need locking here at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the existing DateFormat
should be replaced with a java.time.format.DateTimeFormatter
which is immutable and thread-safe and at that point this synchronization could be of course totally removed. However I guess that this could be addressed with a different pull request, immediately after this will be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure replacement would work, unfortunately, since we are talking about support for java.util.Date
/ java.util.Calendar
; and if I remember correctly, format Strings used with SimpleDateFormat
are different from those used with Java 8 date/time (and Joda).
So the better fix for users is to upgrade from java.util types to java.time.
So while I agree with all the ugliness of java.util world, I am not sure spending time on better support there is worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this Date code is not on critical path - users can avoid it using java.time classes instead of java.util Data/Calendar
so this could be left alone - and maybe tackled later
@@ -42,6 +43,8 @@ public abstract class TypeDeserializerBase | |||
*/ | |||
protected final JavaType _defaultImpl; | |||
|
|||
protected final ReentrantLock _defaultImplLock = new ReentrantLock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem as I described for DateDeserializers - what if JavaType _defaultImpl
instance is used by more than 1 TypeDeserializerBase instance. This set of refactors seem really dangerous to me. I hate the idea of hacking like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is wrong. In my opinion the ReentrantLock
should be moved inside the JavaType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think JavaType
would be the place: locking here is specifically to try to avoid duplicate lookups for deserializer used for "default type" (of given polymorphic base type).
However. Since this is not done for correctness but as (possibly misguided?) performance optimization, maybe even use of basic AtomicReference
would work (accepting possible race condition).
@pjfanning I have not intended to cast any stones or blame anyone and I'm sorry if I came across as blunt or pointing fingers. I am pragmatic about this and only focused on making things work. 🙏 The complexities with Thanks for identifying the bugs in my code! I will address those if there proves to be any interest in merging this PR. 👍 |
With the removal of biased locking I believe that this is no longer so true in modern JVMs. Just to give you one more datapoint we replaced all |
@cowtowncoder @pjfanning @oddbjornkvalsund I gave a second look at this pull request and in particular tried to replace the legacy Regarding the problem at hand I believe that it is not necessary to replace all the synchronized blocks (and in particular it is not needed for the block parsing the date) but only for the blocks that perform an I/O operation inside them. In fact pinning is a problem only when the virtual thread performs an I/O operation and then it should be unmounted from its carrier but it can't since it is pinned. In essence my suggestion is:
|
@mariofusco That sounds like a good approach. Feel free to close this PR! For me, avoiding setting the "jdk.tracePinnedThreads" system property made the problem go away for now, but since this is the approach suggested by the JDK team for identifying pinning issues, it's not great that I can't use it. |
Used in that way that system property doesn't tell the whole story. As I tried to explain a pinned virtual thread is not necessarily a problem: |
@@ -107,29 +110,41 @@ public synchronized int size() { | |||
*/ | |||
public JsonSerializer<Object> untypedValueSerializer(Class<?> type) | |||
{ | |||
synchronized (this) { | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this should never cause issues? There's no I/O operations within Map
(LRUMap
), TypeKey
is stateless.
In fact... I don't think synchronization should be needed at all here: LRUMap
states:
Since Jackson 2.14, the implementation
...
Is thread-safe and does NOT require external synchronization
so could it be this is legacy, pre-2.14 locking, that was just left for no good reason...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading through the code, this may not be as simple as I thought (see #4442): I think synchronized
actually has secondary purpose to act as monitor for resolution of cyclic dependencies for POJO serializers (BeanSerializer
). And if so it cannot really be safely removed (but is also a big potential performance concern at same time).
Ok, with a quick (but I hope not too quick) look, I agree that a minimal patch (PR) would be preferable -- just tackling observed case of
|
Could we go back to having this PR being about the one lock in DeserializerCache ? This locking code needs due respect and every change needs to be analysed on its own merits. We will look at each case - we just don't need to try to handle every case in 1 PR. |
@pjfanning Right, that's what I tried to say (but with too many words) :) |
@pjfanning Could you come up with minimal PR for |
@oddbjornkvalsund could you change this PR to just keep the DeserializerCache change? We can deal with the other classes separately. |
I created a speculative PR - #4456 |
Hello @oddbjornkvalsund @cowtowncoder, I just updated to Spring Boot 3 (jackson 2.17.1) and I start having the bellow issue, is it related to this change? I'm using webflux with reactive-feign. Downgrading jackson to 2.17.0 fix the issue.
|
According to stacktrace, the issue is from JooHyukKim@48124cd.
Idk how it's possible... maybe what happend DeserializationCache in was it....
By intuition, we could probably use |
Hello @oddbjornkvalsund not sure if it helps, but I'm using webflux (Mono/Flux) with reactive-feign. |
@wdallastella Before proceeding with a fix discussion, could you come up with a reproduction? That would help. PS: Idk where to discuss this.. 🤔 |
The That said, for ReentrantLock, this should only happen if the same thread tries to obtain more than 2147483647 recursive locks (which would suggest a deeper problem) |
I agree - moving the lock call to before the try is more correct - and we should make that change. But yes, something really weird is happening. The lock acquisition should not fail. @wdallastella can you open a new issue? I don't think we should be having a detailed discussion on a closed PR. |
@pjfanning I'll open the the issue tomorrow. By the way, I build
If I ignore (like bellow), it works, but we might have issues if there is really a blocking call when using reactor.
|
@wdallastella Blocking call is... well, technically, yes, Locks (ReentrantLock) will block on contention. But in most cases blocking time should be low, and perhaps more importantly there really doesn't seem to be any easy or practical way around that potential. But this isn't I/O with long blocking times, timeouts and so on. This assuming that it is due to use of |
So the immediate concern is resolved for 2.17 (reverting to simpler earlier locking which may be sub-optimal but has no correctness issues); and in 2.18 locking is outside try-block. But I am bit concerned about how this happened all in all, based on all comments: I guess it must be related to nested calls via Anyway. Would be good to have testing for code in 2.18 branch; 2.17 has |
Thanks @cowtowncoder. Regards the |
Correct: cache should not need to lock for already constructed and cached deserializers, but typically just once (theoretically more often if the number of deserializers exceeds cache size etc). It is interesting that there is warning for |
I believe that's just a tool limitation: https://togithub.com/reactor/BlockHound/issues/402 it's common to have to define allowlists for blockhound - for example, spring also defines an exception for a class that uses ReentrantLock - but jackson doesn't depend on project reactor, which makes that tricky given they couldn't reproduce the issue on 2.18 (after adding the blockhound ignore), perhaps something else went wrong rather than 2 billion recursive locks |
oh. I believe the when so the standard explanation of 2 billion recursive locks was a red herring in this case |
Very interesting -- but makes sense wrt lock mismatch and gives more confidence that we don't have something wrong for common usage. I still think it's safer to keep updated version for 2.18.0 (basically admitting that making the change in 2.17.1 was a bad idea despite seeming simple nature), in "abundance of caution". Thank you for digging through this @iProdigy . |
@iProdigy good catch, I tested with
|
@cowtowncoder would it make sense to reinstate the ReentrantLock? Since we now know what went wrong. We can keep the change to move the lock call outside the try. |
It seems otherwise 🤔. I believe
But I also agree with @iProdigy that it is very possible some exception is thrown by Blockhound. Sum it up, what could have be happend in 2.17.0 when try {
_incompleteDeserializersLock.lock(); // 1. Blockhound throws exception
... rest of the impl...
} finally {
// 2. We just ignored exception thrown by Blockhound
// then trying to unlock that current thread does not have hold of.
_incompleteDeserializersLock.unlock();
}
PS: @iProdigy explained |
yes, the particular error that can be thrown during
|
I see, thank you for sharing, @iProdigy !
Since we know the cause and solution of Blockhound allowlist, I agree with @pjfanning for using Using |
Ok. Since everyone else is content with the fix, yes, let's do that for 2.17.2. I'll revert my temporary revert from 2.17. :) That one final (?) PR is #4568, and we should be good. It will take a while until we publish 2.17.2 tho (only 1 month since 2.17.1 and this is the first fix for any of components) |
DeserializerCache
usessynchronized
to protect the_incompleteDeserializers
map from concurrent modification. This deadlocks when using virtual threads + IO within the synchronized block and you end up with a non-responsive VM and stacktraces looking like the one below. Note in particular the threads that have- waiting to lock <0x00000006104ab5e8> (a java.util.HashMap)
.Replacing the
synchronized
keyword with aReentrantLock
fixes the problem. See relevant info here: https://www.infoworld.com/article/3713220/java-virtual-threads-hit-with-pinning-issue.html and here: https://inside.java/2024/02/21/quality-heads-up/.