Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in PreClose method during dup2 system call #605

Open
wants to merge 1 commit into
base: openj9
Choose a base branch
from

Conversation

shruacha1234
Copy link
Contributor

During fcntl, dup2 system calls if read/write happens then PreClose method gets hang. This issue is not seen if we Signal/Kill the thread first and then call the preClose method.

Signed-off-by: Shruthi.Shruthi1 Shruthi.Shruthi1@ibm.com

@shruacha1234 shruacha1234 force-pushed the PreClose_hang branch 2 times, most recently from d863fa7 to 34cdd7b Compare December 27, 2022 10:28
@shruacha1234
Copy link
Contributor Author

@pshipton @keithc-ca Can you please review this PR

@pshipton
Copy link
Member

pshipton commented Jan 3, 2023

Pls update the original source rather than copying it to the closed directory and modifying it.

@shruacha1234 shruacha1234 force-pushed the PreClose_hang branch 2 times, most recently from 7bf6291 to bbc3c5d Compare January 5, 2023 06:12
Comment on lines -110 to +116
nd.preClose(fd);
NativeThread.signal(th);
nd.preClose(fd);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a much more detailed description of why the original order is wrong and why it should apply to all "unix" platforms. Further, the same order (preClose then signal) also exists in each of these files

src/java.base/share/classes/sun/nio/ch/DatagramChannelImpl.java
src/java.base/share/classes/sun/nio/ch/ServerSocketChannelImpl.java
src/java.base/share/classes/sun/nio/ch/SocketChannelImpl.java
src/java.base/unix/classes/sun/nio/ch/SinkChannelImpl.java
src/jdk.sctp/unix/classes/sun/nio/ch/sctp/SctpChannelImpl.java
src/jdk.sctp/unix/classes/sun/nio/ch/sctp/SctpMultiChannelImpl.java
src/jdk.sctp/unix/classes/sun/nio/ch/sctp/SctpServerChannelImpl.java

Please either correct each of those instances or explain why they don't need similar treatment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think you've convinced me that reordering those calls makes sense. Other threads that might retry their read or write operation will fail because the channel has been closed (notice assert !isOpen(); on line 823.

However, you haven't explained why the same pattern that exists in the classes I listed above doesn't need correcting.

Copy link
Contributor Author

@shruacha1234 shruacha1234 Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keithc-ca
The changes are applicable to the files you have listed. I have made the changes to those files

During fcntl, dup2 system calls if read/write happens then PreClose method gets hang. This issue is not seen if we Signal/Kill the thread first and then call the preClose method.

Signed-off-by: Shruthi.Shruthi1 <Shruthi.Shruthi1@ibm.com>
@shruacha1234
Copy link
Contributor Author

@keithc-ca
The preClose() method internally calls the dup2() system call. If there is a dup2() call on a file descriptor on a thread while another thread is blocked in a read/write operation on the same file descriptor, then the dup2() call is known to hang. Currently, preClose() experiences a hang because we call dup2() before killing the reader/writer thread(s). The solution/workaround is to first kill the reader/writer thread(s) and then do the preClose() sequence. This reordering in the different channel implementations resolves the problem.

This is a known issue on AIX (and z/OS) since Java 6 and the solution mentioned above has been adopted in IBM Java since then.

We wanted to take this solution to OpenJDK, but there we require a Java reproducer test. Unfortunately, we haven't been able to reproduce the issue using a simple Java test case. This has been reported by an IBM customer who is able to reproduce the issue consistently on an AIX machine, but only with the IBM Integration Server running with the Terracotta server. Without a simple Java reproducer, OpenJDK is unlikely to accept a bug report and hence a PR. So, we want to propose this solution to the extensions repo.

@shruacha1234
Copy link
Contributor Author

@keithc-ca Can you please review the changes made to the files you have listed

@keithc-ca
Copy link
Member

I think we need to tread carefully here. I think we need to consider whether the change here will make the problem reported in eclipse-openj9/openj9#12582 more likely. @pshipton, do you have any thoughts or opinions on that?

@pshipton
Copy link
Member

I think we need to consider whether the change here will make the problem reported in eclipse-openj9/openj9#12582 more likely.

We can run grinders on the test with this change.

Rather than assuming we won't get anything from OpenJDK, we should take this to OpenJDK even without a testcase. Even if we can't get the change made at OpenJDK, we should try, and we may get some useful information.

I'm also thinking that if the problem only affects AIX and z/OS, the changes should be restricted to those platforms rather than modify other platforms and potentially cause a regression or change in behavior there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants