Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HG_Respond with extra buf lead to libfabric run out of recv entry #735

Open
hsx6876 opened this issue May 21, 2024 · 0 comments
Open

HG_Respond with extra buf lead to libfabric run out of recv entry #735

hsx6876 opened this issue May 21, 2024 · 0 comments

Comments

@hsx6876
Copy link

hsx6876 commented May 21, 2024

Describe the bug
HG_Respond with extra buf lead to libfabric run out of rxm recv_entry

found this when test daos project

When testing the Daos project, I discovered a bug where on the server side, calling HG_Respond with an extra buffer after the client RPC timeout and calling HG_Cancel, the client won't handle the response, leading to no acknowledgment being sent to the server.

The server posts a recv_expected to wait for the client ack, which consumes a recv_entry in ofi_rxm. Eventually, recv_entry runs out, causing this hg_context unable to post any more recv .

The expected behavior would be for HG/NA to take action to drop the recv_expected (waiting for ack ) to free up recv_entry. Consider adding a timeout mechanism maybe.

//

We are currently adding a timeout detection mechanism to crt_reply to prevent this from happening

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant