fix(NODE-6370) response messages to large commands can be lost under load #4245

hunkydoryrepair · 2024-09-21T21:23:13Z

Description

NODE-6370 describes scenarios where executing commands can hang indefinitely.
Debugging uncovered that the response messages were sometimes being lost because the 'data' listener on the socket was not active in time, which was caused by awaiting the 'drain' event before adding the 'data' listener (in readMany).
This changes when the drain event is handled. This would generally only be observable in these scenarios:

The command being sent was large enough that it could not be fully written to the kernel socket and was queued in user memory.
The load on the driver process is such that the task awaiting the 'drain' event would not be resumed before the data is received from the server. This was reproducible only scenarios where a lot of network activity is happening at once, as well as a lot of other synchronous activity (like mongoose schema parsing). In those conditions, the 'data' event may be emitted and handled (or in this case, not handled) before the task awaiting 'drain' is resumed.

What is changing?

The drain event is not awaited when the response will be read. The response cannot arrive until after socket is drained, so is unnecessary in that case. This ensures that the 'data' event will be listened to.

Is there new documentation needed for these changes?

No

What is the motivation for this change?

Current driver is unsafe for use in production environments that experience high load and save large documents.

Release Highlight

Fill in title or leave empty for no highlight

Double check the following

Ran npm run check:lint script
Self-review completed using the steps outlined here
PR title follows the correct format: type(NODE-xxxx)[!]: description
- Example: feat(NODE-1234)!: rewriting everything in coffeescript
Changes are covered by tests
New TODOs have a related JIRA ticket

(this area has a lot of tests covering it. A test for the actual issue is impractical)

…load

nbbeeken · 2024-09-27T19:38:02Z

Thanks again for your help @hunkydoryrepair! Rather than elevating the promise, we went with pause/resume so we can leave the logic contained in the write step as it relates to a new feature we're working on. Let us know if you get a chance to check if this passes your use case if you have the bandwidth.

nbbeeken · 2024-09-27T19:38:28Z

Fixed by: #4249

fix(NODE-6370) response messages to large commands can be lost under …

1e31bfe

…load

dariakp added tracked-in-jira Ticket filed in MongoDB's Jira system External Submission PR submitted from outside the team labels Sep 27, 2024

nbbeeken closed this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(NODE-6370) response messages to large commands can be lost under load #4245

fix(NODE-6370) response messages to large commands can be lost under load #4245

hunkydoryrepair commented Sep 21, 2024 •

edited

Loading

nbbeeken commented Sep 27, 2024

nbbeeken commented Sep 27, 2024

fix(NODE-6370) response messages to large commands can be lost under load #4245

fix(NODE-6370) response messages to large commands can be lost under load #4245

Conversation

hunkydoryrepair commented Sep 21, 2024 • edited Loading

Description

What is changing?

Is there new documentation needed for these changes?

What is the motivation for this change?

Release Highlight

Fill in title or leave empty for no highlight

Double check the following

nbbeeken commented Sep 27, 2024

nbbeeken commented Sep 27, 2024

hunkydoryrepair commented Sep 21, 2024 •

edited

Loading