-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRIVERS-2170 Server info on retryable errors must reflect the originating server #1480
DRIVERS-2170 Server info on retryable errors must reflect the originating server #1480
Conversation
Hi Jamis, is there a succinct snippet from the Ruby driver that implements this behavior that you can share for this PR? Would a test for this behavior be feasible? I'm much more inclined to having testable behavior in a specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed with @kkloberdanz that it would be nice to test this behavior, but I'm also not sure how that would work given that (AFAIK) reporting the server from which the error originated is not required by any spec
error occurred. Specifically, if a retry attempt fails, the server reported | ||
with the error MUST correspond to the server that was selected for the retry | ||
and MUST NOT simply be carried over from the original error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last sentence seems at odds with the following language in the description of DRIVERS-2170:
I think they could implement this ticket just by ensuring that the attached server is always the originating server of the error being reported. For example, a driver that always attaches the first attempt's server to such an exception should be changed to conditionally attach the first or second accordingly.
Can we clarify here that the server reported should be the one that corresponds to the error that gets returned to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a conflict between this text and the JIRA description. The JIRA issue was requesting that any server info attached to a retry attempt error indicate the server used for that retry attempt. "Server that was selected for the retry" suggests the same thing.
What if we replace this entire paragraph with:
If a driver associates server information (e.g. the server address or description) with an error, the driver MUST ensure that the reported server information corresponds to the server that originated the error.
This does away with any reference to "selected" and instead focuses on what was used for the operation or retry attempt.
To @kkloberdanz's earlier comment (#1480 (comment)), I don't think there's any syntax in the unified test format that would be useful for testing this (see: expectedError).
A possible prose test would entail comparing server information reported in a driver exception to the servers inferred from command monitoring events for the original and second attempt. But any meaningful test would also require setting up fail points on two different servers, since consecutive errors on the same host (e.g. failing insert
twice on a primary) would behavior the same irrespective of this spec language.
We intentionally don't have any prose tests that deal with failovers (Atlas failover testing is its own mess), and I don't think that's worth the trouble for this spec change.
DRIVERS-2170 was prompted by my own personal pedantry (Ruby historically did The Right Thing™) and I sincerely regret the sorrow, wailing, and gnashing of teeth it has caused among DBX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplified paragraph. I'll make that change.
Hey Kyle, thanks for taking a look at the PR. The following example from the Ruby driver is taken from code that works with the underlying network socket, and replaces network errors with a driver-specific exception class, and annotates them with (among other things) the server address: rescue Errno::ETIMEDOUT => e
raise Error::SocketTimeoutError, "#{e.class}: #{e} (for #{human_address})"
rescue IOError, SystemCallError => e
raise Error::SocketError, "#{e.class}: #{e} (for #{human_address})"
rescue OpenSSL::SSL::SSLError => e
raise Error::SocketError, "#{e.class}: #{e} (for #{human_address})" I share your reluctance to include a spec update without tests, but:
Ultimately, I'm not sure that adding mandated tests to the spec for this behavior can or should be done in a way that excludes the drivers for which this does not apply, and manages to cover the potential errors that this relates to. I'm open to discussing it, though, if you have thoughts on how this ought to be done. (This is my first rodeo with writing a spec change, so I'm sure there's plenty I could be doing better, here!) |
Some drivers (e.g. Ruby) attach server information to certain errors for diagnostic purposes. This PR updates the language in the retryable reads and writes specifications to require that these drivers ensure that the server information that is associated with an exception reflects the server that actually originated the exception, even in the presence of one more more retries.
Please complete the following before merging:
NOTE: Tests for this situation (server information associated with exceptions during a retry) have proven challenging to write robustly. TL;DR -- I don't have a good solution for it, and currently cannot claim to have reliably tested this, though the code in the Ruby driver has been present since mid-2020.
I've found the most success with subscribing to command events, waiting for (e.g.) a 'find' event, and then forcing the primary to step down (per https://www.mongodb.com/docs/manual/tutorial/force-member-to-be-primary/) so that the retry hits a different server. I also capture logging output in the test, so I can then compare a message logged for the first exception (with the first server) with the message of the exception raised when the retry fails, and confirm that the servers differ. This mostly works, but (1) it only works for replica sets, and (2) it seems to fail non-deterministically, apparently due to some race condition associated with the setting of a failpoint.