Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRIVERS-2170 Server info on retryable errors must reflect the originating server #1480

Conversation

jamis
Copy link
Contributor

@jamis jamis commented Dec 5, 2023

Some drivers (e.g. Ruby) attach server information to certain errors for diagnostic purposes. This PR updates the language in the retryable reads and writes specifications to require that these drivers ensure that the server information that is associated with an exception reflects the server that actually originated the exception, even in the presence of one more more retries.

Please complete the following before merging:

  • Update changelog.
  • Make sure there are generated JSON files from the YAML test files.
  • Test changes in at least one language driver.
  • Test these changes against all server versions and topologies (including standalone, replica set, sharded clusters, and serverless).

NOTE: Tests for this situation (server information associated with exceptions during a retry) have proven challenging to write robustly. TL;DR -- I don't have a good solution for it, and currently cannot claim to have reliably tested this, though the code in the Ruby driver has been present since mid-2020.

I've found the most success with subscribing to command events, waiting for (e.g.) a 'find' event, and then forcing the primary to step down (per https://www.mongodb.com/docs/manual/tutorial/force-member-to-be-primary/) so that the retry hits a different server. I also capture logging output in the test, so I can then compare a message logged for the first exception (with the first server) with the message of the exception raised when the retry fails, and confirm that the servers differ. This mostly works, but (1) it only works for replica sets, and (2) it seems to fail non-deterministically, apparently due to some race condition associated with the setting of a failpoint.

@jamis jamis marked this pull request as ready for review December 11, 2023 20:29
@jamis jamis requested a review from a team as a code owner December 11, 2023 20:29
@jamis jamis requested review from kkloberdanz and removed request for a team December 11, 2023 20:29
@kkloberdanz
Copy link
Contributor

Hi Jamis, is there a succinct snippet from the Ruby driver that implements this behavior that you can share for this PR? Would a test for this behavior be feasible? I'm much more inclined to having testable behavior in a specification.

Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed with @kkloberdanz that it would be nice to test this behavior, but I'm also not sure how that would work given that (AFAIK) reporting the server from which the error originated is not required by any spec

Comment on lines 337 to 339
error occurred. Specifically, if a retry attempt fails, the server reported
with the error MUST correspond to the server that was selected for the retry
and MUST NOT simply be carried over from the original error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last sentence seems at odds with the following language in the description of DRIVERS-2170:

I think they could implement this ticket just by ensuring that the attached server is always the originating server of the error being reported. For example, a driver that always attaches the first attempt's server to such an exception should be changed to conditionally attach the first or second accordingly.

Can we clarify here that the server reported should be the one that corresponds to the error that gets returned to the user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a conflict between this text and the JIRA description. The JIRA issue was requesting that any server info attached to a retry attempt error indicate the server used for that retry attempt. "Server that was selected for the retry" suggests the same thing.

What if we replace this entire paragraph with:

If a driver associates server information (e.g. the server address or description) with an error, the driver MUST ensure that the reported server information corresponds to the server that originated the error.

This does away with any reference to "selected" and instead focuses on what was used for the operation or retry attempt.


To @kkloberdanz's earlier comment (#1480 (comment)), I don't think there's any syntax in the unified test format that would be useful for testing this (see: expectedError).

A possible prose test would entail comparing server information reported in a driver exception to the servers inferred from command monitoring events for the original and second attempt. But any meaningful test would also require setting up fail points on two different servers, since consecutive errors on the same host (e.g. failing insert twice on a primary) would behavior the same irrespective of this spec language.

We intentionally don't have any prose tests that deal with failovers (Atlas failover testing is its own mess), and I don't think that's worth the trouble for this spec change.

DRIVERS-2170 was prompted by my own personal pedantry (Ruby historically did The Right Thing™) and I sincerely regret the sorrow, wailing, and gnashing of teeth it has caused among DBX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplified paragraph. I'll make that change.

@jamis
Copy link
Contributor Author

jamis commented Dec 15, 2023

Hi Jamis, is there a succinct snippet from the Ruby driver that implements this behavior that you can share for this PR? Would a test for this behavior be feasible? I'm much more inclined to having testable behavior in a specification.

Hey Kyle, thanks for taking a look at the PR. The following example from the Ruby driver is taken from code that works with the underlying network socket, and replaces network errors with a driver-specific exception class, and annotates them with (among other things) the server address:

rescue Errno::ETIMEDOUT => e
  raise Error::SocketTimeoutError, "#{e.class}: #{e} (for #{human_address})"
rescue IOError, SystemCallError => e
  raise Error::SocketError, "#{e.class}: #{e} (for #{human_address})"
rescue OpenSSL::SSL::SSLError => e
  raise Error::SocketError, "#{e.class}: #{e} (for #{human_address})"

I share your reluctance to include a spec update without tests, but:

  1. This change only applies to the small subset of drivers that are already annotating errors with server information, and
  2. These errors may originate from a variety of different sources in the code--from network errors, to server errors--and what is more, this spec change only applies to those errors that arise as a result of a second error during a retry, where the retry was performed on a different server than the first attempt. This specific sequence of events has proven particularly challenging to reproduce in a test for even a single exception, let alone the full set of exceptions that this behavior might apply to.

Ultimately, I'm not sure that adding mandated tests to the spec for this behavior can or should be done in a way that excludes the drivers for which this does not apply, and manages to cover the potential errors that this relates to. I'm open to discussing it, though, if you have thoughts on how this ought to be done. (This is my first rodeo with writing a spec change, so I'm sure there's plenty I could be doing better, here!)

@jamis jamis merged commit 7145efa into mongodb:master Jan 9, 2024
4 checks passed
@jamis jamis deleted the drivers-2170-retryable-errors-and-originating-server branch January 9, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants