DRIVERS-2170 Server info on retryable errors must reflect the originating server #1480

jamis · 2023-12-05T19:09:37Z

Some drivers (e.g. Ruby) attach server information to certain errors for diagnostic purposes. This PR updates the language in the retryable reads and writes specifications to require that these drivers ensure that the server information that is associated with an exception reflects the server that actually originated the exception, even in the presence of one more more retries.

Please complete the following before merging:

Update changelog.
Make sure there are generated JSON files from the YAML test files.
Test changes in at least one language driver.
Test these changes against all server versions and topologies (including standalone, replica set, sharded clusters, and serverless).

NOTE: Tests for this situation (server information associated with exceptions during a retry) have proven challenging to write robustly. TL;DR -- I don't have a good solution for it, and currently cannot claim to have reliably tested this, though the code in the Ruby driver has been present since mid-2020.

I've found the most success with subscribing to command events, waiting for (e.g.) a 'find' event, and then forcing the primary to step down (per https://www.mongodb.com/docs/manual/tutorial/force-member-to-be-primary/) so that the retry hits a different server. I also capture logging output in the test, so I can then compare a message logged for the first exception (with the first server) with the message of the exception raised when the retry fails, and confirm that the servers differ. This mostly works, but (1) it only works for replica sets, and (2) it seems to fail non-deterministically, apparently due to some race condition associated with the setting of a failpoint.

… server

kkloberdanz · 2023-12-14T20:56:21Z

Hi Jamis, is there a succinct snippet from the Ruby driver that implements this behavior that you can share for this PR? Would a test for this behavior be feasible? I'm much more inclined to having testable behavior in a specification.

isabelatkinson

agreed with @kkloberdanz that it would be nice to test this behavior, but I'm also not sure how that would work given that (AFAIK) reporting the server from which the error originated is not required by any spec

isabelatkinson · 2023-12-14T21:29:11Z

source/retryable-reads/retryable-reads.rst

+error occurred. Specifically, if a retry attempt fails, the server reported
+with the error MUST correspond to the server that was selected for the retry
+and MUST NOT simply be carried over from the original error.


This last sentence seems at odds with the following language in the description of DRIVERS-2170:

I think they could implement this ticket just by ensuring that the attached server is always the originating server of the error being reported. For example, a driver that always attaches the first attempt's server to such an exception should be changed to conditionally attach the first or second accordingly.

Can we clarify here that the server reported should be the one that corresponds to the error that gets returned to the user?

I don't see a conflict between this text and the JIRA description. The JIRA issue was requesting that any server info attached to a retry attempt error indicate the server used for that retry attempt. "Server that was selected for the retry" suggests the same thing.

What if we replace this entire paragraph with:

If a driver associates server information (e.g. the server address or description) with an error, the driver MUST ensure that the reported server information corresponds to the server that originated the error.

This does away with any reference to "selected" and instead focuses on what was used for the operation or retry attempt.

To @kkloberdanz's earlier comment (#1480 (comment)), I don't think there's any syntax in the unified test format that would be useful for testing this (see: expectedError).

A possible prose test would entail comparing server information reported in a driver exception to the servers inferred from command monitoring events for the original and second attempt. But any meaningful test would also require setting up fail points on two different servers, since consecutive errors on the same host (e.g. failing insert twice on a primary) would behavior the same irrespective of this spec language.

We intentionally don't have any prose tests that deal with failovers (Atlas failover testing is its own mess), and I don't think that's worth the trouble for this spec change.

DRIVERS-2170 was prompted by my own personal pedantry (Ruby historically did The Right Thing™) and I sincerely regret the sorrow, wailing, and gnashing of teeth it has caused among DBX.

I like the simplified paragraph. I'll make that change.

jamis · 2023-12-15T18:47:59Z

Hi Jamis, is there a succinct snippet from the Ruby driver that implements this behavior that you can share for this PR? Would a test for this behavior be feasible? I'm much more inclined to having testable behavior in a specification.

Hey Kyle, thanks for taking a look at the PR. The following example from the Ruby driver is taken from code that works with the underlying network socket, and replaces network errors with a driver-specific exception class, and annotates them with (among other things) the server address:

rescue Errno::ETIMEDOUT => e
  raise Error::SocketTimeoutError, "#{e.class}: #{e} (for #{human_address})"
rescue IOError, SystemCallError => e
  raise Error::SocketError, "#{e.class}: #{e} (for #{human_address})"
rescue OpenSSL::SSL::SSLError => e
  raise Error::SocketError, "#{e.class}: #{e} (for #{human_address})"

I share your reluctance to include a spec update without tests, but:

This change only applies to the small subset of drivers that are already annotating errors with server information, and
These errors may originate from a variety of different sources in the code--from network errors, to server errors--and what is more, this spec change only applies to those errors that arise as a result of a second error during a retry, where the retry was performed on a different server than the first attempt. This specific sequence of events has proven particularly challenging to reproduce in a test for even a single exception, let alone the full set of exceptions that this behavior might apply to.

Ultimately, I'm not sure that adding mandated tests to the spec for this behavior can or should be done in a way that excludes the drivers for which this does not apply, and manages to cover the potential errors that this relates to. I'm open to discussing it, though, if you have thoughts on how this ought to be done. (This is my first rodeo with writing a spec change, so I'm sure there's plenty I could be doing better, here!)

…ing-server

jamis added 2 commits December 5, 2023 11:45

DRIVERS-2170 server info on retryable errors must reflect originating…

59ec7d9

… server

update changelogs

686f8e6

jamis marked this pull request as ready for review December 11, 2023 20:29

jamis requested a review from a team as a code owner December 11, 2023 20:29

jamis requested review from kkloberdanz and removed request for a team December 11, 2023 20:29

kevinAlbs requested a review from isabelatkinson December 14, 2023 20:17

isabelatkinson requested changes Dec 14, 2023

View reviewed changes

simplify the paragraph

2dccf6f

kkloberdanz approved these changes Dec 18, 2023

View reviewed changes

Merge branch 'master' into drivers-2170-retryable-errors-and-originat…

028b928

…ing-server

isabelatkinson approved these changes Jan 8, 2024

View reviewed changes

jamis merged commit 7145efa into mongodb:master Jan 9, 2024
4 checks passed

jamis deleted the drivers-2170-retryable-errors-and-originating-server branch January 9, 2024 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRIVERS-2170 Server info on retryable errors must reflect the originating server #1480

DRIVERS-2170 Server info on retryable errors must reflect the originating server #1480

jamis commented Dec 5, 2023 •

edited

Loading

kkloberdanz commented Dec 14, 2023

isabelatkinson left a comment

isabelatkinson Dec 14, 2023

jmikola Dec 15, 2023

jamis Dec 15, 2023

jamis commented Dec 15, 2023

DRIVERS-2170 Server info on retryable errors must reflect the originating server #1480

DRIVERS-2170 Server info on retryable errors must reflect the originating server #1480

Conversation

jamis commented Dec 5, 2023 • edited Loading

kkloberdanz commented Dec 14, 2023

isabelatkinson left a comment

Choose a reason for hiding this comment

isabelatkinson Dec 14, 2023

Choose a reason for hiding this comment

jmikola Dec 15, 2023

Choose a reason for hiding this comment

jamis Dec 15, 2023

Choose a reason for hiding this comment

jamis commented Dec 15, 2023

jamis commented Dec 5, 2023 •

edited

Loading