Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRIVERS-1670: Add log messages to SDAM spec #1419

Merged
merged 2 commits into from
Jul 20, 2023

Conversation

BorisDog
Copy link
Contributor

DRIVERS-1670

Please complete the following before merging:

  • Update changelog.
  • Make sure there are generated JSON files from the YAML test files.
  • Test changes in at least one language driver.
  • Test these changes against all server versions and topologies (including standalone, replica set, sharded clusters, and serverless).

@BorisDog BorisDog requested review from a team as code owners May 12, 2023 21:35
@BorisDog BorisDog requested review from jmikola, jess-sig, kmahar and isabelatkinson and removed request for a team May 12, 2023 21:35
@BorisDog BorisDog changed the title Drivers 1670 DRIVERS-1670: Add log messages to SDAM spec May 12, 2023
@jess-sig jess-sig removed their request for review May 12, 2023 23:18
kmahar
kmahar previously requested changes May 15, 2023
Copy link
Contributor

@kmahar kmahar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial comments - so far, mostly a bunch of nits around adding comments to help readers understand the YAML tests.
also, did we decide against adding any logs that don't correspond to existing SDAM events? I remember discussing with @jyemin that Java had some possibly useful SDAM logs we could consider including.

- level: debug
component: topology
data:
message: "Server heartbeat failed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we only expect to see one heartbeat failure or should there be one for each host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just the primary heartbeat is supposed to fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that the case? I'd think the failpoint would apply on any member of the topology.

that said, I realize now we only set this to times: 2, so it's (in theory) hit once by the RTT monitor and once for a heartbeat, on the first server that we happen to reach out to.

however, I wonder if it's possible a driver has an interleaving of RTT monitoring commands and server monitoring commands that would lead us to not hit this failpoint on a server monitoring command? i.e. if two RTT monitoring commands hit the failpoint and so it gets disabled before any server monitoring commands encounter it.

it might be more robust to just set the failpoint to always on?

@@ -101,7 +101,7 @@ Terms
Server
``````

A mongod or mongos process.
A mongod or mongos process, or a load balancer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have load balanced mode tests as well and likely add something to this section of the load balancer spec: https://github.com/mongodb/specifications/blob/master/source/load-balancers/load-balancers.rst#monitoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test and short description to load-balancer spec.

@BorisDog BorisDog requested review from jmikola and kmahar May 15, 2023 22:46
@BorisDog BorisDog requested a review from a team as a code owner May 16, 2023 18:38
@BorisDog BorisDog requested review from alcaeus and removed request for a team May 16, 2023 18:38
source/unified-test-format/unified-test-format.rst Outdated Show resolved Hide resolved
- level: debug
component: topology
data:
message: "Server heartbeat failed"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that the case? I'd think the failpoint would apply on any member of the topology.

that said, I realize now we only set this to times: 2, so it's (in theory) hit once by the RTT monitor and once for a heartbeat, on the first server that we happen to reach out to.

however, I wonder if it's possible a driver has an interleaving of RTT monitoring commands and server monitoring commands that would lead us to not hit this failpoint on a server monitoring command? i.e. if two RTT monitoring commands hit the failpoint and so it gets disabled before any server monitoring commands encounter it.

it might be more robust to just set the failpoint to always on?

Log Messages
^^^^^^^^^^^^

Please refer to the `SDAM logging specification <../server-discovery-and-monitoring/server-discovery-and-monitoring-monitoring.rst#log-messages>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be helpful to also refer users to the monitoring section and point out the same considerations apply for the messages that correspond to the events

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't fully understand. Current reference is to the logging section in monitoring.
Do you mean instead of listing the events here just refer to the relevant events?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I meant the monitoring the section directly above this which describes some details around the emitted events, e.g.

:code:TopologyDescriptionChangedEvent. The :code:previousDescription field MUST
have :code:TopologyType :code:Unknown and no servers. The :code:newDescription
MUST have :code:TopologyType :code:LoadBalanced and one server with
:code:ServerType :code:Unknown.

I just meant that the same things stated above should apply for the corresponding log messages

@kmahar
Copy link
Contributor

kmahar commented May 16, 2023

my remaining comments are fairly minor, so I don't need to re-review.

my previous question still applies but I'll defer to all of you on that:

also, did we decide against adding any logs that don't correspond to existing SDAM events?

also, it might be good to have someone who owns SDAM review this too.

@BorisDog BorisDog requested review from jmikola, a team, tom-selander and ShaneHarvey and removed request for a team and tom-selander May 16, 2023 22:32
Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes within unified-test-format/ LGTM. Not reviewing the other files.

Note that you'll need to rebase/merge master to fix the conflict in unified-test-format.rst. Please maintain a blank line after the "Please note schema version bumps" RST comment when doing so.


"Starting Topology Monitoring" Log Message
------------------------------------------
This message MUST be published under the same circumstances as a ``TopologyOpeningEvent`` as detailed in `Events API <#events-api>_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This message MUST be published under the same circumstances as a ``TopologyOpeningEvent`` as detailed in `Events API <#events-api>_`.
This message MUST be published under the same circumstances as a ``TopologyOpeningEvent`` as detailed in `Events API <#events-api>`_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these syntactic changes done.


"Stopped Topology Monitoring" Log Message
------------------------------------------
This message MUST be published under the same circumstances as a ``TopologyClosedEvent`` as detailed in `Events API <#events-api>_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This message MUST be published under the same circumstances as a ``TopologyClosedEvent`` as detailed in `Events API <#events-api>_`.
This message MUST be published under the same circumstances as a ``TopologyClosedEvent`` as detailed in `Events API <#events-api>`_.


"Starting Server Monitoring" Log Message
----------------------------------------
This message MUST be published under the same circumstances as a ``ServerOpeningEvent`` as detailed in `Events API <#events-api>_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This message MUST be published under the same circumstances as a ``ServerOpeningEvent`` as detailed in `Events API <#events-api>_`.
This message MUST be published under the same circumstances as a ``ServerOpeningEvent`` as detailed in `Events API <#events-api>`_.


"Stopped Server Monitoring" Log Message
----------------------------------------
This message MUST be published under the same circumstances as a ``ServerClosedEvent`` as detailed in `Events API <#events-api>_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This message MUST be published under the same circumstances as a ``ServerClosedEvent`` as detailed in `Events API <#events-api>_`.
This message MUST be published under the same circumstances as a ``ServerClosedEvent`` as detailed in `Events API <#events-api>`_.


"Topology Description Changed" Log Message
------------------------------------------
This message MUST be published under the same circumstances as a ``TopologyDescriptionChangedEvent`` as detailed in `Events API <#events-api>_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This message MUST be published under the same circumstances as a ``TopologyDescriptionChangedEvent`` as detailed in `Events API <#events-api>_`.
This message MUST be published under the same circumstances as a ``TopologyDescriptionChangedEvent`` as detailed in `Events API <#events-api>`_.


"Server Heartbeat Started" Log Message
--------------------------------------
This message MUST be published under the same circumstances as a ``ServerHeartbeatStartedEvent`` as detailed in `Events API <#events-api>_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This message MUST be published under the same circumstances as a ``ServerHeartbeatStartedEvent`` as detailed in `Events API <#events-api>_`.
This message MUST be published under the same circumstances as a ``ServerHeartbeatStartedEvent`` as detailed in `Events API <#events-api>`_.


"Server Heartbeat Succeeded" Log Message
----------------------------------------
This message MUST be published under the same circumstances as a ``ServerHeartbeatSucceededEvent`` as detailed in `Events API <#events-api>_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This message MUST be published under the same circumstances as a ``ServerHeartbeatSucceededEvent`` as detailed in `Events API <#events-api>_`.
This message MUST be published under the same circumstances as a ``ServerHeartbeatSucceededEvent`` as detailed in `Events API <#events-api>`_.


* - durationMS
- Int
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatSucceededEvent`` in `Events API <#events-api>_` for details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatSucceededEvent`` in `Events API <#events-api>_` for details
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatSucceededEvent`` in `Events API <#events-api>`_ for details


"Server Heartbeat Failed" Log Message
-------------------------------------
This message MUST be published under the same circumstances as a ``ServerHeartbeatFailedEvent`` as detailed in `Events API <#events-api>_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This message MUST be published under the same circumstances as a ``ServerHeartbeatFailedEvent`` as detailed in `Events API <#events-api>_`.
This message MUST be published under the same circumstances as a ``ServerHeartbeatFailedEvent`` as detailed in `Events API <#events-api>`_.

Ruby:
* - durationMS
- Int
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatFailedEvent`` in `Events API <#events-api>_` for details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatFailedEvent`` in `Events API <#events-api>_` for details
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatFailedEvent`` in `Events API <#events-api>`_ for details

Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are passing in Rust except two issues in the standalone logging file; I've left suggestions in comments.

Comment on lines 312 to 311
"serverConnectionId": {
"$$exists": true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust emits a ServerHeartbeatStartedEvent before the hello sent when establishing a connection, which means that the server connection ID is not yet known. I think we discussed previously that the C# driver does not do this, but I don't think the spec mandates either approach. The condition for this event in the existing SDAM monitoring spec is:

Published when the server monitor sends its hello or legacy hello call to the server.

Can we remove this requirement to accommodate drivers that do emit a heartbeat started event during connection establishment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

}
}
],
"messages": [
Copy link
Contributor

@isabelatkinson isabelatkinson Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected messages assertion here is failing in Rust because we see multiple heartbeat failed events. I think this is reasonable/expected behavior: the failpoint is always on, so as the monitor continues to send hellos/listen for hello replies and emit heartbeat failed events when they fail. I can make the test work either by adding ignoreExtraMessages here or changing the fail point mode above to { "times": 2 } (2 instead of 1 to account for the retry) rather than always on. Does either of those sound like fine change here, or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. Added ignoreExtraMessages.

Comment on lines 501 to 503
"serverConnectionId": {
"$$exists": true
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check (and ditto in the "Failing heartbeat" tests for other topologies) is flaky for me in CI. I think it's due to a race between setting the failpoint for hellos and establishing a monitoring connection. If a monitor establishes a connection first, the sequence of events will look something like this:

  1. Send a hello for initial connection establishment. Emit a "Server heartbeat started" log message with no serverConnectionId, as we haven't talked to the server yet.
  2. The server replies with a successful hello response, including a serverConnectionId. Emit a "Server heartbeat succeeded" message containing the serverConnectionId from the response.
  3. The failpoint for hellos is set.
  4. Begin a new server heartbeat and emit a "Server heartbeat started" log message with the serverConnectionId stored from the previous successful response.
  5. The heartbeat fails. Emit a "Server heartbeat failed" log message with the serverConnectionId stored from the previous successful response.

However, if the failpoint is set before the monitoring connection is established, the following will occur:

  1. The failpoint for hellos is set.
  2. Send a hello for initial connection establishment. Emit a "Server heartbeat started" log message with no serverConnectionId, as we haven't talked to the server yet.
  3. The heartbeat fails. Emit a "Server heartbeat failed" log message. The serverConnectionId will not be present because we do not have a stored value from a previous successful heartbeat.

Let me know if this makes sense to you. If so, I think we'll need to remove this check from each "Failing heartbeat" test because of the previously discussed discrepancy between drivers regarding when heartbeat events begin to be emitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this makes sense, thanks for catching this @isabelatkinson.
Could you please confirm that tests are stable after removing serverConnectionId?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I get a clean patch with the ID removed from all of the failing heartbeat tests.

Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@BorisDog
Copy link
Contributor Author

some initial comments - so far, mostly a bunch of nits around adding comments to help readers understand the YAML tests. also, did we decide against adding any logs that don't correspond to existing SDAM events? I remember discussing with @jyemin that Java had some possibly useful SDAM logs we could consider including.

Will be done in https://jira.mongodb.org/browse/DRIVERS-2676.

@@ -0,0 +1,150 @@
{
"description": "loadbalanced-logging",
"schemaVersion": "1.14",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema version here doesn't match the associated yml file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tnx, done.

Copy link
Contributor

@prestonvasquez prestonvasquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -1177,7 +1177,7 @@ The structure of each object is as follows:
`CMAP
<../connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#events>`__
events, and ``sdam`` for `SDAM
<../server-discovery-and-monitoring/server-discovery-and-monitoring-monitoring.rst#events>`__
<../server-discovery-and-monitoring/server-discovery-and-monitoring-logging-and-monitoring.rst#events>`__
Copy link
Member

@ShaneHarvey ShaneHarvey Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BorisDog should we rename all this file references too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to rename server-discovery-and-monitoring-monitoring.rst to server-discovery-and-monitoring-logging-and-monitoring.rst as last step of the PR. So the references are updated to reflect that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the rename as long as it's recognized by git, although I don't think renaming it adds much value since 1) it's already such a long file name and 2) "monitoring" and "logging" are somewhat synonymous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have failed to force github to recognize the renaming within the PR (need to ensure it's recognized in git history).
The naming follows command-logging-and-monitoring change where logging was added to monitoring. I see a value in having consistent naming across the spec, as long as it does not introduce unnecessary overhead.

Side note: For me monitoring-monitoring suffix has been a bit confusing, so longer name with logging is bit clearer :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted all renaming residue. Due to non-trivial commit history in this PR, renaming has to be considered in a separate ticket.

@BorisDog BorisDog merged commit 9f68487 into mongodb:master Jul 20, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants