-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRIVERS-1670: Add log messages to SDAM spec #1419
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some initial comments - so far, mostly a bunch of nits around adding comments to help readers understand the YAML tests.
also, did we decide against adding any logs that don't correspond to existing SDAM events? I remember discussing with @jyemin that Java had some possibly useful SDAM logs we could consider including.
...e/server-discovery-and-monitoring/server-discovery-and-monitoring-logging-and-monitoring.rst
Outdated
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml
Outdated
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml
Outdated
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-sharded.yml
Outdated
Show resolved
Hide resolved
- level: debug | ||
component: topology | ||
data: | ||
message: "Server heartbeat failed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we only expect to see one heartbeat failure or should there be one for each host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just the primary heartbeat is supposed to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that the case? I'd think the failpoint would apply on any member of the topology.
that said, I realize now we only set this to times: 2, so it's (in theory) hit once by the RTT monitor and once for a heartbeat, on the first server that we happen to reach out to.
however, I wonder if it's possible a driver has an interleaving of RTT monitoring commands and server monitoring commands that would lead us to not hit this failpoint on a server monitoring command? i.e. if two RTT monitoring commands hit the failpoint and so it gets disabled before any server monitoring commands encounter it.
it might be more robust to just set the failpoint to always on?
source/server-discovery-and-monitoring/tests/unified/logging-standalone.yml
Outdated
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-standalone.yml
Outdated
Show resolved
Hide resolved
@@ -101,7 +101,7 @@ Terms | |||
Server | |||
`````` | |||
|
|||
A mongod or mongos process. | |||
A mongod or mongos process, or a load balancer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have load balanced mode tests as well and likely add something to this section of the load balancer spec: https://github.com/mongodb/specifications/blob/master/source/load-balancers/load-balancers.rst#monitoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test and short description to load-balancer spec.
source/unified-test-format/tests/invalid/ignoreLogMessage-data-required.yml
Outdated
Show resolved
Hide resolved
source/unified-test-format/tests/invalid/expectedLogMessagesForClient-ignoreExtraLogs-type.yml
Outdated
Show resolved
Hide resolved
source/unified-test-format/tests/invalid/ignoreLogMessage-component-enum.yml
Outdated
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml
Outdated
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml
Outdated
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-sharded.yml
Outdated
Show resolved
Hide resolved
- level: debug | ||
component: topology | ||
data: | ||
message: "Server heartbeat failed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that the case? I'd think the failpoint would apply on any member of the topology.
that said, I realize now we only set this to times: 2, so it's (in theory) hit once by the RTT monitor and once for a heartbeat, on the first server that we happen to reach out to.
however, I wonder if it's possible a driver has an interleaving of RTT monitoring commands and server monitoring commands that would lead us to not hit this failpoint on a server monitoring command? i.e. if two RTT monitoring commands hit the failpoint and so it gets disabled before any server monitoring commands encounter it.
it might be more robust to just set the failpoint to always on?
Log Messages | ||
^^^^^^^^^^^^ | ||
|
||
Please refer to the `SDAM logging specification <../server-discovery-and-monitoring/server-discovery-and-monitoring-monitoring.rst#log-messages>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be helpful to also refer users to the monitoring section and point out the same considerations apply for the messages that correspond to the events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't fully understand. Current reference is to the logging section in monitoring.
Do you mean instead of listing the events here just refer to the relevant events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I meant the monitoring the section directly above this which describes some details around the emitted events, e.g.
:code:
TopologyDescriptionChangedEvent
. The :code:previousDescription
field MUST
have :code:TopologyType
:code:Unknown
and no servers. The :code:newDescription
MUST have :code:TopologyType
:code:LoadBalanced
and one server with
:code:ServerType
:code:Unknown
.
I just meant that the same things stated above should apply for the corresponding log messages
my remaining comments are fairly minor, so I don't need to re-review. my previous question still applies but I'll defer to all of you on that:
also, it might be good to have someone who owns SDAM review this too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes within unified-test-format/
LGTM. Not reviewing the other files.
Note that you'll need to rebase/merge master to fix the conflict in unified-test-format.rst
. Please maintain a blank line after the "Please note schema version bumps" RST comment when doing so.
source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-sharded.yml
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-standalone.yml
Show resolved
Hide resolved
source/server-discovery-and-monitoring/server-discovery-and-monitoring-monitoring.rst
Outdated
Show resolved
Hide resolved
source/server-discovery-and-monitoring/server-discovery-and-monitoring-monitoring.rst
Outdated
Show resolved
Hide resolved
|
||
"Starting Topology Monitoring" Log Message | ||
------------------------------------------ | ||
This message MUST be published under the same circumstances as a ``TopologyOpeningEvent`` as detailed in `Events API <#events-api>_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message MUST be published under the same circumstances as a ``TopologyOpeningEvent`` as detailed in `Events API <#events-api>_`. | |
This message MUST be published under the same circumstances as a ``TopologyOpeningEvent`` as detailed in `Events API <#events-api>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these syntactic changes done.
|
||
"Stopped Topology Monitoring" Log Message | ||
------------------------------------------ | ||
This message MUST be published under the same circumstances as a ``TopologyClosedEvent`` as detailed in `Events API <#events-api>_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message MUST be published under the same circumstances as a ``TopologyClosedEvent`` as detailed in `Events API <#events-api>_`. | |
This message MUST be published under the same circumstances as a ``TopologyClosedEvent`` as detailed in `Events API <#events-api>`_. |
|
||
"Starting Server Monitoring" Log Message | ||
---------------------------------------- | ||
This message MUST be published under the same circumstances as a ``ServerOpeningEvent`` as detailed in `Events API <#events-api>_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message MUST be published under the same circumstances as a ``ServerOpeningEvent`` as detailed in `Events API <#events-api>_`. | |
This message MUST be published under the same circumstances as a ``ServerOpeningEvent`` as detailed in `Events API <#events-api>`_. |
|
||
"Stopped Server Monitoring" Log Message | ||
---------------------------------------- | ||
This message MUST be published under the same circumstances as a ``ServerClosedEvent`` as detailed in `Events API <#events-api>_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message MUST be published under the same circumstances as a ``ServerClosedEvent`` as detailed in `Events API <#events-api>_`. | |
This message MUST be published under the same circumstances as a ``ServerClosedEvent`` as detailed in `Events API <#events-api>`_. |
|
||
"Topology Description Changed" Log Message | ||
------------------------------------------ | ||
This message MUST be published under the same circumstances as a ``TopologyDescriptionChangedEvent`` as detailed in `Events API <#events-api>_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message MUST be published under the same circumstances as a ``TopologyDescriptionChangedEvent`` as detailed in `Events API <#events-api>_`. | |
This message MUST be published under the same circumstances as a ``TopologyDescriptionChangedEvent`` as detailed in `Events API <#events-api>`_. |
|
||
"Server Heartbeat Started" Log Message | ||
-------------------------------------- | ||
This message MUST be published under the same circumstances as a ``ServerHeartbeatStartedEvent`` as detailed in `Events API <#events-api>_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message MUST be published under the same circumstances as a ``ServerHeartbeatStartedEvent`` as detailed in `Events API <#events-api>_`. | |
This message MUST be published under the same circumstances as a ``ServerHeartbeatStartedEvent`` as detailed in `Events API <#events-api>`_. |
|
||
"Server Heartbeat Succeeded" Log Message | ||
---------------------------------------- | ||
This message MUST be published under the same circumstances as a ``ServerHeartbeatSucceededEvent`` as detailed in `Events API <#events-api>_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message MUST be published under the same circumstances as a ``ServerHeartbeatSucceededEvent`` as detailed in `Events API <#events-api>_`. | |
This message MUST be published under the same circumstances as a ``ServerHeartbeatSucceededEvent`` as detailed in `Events API <#events-api>`_. |
|
||
* - durationMS | ||
- Int | ||
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatSucceededEvent`` in `Events API <#events-api>_` for details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatSucceededEvent`` in `Events API <#events-api>_` for details | |
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatSucceededEvent`` in `Events API <#events-api>`_ for details |
|
||
"Server Heartbeat Failed" Log Message | ||
------------------------------------- | ||
This message MUST be published under the same circumstances as a ``ServerHeartbeatFailedEvent`` as detailed in `Events API <#events-api>_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message MUST be published under the same circumstances as a ``ServerHeartbeatFailedEvent`` as detailed in `Events API <#events-api>_`. | |
This message MUST be published under the same circumstances as a ``ServerHeartbeatFailedEvent`` as detailed in `Events API <#events-api>`_. |
Ruby: | ||
* - durationMS | ||
- Int | ||
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatFailedEvent`` in `Events API <#events-api>_` for details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatFailedEvent`` in `Events API <#events-api>_` for details | |
- The execution time for the heartbeat in milliseconds. See ``ServerHeartbeatFailedEvent`` in `Events API <#events-api>`_ for details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are passing in Rust except two issues in the standalone logging file; I've left suggestions in comments.
"serverConnectionId": { | ||
"$$exists": true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust emits a ServerHeartbeatStartedEvent
before the hello sent when establishing a connection, which means that the server connection ID is not yet known. I think we discussed previously that the C# driver does not do this, but I don't think the spec mandates either approach. The condition for this event in the existing SDAM monitoring spec is:
Published when the server monitor sends its
hello
or legacy hello call to the server.
Can we remove this requirement to accommodate drivers that do emit a heartbeat started event during connection establishment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
} | ||
} | ||
], | ||
"messages": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected messages assertion here is failing in Rust because we see multiple heartbeat failed events. I think this is reasonable/expected behavior: the failpoint is always on, so as the monitor continues to send hello
s/listen for hello
replies and emit heartbeat failed events when they fail. I can make the test work either by adding ignoreExtraMessages
here or changing the fail point mode above to { "times": 2 }
(2 instead of 1 to account for the retry) rather than always on. Does either of those sound like fine change here, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks. Added ignoreExtraMessages
.
source/server-discovery-and-monitoring/tests/unified/logging-loadbalanced.json
Outdated
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/logging-loadbalanced.yml
Outdated
Show resolved
Hide resolved
"serverConnectionId": { | ||
"$$exists": true | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check (and ditto in the "Failing heartbeat" tests for other topologies) is flaky for me in CI. I think it's due to a race between setting the failpoint for hellos and establishing a monitoring connection. If a monitor establishes a connection first, the sequence of events will look something like this:
- Send a hello for initial connection establishment. Emit a "Server heartbeat started" log message with no serverConnectionId, as we haven't talked to the server yet.
- The server replies with a successful hello response, including a serverConnectionId. Emit a "Server heartbeat succeeded" message containing the serverConnectionId from the response.
- The failpoint for hellos is set.
- Begin a new server heartbeat and emit a "Server heartbeat started" log message with the serverConnectionId stored from the previous successful response.
- The heartbeat fails. Emit a "Server heartbeat failed" log message with the serverConnectionId stored from the previous successful response.
However, if the failpoint is set before the monitoring connection is established, the following will occur:
- The failpoint for hellos is set.
- Send a hello for initial connection establishment. Emit a "Server heartbeat started" log message with no serverConnectionId, as we haven't talked to the server yet.
- The heartbeat fails. Emit a "Server heartbeat failed" log message. The serverConnectionId will not be present because we do not have a stored value from a previous successful heartbeat.
Let me know if this makes sense to you. If so, I think we'll need to remove this check from each "Failing heartbeat" test because of the previously discussed discrepancy between drivers regarding when heartbeat events begin to be emitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this makes sense, thanks for catching this @isabelatkinson.
Could you please confirm that tests are stable after removing serverConnectionId
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I get a clean patch with the ID removed from all of the failing heartbeat tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Will be done in https://jira.mongodb.org/browse/DRIVERS-2676. |
@@ -0,0 +1,150 @@ | |||
{ | |||
"description": "loadbalanced-logging", | |||
"schemaVersion": "1.14", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema version here doesn't match the associated yml file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tnx, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -1177,7 +1177,7 @@ The structure of each object is as follows: | |||
`CMAP | |||
<../connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#events>`__ | |||
events, and ``sdam`` for `SDAM | |||
<../server-discovery-and-monitoring/server-discovery-and-monitoring-monitoring.rst#events>`__ | |||
<../server-discovery-and-monitoring/server-discovery-and-monitoring-logging-and-monitoring.rst#events>`__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BorisDog should we rename all this file references too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is to rename server-discovery-and-monitoring-monitoring.rst
to server-discovery-and-monitoring-logging-and-monitoring.rst
as last step of the PR. So the references are updated to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the rename as long as it's recognized by git, although I don't think renaming it adds much value since 1) it's already such a long file name and 2) "monitoring" and "logging" are somewhat synonymous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have failed to force github to recognize the renaming within the PR (need to ensure it's recognized in git history).
The naming follows command-logging-and-monitoring change where logging
was added to monitoring
. I see a value in having consistent naming across the spec, as long as it does not introduce unnecessary overhead.
Side note: For me monitoring-monitoring
suffix has been a bit confusing, so longer name with logging
is bit clearer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted all renaming residue. Due to non-trivial commit history in this PR, renaming has to be considered in a separate ticket.
DRIVERS-1670
Please complete the following before merging: