DRIVERS-1670: Add log messages to SDAM spec #1419

BorisDog · 2023-05-12T21:35:32Z

Please complete the following before merging:

Update changelog.
Make sure there are generated JSON files from the YAML test files.
Test changes in at least one language driver.
Test these changes against all server versions and topologies (including standalone, replica set, sharded clusters, and serverless).

source/unified-test-format/schema-1.14.json

source/unified-test-format/unified-test-format.rst

kmahar

some initial comments - so far, mostly a bunch of nits around adding comments to help readers understand the YAML tests.
also, did we decide against adding any logs that don't correspond to existing SDAM events? I remember discussing with @jyemin that Java had some possibly useful SDAM logs we could consider including.

...e/server-discovery-and-monitoring/server-discovery-and-monitoring-logging-and-monitoring.rst

source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml

source/server-discovery-and-monitoring/tests/unified/logging-sharded.yml

kmahar · 2023-05-15T15:38:22Z

source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml

+ - level: debug
+ component: topology
+ data:
+ message: "Server heartbeat failed"


do we only expect to see one heartbeat failure or should there be one for each host?

I think just the primary heartbeat is supposed to fail.

is that the case? I'd think the failpoint would apply on any member of the topology.

that said, I realize now we only set this to times: 2, so it's (in theory) hit once by the RTT monitor and once for a heartbeat, on the first server that we happen to reach out to.

however, I wonder if it's possible a driver has an interleaving of RTT monitoring commands and server monitoring commands that would lead us to not hit this failpoint on a server monitoring command? i.e. if two RTT monitoring commands hit the failpoint and so it gets disabled before any server monitoring commands encounter it.

it might be more robust to just set the failpoint to always on?

source/server-discovery-and-monitoring/tests/unified/logging-standalone.yml

kmahar · 2023-05-15T15:46:01Z

source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst

@@ -101,7 +101,7 @@ Terms
 Server
 ``````

-A mongod or mongos process.
+A mongod or mongos process, or a load balancer.


I think we should have load balanced mode tests as well and likely add something to this section of the load balancer spec: https://github.com/mongodb/specifications/blob/master/source/load-balancers/load-balancers.rst#monitoring

Added test and short description to load-balancer spec.

source/unified-test-format/unified-test-format.rst

source/unified-test-format/tests/invalid/ignoreLogMessage-data-required.yml

source/unified-test-format/tests/invalid/expectedLogMessagesForClient-ignoreExtraLogs-type.yml

source/unified-test-format/tests/invalid/ignoreLogMessage-component-enum.yml

source/unified-test-format/unified-test-format.rst

source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml

source/server-discovery-and-monitoring/tests/unified/logging-sharded.yml

kmahar · 2023-05-16T19:35:26Z

source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml

+ - level: debug
+ component: topology
+ data:
+ message: "Server heartbeat failed"


is that the case? I'd think the failpoint would apply on any member of the topology.

that said, I realize now we only set this to times: 2, so it's (in theory) hit once by the RTT monitor and once for a heartbeat, on the first server that we happen to reach out to.

however, I wonder if it's possible a driver has an interleaving of RTT monitoring commands and server monitoring commands that would lead us to not hit this failpoint on a server monitoring command? i.e. if two RTT monitoring commands hit the failpoint and so it gets disabled before any server monitoring commands encounter it.

it might be more robust to just set the failpoint to always on?

kmahar · 2023-05-16T19:41:16Z

source/load-balancers/load-balancers.rst

+Log Messages
+^^^^^^^^^^^^
+
+Please refer to the `SDAM logging specification <../server-discovery-and-monitoring/server-discovery-and-monitoring-monitoring.rst#log-messages>`_ 


may be helpful to also refer users to the monitoring section and point out the same considerations apply for the messages that correspond to the events

Sorry I didn't fully understand. Current reference is to the logging section in monitoring.
Do you mean instead of listing the events here just refer to the relevant events?

sorry, I meant the monitoring the section directly above this which describes some details around the emitted events, e.g.

:code:TopologyDescriptionChangedEvent. The :code:previousDescription field MUST
have :code:TopologyType :code:Unknown and no servers. The :code:newDescription
MUST have :code:TopologyType :code:LoadBalanced and one server with
:code:ServerType :code:Unknown.

I just meant that the same things stated above should apply for the corresponding log messages

kmahar · 2023-05-16T19:42:54Z

my remaining comments are fairly minor, so I don't need to re-review.

my previous question still applies but I'll defer to all of you on that:

also, did we decide against adding any logs that don't correspond to existing SDAM events?

also, it might be good to have someone who owns SDAM review this too.

jmikola

Changes within unified-test-format/ LGTM. Not reviewing the other files.

Note that you'll need to rebase/merge master to fix the conflict in unified-test-format.rst. Please maintain a blank line after the "Please note schema version bumps" RST comment when doing so.

source/server-discovery-and-monitoring/tests/unified/logging-replicaset.yml

source/unified-test-format/unified-test-format.rst

source/server-discovery-and-monitoring/tests/unified/logging-sharded.yml

source/server-discovery-and-monitoring/tests/unified/logging-standalone.yml