Skip to content

Commit

Permalink
DRIVERS-2578 Drivers use polling SDAM on AWS Lambda
Browse files Browse the repository at this point in the history
Disable streaming SDAM by default on AWS Lambda and similar FaaS platforms.
Introduce the sdamMode=stream/poll/auto URI option.
  • Loading branch information
ShaneHarvey committed Aug 21, 2023
1 parent 674bee7 commit e29f381
Show file tree
Hide file tree
Showing 7 changed files with 374 additions and 11 deletions.
5 changes: 5 additions & 0 deletions source/faas-automated-testing/faas-automated-testing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,8 @@ the function implementation the driver MUST:
- Drivers MUST record the durations and counts of the heartbeats, the durations of the
commands, as well as keep track of the number of open connections, and report this information in
the function response as JSON.
- Drivers MUST assert no ServerHeartbeat events contain the ``awaited=True`` flag to
confirm that the streaming protocol is disabled (`DRIVERS-2578`_).


Running in Continuous Integration
Expand Down Expand Up @@ -368,6 +370,9 @@ Description of the behaviour of run-deployed-lambda-aws-tests.sh:
Changelog
=========

:2023-08-21: Drivers MUST assert that the streaming protocol is disabled in the Lambda function.
:2023-08-17: Fixed URI typo, added host note, increase assume role duration.
:2023-06-22: Updated evergreen configuration to use task groups.
:2023-04-14: Added list of supported variants, added additional template config.

.. _DRIVERS-2578: https://jira.mongodb.org/browse/DRIVERS-2578
92 changes: 83 additions & 9 deletions source/server-discovery-and-monitoring/server-monitoring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,28 @@ Round trip time. The client's measurement of the duration of one hello or legacy
The RTT is used to support `localThresholdMS`_ from the Server Selection spec
and `timeoutMS`_ from the `Client Side Operations Timeout Spec`_.

FaaS
````

A Function-as-a-Service (FaaS) environment like AWS Lambda.

sdamMode
````````

The sdamMode option configures which server monitoring protocol to use. Valid modes are
"stream", "poll", or "auto". The default value MUST be "auto":

- With "stream" mode, the client MUST use the streaming protocol when the server supports
it or fall back to the polling protocol otherwise.
- With "poll" mode, the client MUST use the polling protocol.
- With "auto" mode, the client MUST behave the same as "poll" mode when running on a FaaS
platform or the same as "stream" mode otherwise. The client detects that it's
running on a FaaS platform via the same rules for generating the ``client.env``
handshake metadata field in the `MongoDB Handshake spec`_.

Multi-threaded or asynchronous drivers MUST implement this option.
See `Why disable the streaming protocol on FaaS platforms like AWS Lambda?`_ and
`Why introduce a knob for sdamMode?`_

Monitoring
''''''''''
Expand Down Expand Up @@ -203,7 +225,7 @@ Clients use the streaming protocol when supported
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When a monitor discovers that the server supports the streamable hello or legacy hello
command, it MUST use the `streaming protocol`_.
command and the client does not have `streaming disabled`_, it MUST use the `streaming protocol`_.

Single-threaded monitoring
``````````````````````````
Expand Down Expand Up @@ -491,6 +513,22 @@ connected to a server that supports the awaitable hello or legacy hello commands
This protocol requires an extra thread and an extra socket for
each monitor to perform RTT calculations.

.. _streaming is disabled:

Streaming disabled
``````````````````

The streaming protocol MUST be disabled when either:

- the client is configured with sdamMode=poll, or
- the client is configured with sdamMode=auto and a FaaS platform is detected, or
- the server does not support streaming (eg MongoDB <4.4).

When the streaming protocol is disabled the client MUST use the `polling protocol`_
and MUST NOT start an extra thread or connection for `Measuring RTT`_.

See `Why disable the streaming protocol on FaaS platforms like AWS Lambda?`_.

Streaming hello or legacy hello
```````````````````````````````

Expand Down Expand Up @@ -584,8 +622,8 @@ current monitoring connection. (See `Drivers cancel in-progress monitor checks`_
Polling Protocol
''''''''''''''''

The polling protocol is used to monitor MongoDB <= 4.4 servers. The client
`checks`_ a server with a hello or legacy hello command and then sleeps for
The polling protocol is used to monitor MongoDB <= 4.4 servers or when `streaming is disabled`_.
The client `checks`_ a server with a hello or legacy hello command and then sleeps for
heartbeatFrequencyMS before running another check.

Marking the connection pool as ready (CMAP only)
Expand Down Expand Up @@ -661,6 +699,12 @@ The event API here is assumed to be like the standard `Python Event
heartbeatFrequencyMS = heartbeatFrequencyMS
minHeartbeatFrequencyMS = 500
stableApi = stableApi
if sdamMode == "stream":
streamingEnabled = True
elif sdamMode == "poll":
streamingEnabled = False
else: # sdamMode == "auto"
streamingEnabled = not isFaas()
# Internal Monitor state:
connection = Null
Expand All @@ -671,8 +715,6 @@ The event API here is assumed to be like the standard `Python Event
rttMonitor = RttMonitor(serverAddress, stableApi)
def run():
# Start the RttMonitor.
rttMonitor.run()
while this monitor is not stopped:
previousDescription = description
try:
Expand Down Expand Up @@ -700,7 +742,10 @@ The event API here is assumed to be like the standard `Python Event
serverSupportsStreaming = description.type != Unknown and description.topologyVersion != Null
connectionIsStreaming = connection != Null and connection.moreToCome
transitionedWithNetworkError = isNetworkError(description.error) and previousDescription.type != Unknown
if serverSupportsStreaming or connectionIsStreaming or transitionedWithNetworkError:
if streamingEnabled and serverSupportsStreaming and not rttMonitor.started:
# Start the RttMonitor.
rttMonitor.run()
if (streamingEnabled and (serverSupportsStreaming or connectionIsStreaming)) or transitionedWithNetworkError:
continue
wait()
Expand Down Expand Up @@ -733,13 +778,13 @@ The event API here is assumed to be like the standard `Python Event
response = connection.handshakeResponse
elif connection.moreToCome:
response = read next helloCommand exhaust response
elif previousDescription.topologyVersion:
elif streamingEnabled and previousDescription.topologyVersion:
# Initiate streaming hello or legacy hello
if connectTimeoutMS != 0:
set connection timeout to connectTimeoutMS+heartbeatFrequencyMS
response = call {helloCommand: 1, helloOk: True, topologyVersion: previousDescription.topologyVersion, maxAwaitTimeMS: heartbeatFrequencyMS}
else:
# The server does not support topologyVersion.
# The server does not support topologyVersion or streamingEnabled=False.
response = call {helloCommand: 1, helloOk: True}
# If the server supports hello, then response.helloOk will be true
Expand Down Expand Up @@ -1140,6 +1185,32 @@ the "awaited" field on server heartbeat events so that applications can
differentiate a slow heartbeat in the polling protocol from a normal
awaitable hello or legacy hello heartbeat in the new protocol.

Why disable the streaming protocol on FaaS platforms like AWS Lambda?
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

The streaming protocol requires an extra connection and thread per monitored
server which is expensive on platforms like AWS Lambda. The extra connection
is particularly inefficient when thousands of AWS instances and thus
thousands of clients are used.

Additionally, the streaming protocol relies on the assumption that the client
can read the server's heartbeat responses in a timely manner, otherwise the
client will be acting on stale information. In many FaaS platforms, like AWS
Lambda, host applications will be suspended and resumed many minutes later.
This behavior causes a build up of heartbeat responses and the client can end
up spending a long time in a catch up phase processing outdated responses.
This problem was discovered in `DRIVERS-2246`_.

We decided to make polling the default behavior when running on FaaS platforms
like AWS Lambda to improve scalability, performance, and reliability.

Why introduce a knob for sdamMode?
''''''''''''''''''''''''''''''''''

The sdamMode knob provides an workaround in cases where the polling
protocol would be a better choice but the driver is not running on a FaaS
platform. It also provides a workaround in case the FaaS detection
logic becomes outdated or inaccurate.

Changelog
---------
Expand All @@ -1159,6 +1230,7 @@ Changelog
:2022-04-05: Preemptively cancel in progress operations when SDAM heartbeats timeout.
:2022-10-05: Remove spec front matter reformat changelog.
:2022-11-17: Add minimum RTT tracking and remove 90th percentile RTT.
:2023-08-21: Add sdamMode and default to the Polling Protocol on FaaS.

----

Expand All @@ -1183,4 +1255,6 @@ Changelog
.. _Why synchronize clearing a server's pool with updating the topology?: server-discovery-and-monitoring.rst#why-synchronize-clearing-a-server-s-pool-with-updating-the-topology?
.. _Client Side Operations Timeout Spec: /source/client-side-operations-timeout/client-side-operations-timeout.rst
.. _timeoutMS: /source/client-side-operations-timeout/client-side-operations-timeout.rst#timeoutMS
.. _Why does the pool need to support closing in use connections as part of its clear logic?: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#Why-does-the-pool-need-to-support-closing-in-use-connections-as-part-of-its-clear-logic?
.. _Why does the pool need to support closing in use connections as part of its clear logic?: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#Why-does-the-pool-need-to-support-closing-in-use-connections-as-part-of-its-clear-logic?
.. _DRIVERS-2246: https://jira.mongodb.org/browse/DRIVERS-2246
.. _MongoDB Handshake spec: /source/mongodb-handshake/handshake.rst#client-env
129 changes: 129 additions & 0 deletions source/server-discovery-and-monitoring/tests/unified/sdamMode.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
{
"description": "sdamMode",
"schemaVersion": "1.3",
"tests": [
{
"description": "connect with sdamMode=auto",
"operations": [
{
"name": "createEntities",
"object": "testRunner",
"arguments": {
"entities": [
{
"client": {
"id": "client0",
"uriOptions": {
"sdamMode": "auto"
}
}
},
{
"database": {
"id": "dbSdamModeAuto",
"client": "client0",
"databaseName": "sdam-tests"
}
}
]
}
},
{
"name": "runCommand",
"object": "dbSdamModeAuto",
"arguments": {
"commandName": "ping",
"command": {
"ping": 1
}
},
"expectResult": {
"ok": 1
}
}
]
},
{
"description": "connect with sdamMode=stream",
"operations": [
{
"name": "createEntities",
"object": "testRunner",
"arguments": {
"entities": [
{
"client": {
"id": "client1",
"uriOptions": {
"sdamMode": "stream"
}
}
},
{
"database": {
"id": "dbSdamModeStream",
"client": "client1",
"databaseName": "sdam-tests"
}
}
]
}
},
{
"name": "runCommand",
"object": "dbSdamModeStream",
"arguments": {
"commandName": "ping",
"command": {
"ping": 1
}
},
"expectResult": {
"ok": 1
}
}
]
},
{
"description": "connect with sdamMode=poll",
"operations": [
{
"name": "createEntities",
"object": "testRunner",
"arguments": {
"entities": [
{
"client": {
"id": "client2",
"uriOptions": {
"sdamMode": "poll"
}
}
},
{
"database": {
"id": "dbSdamModePoll",
"client": "client2",
"databaseName": "sdam-tests"
}
}
]
}
},
{
"name": "runCommand",
"object": "dbSdamModePoll",
"arguments": {
"commandName": "ping",
"command": {
"ping": 1
}
},
"expectResult": {
"ok": 1
}
}
]
}
]
}
67 changes: 67 additions & 0 deletions source/server-discovery-and-monitoring/tests/unified/sdamMode.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
description: sdamMode

schemaVersion: "1.3"

tests:
- description: "connect with sdamMode=auto"
operations:
- name: createEntities
object: testRunner
arguments:
entities:
- client:
id: &client0 client0
uriOptions:
sdamMode: "auto"
- database:
id: &dbSdamModeAuto dbSdamModeAuto
client: *client0
databaseName: sdam-tests
- name: runCommand
object: *dbSdamModeAuto
arguments:
commandName: ping
command: { ping: 1 }
expectResult: { ok: 1 }

- description: "connect with sdamMode=stream"
operations:
- name: createEntities
object: testRunner
arguments:
entities:
- client:
id: &client1 client1
uriOptions:
sdamMode: "stream"
- database:
id: &dbSdamModeStream dbSdamModeStream
client: *client1
databaseName: sdam-tests
- name: runCommand
object: *dbSdamModeStream
arguments:
commandName: ping
command: { ping: 1 }
expectResult: { ok: 1 }

- description: "connect with sdamMode=poll"
operations:
- name: createEntities
object: testRunner
arguments:
entities:
- client:
id: &client2 client2
uriOptions:
sdamMode: "poll"
- database:
id: &dbSdamModePoll dbSdamModePoll
client: *client2
databaseName: sdam-tests
- name: runCommand
object: *dbSdamModePoll
arguments:
commandName: ping
command: { ping: 1 }
expectResult: { ok: 1 }
Loading

0 comments on commit e29f381

Please sign in to comment.