Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PYTHON-4636 - Avoid blocking I/O calls in async code paths #1870

Merged
merged 18 commits into from
Oct 3, 2024

Conversation

NoahStapp
Copy link
Contributor

No description provided.

caseyclements
caseyclements previously approved these changes Sep 19, 2024
Copy link
Contributor

@caseyclements caseyclements left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ShaneHarvey
Copy link
Member

We should use the loop.recv_into approach in #1871 to avoid any blocking calls.

pymongo/network_layer.py Show resolved Hide resolved
finished = next(iter(result[0]))
next(iter(result[1])).cancel()
if finished == read_task:
return finished.result() # type: ignore[return-value]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we used read_task.result() would that avoid the type error?

Comment on lines 141 to 143
read += conn.recv_into(mv[read:])
if read == 0:
raise OSError("connection closed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be something like:

                read = conn.recv_into(mv[read:])
                if read == 0:
                    raise OSError("connection closed")
                total_read += read

try:
read = conn.recv_into(mv[total_read:])
except BLOCKING_IO_ERRORS:
await asyncio.sleep(0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RIP windows perf. Can you schedule some of the windows tasks to see the impact?

For anyone else following along we'll fix this when we migrate to asyncio streams in: https://jira.mongodb.org/browse/PYTHON-4493

Copy link
Contributor Author

@NoahStapp NoahStapp Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a new change, it's already in master:

await asyncio.sleep(0.5)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but the existing code was only for send() whereas this new code is for recv so the perf impact could be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll schedule some Windows tasks.

@blink1073
Copy link
Member

The test_find_raw_retryable_reads error persists.

@NoahStapp
Copy link
Contributor Author

The test_find_raw_retryable_reads error persists.

Our theory is that #1875 will fix it, the error is caused by the retried command timing out on server selection. We think that's due to the condition lock deadlocking across background threads.

@blink1073
Copy link
Member

Cool, let's merge that one first then to verify.

@blink1073
Copy link
Member

I think we have to defer this PR, we're seeing failures in test.asynchronous.test_monitoring.AsyncTestCommandMonitoring across several variants:

 [2024/09/30 18:37:48.295] FAILURE: pymongo.errors.ServerSelectionTimeoutError: No primary available for writes, Timeout: 30s, Topology Description: <TopologyDescription id: 66fb353287d73424ce49c8c7, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('localhost:27017: timed out (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>, <ServerDescription ('localhost', 27018) server_type: RSSecondary, rtt: 0.006545153367957028>, <ServerDescription ('localhost', 27019) server_type: RSArbiter, rtt: 0.0075699916319154>]> ()

@NoahStapp
Copy link
Contributor Author

I think we have to defer this PR, we're seeing failures in test.asynchronous.test_monitoring.AsyncTestCommandMonitoring across several variants:

 [2024/09/30 18:37:48.295] FAILURE: pymongo.errors.ServerSelectionTimeoutError: No primary available for writes, Timeout: 30s, Topology Description: <TopologyDescription id: 66fb353287d73424ce49c8c7, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('localhost:27017: timed out (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms)')>, <ServerDescription ('localhost', 27018) server_type: RSSecondary, rtt: 0.006545153367957028>, <ServerDescription ('localhost', 27019) server_type: RSArbiter, rtt: 0.0075699916319154>]> ()

All tests passing after merging master in.

@ShaneHarvey
Copy link
Member

Looks like there are still some test failures. I just scheduled some Windows and macOS tasks as well.

@NoahStapp
Copy link
Contributor Author

Looks like there are still some test failures. I just scheduled some Windows and macOS tasks as well.

Yeah they seem to be flakey, none consistently fail so far.

@blink1073
Copy link
Member

I fixed PYTHON-4798 which will make MacOS and Windows run again in PRs.

@blink1073
Copy link
Member

test.asynchronous.test_client.TestClient.test_socket_timeout failure looks relevant: FAILURE: OSError: [WinError 10038] An operation was attempted on something that is not a socket ().

@ShaneHarvey
Copy link
Member

Yeah they seem to be flakey, none consistently fail so far.

Sure any one task may be flaky at a low probability but looking at the whole patch about 10% of tasks are failing. That's something we need to fix.

Also on Windows test_network_error_message seems to block forever.

@NoahStapp
Copy link
Contributor Author

Yeah they seem to be flakey, none consistently fail so far.

Sure any one task may be flaky at a low probability but looking at the whole patch about 10% of tasks are failing. That's something we need to fix.

Also on Windows test_network_error_message seems to block forever.

Agreed that we need to fix them, I meant that there wasn't a consistent pattern in which tests fail and how often.

self._run()
else:
asyncio.run(self._run())
self._run()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was this test failing in async?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was trying to call asyncio.run while in an active loop since it's an async test, which isn't supported.

@NoahStapp NoahStapp merged commit b111cbf into mongodb:master Oct 3, 2024
35 of 36 checks passed
@NoahStapp NoahStapp deleted the PYTHON-4636 branch October 3, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants