fix: Release GIL during server.stop() to allow request release callbacks to complete #381

rmccorm4 · 2024-07-17T01:50:53Z

What does the PR do?

This PR fixes an issue where server.stop() in the L0_python_api::test_api::test_stop() unit test would intermittently fail waiting the full server exit timeout, waiting for all "live models" to be unloaded. However, the "live models" were not getting unloaded because the relevant request object was not getting destructed before server.stop(). The Triton C++ request object holds a reference to a Triton Model object, preventing the model from getting destructed and unloaded, thus preventing the server from shutting down gracefully.

The root cause was that the final reference to the request object would be decremented by the request release callback internally in the python core bindings - but this callback was trying to acquire the Python GIL. If server.stop() was executed first and acquired the GIL, the request release callback (and request destruction) would be blocked for the full exit timeout until server.stop() returns/raises. Similarly, the server.stop() call would be blocked waiting for the request (and model) to be destructed for the full exit timeout.

The solution in this PR is to release the GIL internally while making the call to TRITONSERVER_ServerStop, which allows the PyTritonRequestReleaseCallback to acquire the GIL, proceed, release the final reference to the request, destroy the request and model, and allow the server to gracefully shutdown.

NOTE: See the Caveats below.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

N/A

Where should the reviewer start?

Test plan:

L0_python_api
CI Pipeline ID: 16660542

Caveats:

There was a pre-existing issue with the current bindings and test, where if you omit the server.stop() call and let the server object simply go out of scope, it may run into the same issue this PR fixes with the manual call to server.stop(). This is because the C++ implementation of TRITONSERVER_ServerDelete also calls server->Stop().

This leads to a "Exit timeout expired" message being printed to STDOUT for some tests when run the test with pytest -s -v test_api.py, for example:

$ pytest -v -s test_api.py
...
test_api.py::ModelTests::test_create_request PASSED
test_api.py::AllocatorTests::test_allocate_on_cpu_and_reshape SKIPPED (Skipping test, torch not installed)
test_api.py::AllocatorTests::test_allocate_on_gpu_and_reshape SKIPPED (Skipping test, torch not installed)
test_api.py::AllocatorTests::test_memory_allocator_exception PASSED
test_api.py::AllocatorTests::test_memory_fallback_to_cpu PASSED Exit timeout expired. Exiting immediately.  <---
test_api.py::AllocatorTests::test_unsupported_memory_type PASSED
test_api.py::TensorTests::test_cpu_to_gpu PASSED
test_api.py::TensorTests::test_gpu_tensor_from_dl_pack SKIPPED (Skipping gpu memory, torch not installed)
test_api.py::TensorTests::test_tensor_from_numpy SKIPPED (Skipping test, torch not installed)
test_api.py::ServerTests::test_invalid_option_type PASSED
test_api.py::ServerTests::test_invalid_repo PASSED
test_api.py::ServerTests::test_model_repository_not_specified PASSED
test_api.py::ServerTests::test_not_started PASSED
test_api.py::ServerTests::test_ready PASSED
test_api.py::ServerTests::test_stop PASSED
test_api.py::InferenceTests::test_basic_inference PASSED Exit timeout expired. Exiting immediately.  <---
test_api.py::InferenceTests::test_gpu_output PASSED
test_api.py::InferenceTests::test_parameters PASSED Exit timeout expired. Exiting immediately.  <---

This issue shouldn't be ignored, and may have a similar solution to this PR. However, a couple naive attempts to apply the same fix to this issue caused some crashes/segfaults, so it will require further investigation to fix and this was already broken beforehand - so I'd like to merge this fix in first to reduce flakiness in CI and investigate the follow-up separately.

Background

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

…o complete during shutdown

krishung5

Nice find!

python/tritonserver/_c/tritonserver_pybind.cc

rmccorm4 added 3 commits July 16, 2024 18:41

Release GIL during server.stop() to allow request release callbacks t…

41ac76b

…o complete during shutdown

Remove xfail from test_stop now that is passes

7c2cb10

Update copyright

ed4fd98

rmccorm4 requested review from nnshah1, GuanLuo and krishung5 July 17, 2024 19:47

GuanLuo approved these changes Jul 17, 2024

View reviewed changes

Tabrizian approved these changes Jul 17, 2024

View reviewed changes

krishung5 approved these changes Jul 17, 2024

View reviewed changes

rmccorm4 merged commit d2abb8b into main Jul 17, 2024
1 check passed

rmccorm4 deleted the rmccormick-fix-L0_python_api branch July 17, 2024 22:57

nnshah1 reviewed Jul 19, 2024

View reviewed changes

python/tritonserver/_c/tritonserver_pybind.cc Show resolved Hide resolved

rmccorm4 mentioned this pull request Jul 24, 2024

test: Improve stability of server shutdown test in L0_python_api::test_api::test_stop #383

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Release GIL during server.stop() to allow request release callbacks to complete #381

fix: Release GIL during server.stop() to allow request release callbacks to complete #381

rmccorm4 commented Jul 17, 2024 •

edited

Loading

krishung5 left a comment

fix: Release GIL during server.stop() to allow request release callbacks to complete #381

fix: Release GIL during server.stop() to allow request release callbacks to complete #381

Conversation

rmccorm4 commented Jul 17, 2024 • edited Loading

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

krishung5 left a comment

Choose a reason for hiding this comment

rmccorm4 commented Jul 17, 2024 •

edited

Loading