Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Release GIL during server.stop() to allow request release callbacks to complete #381

Merged
merged 3 commits into from
Jul 17, 2024

Conversation

rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Jul 17, 2024

What does the PR do?

This PR fixes an issue where server.stop() in the L0_python_api::test_api::test_stop() unit test would intermittently fail waiting the full server exit timeout, waiting for all "live models" to be unloaded. However, the "live models" were not getting unloaded because the relevant request object was not getting destructed before server.stop(). The Triton C++ request object holds a reference to a Triton Model object, preventing the model from getting destructed and unloaded, thus preventing the server from shutting down gracefully.

The root cause was that the final reference to the request object would be decremented by the request release callback internally in the python core bindings - but this callback was trying to acquire the Python GIL. If server.stop() was executed first and acquired the GIL, the request release callback (and request destruction) would be blocked for the full exit timeout until server.stop() returns/raises. Similarly, the server.stop() call would be blocked waiting for the request (and model) to be destructed for the full exit timeout.

The solution in this PR is to release the GIL internally while making the call to TRITONSERVER_ServerStop, which allows the PyTritonRequestReleaseCallback to acquire the GIL, proceed, release the final reference to the request, destroy the request and model, and allow the server to gracefully shutdown.

NOTE: See the Caveats below.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

N/A

Where should the reviewer start?

Test plan:

  • L0_python_api
  • CI Pipeline ID: 16660542

Caveats:

There was a pre-existing issue with the current bindings and test, where if you omit the server.stop() call and let the server object simply go out of scope, it may run into the same issue this PR fixes with the manual call to server.stop(). This is because the C++ implementation of TRITONSERVER_ServerDelete also calls server->Stop().

This leads to a "Exit timeout expired" message being printed to STDOUT for some tests when run the test with pytest -s -v test_api.py, for example:

$ pytest -v -s test_api.py
...
test_api.py::ModelTests::test_create_request PASSED
test_api.py::AllocatorTests::test_allocate_on_cpu_and_reshape SKIPPED (Skipping test, torch not installed)
test_api.py::AllocatorTests::test_allocate_on_gpu_and_reshape SKIPPED (Skipping test, torch not installed)
test_api.py::AllocatorTests::test_memory_allocator_exception PASSED
test_api.py::AllocatorTests::test_memory_fallback_to_cpu PASSED Exit timeout expired. Exiting immediately.  <---
test_api.py::AllocatorTests::test_unsupported_memory_type PASSED
test_api.py::TensorTests::test_cpu_to_gpu PASSED
test_api.py::TensorTests::test_gpu_tensor_from_dl_pack SKIPPED (Skipping gpu memory, torch not installed)
test_api.py::TensorTests::test_tensor_from_numpy SKIPPED (Skipping test, torch not installed)
test_api.py::ServerTests::test_invalid_option_type PASSED
test_api.py::ServerTests::test_invalid_repo PASSED
test_api.py::ServerTests::test_model_repository_not_specified PASSED
test_api.py::ServerTests::test_not_started PASSED
test_api.py::ServerTests::test_ready PASSED
test_api.py::ServerTests::test_stop PASSED
test_api.py::InferenceTests::test_basic_inference PASSED Exit timeout expired. Exiting immediately.  <---
test_api.py::InferenceTests::test_gpu_output PASSED
test_api.py::InferenceTests::test_parameters PASSED Exit timeout expired. Exiting immediately.  <---

This issue shouldn't be ignored, and may have a similar solution to this PR. However, a couple naive attempts to apply the same fix to this issue caused some crashes/segfaults, so it will require further investigation to fix and this was already broken beforehand - so I'd like to merge this fix in first to reduce flakiness in CI and investigate the follow-up separately.

Background

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Copy link
Contributor

@krishung5 krishung5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

@rmccorm4 rmccorm4 merged commit d2abb8b into main Jul 17, 2024
1 check passed
@rmccorm4 rmccorm4 deleted the rmccormick-fix-L0_python_api branch July 17, 2024 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants