Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Relax flaky comparison in AsyncWorkQueue parallel test #379

Merged
merged 4 commits into from
Jul 15, 2024

Conversation

rmccorm4
Copy link
Contributor

@rmccorm4 rmccorm4 commented Jul 12, 2024

This test has had a FIXME for a long time:

    // FIXME manual testing shows parallelized time is between 30% to 33.3%
    // for 128 M total elements

In CI, this comparison is flaky and fails about 15-20% of the time over a sample size of the past 7 months.

Status Summary [2023-12-08 - 2024-07-12]: success=425, failed=77, success_rate=84.66%

For the time being, I think it would improve pipeline health to relax this condition to expect any speedup from parallelism whatsoever, and revisit this later. Justification for revisiting this test later is below.

Alternatives:

  • We could relax this test to something we're more confident in like 1.5-2x speedup instead of 3x speedup. But would require some due diligence to find a better consistent threshold. For the sake of time/priority, I went with looking for any speedup whatsoever for now.

Justifications

In practice, the AsyncWorkQueue is initialized with the --buffer-manager-thread-count value, which when used via tritonserver binary, doesn't appear to correctly propogate to backends that initialize the AsyncWorkQueue anyways, which led L0_parallel_copy (e2e test of AsyncWorkQueue) to be disabled in the past. This is just to back up the reasoning that it is OK to relax this unit test's strictness, since the underlying feature isn't being utilized correctly in Triton/backends today.

This test will have more importance when the e2e functionality is working as expected. That would involve taking a closer look at:

  • re-enabling L0_parallel_copy
  • investigating the issue with the AsyncWorkQueue "singleton" appearing to refer to different objects in core vs. backend

Some background info to back up the flaw in AsyncWorkQueue today, the core recognizes the singleton to be at address 0x7061331fe1d0 correctly intialized with a thread pool, but the backend sees it at address 0x706126096610 and is also uninitialized with an empty thread pool, so the "singleton" nature doesn't appear to be working as expected:

$ tritonserver --model-store models --buffer-manager-thread-count 4
...
[DEBUG][common::AsyncWorkQueue] Successfully initialized AsyncWorkQueue thread pool with worker_count=4
[DEBUG][common::AsyncWorkQueue] singleton address called from core: 0x7061331fe1d0
...
# When checking triton::common::AsyncWorkQueue::WorkerCount() in the `backend_input_collector.cc` which
# uses AsyncWorkQueue to accelerate copies in backends:
[DEBUG][common::AsyncWorkQueue] singleton address called from WorkerCount(): 0x706126096610
[DEBUG][common::AsyncWorkQueue] singleton thread_pool is null, so return WorkerCount() = 0 (zero)

@rmccorm4 rmccorm4 requested review from fpetrini15 and GuanLuo July 12, 2024 21:49
@rmccorm4 rmccorm4 changed the title Disable flaky AsyncWorkQueue parallel test Relax flaky comparison in AsyncWorkQueue parallel test Jul 12, 2024
@rmccorm4 rmccorm4 changed the title Relax flaky comparison in AsyncWorkQueue parallel test test: Relax flaky comparison in AsyncWorkQueue parallel test Jul 12, 2024
Copy link
Contributor

@fpetrini15 fpetrini15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synched offline with Ryan regarding the changes. Logic is reasonable, so I'll mark this as approved! Thanks for fixing 🚀

@rmccorm4 rmccorm4 merged commit 8153c2e into main Jul 15, 2024
1 check passed
@rmccorm4 rmccorm4 deleted the rmccormick-awq-flaky branch July 15, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants