test: Relax flaky comparison in AsyncWorkQueue parallel test #379

rmccorm4 · 2024-07-12T21:49:21Z

This test has had a FIXME for a long time:

    // FIXME manual testing shows parallelized time is between 30% to 33.3%
    // for 128 M total elements

In CI, this comparison is flaky and fails about 15-20% of the time over a sample size of the past 7 months.

Status Summary [2023-12-08 - 2024-07-12]: success=425, failed=77, success_rate=84.66%

For the time being, I think it would improve pipeline health to relax this condition to expect any speedup from parallelism whatsoever, and revisit this later. Justification for revisiting this test later is below.

Alternatives:

We could relax this test to something we're more confident in like 1.5-2x speedup instead of 3x speedup. But would require some due diligence to find a better consistent threshold. For the sake of time/priority, I went with looking for any speedup whatsoever for now.

Justifications

In practice, the AsyncWorkQueue is initialized with the --buffer-manager-thread-count value, which when used via tritonserver binary, doesn't appear to correctly propogate to backends that initialize the AsyncWorkQueue anyways, which led L0_parallel_copy (e2e test of AsyncWorkQueue) to be disabled in the past. This is just to back up the reasoning that it is OK to relax this unit test's strictness, since the underlying feature isn't being utilized correctly in Triton/backends today.

This test will have more importance when the e2e functionality is working as expected. That would involve taking a closer look at:

re-enabling L0_parallel_copy
investigating the issue with the AsyncWorkQueue "singleton" appearing to refer to different objects in core vs. backend

Some background info to back up the flaw in AsyncWorkQueue today, the core recognizes the singleton to be at address 0x7061331fe1d0 correctly intialized with a thread pool, but the backend sees it at address 0x706126096610 and is also uninitialized with an empty thread pool, so the "singleton" nature doesn't appear to be working as expected:

$ tritonserver --model-store models --buffer-manager-thread-count 4
...
[DEBUG][common::AsyncWorkQueue] Successfully initialized AsyncWorkQueue thread pool with worker_count=4
[DEBUG][common::AsyncWorkQueue] singleton address called from core: 0x7061331fe1d0
...
# When checking triton::common::AsyncWorkQueue::WorkerCount() in the `backend_input_collector.cc` which
# uses AsyncWorkQueue to accelerate copies in backends:
[DEBUG][common::AsyncWorkQueue] singleton address called from WorkerCount(): 0x706126096610
[DEBUG][common::AsyncWorkQueue] singleton thread_pool is null, so return WorkerCount() = 0 (zero)

…oesn't currently work as tested

fpetrini15

Synched offline with Ryan regarding the changes. Logic is reasonable, so I'll mark this as approved! Thanks for fixing 🚀

Disable parallel async_work_queue subtest to avoid flakiness, as it d…

ff64478

…oesn't currently work as tested

rmccorm4 requested review from fpetrini15 and GuanLuo July 12, 2024 21:49

Relax comparison instead of disabling it

2919536

rmccorm4 changed the title ~~Disable flaky AsyncWorkQueue parallel test~~ Relax flaky comparison in AsyncWorkQueue parallel test Jul 12, 2024

rmccorm4 changed the title ~~Relax flaky comparison in AsyncWorkQueue parallel test~~ test: Relax flaky comparison in AsyncWorkQueue parallel test Jul 12, 2024

rmccorm4 added 2 commits July 12, 2024 16:11

Add debug statements to show underlying issue

8686ae6

Remove debug statement

a92400a

fpetrini15 approved these changes Jul 13, 2024

View reviewed changes

rmccorm4 merged commit 8153c2e into main Jul 15, 2024
1 check passed

rmccorm4 deleted the rmccormick-awq-flaky branch July 15, 2024 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Relax flaky comparison in AsyncWorkQueue parallel test #379

test: Relax flaky comparison in AsyncWorkQueue parallel test #379

rmccorm4 commented Jul 12, 2024 •

edited

Loading

fpetrini15 left a comment

test: Relax flaky comparison in AsyncWorkQueue parallel test #379

test: Relax flaky comparison in AsyncWorkQueue parallel test #379

Conversation

rmccorm4 commented Jul 12, 2024 • edited Loading

fpetrini15 left a comment

Choose a reason for hiding this comment

rmccorm4 commented Jul 12, 2024 •

edited

Loading