perf: Upgrade vLLM version to 0.6.3.post1 #76

kthui · 2024-12-06T18:39:44Z

What does the PR do?

Enable the use of ZMQ on vLLM.
Refactor how the vLLM engine is started.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

triton-inference-server/server#7858

Where should the reviewer start?

N/A

Test plan:

The is a performance improvement, so any issue should be covered by existing test cases.

CI Pipeline ID: 21162737

Caveats:

The ZMQ performance improvement is not applicable when metrics are enabled.
The backend may not shutdown cleanly when metrics are enabled.

Background

There is a performance improvement on vLLM >= 0.6.x with the use of ZMQ, so the vLLM backend can take advantage of the improvement by also enabling the use of ZMQ.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

* Fix engine start fail error propagation

* Temporary patch metrics tests

ci/L0_backend_vllm/metrics_test/vllm_metrics_test.py

oandreeva-nv · 2024-12-06T18:52:34Z

src/model.py

+            # statement.
+            async with build_async_engine_client_from_engine_args(
+                engine_args=self._aync_engine_args,
+                disable_frontend_multiprocessing=self._enable_metrics,


this one is a bit confusing, how disable_frontend_multiprocessing is related to whether we want to enable metrics or not?

yes, because the engine interface where the metrics was relying on is no longer exposed over ZMQ multi-process, so the ZMQ has to be disabled when metrics are enabled.

I think we will need to revisit this soon. I will create a thread offline discussing the options.

oandreeva-nv · 2024-12-06T21:50:34Z

src/model.py

+    def _setup_metrics(self):
+        self._vllm_metrics = None
+        # TODO: Do not read metrics directly from the vLLM engine, read from prometheus
+        #       client to allow the use of ZMQ process when metrics are enabled. See
+        #       https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/entrypoints/openai/api_server.py#L222-L245
+        if self._enable_metrics:
+            try:
+                labels = {
+                    "model": self.args["model_name"],
+                    "version": self.args["model_version"],
+                }
+                # Add vLLM custom metrics
+                engine_config = self._llm_engine.engine.model_config
+                self._vllm_metrics = VllmStatLogger(
+                    labels, engine_config.max_model_len, self.logger
+                )
+                self._llm_engine.add_logger("triton", self._vllm_metrics)
+            except pb_utils.TritonModelException as e:
+                if "metrics not supported" in str(e):
+                    # Metrics are disabled at the server
+                    self.logger.log_info("[vllm] Metrics not supported")
+                else:
+                    raise e


Can we do some re-factoring in the logic, please.

Let's call _setup_metrics prior to _init_engine

Let's initialize self._enable_metrics in _setup_metrics

Do something like :

def _setup_metrics(self): self._vllm_metrics = None self._enable_metrics = ( self._get_bool_config_param("REPORT_CUSTOM_METRICS") and not self._aync_engine_args.disable_log_stats ) if !self.enable_metrics: return # TODO: Do not read metrics directly from the vLLM engine, read from prometheus # client to allow the use of ZMQ process when metrics are enabled. See # https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/entrypoints/openai/api_server.py#L222-L245 try: labels = { "model": self.args["model_name"], "version": self.args["model_version"], } # Add vLLM custom metrics engine_config = self._llm_engine.engine.model_config self._vllm_metrics = VllmStatLogger( labels, engine_config.max_model_len, self.logger ) self._llm_engine.add_logger("triton", self._vllm_metrics) except pb_utils.TritonModelException as e: if "metrics not supported" in str(e): # Metrics are disabled at the server self.logger.log_info("[vllm] Metrics not supported") else: raise e

this way all metric re-lated staff will be under the same related function and we'll somewhat follow single responsibility idea

or do we need an engine set up before we can set up metrics?

discussed offline, I'll take care of re-factor in a follow up PR

oandreeva-nv · 2024-12-06T21:54:02Z

src/model.py

+        self._enable_metrics = (
+            self._get_bool_config_param("REPORT_CUSTOM_METRICS")
+            and not self._aync_engine_args.disable_log_stats


with the refactoring, suggested here: https://github.com/triton-inference-server/vllm_backend/pull/76/files#r1874032158, we can remove it

sure, Move metrics enable check to initialize()

The metrics enabled check also depends on engine args, so added this: Fix engine args dependency issue

Confirmed this version passed L0_backend_vllm

oandreeva-nv · 2024-12-06T22:19:02Z

src/model.py

+                    self._llm_engine_shutdown_event.is_set() is False
+                ), "Cannot create tasks after shutdown has been requested"
+                coro = self._generate(request)
+                asyncio.run_coroutine_threadsafe(coro, self._event_loop)


Adding note to future self: Take care of returned future object

oandreeva-nv

LGTM with the follow up refactor

oandreeva-nv · 2024-12-09T22:09:26Z

ci/L0_backend_vllm/metrics_test/vllm_metrics_test.py

@@ -170,6 +170,7 @@ def test_vllm_metrics(self):
            total_prompts,
        )

+    # TODO: Revisit this test due to the removal of best_of


Could you please clarify the revisit part? In case this will be assigned to another engineer, what are steps needed for this revision

Yes, there are two things to check:

What is the goal of this test?

Why request_params_n_sum assert failed? Is it expected with the vLLM 0.6.3.post1 update?

Basically, if we are updating/deleting request_params_n_sum assert, we need to know why it is safe to update.

oandreeva-nv

lgtm

rmccorm4

LGTM - acknowledging the known issues with Metrics and shutdown will be follow-up fixes as described in the Caveats section.

oandreeva-nv · 2024-12-20T01:55:00Z

src/model.py

+        self._llm_engine_shutdown_event = asyncio.Event()
+        self._event_thread = threading.Thread(
+            target=asyncio.run, args=(self._run_llm_engine(),)


I found out why metrics were not shutting down properly,
seems like self._llm_engine_shutdown_event being an asyncio event was accessed from different thread, which was causing issues. I managed to run into this with metrics OFF ZMQ route as well.

Will include fix in a follow up pr

an asyncio event was accessed from different thread

This sounds a bit similar to the errors we're seeing on the L0_model_control_stress_*_vllm tests for r24.12, in case your changes or investigation end up being helpful there

seems to be unrelated 😭 Pipeline: 21686181, those still red

those are not related, they are due to the Python AsyncIO client.

Same debug approach may help on client side though

oandreeva-nv

LGTM!

kthui added 4 commits December 5, 2024 12:37

Remove vLLM 0.6.x version checks

8b17341

Use ZMQ and some refactoring

f9a31fc

* Fix engine start fail error propagation

Skip ZMQ process if metrics are enabled

050380b

* Temporary patch metrics tests

Update L0_check_health_vllm engine failure mock

96cece2

kthui added the PR: perf A code change that improves performance label Dec 6, 2024

kthui mentioned this pull request Dec 6, 2024

perf: Upgrade vLLM version to 0.6.3.post1 triton-inference-server/server#7858

Merged

20 tasks

kthui requested review from nnshah1, oandreeva-nv and rmccorm4 December 6, 2024 18:45

kthui marked this pull request as ready for review December 6, 2024 18:46

oandreeva-nv reviewed Dec 6, 2024

View reviewed changes

ci/L0_backend_vllm/metrics_test/vllm_metrics_test.py Show resolved Hide resolved

oandreeva-nv reviewed Dec 6, 2024

View reviewed changes

Move metrics enable check to initialize()

94b80cd

kthui requested a review from oandreeva-nv December 7, 2024 00:27

oandreeva-nv previously approved these changes Dec 7, 2024

View reviewed changes

Fix engine args dependency issue

cd4cf06

kthui dismissed oandreeva-nv’s stale review via cd4cf06 December 7, 2024 00:45

kthui requested a review from oandreeva-nv December 7, 2024 00:46

oandreeva-nv reviewed Dec 9, 2024

View reviewed changes

oandreeva-nv previously approved these changes Dec 9, 2024

View reviewed changes

rmccorm4 previously approved these changes Dec 18, 2024

View reviewed changes

oandreeva-nv reviewed Dec 20, 2024

View reviewed changes

Setting shutdown asyncio event in a thread-safe manner (#78)

7348662

oandreeva-nv dismissed stale reviews from rmccorm4 and themself via 7348662 December 20, 2024 23:11

oandreeva-nv approved these changes Dec 20, 2024

View reviewed changes

kthui merged commit 2f5bfbd into main Dec 20, 2024
3 checks passed

kthui deleted the jacky-vllm-0.6.3.post1 branch December 20, 2024 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Upgrade vLLM version to 0.6.3.post1 #76

perf: Upgrade vLLM version to 0.6.3.post1 #76

kthui commented Dec 6, 2024 •

edited

Loading

oandreeva-nv Dec 6, 2024

kthui Dec 6, 2024 •

edited

Loading

oandreeva-nv Dec 6, 2024

oandreeva-nv Dec 6, 2024

oandreeva-nv Dec 6, 2024

oandreeva-nv Dec 6, 2024

oandreeva-nv Dec 6, 2024

kthui Dec 7, 2024

kthui Dec 7, 2024 •

edited

Loading

oandreeva-nv Dec 6, 2024

oandreeva-nv left a comment

oandreeva-nv Dec 9, 2024

kthui Dec 20, 2024

oandreeva-nv left a comment

rmccorm4 left a comment

oandreeva-nv Dec 20, 2024

rmccorm4 Dec 20, 2024 •

edited

Loading

oandreeva-nv Dec 20, 2024

kthui Dec 20, 2024

rmccorm4 Dec 20, 2024

oandreeva-nv left a comment

perf: Upgrade vLLM version to 0.6.3.post1 #76

perf: Upgrade vLLM version to 0.6.3.post1 #76

Conversation

kthui commented Dec 6, 2024 • edited Loading

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Choose a reason for hiding this comment

kthui Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kthui Dec 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oandreeva-nv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oandreeva-nv left a comment

Choose a reason for hiding this comment

rmccorm4 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmccorm4 Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oandreeva-nv left a comment

Choose a reason for hiding this comment

kthui commented Dec 6, 2024 •

edited

Loading

kthui Dec 6, 2024 •

edited

Loading

kthui Dec 7, 2024 •

edited

Loading

rmccorm4 Dec 20, 2024 •

edited

Loading