Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Upgrade vLLM version to 0.6.3.post1 #76

Merged
merged 7 commits into from
Dec 20, 2024
Merged

Conversation

kthui
Copy link
Contributor

@kthui kthui commented Dec 6, 2024

What does the PR do?

  • Enable the use of ZMQ on vLLM.
  • Refactor how the vLLM engine is started.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

triton-inference-server/server#7858

Where should the reviewer start?

N/A

Test plan:

The is a performance improvement, so any issue should be covered by existing test cases.

  • CI Pipeline ID: 21162737

Caveats:

  • The ZMQ performance improvement is not applicable when metrics are enabled.
  • The backend may not shutdown cleanly when metrics are enabled.

Background

There is a performance improvement on vLLM >= 0.6.x with the use of ZMQ, so the vLLM backend can take advantage of the improvement by also enabling the use of ZMQ.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

@kthui kthui added the PR: perf A code change that improves performance label Dec 6, 2024
@kthui kthui marked this pull request as ready for review December 6, 2024 18:46
# statement.
async with build_async_engine_client_from_engine_args(
engine_args=self._aync_engine_args,
disable_frontend_multiprocessing=self._enable_metrics,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is a bit confusing, how disable_frontend_multiprocessing is related to whether we want to enable metrics or not?

Copy link
Contributor Author

@kthui kthui Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because the engine interface where the metrics was relying on is no longer exposed over ZMQ multi-process, so the ZMQ has to be disabled when metrics are enabled.

I think we will need to revisit this soon. I will create a thread offline discussing the options.

src/model.py Outdated
Comment on lines 343 to 365
def _setup_metrics(self):
self._vllm_metrics = None
# TODO: Do not read metrics directly from the vLLM engine, read from prometheus
# client to allow the use of ZMQ process when metrics are enabled. See
# https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/entrypoints/openai/api_server.py#L222-L245
if self._enable_metrics:
try:
labels = {
"model": self.args["model_name"],
"version": self.args["model_version"],
}
# Add vLLM custom metrics
engine_config = self._llm_engine.engine.model_config
self._vllm_metrics = VllmStatLogger(
labels, engine_config.max_model_len, self.logger
)
self._llm_engine.add_logger("triton", self._vllm_metrics)
except pb_utils.TritonModelException as e:
if "metrics not supported" in str(e):
# Metrics are disabled at the server
self.logger.log_info("[vllm] Metrics not supported")
else:
raise e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do some re-factoring in the logic, please.

  1. Let's call _setup_metrics prior to _init_engine
  2. Let's initialize self._enable_metrics in _setup_metrics
  3. Do something like :
    def _setup_metrics(self):
        self._vllm_metrics = None
        
        self._enable_metrics = (
            self._get_bool_config_param("REPORT_CUSTOM_METRICS")
            and not self._aync_engine_args.disable_log_stats
        )
        if !self.enable_metrics:
            return
        # TODO: Do not read metrics directly from the vLLM engine, read from prometheus
        #       client to allow the use of ZMQ process when metrics are enabled. See
        #       https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/entrypoints/openai/api_server.py#L222-L245
        try:
                labels = {
                    "model": self.args["model_name"],
                    "version": self.args["model_version"],
                }
                # Add vLLM custom metrics
                engine_config = self._llm_engine.engine.model_config
                self._vllm_metrics = VllmStatLogger(
                    labels, engine_config.max_model_len, self.logger
                )
                self._llm_engine.add_logger("triton", self._vllm_metrics)
          except pb_utils.TritonModelException as e:
                if "metrics not supported" in str(e):
                    # Metrics are disabled at the server
                    self.logger.log_info("[vllm] Metrics not supported")
                else:
                    raise e

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way all metric re-lated staff will be under the same related function and we'll somewhat follow single responsibility idea

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do we need an engine set up before we can set up metrics?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline, I'll take care of re-factor in a follow up PR

src/model.py Outdated
Comment on lines 243 to 245
self._enable_metrics = (
self._get_bool_config_param("REPORT_CUSTOM_METRICS")
and not self._aync_engine_args.disable_log_stats
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the refactoring, suggested here: https://github.com/triton-inference-server/vllm_backend/pull/76/files#r1874032158, we can remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@kthui kthui Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics enabled check also depends on engine args, so added this: Fix engine args dependency issue

Confirmed this version passed L0_backend_vllm

self._llm_engine_shutdown_event.is_set() is False
), "Cannot create tasks after shutdown has been requested"
coro = self._generate(request)
asyncio.run_coroutine_threadsafe(coro, self._event_loop)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding note to future self: Take care of returned future object

@kthui kthui requested a review from oandreeva-nv December 7, 2024 00:27
oandreeva-nv
oandreeva-nv previously approved these changes Dec 7, 2024
Copy link
Collaborator

@oandreeva-nv oandreeva-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the follow up refactor

@@ -170,6 +170,7 @@ def test_vllm_metrics(self):
total_prompts,
)

# TODO: Revisit this test due to the removal of best_of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please clarify the revisit part? In case this will be assigned to another engineer, what are steps needed for this revision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are two things to check:

  1. What is the goal of this test?
  2. Why request_params_n_sum assert failed? Is it expected with the vLLM 0.6.3.post1 update?

Basically, if we are updating/deleting request_params_n_sum assert, we need to know why it is safe to update.

oandreeva-nv
oandreeva-nv previously approved these changes Dec 9, 2024
Copy link
Collaborator

@oandreeva-nv oandreeva-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

rmccorm4
rmccorm4 previously approved these changes Dec 18, 2024
Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - acknowledging the known issues with Metrics and shutdown will be follow-up fixes as described in the Caveats section.

Comment on lines +231 to +233
self._llm_engine_shutdown_event = asyncio.Event()
self._event_thread = threading.Thread(
target=asyncio.run, args=(self._run_llm_engine(),)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out why metrics were not shutting down properly,
seems like self._llm_engine_shutdown_event being an asyncio event was accessed from different thread, which was causing issues. I managed to run into this with metrics OFF ZMQ route as well.

Will include fix in a follow up pr

Copy link
Contributor

@rmccorm4 rmccorm4 Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an asyncio event was accessed from different thread

This sounds a bit similar to the errors we're seeing on the L0_model_control_stress_*_vllm tests for r24.12, in case your changes or investigation end up being helpful there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be unrelated 😭 Pipeline: 21686181, those still red

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are not related, they are due to the Python AsyncIO client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same debug approach may help on client side though

@oandreeva-nv oandreeva-nv dismissed stale reviews from rmccorm4 and themself via 7348662 December 20, 2024 23:11
Copy link
Collaborator

@oandreeva-nv oandreeva-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kthui kthui merged commit 2f5bfbd into main Dec 20, 2024
3 checks passed
@kthui kthui deleted the jacky-vllm-0.6.3.post1 branch December 20, 2024 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: perf A code change that improves performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants