-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Upgrade vLLM version to 0.6.3.post1 #76
Conversation
* Fix engine start fail error propagation
* Temporary patch metrics tests
# statement. | ||
async with build_async_engine_client_from_engine_args( | ||
engine_args=self._aync_engine_args, | ||
disable_frontend_multiprocessing=self._enable_metrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is a bit confusing, how disable_frontend_multiprocessing
is related to whether we want to enable metrics or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, because the engine interface where the metrics was relying on is no longer exposed over ZMQ multi-process, so the ZMQ has to be disabled when metrics are enabled.
I think we will need to revisit this soon. I will create a thread offline discussing the options.
src/model.py
Outdated
def _setup_metrics(self): | ||
self._vllm_metrics = None | ||
# TODO: Do not read metrics directly from the vLLM engine, read from prometheus | ||
# client to allow the use of ZMQ process when metrics are enabled. See | ||
# https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/entrypoints/openai/api_server.py#L222-L245 | ||
if self._enable_metrics: | ||
try: | ||
labels = { | ||
"model": self.args["model_name"], | ||
"version": self.args["model_version"], | ||
} | ||
# Add vLLM custom metrics | ||
engine_config = self._llm_engine.engine.model_config | ||
self._vllm_metrics = VllmStatLogger( | ||
labels, engine_config.max_model_len, self.logger | ||
) | ||
self._llm_engine.add_logger("triton", self._vllm_metrics) | ||
except pb_utils.TritonModelException as e: | ||
if "metrics not supported" in str(e): | ||
# Metrics are disabled at the server | ||
self.logger.log_info("[vllm] Metrics not supported") | ||
else: | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do some re-factoring in the logic, please.
- Let's call
_setup_metrics
prior to_init_engine
- Let's initialize
self._enable_metrics
in_setup_metrics
- Do something like :
def _setup_metrics(self):
self._vllm_metrics = None
self._enable_metrics = (
self._get_bool_config_param("REPORT_CUSTOM_METRICS")
and not self._aync_engine_args.disable_log_stats
)
if !self.enable_metrics:
return
# TODO: Do not read metrics directly from the vLLM engine, read from prometheus
# client to allow the use of ZMQ process when metrics are enabled. See
# https://github.com/vllm-project/vllm/blob/v0.6.3.post1/vllm/entrypoints/openai/api_server.py#L222-L245
try:
labels = {
"model": self.args["model_name"],
"version": self.args["model_version"],
}
# Add vLLM custom metrics
engine_config = self._llm_engine.engine.model_config
self._vllm_metrics = VllmStatLogger(
labels, engine_config.max_model_len, self.logger
)
self._llm_engine.add_logger("triton", self._vllm_metrics)
except pb_utils.TritonModelException as e:
if "metrics not supported" in str(e):
# Metrics are disabled at the server
self.logger.log_info("[vllm] Metrics not supported")
else:
raise e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this way all metric re-lated staff will be under the same related function and we'll somewhat follow single responsibility idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or do we need an engine set up before we can set up metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline, I'll take care of re-factor in a follow up PR
src/model.py
Outdated
self._enable_metrics = ( | ||
self._get_bool_config_param("REPORT_CUSTOM_METRICS") | ||
and not self._aync_engine_args.disable_log_stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the refactoring, suggested here: https://github.com/triton-inference-server/vllm_backend/pull/76/files#r1874032158, we can remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics enabled check also depends on engine args, so added this: Fix engine args dependency issue
Confirmed this version passed L0_backend_vllm
self._llm_engine_shutdown_event.is_set() is False | ||
), "Cannot create tasks after shutdown has been requested" | ||
coro = self._generate(request) | ||
asyncio.run_coroutine_threadsafe(coro, self._event_loop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding note to future self: Take care of returned future object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the follow up refactor
@@ -170,6 +170,7 @@ def test_vllm_metrics(self): | |||
total_prompts, | |||
) | |||
|
|||
# TODO: Revisit this test due to the removal of best_of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify the revisit part? In case this will be assigned to another engineer, what are steps needed for this revision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are two things to check:
- What is the goal of this test?
- Why
request_params_n_sum
assert failed? Is it expected with the vLLM 0.6.3.post1 update?
Basically, if we are updating/deleting request_params_n_sum
assert, we need to know why it is safe to update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - acknowledging the known issues with Metrics and shutdown will be follow-up fixes as described in the Caveats
section.
self._llm_engine_shutdown_event = asyncio.Event() | ||
self._event_thread = threading.Thread( | ||
target=asyncio.run, args=(self._run_llm_engine(),) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found out why metrics were not shutting down properly,
seems like self._llm_engine_shutdown_event
being an asyncio
event was accessed from different thread, which was causing issues. I managed to run into this with metrics OFF ZMQ route as well.
Will include fix in a follow up pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an asyncio event was accessed from different thread
This sounds a bit similar to the errors we're seeing on the L0_model_control_stress_*_vllm
tests for r24.12
, in case your changes or investigation end up being helpful there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be unrelated 😭 Pipeline: 21686181, those still red
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those are not related, they are due to the Python AsyncIO client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same debug approach may help on client side though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What does the PR do?
Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
triton-inference-server/server#7858
Where should the reviewer start?
N/A
Test plan:
The is a performance improvement, so any issue should be covered by existing test cases.
Caveats:
Background
There is a performance improvement on vLLM >= 0.6.x with the use of ZMQ, so the vLLM backend can take advantage of the improvement by also enabling the use of ZMQ.
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
N/A