V2: All model requests fail if the expected count of model replicas exceeds the count of server replicas #5124

lynnmatrix · 2023-09-05T07:59:14Z

Describe the bug

All model requests fail if the expected count of model replicas exceeds the count of server replicas

To reproduce

set the count of server(triton) replicas to 1
set the count of any model(M) replicas to 2
send requests to model M in concurrency

Expected behaviour

At least half of the requests succeed

Environment

Seldon core v2.6.0

agrski · 2023-09-05T09:26:58Z

Hi @lynnmatrix,

Is this if you try to schedule a new model and send it requests, or when an existing model loses some amount of availability?

lynnmatrix · 2023-09-05T09:34:41Z

@agrski It is the first case that adding more replicas for existing model

agrski · 2023-09-05T14:48:56Z

If this is a completely new model, it's likely that it simply isn't scheduling.

In any case, could you please check the Model resource, assuming you're running in Kubernetes, and post the status field, e.g. available via:

kubectl -n <namespace> get model <model name> -o jsonpath '{.status}' | jq

lynnmatrix · 2023-09-06T02:16:25Z

@agrski It goes like this. The triton replicas=1, and the model M replicas=1, now all requests for model M work well.
Next, change the model M replicas to 2 (triggered by manual or auto scaling), keep the count of triton replicas=1, the model M status becomes unready, and all requests to model M fail.

kubectl -n <namespace> get model <model name> -o jsonpath '{.status}' | jq

{
  "conditions": [
    {
      "lastTransitionTime": "2023-09-06T02:07:09Z",
      "message": "ScheduleFailed",
      "reason": "****",
      "status": "False",
      "type": "ModelReady"
    },
    {
      "lastTransitionTime": "2023-09-06T02:07:09Z",
      "message": "ScheduleFailed",
      "reason": "****",
      "status": "False",
      "type": "Ready"
    }
  ],
  "replicas": 2
}

agrski · 2023-09-06T15:35:44Z

Thanks for providing the extra details @lynnmatrix. It looks like the scheduler is attempting to fully reschedule the model and removing its existing assignment(s). This might be just the routing being affected and the model remaining loaded on the one Triton server, or it might be unloading that as well.

I believe the desired behaviour should be for existing assignments to remain and for the model to be considered partially available.

lynnmatrix · 2023-09-07T02:40:49Z

I looked at the code. After model schedule fails (because there is no server replicas to arrange a new model replica), model.server will be reset to empty. Envoy try to remove and re-add the model's route, but fails to re-add because model.server is empty.

ukclivecox · 2023-09-07T06:33:24Z

@sakoush This sounds like a regression bug from the reset server "fix" for other reasons?

sakoush · 2023-09-07T10:30:57Z

there was a regression and it is fixed in #5074
@lynnmatrix could you try using v2? we also have a rc with this fix that it likely to go out in the next week or so.

lynnmatrix · 2023-09-07T10:44:58Z

Thanks. #5074 can fix this issue

sakoush · 2023-10-09T11:41:15Z

@lynnmatrix I will close this issue as it seems it is fixed in your case now.

lynnmatrix added the bug label Sep 5, 2023

agrski added the v2 label Sep 5, 2023

sakoush closed this as completed Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V2: All model requests fail if the expected count of model replicas exceeds the count of server replicas #5124

V2: All model requests fail if the expected count of model replicas exceeds the count of server replicas #5124

lynnmatrix commented Sep 5, 2023

agrski commented Sep 5, 2023 •

edited

Loading

lynnmatrix commented Sep 5, 2023

agrski commented Sep 5, 2023

lynnmatrix commented Sep 6, 2023

agrski commented Sep 6, 2023

lynnmatrix commented Sep 7, 2023

ukclivecox commented Sep 7, 2023

sakoush commented Sep 7, 2023 •

edited

Loading

lynnmatrix commented Sep 7, 2023

sakoush commented Oct 9, 2023

V2: All model requests fail if the expected count of model replicas exceeds the count of server replicas #5124

V2: All model requests fail if the expected count of model replicas exceeds the count of server replicas #5124

Comments

lynnmatrix commented Sep 5, 2023

Describe the bug

To reproduce

Expected behaviour

Environment

agrski commented Sep 5, 2023 • edited Loading

lynnmatrix commented Sep 5, 2023

agrski commented Sep 5, 2023

lynnmatrix commented Sep 6, 2023

agrski commented Sep 6, 2023

lynnmatrix commented Sep 7, 2023

ukclivecox commented Sep 7, 2023

sakoush commented Sep 7, 2023 • edited Loading

lynnmatrix commented Sep 7, 2023

sakoush commented Oct 9, 2023

agrski commented Sep 5, 2023 •

edited

Loading

sakoush commented Sep 7, 2023 •

edited

Loading