Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2: All model requests fail if the expected count of model replicas exceeds the count of server replicas #5124

Closed
lynnmatrix opened this issue Sep 5, 2023 · 10 comments
Labels

Comments

@lynnmatrix
Copy link

Describe the bug

All model requests fail if the expected count of model replicas exceeds the count of server replicas

To reproduce

  1. set the count of server(triton) replicas to 1
  2. set the count of any model(M) replicas to 2
  3. send requests to model M in concurrency

Expected behaviour

At least half of the requests succeed

Environment

Seldon core v2.6.0

@lynnmatrix lynnmatrix added the bug label Sep 5, 2023
@agrski
Copy link
Contributor

agrski commented Sep 5, 2023

Hi @lynnmatrix,

Is this if you try to schedule a new model and send it requests, or when an existing model loses some amount of availability?

@agrski agrski added the v2 label Sep 5, 2023
@lynnmatrix
Copy link
Author

@agrski It is the first case that adding more replicas for existing model

@agrski
Copy link
Contributor

agrski commented Sep 5, 2023

If this is a completely new model, it's likely that it simply isn't scheduling.

In any case, could you please check the Model resource, assuming you're running in Kubernetes, and post the status field, e.g. available via:

kubectl -n <namespace> get model <model name> -o jsonpath '{.status}' | jq

@lynnmatrix
Copy link
Author

@agrski It goes like this. The triton replicas=1, and the model M replicas=1, now all requests for model M work well.
Next, change the model M replicas to 2 (triggered by manual or auto scaling), keep the count of triton replicas=1, the model M status becomes unready, and all requests to model M fail.

kubectl -n <namespace> get model <model name> -o jsonpath '{.status}' | jq

{
  "conditions": [
    {
      "lastTransitionTime": "2023-09-06T02:07:09Z",
      "message": "ScheduleFailed",
      "reason": "****",
      "status": "False",
      "type": "ModelReady"
    },
    {
      "lastTransitionTime": "2023-09-06T02:07:09Z",
      "message": "ScheduleFailed",
      "reason": "****",
      "status": "False",
      "type": "Ready"
    }
  ],
  "replicas": 2
}

@agrski
Copy link
Contributor

agrski commented Sep 6, 2023

Thanks for providing the extra details @lynnmatrix. It looks like the scheduler is attempting to fully reschedule the model and removing its existing assignment(s). This might be just the routing being affected and the model remaining loaded on the one Triton server, or it might be unloading that as well.

I believe the desired behaviour should be for existing assignments to remain and for the model to be considered partially available.

@lynnmatrix
Copy link
Author

I looked at the code. After model schedule fails (because there is no server replicas to arrange a new model replica), model.server will be reset to empty. Envoy try to remove and re-add the model's route, but fails to re-add because model.server is empty.

@ukclivecox
Copy link
Contributor

@sakoush This sounds like a regression bug from the reset server "fix" for other reasons?

@sakoush
Copy link
Member

sakoush commented Sep 7, 2023

there was a regression and it is fixed in #5074
@lynnmatrix could you try using v2? we also have a rc with this fix that it likely to go out in the next week or so.

@lynnmatrix
Copy link
Author

Thanks. #5074 can fix this issue

@sakoush
Copy link
Member

sakoush commented Oct 9, 2023

@lynnmatrix I will close this issue as it seems it is fixed in your case now.

@sakoush sakoush closed this as completed Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants