feat(scheduler): Report lack of dataflow engines in pipeline statuses #5080

agrski · 2023-08-10T12:41:31Z

Why

Issues

N/A

Motivation

Currently, the pipeline status returned from the scheduler to the operator does not report if there are no dataflow engines available on which to schedule that pipeline. The pipeline is left in state PipelineCreate with no further details; users must check the scheduler's logs in order to determine that this is the cause.

This PR ameliorates this situation by propagating this information back to the pipeline status in the scheduler-internal data store, which in turn is used to inform subscribers such as the Core v2 operator.

This PR may, at first glance, look like it changes more than required; there is good reason for this. The chainer server (dataflow engine server) is both a consumer of pipeline events on the internal event bus and also a publisher to this particular stream of events. The simple solution of just updating the pipeline state when no dataflow engines are registered causes a potentially infinite loop in which the server component receives an event, sees there are no engines, publishes an event, then listens to this same event it has just published. This loop can run very quickly, constantly spawning new coroutines, in fact to the point that with a single pipeline it can OOM the scheduler (running with the default 1GB memory) in ~40 seconds.

To break this cycle, we need a mechanism for de-duplicating or otherwise ignoring events. The solution provided here is to add an event source, such that the producer of a message can ignore anything produced by itself. This is relatively simple and also quite generic in its approach--there is no need to cache particular event IDs or anything like that, or to split things up into many separate topics.

What

Summary of changes

Set pipeline status message when no dataflow engines are available.
Add source for pipeline events for publishers to ignore their own output.

Testing

Prior to my changes, the pipeline status is:

k -n seldon get pipelines.mlops.seldon.io dummy-pipeline -o json \
  | jq '.status.conditions[] | select(.type == "PipelineReady")'

{
  "lastTransitionTime": "2023-08-10T14:01:18Z",
  "reason": "PipelineCreate",
  "status": "False",
  "type": "PipelineReady"
}

By building a local scheduler image and pushing this into a kind cluster:

make -C scheduler docker-build-scheduler
make -C scheduler kind-image-install-scheduler
kubectl rollout restart -n seldon sts seldon-scheduler

then deleting and re-creating a dummy pipeline, the status now looks like:

kubectl -n seldon get pipelines.mlops.seldon.io dummy-pipeline -o json \
  | jq '.status.conditions[] | select(.type == "PipelineReady")'

{
  "lastTransitionTime": "2023-08-09T11:11:09Z",
  "message": "no dataflow engines available to handle pipeline",
  "reason": "PipelineCreate",
  "status": "False",
  "type": "PipelineReady"
}

…able

…ting state

… from it

jesse-c

LGTM

We were able to reproduce the issue yesterday, apply the fix, and confirm it!

agrski added 9 commits August 8, 2023 17:23

Fix pipeline name used in log messages

4a22818

Switch message & reason in pipeline status updates in operator

5dcb489

Add reason pipeline is stuck in creating if no dataflow engines avail…

b1a1b89

…able

Make handling for no dataflow engines more general than just for crea…

4443793

…ting state

Add source field to pipeline events

4ab8e5d

Ignore own event bus messages in dataflow server

3bb68c9

Add source field to pipeline events in store method

d32b1ab

Add dataflow/chainer server as source when publishing pipeline events…

da1acdf

… from it

Fix tests re. interface for setting pipeline state

06fe5b1

agrski requested review from jesse-c and ukclivecox August 10, 2023 12:41

agrski self-assigned this Aug 10, 2023

agrski marked this pull request as ready for review August 10, 2023 13:04

Update error message for consistency & format long line

c09abdd

agrski added the v2 label Aug 10, 2023

jesse-c approved these changes Aug 11, 2023

View reviewed changes

ukclivecox approved these changes Aug 14, 2023

View reviewed changes

agrski merged commit 8d9547e into SeldonIO:v2 Aug 14, 2023
3 checks passed

agrski deleted the update-pipeline-status-to-report-lack-of-dataflow-engines branch August 14, 2023 11:31

sakoush mentioned this pull request Mar 21, 2024

fix(dataflow): Do not skip events to remove old pipeline versions #5469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scheduler): Report lack of dataflow engines in pipeline statuses #5080

feat(scheduler): Report lack of dataflow engines in pipeline statuses #5080

agrski commented Aug 10, 2023 •

edited

Loading

jesse-c left a comment

feat(scheduler): Report lack of dataflow engines in pipeline statuses #5080

feat(scheduler): Report lack of dataflow engines in pipeline statuses #5080

Conversation

agrski commented Aug 10, 2023 • edited Loading

Why

Issues

Motivation

What

Summary of changes

Testing

jesse-c left a comment

Choose a reason for hiding this comment

agrski commented Aug 10, 2023 •

edited

Loading