Improve the handling of sh.keptn.event.get-sli.triggered
events where SLIs don't produce a value
#704
Replies: 2 comments 1 reply
-
@arthurpitman thanks for the summary and simplifying it further! |
Beta Was this translation helpful? Give feedback.
-
As a starter, it would be nice to see at least which metrics failed. At the moment there is zero output. If I have an SLO with What would help tremendously is at least showing (somehow that A customer I'm working with has exactly this issue which has lead to me having to create a utility which reads an |
Beta Was this translation helpful? Give feedback.
-
Current situation
SLIs that produce no data during a given evaluation result in
sh.keptn.event.get-sli.finished
event withresult
set tofail
. This in turn causes the evaluation to fail, as the lighthouse-service does not process the event further.Desired behavior
In cases where the dynatrace-service is able to query the Dyntrace tenant successfully, but the processing of the result leads to no single numerical value (i.e. when no data is available during the timeframe or no single value can be selected from the return results),
success
of the corresponding element in theindicatorValues
is set tofalse
but the overallresult
is set towarning
.This would mean the dynatrace-service would only function as an SLI provider, providing indicators with as much information as possible, making its behavior more obvious and easy to document. It would no longer perform any sort of evaluation, this would be done entirely by the lighthouse-service, depending on the scoring of the SLO configuration.
In short, if the user has set up the evaluation to tolerate a certain number of failing objectives, then it could still pass even when some SLIs don't produce a value. Other classes of errors would still be handled as before: situations where it was unable to process the request would lead to
sh.keptn.event.get-sli.finished
event withstatus
oferrored
, while situations where it is misconfigured (bad SLI syntax or errors from the Dynatrace tenant) would still result in aresult
set tofail
.The overall behavior can be summarized as follows:
status
result
If the dynatrace-service was able to retrieve the metric from the Dynatrace API, i.e. the chain of calls succeeded (~HTTP 200-class) and process the result to produce a single value
succeeded
pass
success
field of each indicator inindicatorValues
is set totrue
- an SLI retrieval was performed
-> good case (lighthouse-service can continue with evaluation task)
As above, but if the processing of the result cannot produce a single value
- due to no data points being returned,
- multiple data points being returned,
- or not being able to select a single result value (USQL queries).
succeeded
warning
success
field of the individual indicator inindicatorValues
is set tofalse
- an SLI retrieval was performed
-> lighthouse-service has to deal with SLIs that have no value
If an HTTP 400- class error occurs, due to e.g.:
- syntax errors in SLI files,
- dashboard syntax errors,
- or failure to retrieve the metric from the Dynatrace API
succeeded
fail
success
field of each affected indicator inindicatorValues
is set tofalse
- a partial SLI retrieval may have been performed but it should be disregarded
->
get-sli.finished
terminates the sequence executionIf the dynatrace-service was unable to process the request (encounters HTTP 500-class errors)
errored
fail
success
field of all indicators inindicatorValues
is set tofalse
- most likely no SLI retrieval was performed
->
get-sli.finished
terminates the sequence executionBeta Was this translation helpful? Give feedback.
All reactions