Improve visibility and facilitate troubleshooting #554
nikos912000
started this conversation in
Ideas, user requests and proposals
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi @Devatoria @ptnapoleon 👋
I am starting this discussion following exploratory work and user feedback while integrating with our CD platform (Spinnaker).
It might help sharing some background on our side. Since any errors and feedback is deep down the stack (in the controller) we are working on adding links to internal observability platforms to facilitate troubleshooting.
We did an analysis on what is available in terms of reporting which I thought could help sharing.
Metrics
Both the controller and injector Pods are reporting metrics. These metrics are documented and include tags which help with troubleshooting.
Logs
There is already enough context in the logs to tie them to a Disruption. This allows logging platforms to parse them and extract the required fields. In its turn this enables deep links which can be shown in UIs.
Sample log:
Events
On Disruption CR
Events, including ones coming from webhooks, are reported on the Disruption Custom Resource and are available in observability platforms. These events include tags which help with filtering down to a particular Disruption.
Sample:
These events are available in observability platforms (e.g. DataDog's Events Explorer):
On targets
The controller records events on the target Nodes/Pods.
Sample:
Ideas on improving visibility and troubleshooting
Document reported events
It would be useful to document any reporting features, such as structured logs and metrics/events with tags. I am happy to document these based on the above analysis if that helps.
Dynamic feedback on the status
Strictly speaking, the
injectionStatus
is a state rather than a status.Imo providing more info in the Custom Resource's status would be useful for client-side integrations. A string field could be introduced for that. Clients could then poll the status sub-resource and get the
injectionStatus
(state) and the value of the new field (status).Introduction of a Failed state and context behind its cause
I think one key state that is missing currently is one that reflects failures. Even with dynamic targeting if a Disruption is not valid or the injector Pods fail there is a clear failure state. The workaround when reflecting the status in a UI would be to use a combination of the current states with a timeout (if the state hasn't moved to
Injected
within X seconds then it has failed) but this sounds a bit ugly.Few questions:
Failed
state in your opinion? And a field that reflects the errors? This could be the field suggested above, which stores dynamic feedback.injectionStatus
? And would that require major design changes due to that logic being in webhooks?Beta Was this translation helpful? Give feedback.
All reactions