-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define mean time to resolve an incident (mean time to resolve, MTTR) #3191
Comments
Some initial research for whoever picks up this ticket: DataDog may work great for this (see https://docs.datadoghq.com/dora_metrics/failures/). We can hook up DD with PagerDuty using a webhook, and MTTR will be automatically calculated based on incident start and end times. Dora metrics with DD is currently in public beta as stated in that link though, which could be a red flag. More information: https://docs.datadoghq.com/dora_metrics/setup/. |
I feel like this is exactly what I would want, but yeah, I don't feel comfortable with using a public beta 😅 we might have to do something a bit more manual for now unfortunately |
for consideration: |
It looks like PagerDuty has a dashboard which shows MTTR over some time range, under Analytics -> Dashboard. Definitely worth discussing whether this work well enough for now for this ticket. Mason suggested DataDog would be a better long term solution since our metrics and incident visualization would be consolidated, sometime after DD's DORA metrics feature exits public beta. |
Maybe we can make this a 16th min? We need to have something even if the second iteration is DD. |
The nice thing about the PagerDuty dashboard, is it already has a baseline established for our history of use of the tool for our incident management. And it does seem to support breakdowns by severity level out of the box as well. The goal would that current features would essentially trivialize this ticket and we just have to make sure the team knows how to find this dashboard and we can easily pull a graphic to include in any reports to the enablement team. |
Hi @brostk would you be able to comment your progress on this ticket? Do you need more time or will this carry over to the next sprint? |
Sure - I'm about to finish up the documentation and post it on a wiki page. Currently working with the team to make sure we're on the same page about acceptance criteria 3 and 5. I'd expect to finish today. |
thanks @brostk ! |
As the VRO Team, we aim to define the mean time to resolve an incident. This will help us understand our average resolution time and identify areas for process improvement.
AC:
Notes:
Resolution should be considered "fixed" from the partner teams perspective.
The text was updated successfully, but these errors were encountered: