Define mean time to resolve an incident (mean time to resolve, MTTR) #3191

meganhicks · 2024-07-15T21:12:34Z

As the VRO Team, we aim to define the mean time to resolve an incident. This will help us understand our average resolution time and identify areas for process improvement.

AC:

This should only be for incidents where VRO is the root cause of the problem.
Determine how VRO will measure the mean time to resolve an incident (MTTR). Ensure that MTTR is calculated by severity level.
Determine how the team will form a baseline for this metric.
Identify the tool(s) the team will use to measure the metric.
Establish the cadence and process the team will follow to measure MTTR. The Enablement Team has requested that this metric be reported at each Sprint Review. This negotiable.
Create documentation titled "Metrics" to ensure points 1-3 are agreed upon and communicated across the team.
Ensure responsibilities align with "on call" documentation

Notes:
Resolution should be considered "fixed" from the partner teams perspective.

brostk · 2024-07-29T21:07:14Z

Some initial research for whoever picks up this ticket: DataDog may work great for this (see https://docs.datadoghq.com/dora_metrics/failures/). We can hook up DD with PagerDuty using a webhook, and MTTR will be automatically calculated based on incident start and end times. Dora metrics with DD is currently in public beta as stated in that link though, which could be a red flag.

More information: https://docs.datadoghq.com/dora_metrics/setup/.

msnwatson · 2024-08-02T02:22:29Z

Some initial research for whoever picks up this ticket: DataDog may work great for this (see https://docs.datadoghq.com/dora_metrics/failures/). We can hook up DD with PagerDuty using a webhook, and MTTR will be automatically calculated based on incident start and end times. Dora metrics with DD is currently in public beta as stated in that link though, which could be a red flag.

More information: https://docs.datadoghq.com/dora_metrics/setup/.

I feel like this is exactly what I would want, but yeah, I don't feel comfortable with using a public beta 😅 we might have to do something a bit more manual for now unfortunately

lisac · 2024-08-02T15:16:48Z

for consideration:
the Incident Report slack workflow is integrated with Pagerduty. As part of the opening actions of the workflow, an incident in Pagerduty is created by the slack workflow (similar to how it creates a GitHub issue for the incident). New as of this week, the slack workflow is also set up to mark the incident in Pagerduty as resolved, IF the responder follows through with slack workflow steps. It's also an option to go into the Pagerduty web ui directly to mark the incident as resolved.

brostk · 2024-08-06T16:59:53Z

It looks like PagerDuty has a dashboard which shows MTTR over some time range, under Analytics -> Dashboard. Definitely worth discussing whether this work well enough for now for this ticket. Mason suggested DataDog would be a better long term solution since our metrics and incident visualization would be consolidated, sometime after DD's DORA metrics feature exits public beta.

meganhicks · 2024-08-06T18:55:55Z

Maybe we can make this a 16th min? We need to have something even if the second iteration is DD.

msnwatson · 2024-08-06T22:34:36Z

The nice thing about the PagerDuty dashboard, is it already has a baseline established for our history of use of the tool for our incident management. And it does seem to support breakdowns by severity level out of the box as well. The goal would that current features would essentially trivialize this ticket and we just have to make sure the team knows how to find this dashboard and we can easily pull a graphic to include in any reports to the enablement team.

BerniXiongA6 · 2024-08-12T14:41:39Z

Hi @brostk would you be able to comment your progress on this ticket? Do you need more time or will this carry over to the next sprint?

brostk · 2024-08-12T14:53:59Z

Hi @brostk would you be able to comment your progress on this ticket? Do you need more time or will this carry over to the next sprint?

Sure - I'm about to finish up the documentation and post it on a wiki page. Currently working with the team to make sure we're on the same page about acceptance criteria 3 and 5. I'd expect to finish today.

BerniXiongA6 · 2024-08-12T14:58:36Z

thanks @brostk !

meganhicks added the VRO-team label Jul 15, 2024

meganhicks changed the title ~~Tracking average time to resolve an incident (mean time to resolve, MTTR)~~ Define mean time to resolve an incident (mean time to resolve, MTTR) Jul 15, 2024

meganhicks mentioned this issue Jul 16, 2024

Sprint A #3100

Closed

17 tasks

meganhicks added the ready to work label Jul 24, 2024

lisac assigned lisac and unassigned lisac Jul 31, 2024

msnwatson self-assigned this Aug 1, 2024

msnwatson removed their assignment Aug 2, 2024

meganhicks mentioned this issue Aug 6, 2024

Sprint C #3173

Open

18 tasks

brostk self-assigned this Aug 8, 2024

brostk closed this as completed Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define mean time to resolve an incident (mean time to resolve, MTTR) #3191

Define mean time to resolve an incident (mean time to resolve, MTTR) #3191

meganhicks commented Jul 15, 2024 •

edited

Loading

brostk commented Jul 29, 2024

msnwatson commented Aug 2, 2024

lisac commented Aug 2, 2024

brostk commented Aug 6, 2024

meganhicks commented Aug 6, 2024

msnwatson commented Aug 6, 2024

BerniXiongA6 commented Aug 12, 2024

brostk commented Aug 12, 2024

BerniXiongA6 commented Aug 12, 2024

Define mean time to resolve an incident (mean time to resolve, MTTR) #3191

Define mean time to resolve an incident (mean time to resolve, MTTR) #3191

Comments

meganhicks commented Jul 15, 2024 • edited Loading

brostk commented Jul 29, 2024

msnwatson commented Aug 2, 2024

lisac commented Aug 2, 2024

brostk commented Aug 6, 2024

meganhicks commented Aug 6, 2024

msnwatson commented Aug 6, 2024

BerniXiongA6 commented Aug 12, 2024

brostk commented Aug 12, 2024

BerniXiongA6 commented Aug 12, 2024

meganhicks commented Jul 15, 2024 •

edited

Loading