Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define mean time to resolve an incident (mean time to resolve, MTTR) #3191

Closed
meganhicks opened this issue Jul 15, 2024 · 9 comments
Closed

Comments

@meganhicks
Copy link

meganhicks commented Jul 15, 2024

As the VRO Team, we aim to define the mean time to resolve an incident. This will help us understand our average resolution time and identify areas for process improvement.

AC:

  1. This should only be for incidents where VRO is the root cause of the problem.
  2. Determine how VRO will measure the mean time to resolve an incident (MTTR). Ensure that MTTR is calculated by severity level.
  3. Determine how the team will form a baseline for this metric.
  4. Identify the tool(s) the team will use to measure the metric.
  5. Establish the cadence and process the team will follow to measure MTTR. The Enablement Team has requested that this metric be reported at each Sprint Review. This negotiable.
  6. Create documentation titled "Metrics" to ensure points 1-3 are agreed upon and communicated across the team.
  7. Ensure responsibilities align with "on call" documentation

Notes:
Resolution should be considered "fixed" from the partner teams perspective.

@meganhicks meganhicks changed the title Tracking average time to resolve an incident (mean time to resolve, MTTR) Define mean time to resolve an incident (mean time to resolve, MTTR) Jul 15, 2024
@meganhicks meganhicks mentioned this issue Jul 16, 2024
17 tasks
@brostk
Copy link
Contributor

brostk commented Jul 29, 2024

Some initial research for whoever picks up this ticket: DataDog may work great for this (see https://docs.datadoghq.com/dora_metrics/failures/). We can hook up DD with PagerDuty using a webhook, and MTTR will be automatically calculated based on incident start and end times. Dora metrics with DD is currently in public beta as stated in that link though, which could be a red flag.

More information: https://docs.datadoghq.com/dora_metrics/setup/.

@lisac lisac assigned lisac and unassigned lisac Jul 31, 2024
@msnwatson msnwatson self-assigned this Aug 1, 2024
@msnwatson
Copy link
Collaborator

Some initial research for whoever picks up this ticket: DataDog may work great for this (see https://docs.datadoghq.com/dora_metrics/failures/). We can hook up DD with PagerDuty using a webhook, and MTTR will be automatically calculated based on incident start and end times. Dora metrics with DD is currently in public beta as stated in that link though, which could be a red flag.

More information: https://docs.datadoghq.com/dora_metrics/setup/.

I feel like this is exactly what I would want, but yeah, I don't feel comfortable with using a public beta 😅 we might have to do something a bit more manual for now unfortunately

@msnwatson msnwatson removed their assignment Aug 2, 2024
@lisac
Copy link
Contributor

lisac commented Aug 2, 2024

for consideration:
the Incident Report slack workflow is integrated with Pagerduty. As part of the opening actions of the workflow, an incident in Pagerduty is created by the slack workflow (similar to how it creates a GitHub issue for the incident). New as of this week, the slack workflow is also set up to mark the incident in Pagerduty as resolved, IF the responder follows through with slack workflow steps. It's also an option to go into the Pagerduty web ui directly to mark the incident as resolved.

@brostk
Copy link
Contributor

brostk commented Aug 6, 2024

It looks like PagerDuty has a dashboard which shows MTTR over some time range, under Analytics -> Dashboard. Definitely worth discussing whether this work well enough for now for this ticket. Mason suggested DataDog would be a better long term solution since our metrics and incident visualization would be consolidated, sometime after DD's DORA metrics feature exits public beta.

@meganhicks meganhicks mentioned this issue Aug 6, 2024
18 tasks
@meganhicks
Copy link
Author

Maybe we can make this a 16th min? We need to have something even if the second iteration is DD.

@msnwatson
Copy link
Collaborator

The nice thing about the PagerDuty dashboard, is it already has a baseline established for our history of use of the tool for our incident management. And it does seem to support breakdowns by severity level out of the box as well. The goal would that current features would essentially trivialize this ticket and we just have to make sure the team knows how to find this dashboard and we can easily pull a graphic to include in any reports to the enablement team.

@brostk brostk self-assigned this Aug 8, 2024
@BerniXiongA6
Copy link

Hi @brostk would you be able to comment your progress on this ticket? Do you need more time or will this carry over to the next sprint?

@brostk
Copy link
Contributor

brostk commented Aug 12, 2024

Hi @brostk would you be able to comment your progress on this ticket? Do you need more time or will this carry over to the next sprint?

Sure - I'm about to finish up the documentation and post it on a wiki page. Currently working with the team to make sure we're on the same page about acceptance criteria 3 and 5. I'd expect to finish today.

@BerniXiongA6
Copy link

thanks @brostk !

@brostk brostk closed this as completed Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants