A curated list of awesome incident management resources and collections.
- Introduction
- Tools
- Documentation and Articles
- Best Practices
- Training and Courses
- Contributing
- License
Incident management involves identifying, analyzing, and responding to incidents in a way that minimizes damage and reduces recovery time and costs. This list aims to provide high-quality resources and tools to help you manage incidents effectively.
- HolmesGPT - An open-source AI-powered incident response tool that leverages GPT-3 for automated incident analysis and response.
- incident-bot - A bot for managing incidents in Slack, created by EchoBoomer.
- slack-groups - A Slack app for grouping users and channels, created by Pipedrive.
- template-incident-management - A template for managing incidents using Slack API, created by Slack.
- grafana-on-call - An on-call management tool by Grafana.
- TheHive - A scalable, open-source security incident response platform that supports collaborative investigation of security incidents.
- OneUptime - A complete open-source observability platform that includes incident management, on-call rotations, log analysis, performance tracking, and more.
- Grafana - An open source platform for monitoring and observability.
- Prometheus - An open source monitoring system and time series database.
- Zabbix - An enterprise-level open source monitoring solution.
- Nagios - An open source system and network monitoring application.
- Loki - A log aggregation system designed to work with Grafana.
- Graylog - An open source log management platform.
- Check_MK - A comprehensive IT monitoring system.
- Mimir - A scalable and highly available long-term storage for Prometheus metrics.
- Thanos - A highly available Prometheus setup with long-term storage capabilities.
- PagerDuty - A digital operations management platform that helps businesses enhance their incident response efforts.
- Splunk On-Call (formerly VictorOps) - A platform designed to streamline incident response by centralizing alerts and communications.
- Opsgenie - An incident management and response tool by Atlassian that integrates with monitoring and chat tools.
- Datadog - A monitoring and security platform for cloud applications.
- New Relic - A comprehensive observability platform built to help you monitor, debug, and improve your entire stack.
- Splunk - A platform for searching, monitoring, and analyzing machine-generated big data via a web-style interface.
- Blameless E-Books - A collection of free e-books on incident management and related topics by Blameless.
- Incident Management: The Complete Guide - A comprehensive guide from Splunk covering various aspects of incident management.
- Incident Response Plans: The Complete Guide - Detailed insights on creating and maintaining effective incident response plans by Splunk.
- Post-Mortem Template - A comprehensive template for conducting post-mortems by ghostinthewires.
- Site Reliability Engineering - The official SRE book by Google, which covers incident management among other topics.
- The Phoenix Project - A novel about IT, DevOps, and helping your business win, which includes insights into incident management.
- PagerDuty Incident Response Documentation - Comprehensive incident response documentation by PagerDuty.
- Incident Management Handbook - A detailed guide on incident management best practices by Atlassian.
- Incident Management at Scale - An article on how Dropbox handles incident management.
- Wheel of Misfortune - An open source tool for running chaos engineering game days to practice incident management.
- Coursera SRE Course - A course by Google on Coursera that includes incident management practices.
Contributions are welcome! Please read the contribution guidelines first.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.