Each week a person from the team volunteers to be the "runner." This person watches honeybadger for all our projects and ensures that newly discovered errors are appropriately ticketed and followed through on.
Our team found that without someone specifically designated to track exceptions in our application that the rate of exceptions would only go up. By creating tickets and designating priority issues for the team to work on we hope to create more stable software for our users.
Running happens on a weekly cycle. There is a new runner every week and the goal of the runner is to ensure that exceptions which come in have actionable tickets with sufficient information to close them without honeybadger history.
Volunteer to be the runner at stand-up or the work cycle wrap-up.
Ensure your notifications are on for all of DLS' applications.
Look through new errors from the weekend or night before. For any that have no tickets create a ticket with enough of a backtrace to tell what's going on, a descriptive title of what caused the error, and any background information that might help work it. Apply the "maintenance / research" label so it appears in the inbox of our maintenance board.
Do the same as above, but notify #incident_reports if it's indicative of a major issue that needs immediate response.
Choose one ticket that you either created or saw repeatedly to be worked on during maintenance/research week. Mark it "high-priority" by clicking the pin in the maintenance board and mention it at stand-up.
Running is exhausting and distracting. Don't volunteer too often and lean on your team when you need support. If, for example, a downtime event leads to too many errors to get through while keeping up with what you're working on you can always ask in #dls to split the workload. We're here for you, thank you for volunteering!
- Staging errors are often due to development / testing. But it could be an opportunity to catch an error before we see it in production. If you're not sure, check in with the team.
- Errors that look like attack probes should be ticketed in princeton_ansible for lack of a better place. Operations will use them to look at Signal Sciences configuration.
- Errors that look like SIGHUP from closing a console on a box can be set to "ignore".