The art of reporting incidents

During my PhD, I have been responsible for several pieces of lab infrastructure (data storage, microscopes). To learn how to be responsible service provider, I have watched many videos (Tom Limoncelli, Alice Goldfuss) and even purchased couple of books (best is The Practice of System and Network Administration).

What I’ve learned is that network/system/computer administrators have figured pretty good way to manage infrastructure, manage expectations, and make sure systems with users run smoothly, while minimizing pain for people in charge. We can borrow a lot of this knowledge and reuse it in the setting of research lab.

One aspect of managing systems is response to incidents. That is basically when bad or unexpected stuff happens, no matter the reason. For example, the university network stopped working. Or there is an electrical outage. Or I have done some configuration change that blocked all users from accessing their data.

All of this events have one in common: there is something affecting service beyond original expectations of the user. The first step in remediation, often, must be clear and honest communication of the situation to users, or any interested party (think students who use system but also PI who runs the lab). We often see that this is not done clearly enough, or at a right time, or using the right tools.

Bad way to report incident: piece of paper on the outside of the building; no official emails; no alarms raised inside

Base of my thinking about it was stolen shamelessly from Tom Limoncelli (for example, Radical Ideas Enterprises Can Learn From The Cloud)

The way we inform people of any issue should follow this minimal checklist:

  • Inform in timely fashion, hopefully as soon as issue was discovered and initial assessment was done. It depends on relative risk of the condition. If we suspect gas leak, we should not wait and inform all parties immediately. If fridge seems to be broken, we should first check if it was plugged in before reporting.
  • Be clear about the incident area. “There is an issue with system” is not as clear as “The network connectivity is dropped since 10am”. The purpose of communicating as much as possible is to reassure the everyone that you are on top of things and transparent about what’s going on. Also it removes unnecessary worries, as “Data is unaccessible due to network issue” makes it clear that data is still intact.
  • As you describe what had happen, make sure to include things that didn’t happen (to the best of your knowledge). The network is down, but data is safe. The power in room 123 is off, but emergency power in room 123 is still running, so the microscope is still working. The fridge seems to be broken, but temperature sensor still says -20C.
  • Be clear about what has been done so far to investigate and remedy the issue. “Something happened with lights in room 321, we called facilities” conveys that you are on top of things.
  • Make sure to be clear that you will update people on the issue. It might be not your job to fix the issue, but it is your job to communicate. There is no electricity? Cool, provide a contact for person in charge or be the point of contact. It is OK to delegate or give a way responsibility. “Contacted facilities, please refer all questions to John Doe, as there is nothing I can do” means you managing people’s expectations and provide transparency once again. Ideally, provide time when you will update (“Will report back by 4pm with updates”).

Making sure you check all these boxes in your very first email / report about the incident will allow people to make decisions about their work; it will provide confidence that this incident is dealt with professionally; it will save your time by avoiding people asking question like “has X been affected” and “who should I contact about this”.