what is an ITSM Major Incident? ITIL doesnt say.
One of the enigmatic parts of ITIL is Major Incidents. Here are my tips for better Major Incident Management.
If you look at ITIL 2011 Service operation there is a paragraph under 126.96.36.199 headed Major Incidents (MI). It tells you that:
- There is a separate procedure
- You must agree what constitutes a MI
- Form a separate team
- Keep incident and problem separate
...um... that's about it.
On Figure 4.3 in that book, the incident process flow branches to a box marked "Major Incident" and never comes back. That box is defined.... nowhere.
So you are pretty much on your own when it comes to Major Incident. Here are a few of my thoughts, as I'm hot on Major Incident Management, or MIM.
MIM is very important. It needs to be well defined.
Some organisations equate a MI with a Priority 1 Incident (or a Severity 1 Incident). I don't think the mapping is that crisp. Incident priority is for sorting and prioritising (and measuring and reporting). A MI is about abandoning the normal process and switching to different procedures. As discussed below, a MI is about having to invent process. So a MI is about the recognition that normal Incident and Problem Management are not going to cut it. A Major Incident is a declaration of a state of emergency.
[From a comment by BoonNam Goh (thank-you!):
A major incident is mid-way between a normal incident and a disaster (where the IT Service Continuity Management process kicks in). It is mid-way towards a disaster in terms of impact (especially, public impact) but is not yet a disaster in terms of having to activate Disaster Recovery (usually in a major incident, the infrastructure or the bulk of it is still intact and so does not make sense to go to the DRC).
Since Major Incident and ITSCM are similar, some of the activities and organisation structures pre-planned for use in a major incident could be a lighter variation or a reuse of those used for ITSCM (e.g. notification of organisation management, involvement of the comms manager etc).]
Don't bother defining what constitutes a MI. Many circumstances will be unforeseen: guidelines on spotting one don't help. A MI is like art: hard to define but you know it when you see it. I call this the "Oh shit!" test.
So put the effort into defining who declares a MI. Specify the roles that can push the big red button: e.g. service desk manager, service delivery manager, operations manager, business owner.
Likewise the process cannot be tightly defined (to be fair to ITIL with their mystery process box). A MI is all about Case Management: you need to take each one on its merits and work it out as you go along. For much more on Case Management see my Standard+Case approach.
What needs to be well defined are:
- Policy: if people are making decisions on the fly give them principles, guidelines, rules, bounds, goals, inputs, and outputs.
- Roles and responsibilities: especially a Comms Manager and a Technical Manager, who work back to back - one faces outwards and one inwards. One of these could be the overall Major Incident Manager or it could be separate person.
- Procedures: comms plan, war-rooms, supplier mobilisation, RCA...
The Major Incident Manager is not automatically the same person as the Incident Manager. The skillsets are different. See Choose your Major Incident Manager.
MIM is about restoring service. Problem resolution is a closely related but distinct process. Don't let chasing the problem distract tech staff from getting the service back on the air as their top priority. Best to have two teams: incident resolution and problem resolution. But then I've been saying that for years, most recently here.
MIM is as much about managing the impacted customers as it is about managing service restoration.
Once you have it defined, then rehearse rehearse rehearse. You traipse down the fire stairs twice a year, but the only time you practice MIM is when it happens, right? Organisations who get lots of MIs are well practiced in MIM. Good stable production environments get complacent and really screw up when the inevitable MI happens. Stay sharp: do MIM drills.
Note: there is such an animal as a Major Problem. Service has been restored but it is still out there ready to strike again, and it passes the "Oh shit" test. Proceed as above, with slightly less urgency.
For more on MIM see