How ITIL gets Incident vs Problem wrong
In ITIL, we don't separate Incidents from Problems properly. This causes a muddy and confused definition of both. Join me as I try one more time to make this clear.
In my post on Standard+Case I introduced this idea
If you seek to break a rock, many types of stone have a crystal structure which create fracture planes: hit the rock in the right place and it drops into two; hit it in the wrong direction and it just chips.
When you attempt to categorise stuff into two or more types, if division comes easily and is clear to everyone then you have found a categorisation which reveals something about the underlying nature of the information. If it is hard to do and the results are messy and debatable, then you are trying to force an unnatural taxonomy onto the data.
(That's a really important post BTW. If you read no other post on this blog, read that one.)
I have been thinking about that concept in the context of my Sisyphean quest to get the ITIL definition of Incident fixed. This may take a fair bit of explanation. if you are interested enough, stick with me. The rest of you go read about Kanban or BYOD or JobsToBeDone or something else fizzy and exciting.
According to ITIL, an Incident is an unplanned interruption to a service, or the failure of a component of a service that hasn't yet impacted service.
And yet according to ITIL the purpose of Incident Management is to restore normal service as quickly as possible and minimise business impact.
So if a component has failed and the current impact on service is zero, why does incident management give a flying fox about fixing it?
I think ITIL has the definition (but not description) of Incident Management exactly right and the definition of Incidents awfully wrong. That "component" crap shouldn't be in there.
But wait, there is more.
According to ITIL, a Problem is a cause of one or more Incidents.
And according to ITIL, Problem Management proactively prevents Incidents from happening and minimises the impact of incidents (...by removing causes of incidents, though the definition doesn't explicitly say that).
If a Problem isn't a problem until it causes at least one incident, what is proactive about that?
Clearly, in order for proactive Problem Management to even exist, the definition of a Problem should be the cause or potential cause of zero or more Incidents.
A failed service component is a Problem that will potentially cause Incidents. How it ever got dumped in the Incident definition is beyond me. My suspicion is that in many organisations Incident Management behaves with urgency and Problem Management doesn't, and that was why "failed component" got called an incident. It's a dumb reason but I can't think of a more plausible one.
What ITIL also gets wrong is the description of the Incident Management process: Incident resolution (SO 184.108.40.206) can include the fixing of a fault, or the fault becomes a Problem to "prevent...recurrence" (SO 220.127.116.11). WTF? We create a problem sometimes but not others? Sometimes we do root cause analysis as part of Incident Management and sometimes as part of Problem Manegement? Sometimes the fault is the responsibility of the Incident Manager and team, others times the Problem Manager? When we do statistical reporting on causes, we need to somehow extract them from (multiple) incident records as well as problem records? A Problem Manager looking at their portfolio of problems won't see many of the current issues? This rock is chipping not splitting.
So the formal ITIL definition of Incident Management is fine: restore the service, whatever it takes e.g. workarounds. The description of Incident Management fails to honour this. It seems to me the implicit ITIL definition of Incident Management is:
- restore the service and find the underlying fault and fix it and in vaguely defined circumstances create a Problem for the cause instead of fixing it.
And the implicit ITIL definition of Problem Management is:
- remove the causes of incidents except in vaguely defined circumstances where Incident Management is going to remove them
Coming back to my initial concept of cleaving a rock, let's revisit that mess with some nice crisp definitions of our own:
- An Incident is a user reporting an unplanned interruption to a service.
- An interruption to a service is a reduction in quality below the agreed service levels.
- The purpose of Incident Management is to restore service to the user.
- A Problem is a cause or potential cause of Incidents.
- The cause of an interruption is a Problem. Every time.
- The purpose of Problem Management is to remove Problems.
- When Incident Management identifies a cause or suspected cause of an Incident, it immediately creates a Problem record, after which Incident Management continues to focus on restoring service to the users, including helping to prioritise the related Problem.
This splits the rock nicely.
- A user-focused team works on restoring the user to service.
- If there are workarounds, the Incident team are happy to get the user working again. If not, the Incident team will drive the priority of removing the problem.
- A technical-focused team works on helping the Incident team with diagnosis of an interruption, and doing all the work to determine and remove cause of the interruption.
- All problems are recorded as Problems so our problem portfolio and problem stats are finally useful.
- If incidents are related, there must be a suspected cause to relate them, so incidents can be related to a mutual problem, not related to the cumbersome concept of a master or parent incident.
- The conflicting priorities of restoring service and identifying cause are not mixed within one process, team and accountability.
- Incident management can be measured on restoration of service.
- Problem management can be measured on elimination of cause.
An analogy for my view of problem vs incident: the problem team are the Fire Service. The fire service has firemen, inspectors, and technical advisors to government. They don't have counsellors, emergency housing, or doctors. That's looking after the people: that's incident management. See also my "Cherry Valley" article on this subject.
[From comments below:
That's not what ITIL says, but it is what I think it should say.
An incident is that a user perceives they are not getting their agreed service levels.
Incident management is about getting them to again feel they are getting their agreed service levels.
Incident management is an outward facing process - a subset of request fulfilment - dedicated to providing maximum service to users.
Incident management is the responsibility of front-office outward-facing customer-service (actually user-service) teams with their own tools around service desk, CRM, SLM etc
Period. Nothing else. One process one purpose. one accountability, one set of goals and metrics. incident management. Don't confuse an incident with an interruption. Different entity.
There are inward-facing back-office technical teams whose role is to look after the components of the service. As part of that their job is to remove causes of service failure and therefore to restore the underlying service. Different people, different purpose, different goals and metrics, different tools. Ergo different process. Problem Management.
Split the rock.
I don't get how people don't get that. It is so clearly better. It is the right plane of cleavage between Incidents and Problems.
I'm proud of the article I did on incident management for ITSM Review as one of my best, and the follow-on article on Problem management. These use railroading as an illustration of the concepts.
- Interesting note: a problem is a cause of incidents. An incident is a user reporting an interruption. So poor user training is a problem because it causes incidents which are false reports of interruptions. Just a thought.
Then there is the right plane of cleavage between Incidents and Requests, which I have discussed elsewhere: there isn't one. We shouldn't try to separate them because an Incident is a category of Request. But that's a whole other debate...