ITIL V3 Incident Definition: Camels and Committees

ITIL defines an incident to be an impact on service or a failure of a CI that might impact service. I think that is clumsy. An incident is an impact on service. Period. A failure of a CI is something else.

In a comment, Aale Roos reminded us that ITIL V3 defines an Incident as

An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident, for example failure of one disk from a mirror set.

Talk about camels and committees. ITIL V2 didn't have that second part about broken CIs. "Failure of a configuration item that has not yet impacted service is also an incident". No it isn't. It's a Fault (or some other name).

As Aale also reminded us, Incident Management is now defined as

Incident Management is the process for dealing with all incidents; this can include failures, questions or queries reported by the users (usually via a telephone call to the Service Desk), by technical staff, or automatically detected and reported by event monitoring tools.

No it isn't. They have thoroughly muddied the definition to include a wide variety of requests. Incident Management is the process of restoring normal service operation as quickly as possible and minimising the adverse impact on business operations, just as the very next paragraph (Service Operation 4.2.1) defines it. All that other crap is more general Request Management, including Faults. Most event monitoring tools inform us of Faults. Only the really smart ones know if a service has been impacted or not, unless the event is the detection of a service level target violation by an end-user-experience monitoring tool.

This isn't a camel - I'm not sure it is still a quadruped. Let Incident Management focus on what it is there for, restoration, and leave all the clutter out.

Thanks Aale for drawing our attention to that - I had missed the V3 bastardisation. (I don't agree with Aale that the authors didn't know V2. I just think there were too many cooks spoiling the camel ...er... whatever).

Recently in a comment I published my (humorous) list of classes/types of Requests. I realise now I didn't have "Fault".

Certainly, failure of a CI is something that must be responded to, but if it hasn't impacted a service yet it isn't an Incident.

Actually it is a Problem. If we are lucky enough to have found a busted CI without some Incident drawing our attention to it, then arguably we could just go ahead and open a Problem ticket and be done with it.

But sites might want all Requests/Tickets/Issues/Incoming to go through the same initial recording/gating process before spawning a Problem, in which case it is a new type of request, a Fault. So I will post a new list of Request types. I will give it a thread of its own.

Comments

Redundancy is something

If a service that is supposed to be redundant, my expectation is that the redundancy is part of what is considered "normal" operation, not to mention I am likely paying more for such redundancy. A failure of a redundant component must be considered an incident. It would not be a Major Incident in most cases because the service is still availabile, but it certainly merits the visibility and tracking of an incident for multiple reasons.

If the redundant service is truely in "normal" operation, there would not be an elevated risk of a service outage. Quite literally I am paying for the overall reduction of risk of a service outage with a redundant service.

122,000 That's how many

122,000

That's how many angels can dance on the head of a pin by my reckoning.

Incident management is about restoration of service.

Incident management includes interaction with users.

Lets stick to basic principles and stop trying to make ITSM into something arcane.

A failed redundant device is clearly a Problem in the current ITIL taxonomy, as it will potentially cause an incident.

If we need a front end entity to track it before we determine it is a true Problem, then we need a Fault because by any common sense measure it isn't an Incident.

Why is this so hard?

Okay, I've followed this,

Okay, I've followed this, and other related threads for some time now, and reached the point where I need to say something.

I don't profess to understand anybody's motives for sustaining ITIL (and its ilk), but at the end of the day -- it's defunct. I did research this for a while -- but then rapidly lost interest --

-- and this, IMHO, represents the *fundamental* difference between now, and the old world of ITIL and similar. Why is so much time being spent trying to improve the steam engine, when the rest of the world -- quite simply -- has moved on?

It doesn't matter what maturity level you're at, how well you comply with standards X or Y. It's the consumers who tell you how well you're doing by voting with their feet.

Ask RIM. Ask Nokia. Anybody who thinks they will survive over the next 18 months is seriously deluding themselves.

The evidence to support all of this is out there in the wild. I don't need to justify my position, because the results (sales figures, % market share, etc.) are public knowledge. So in an age where the consumer can impact the share price of Organisation X via Youtube, for instance -- what possible relevance has old-style stuff like ITIL got? Nothing. There is no justification for it whatsoever.

I've been around this industry for longer than I care to remember. But we're talking the early days of Netcool (I was the 3rd Netcool user on the planet), and I've grown with (and out of) that whole world of -- what's that famous buzzphase? -- people, process and technology.

Today, like it or not, is completely consumer-driven, and no amount of internal process is going to improve or change your relationship with your customers. I mean, at the end of the day, who cares if you're ISO27001 or whatever compliant? If you deliver on your promises, and your customer base like (+1, whatever) what you do, then what else matters?

There are outfits out there that get it (no names, no pack drill), and their ideas are already being adopted by globally known brands. How do they implement these new ideas / approaches / technologies? They simply turn them on, and watch what happens.

This is the only way to Get Things Done these days. It's the only way to avoid having dinosaur consultants crawling all over you with entirely inappropriate ideas for the 21st Century, derived from companies who only wish to preserve their old business models because they have no idea how to compete with what's already happening around them.

Discuss.

hmm

"no amount of internal process is going to improve or change your relationship with your customers"

Internal processes - IT or otherwise - are what make or break your relationship with your customers

Internal processes != complying with a standard

not far enough

Yew it was a dumb thing to say buy you don't go far enough. One's relationship with one's customer(s) depends on several factors, one of which is compliance.
Others are performance, quality, fit for purpose, user satisfaction etc
Successful and continuing delivery of every single one depends on good internal practices (I hate the word processes)

Yep you're right, but... you're not

I've been in organizations that couldn't spell I T I L that somehow were quite functional. Doing an assessment of their capabilities made an interesting discovery. What they were doing, without knowing anything about ITIL, that they figured our either by vicarious learning, trial and error, or experience transferred from new hires aligned quite nicely with the practices ITIL describes. If ITIL is so bad and passe, how do you think that might have happened.

Oh, they get things done! :-)

The central driver in each organization was variants on a single question: "How do we provide exceptional value to internal users, customers, and partners?"

Btw, all of this is within the last 18 months.

Hmmmmm.....

David

Buzz off

I'd love to discuss but

(a) wrong thread. Try here or here or here or here or...

(b) I'm tired of debating with the anti-ITIL crowd.
You think you can deliver good services to customers without defining them?
You think you can deliver good services to customers without having your shit together within the system?
You think you can manage a multi-provider environment without owners for multiple practices including Supplier, Risk and Service Levels?
You think there is no room for doing improvement in a dynamic environment? No room for change control?
You think you understand service management in general and ITIL in particular when you call it "dinosaur"?

Great. ****Buzz off then and do so and quit bugging us.

ITIL needs help

Skep - come on. I'll give you some of the critics would struggle to come up with any form of alternative - but ITIL does need help. Any 'framework' would.

For example, it does NOT explain how to define a service. Nor does it properly reflect how one is brought to market and rolled out. Not in terms generally used by product managers trained in this anyway. Can I cite unbundling, bundling and positioning statements as a start. All absent. Incident management may 'restore a service' (assuming we know what a service is), but it fails to address run of the mill complaints management. Its request fulfillment 'process' is at best light and does not explain how it should work with a service request or service catalog. Problem management is half assed, and change does not walk you through the development of a change schedule, management of change across multiple locations (useful for cloud), or how to assess risk using concepts and methods generally used by risk mangers.

This said, it does contain a lot of useful guidance and does provide a solid framework. But it needs help to truly be adapted into a much wider 'service management' program. Like you, I'm tired of anti-ITIL folks - especially when they moan and groan without offering improvements or alternatives. Its not ITIL - its these folks and others who regard ITIL as the only source of all knowledge.

As for service management - you know my view on that - ITIL is not a good source and its certainly not definitive. There are more than 150 non-IT books I use to define the term and its source is product management/marketing. In fact ITIL now says that itself. So who is out there propagating the belief ITIL is service management? Those doing it are misrepresenting ITIL and frankly damaging the brand...

preaching to the choir

Hey, you know I agree with all that Ian. ITIL isn't perfect, needs plenty of improvement. Quit preaching to the choir :)

That's not the same as ITIL is useless or we don't need itsm.

Things that lack names?

Great post, great discussions ... allow me though to reiterate my preference for having participants identify themselves somehow :(.

I nearly replied to three or four of the sub-threads, but I'll summarise* my thoughts in one.

Regardless of what ITIL says or what the risk management community says, we have a bunch of concepts that can be split up to various levels ...

* Disruption of the current value of service to user (this is the undisputed part of ITIL's Incident)
* An abnormal condition in some part of the managed environment (infrastructure, applications, their configuration, ...) (I think this is close to "Known Error")
* An abnormal symptom not yet identified (fuzzily worded, but I think it's a concrete concept - close to "Problem")
* An identified risk of an abnormal condition emerging in the future (a "Problem"? or is this where ITIL fails to cover Risk? It doesn't properly match ITIL's definition "cause of one or more incidents)
* An identified risk to the future value of service to the user (a disputed part of "Incident"? a "Problem"?)

By abnormal I mean different from what it's supposed to be. This "supposed to be" concept crops up everywhere: a CMDB that doesn't cover it is useless. An SLA that doesn't cover it is useless. A monitoring tool that doesn't cover it is useless. And I think it's central here. Some "thing", some attribute of some CI, is not what it's supposed to be or is at risk of not being so.

If the CI is a user service, we have an Incident. Or a potential Incident. Some other CIs/attributes are almost certainly not what they are supposed to be (or we may not yet have learnt what they are supposed to be). We have to find them: root cause analysis.

If the CI is not a user service, we have something (now or soon) that may (or may not) impact user service (or other controls on the system). This is where we have too many concepts and not enough terms, and therefore we have confusion between incidents, events, problems, risks and "go away all you ITSM people, this is technical, we'll fix it".

* Yes, this was a summary

a great analysis

What a great analysis!!!

I think you missed one off your list Joe: the actual concrete classical Problem: an identified fault/break/mis-configuration/... that needs to be risk analysed, prioritised, and (usuall) fixed.

yes there are all sorts of shades and gradations and variants. One has to be careful to not bag-and-tag too many of them, as I may arguably have done with Request. I don't think I have over-analysed there and I don't think you have here: the distinctions are useful. However when one comes up with lots of variants it always suggest to me that one is trying to break the stone along the wrong axis. A crystalline rock has a natural plane of fracture when it come scleanly apart into two crisp pieces. When one finds that natural plane of fracture, a thorny issue comes apart into a few nice clean pieces.

I feel ITIL's distinction between Incident and Problem is one of those. For me and many others in the 90s, this was an ah-hah! lightbulb splitting of the two concepts that seems facile now but was novel then. Now we mature and analyse mnore deeply, we can see that within Incident and within problem there is more to it. But we aren't breaking Problem cleanly aliong a new fracture line, we are busting it up into fragments. likewise with Request.

Am i making sense and am i right? Or is it that Incident, Problem and Request cannot be cleanly fractured further and can only be rendered into gravel? :)

I was hoping you would actually

Answer one of these using your taxonomy rather than continuing to use ITIL.

Since you didn't, I thought I'd give it a shot...

http://www.itskeptic.org/itil-v3-incident-definition-camels-and-committe...

This would be a Fault. From there perhaps it would go to "Problem" (why isn't Problem part of this taxonomy? Shouldn't it fall under "Support" next to Fault?) and then perhaps Change. Hopefully it would skip Incident altogether.

So, in this scenario - no matter what it would be under the "Support" bucket - so, we can say that it is a "Support" issue if: the service is impacted (Inc), the service *isn't yet* impacted and the CI failure has been detected.

Problems are either of known or unknown causes. Known causes are Known Errors. Unknown causes are...Problems? The Fault is assigned to Problem Management where it is either KE or Problem. So a definition of a Problem can be: An unknown cause of one or more Incidents or Faults. Known Errors being: A known cause of one or more Incidents or Faults.

Why don't we just call it Error Management and have Known Errors and Unknown Errors? Just a thought.

So, from Problems/Unknown Errors or Known Errors we can move to Change Management and hopefully (with successful implementations) reduces Faults/Incidents.

I think I got that one alright.

http://www.itskeptic.org/itil-v3-incident-definition-camels-and-committe...

This one is again Fault. It is a detected CI that is "about" to cause an Incident. Follows like the above example.

http://www.itskeptic.org/itil-v3-incident-definition-camels-and-committe...

The bomb...I'd say...Fault! It speaks to the "imminent failure of a CI" (one can presume it is still Fault even if it is more than a singularly CI that is under threat).

It does get interesting here though. Would it naturally go from Fault to Unknown Error/Problem? Would we go from there to a BC plan? Or do you go from Fault to "Risk Management" to BC plan (or perhaps Problem to disengage the bomb). My gut says we go to Problem first. Well, actually, my gut says since I am in the same building as the DC that I don't quibble about details like this, and busy myself calling my wife as I quickly make my escape.

http://www.itskeptic.org/itil-v3-incident-definition-camels-and-committe...

1) Theory: This would be a Fault. It would flow to Problem and so on.
2) Real Life: Unauthorized change, Fault, Known Error (why try to restart it, unless this is a prescribed action?), Problem (it turned out the prescribed action didn't work...so now it is an Unknown Error).....then Friday we have Incidents tying to the already opened Problem.

http://www.itskeptic.org/itil-v3-incident-definition-camels-and-committe...

I would disagree with this statement. Using our handy dandy taxonomy - I'd say what is being described as a "Risk" is actually a "Fault" and Problems are (as described above) Errors with an unknown cause - not to be confused Known Errors. Faults may be linked to Known Errors or Problems. To suggest that Faults (or Risks) can only be associated with Unknown Errors (Problems) is inaccurate.

http://www.itskeptic.org/itil-v3-incident-definition-camels-and-committe...

In order of the bullets...
*Incident (agree with Joe Pearson)
*I presume this has been detected and it is not causing an Incident. Therefore it defaults to Fault. If it is caused by a known error, it is of course linked to a KE.
*Like above, it is a Fault but this time linked to an unknown error - so linked to a Problem.
*How far into the future? What constitutes "imminent"? Could be Fault, probably then linked to a Problem. Or perhaps "Suggestion" that leads to Change? How are CSI items being captured? Through Suggestions? I could see maybe, this is a CSI item - Identified Risk: Hardware no longer supported in 6 months. Suggestion: Buy new hardware. Migrate to other existing hardware that is still supported. Move to the cloud!
*Suggestion? Proposal? Depending on the actual situation I suppose. If the value is degrading or will degrade at some point - what is your idea to "fix" that?

http://www.itskeptic.org/itil-v3-incident-definition-camels-and-committe...

It is a Fault. Leading to either Known or Unknown Error (Problem).

It seems like I've replaced the Incident Management bucket (http://www.itskeptic.org/itil-v3-incident-definition-camels-and-committe...) with the Fault Management bucket though.

Still, I think it is probably a worthwhile distinction to make.

Would you want your Faults to be higher than Incidents? Would you want most Incidents to be linked to KE and expect a higher % of Faults to be linked to Problems (Unknown Errors)? Would better Fault detection lead to fewer Incidents? Would this then link Fault and Event closely?

I think this is great stuff. Does anybody out there use anything like the IT Skeptic Taxonomy in real life?

Well, err hmm

"I think you missed one off your list Joe: the actual concrete classical Problem: an identified fault/break/mis-configuration/... that needs to be risk analysed, prioritised, and (usuall) fixed."

- This is pretty much what I meant by "An abnormal condition in some part of the managed environment". I was trying to use generic words and not reuse any of the terms we're trying to define. Unfortunately when you get too generic you lose all meaning!

On distinguishing concepts along natural fault-lines: yes, some distinctions are good, even great, and some not so much. My instinctive approach is to break everything up as much as possible (in a lab/dev environment of course!) and then see where unsuspected natural fault lines might have been. But don't drop the analysis if they don't seem obvious!

I think there are important, if not quite natural, fault lines not covered in the current terminology:
* affecting the users - breaching SLA (an incident) vs not e.g. due to being covered by a backup CI, not being in production, being within acceptable but not ideal parameters, ...
* affecting us now (an issue, in project mgt terms) vs potentially affecting us in the future (a risk)

These fault-lines may not be clear enough to earn the "best practice" label, but people designing ITSM capability that they hope to be robust had better at least give them some thought. (And stop assuming that ITIL has done all the work and is perfect grumble grumble.)

Should it matter how you subscribe to the definition?

This is an awesome thread! I can see both sides, but I personally subscribe to what makes sense to the business is at hand.

I can also see that if a faulty CI can be an incident; while it may not effect the services you provide to your customers, are you not a customer to a vendor consuming their services/products/software?

The "root cause" of this debate is obviously between what is ITIL, and what is practical. It may be ideal to have more processes to handle service impacting versus non-service impacting, but I bet when push comes to shove- the wallet will win. It is cheaper to tool a single flexible process with an emphasis on service incidents than to have to tool two different fault management processes.

Many IT organizations don't have the cash to purchase a successfully implemented Service Level Monitoring integration solution, or an effective CMDB. Without an great insight into what and where an impact is, risk management wins out.

The "if your not sure, open a ticket" behavior will surely win, especially in the more fascist operations.

Rage Against the Integrated ITSM Machine

I started this as a simple reply and got a little carried away....I'm really not a radical. :)

The separation of Event and Request Fulfillment as separate processes in ITIL v3 was appropriate and needed in my opinion. However, further separation of 'non-service impacting' Incidents into a separate process (if that's what you were thinking) may not be such a hot idea.

Clearly separation of Requests made sense, and I was extremely pleased to see an Event Management process as well. In fact, I think more discussion of how to automate Event Management would benefit many clients but this may not play into the big gorillas '5 year plans' for their integrated ITSM machines...

In most cases the number of events coming out of the infrastructure is very large....too large to deal with without some automation. How you automate this, and at what cost, is a pretty big deal. In the example of a failure of a CI that does not (yet) impact production, obviously this needs to be dealt with. The only question is one of priority, i.e., this may be a lower priority Incident than those that are service-impacting.

The bigger issue is diagnosing which layer of which component is the source of an anomaly. When hardware breaks, perhaps not a big deal; but in the increasingly virtual spaghetti of today's service infrastructures with weird performance-related anomalies that's a very big deal. You just don't see the major players focusing their 'integrated ITSM machines' on that (they'd rather focus on workflow). I suspect because they do not have a very good solution...

While I'm on this rant, I'll also state that the idea that we can effectively automate the Event Management process in a service-oriented manner without incorporating applications is another cop-out. You need to identify every layer of every component in an end-to-end service, learn the norms of all collected measurements at any point in time, and automatically isolate which layer of which component is the source of an issue (or Incident, or Event, or Problem, or whatever the hell you want to call it).

And oh by the way, you'd better be able to easily give relevant stakeholders an intelligent view of what's happening. The days of the operations bridge lighting up like a christmas tree are still with us, but it is killing any opportunity to significantly change the tribal culture of most IT organizations and it certainly doesn't address key issues for the Service Desk and Incident Management any more than re-defining "what is an Incident".

Finally, some promote the idea that we can magically collect data from hundreds of sources and funnel them into an 'engine' to sort things out. Maybe for base level 'infrastructure', but what happens as you incorporate applications into the mix? You'd better consider that UP FRONT, since without an application I don't see how you will EVER get to ITIL's definition of a service, and certainly not business service management (which is the end game). People who think they can effectively segregate applications from infrastructure and achieve business service management, without some overall end-to-end and top-to-bottom perspective, will be disappointed.

I'm all for process and workflow automation. It IS is important, but I think in many cases it may be the wrong place to be investing increasingly limited IT dollars. In a world where survival goes to the fittest, I believe that there should be much more discussion about automation of Event Management, with the skeptics among us keeping the Bull___ to a minimum. It holds an important key to successful ITSM adoption and cultural change.

How long will it take YOUR ITSM Suite vendor to reach this nirvana?

How long? Not long, cause what you reap is what you sow
- RATM

John M. Worthington
MyServiceMonitor, LLC

complexity rules again

Whenever I hear people talk about processes, and then using ITIL terms, shiver runs over my spine...

[1] please explain the difference between Event Management and Monitoring
[2] please show me the process that describes "the Monitoring Process" in ITIL
[3] try looking at the "Event Management Process" in ITIL, say it's a process, and keep your eyes dry.... and your nose from growing

Everybody is talking about customer interfaces (incidents, faults, complaints, service requests, etc. etc.) and then they're imagining a process behind it.
Why don't we use our common sense here?
Why not first think of the processes and then determining - as a consequence - what the interfaces would have to look like?
It would solve 99% of the issues in this discussion.
ITIL cannot be implemented in practice. It's a nice and useful reference framework of things you find in practice, but you need an implementation framework to get it going.

Don't expect too much from ITIL

[1] please explain the difference between Event Management and Monitoring
Event Management is the process of managing events. Monitoring is the process for identifying events.

[2] please show me the process that describes "the Monitoring Process" in ITIL
It doesn’t in the same way it doesn’t show you how to auto discover CIs and identify unauthorised changes. One day it would be nice if ITIL got to the real world level but right now it is at the framework level.

[3] try looking at the "Event Management Process" in ITIL, say it's a process, and keep your eyes dry.... and your nose from growing
ITIL is a framework of processes don’t look to it to tell you how to do your job but more as guidance how processes within an organisation interrelate and a standard language for what these processes are.

You are right you cannot implement ITIL but you can implement a solution that adheres to the ITIL Framework. The benefit is you have a structure that you do not have to reinvent and you have a common language that is understood across the industry.

hmmmm Martin: [1] let me be

hmmmm

Martin:
[1] let me be more specific. ITIL quote: "Reactive Monitoring is designed to request or trigger action following a certain type of event or failure. For example, server performance degradation may trigger a reboot, or a system failure will generate an incident. Reactive monitoring is not only used for exceptions. It can also be used as part of normal operations procedures, for example a batch job completes
successfully, which prompts the scheduling system to submit the next batch job."

Again: can you explain the difference with Event Management? Or even show the process in Monitoring?

[2] you say that 'the framework level' is not related to 'the real world level' ? Stupid me! I always thought it was best practice....

[3] tell that to the milions of believers. And maybe you can explain how security management is a process? or capacity management? or financial management? or continuity management? or knowledge management? or ....... And then please explain to me how these processes interrelate. Maybe you can deliver that ITIL process model we've all been waiting for, and which is several years overdue now.

And if the ITIL structure cannot be implemented, then why should I "not want to reinvent it" ?
As a comfort: the piece on the common language is largely true.

hmmm

1] Are you asking generally or being specific to infrastructure monitoring? I have always seen it as the way I described. Event Management, manages events and when necessary feeds the incident management process. Monitoring is a process to monitor the state as it is expected to be and generates events when this is not the case. This can be at the infrastructure level, application level or service level.

2] I didn’t say there were not related as clearly the framework relates to the processes and activities within it. But it is only best practice for the framework for Service Management and does not prescribe the actual day to day activities or the procedures needed to operate the service or the shape of the organisation required to do this or the tools (CMDB is an exception). This was the bit I had in mind when I said “real word”.

3] This is my point, ITIL names a set of processes and describes some of the activities within each, it only gives high level interrelationships i.e. an incident can trigger a problem. For a complete solution you need to design the actual service that fits with the requirements for your organisation. The flows for each process in the service model will show all the interrelationships. Security Management maybe very different in your organisation to someone else’s but both can follow the same ITIL framework.

Exactly

Good points.

The thing that worries me most is that this is still the heart of the matter. We need infrastruture that is stable and does not have incidents. There has to be a good monitoring process, incidents and problems have to be handled. V3 makes a total mess of this area with the new definitions and badly designed processes. For example Event management is clearly an activity the Monitoring process.

Another annoying V3 error is in the Problem Management process. The V2 distinction with Problem (you don't know why it is not working) and Error (you know but do not have the time or money to fix it yet) is a good concept. These are two differet things and they need to be handled separately. One error can cause several problems ore one problem may be the result of several errors.

It was clerly not the intention to leave this Error control out, they just planned to change its name but it seems that accidents happen. Those of you who have the book available, search for known error sub-process. You will find a lot of references to it but there is no such sub-process or activity.

Aale

forget about the problem Management process

The PM process imho is a relict from ancient times when ITIL indeed was technology/infrastructure focused. Others might say that PM is the proof that ITIL still is infrastructure focused...
Why not forget about this PM process and exchange it with what has been developed in so many other disciplines? I mean Risk Management. Is there anyone who can explain the difference between "an ITIL Problem" and a Risk?
And have you any idea about the state of development of the field of Risk Managemnent out there? There is plenty of totally useful material out there that exceeds the value of ITIL PM by a factor 100+.
It's strange to see how even most people in this Skeptic blog platform think within the ITIL framework.... As if there is not a wealth of knowledge in other disciplines. And in case one cannot look over the border, try a peek at MOF: it uses Risk instead of Problem. This MOF information is totally free. Just like so many other valuable Risk Management sources. Unlike ITIL - where you have to pay a huge amount to get your hands on material that - in cases like this - is isolated and outdated. ITIL is a useful reference model with valuable guidance for practical problems (note the small "p"), but you need to pick from that, mix it with other guidance, and apply it in your own management system.

Diference between a Problem and a Risk

This is easy, a problem is something that already exists and is having an impact so the probability is 100%. A risk is something that has not occurred but has a probability of occurring and causing an impact - so the probability is anything under 100% (but not 100%).
A key reason PM is so important is to eliminate the root cause and therefore ensure new incidents do not occur as a result of the root cause still lurking. ITIL focuses a lot on recording the known error from the problem but this really should be a secondary measure to speed up resolution of incidents where the root cause wasn't properly eliminated (still important). PM reduces incidents and in turn reduces problems.

McEvoy - (I'm becoming a

McEvoy -

(I'm becoming a fan.)

Agreed. Risks represent future problems that have not yet resulted in impacts. Problems are risks that were not mitigated and thereby generate impact.

Great thread..... I also

Great thread.....

I also liked v2's differentiation between Problem (unknown) and a Known Error (known).

There is no monitoring process. Monitoring is a broad-based activity that occurs across all stages of the service lifecycle -- Event Management should establish the policy and procedures for proper implementation of the monitoring activities (specific to infrastructure monitoring, although we monitor budgets and other metric too --- will have to check the Good Books about how Event Mgt comments on this...I don't remember right now. From what I recall Event Mgt was pretty infrastructure focused)

Really like the comparison with Risk; gets you outside the ITIL box nicely! Problems ARE risks, and part of Prob Mgt seems to be risk mgt. Of course, once we have a known error, and we determine the risk of living with it is unacceptable, then we have an Incident....

:)

John M. Worthington
MyServiceMonitor, LLC

Depends what you do

If you sell certification training you have to work within the framework. In consulting one has more freedom.

It is the infrastructure people who are interested in ITIL. Higher management uses Cobit, application people want CMMI etc.

I have been recommending MOF but there has been no reaction so far.

As an involuntary Vista user, I would say that it is still the infrastructure that is the problem. Yesterday I wasted a couple hours of working on a problem before I found out that it is a known Vista error.

Aale

The little itSMF handbook

The little itSMF handbook called "An Introductory Overview of ITIL® V3" (don't have the regular books with me at the moment) pg 30 says "An incident is an unplanned interruption to an IT service, or a reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident."

WHo among us would NOT invoke IM if we learn (from event monitoring tools or some smart, on the ball human) that a runaway thread, increasingly CPU utilization or any other failure or deteriorating condition is ABOUT to cause an outage?

Don't confuse Incident and Problem Management

I wouldn't. What you mean is "Who would not invoke Problem Management?". A number of commenters on this thread are confusing Incident and Problem Mangement. Incident management is about restoring the service, not fixing broken CIs. Folk seem to think Incident Mangeemnt is dealing with anything urgent. it is not. Stop putting everything urgent in the Inc Mgmt bucket.

if the service aint impacted yet it is not an Incident. it is a very serious and urgent Problem. to define the correct request procedure before a Problem is spawned, call it a Fault. the procedure for dealing with a Fault and an Incident are quite different. For example, end users need never know about a Fault. in some policies even business service owners need not.

Careful Skep

Hmmm, so let me ask the archetypal use case:

Imagine you have 3 servers in an active cluster. Server 1 goes down.

Is this an Incident, Fault or Problem?

trick question?

Easy, service is (I assume) not impacted, i.e. service levels are still being met. So it is a fault leading to a problem to be resolved. You speak as if this were a trick question

trick question?

Well, this is a MAJOR incident in my house. Service Level not impacted, but it MIGHT put an end to our business if something else goes wrong.

In my ITIL books it is definitely not a problem, since a problem is "A cause of one or more Incidents.". Chicken or an egg? OK, maybe there is an underlying cause for this cluster server crash, but we dont care right now, we first want to resolve an incident.

you got me

Aha you got me Doc. I'm using the intuitive Problem definition rather than the literal ITIL one. You are right. Problem is defined in terms of Incident.

But I stick to my basic premise. Inc Mgt is about restoration of service. it is user focused. There is sno service interruption, there is no impacted user. If Inc Mgmt is truly a process (see JVBs comments) then this is not in scope. If Inc Mgt is just a grab bucket of anything that makes us run around, then it is.

Here is the key flaw in your

Here is the key flaw in your reasoning (and what frustrates business units):

“…service is not impacted...”

Are you certain? And if you are correct, for how long?

A service only exists during the act of consumption. Since it is generally intangible, what matters is its impact as *perceived* by the customer, not the provider. This might sound like useless theory but its key to understanding the problem. While managing infrastructure requires a focus on operational availability, managing services is centered on customer perceptions.

The challenge is to derive the operational expectations of this perception and manage accordingly. Your instinct was to judge the impact solely in discrete (and static) availability terms. Is the service up or down? (Actually, you’re probably just keying in on the availability of the application. Even worse.) In most instances, this is woefully inadequate.

For example, look through the lens of Warranty:
- Could a hacker have caused the server to go down? (Security)
- Does the loss of 1/3 computing cycles trigger congestion? (Capacity)
- Does the loss of an active node impact contingency assurances? (Continuity)

Each dynamic has the potential to affect Utility:
- Did the impact to security place business data at risk? (E.g. legal/regulatory compliance)
- Did the impact to capacity impact the service’s fitness for purpose? (E.g heavy end-of-quarter processing)
- Did the impact to continuity place business operations at risk? (E.g. Contract/Shareholder obligations)

Would not the customer consider these dynamics as something more than a “grab bucket” designation? When the coal-face set the impact criteria based on application availability, they are no longer managing services; they are managing infrastructure – and you know how that movie ends.

A fault is an abnormal condition that requires action to repair, whereas an error is a single event. A fault is usually indicated by failure to operate correctly or excessive errors. It can arise from a threshold violation, state change, or a receipt of event information. It is important, as rapidly as possible, to determine where the fault lies and remediate. Taking a fault-centric view, however, ignores Performance, the measure of *how well* something is working over time.*

Event management refines instrumentation data (fault, performance, etc.) into those that require further attention through the Incident process; workflow, decision making, and information flow. (JVB and I are on different planets when it comes to process definition criteria.)

While the line between instrumentation and event management can vary, the goal remains the same: create usable actionable information, preferably in the context of a service’s Utility and Warranty. A model found in leading IT shops.

Hence ITILv3’s definition of Incident. Failure of a CI that has *not yet* impacted service is best categorized as an incident.

*I'll take a moment to plug a vendor who understands this: Netuitive. Their use of multivariate regression models for operational monitoring focuses on the dynamics of fault and performance mgt, an interesting component of a predictive operations framework.

incident management is about restoration of service

I find your assumptions patronising. "Your instinct was to judge the impact solely in discrete (and static) availability terms". says who? I'm perfectly well aware of the subtleties of availability, and about warranty etc. All i said was if the fault is not impacting service... you then took a simplistic turn that I hope I never implied. the very ITIL definition says we are discussing a condition that has not impacted the service. they don't say impacted availability. Or are you saying anything that we think *might* be impacting service in any way is an incident? if I follow that arguement to its logical concusion then any condition detected by event management must be treated as an incident until proven otherwise.

Do you or do you not accept that incident management is about restoration of service? When we dump all these other activiites into it then we fall into the trap JVB talks about: we are no longer talking about a process, we are talking about a grab-bag of vaguely related activities amd procedures. No wonder ITIL V3 is so muddy: committess have chucked evrything in to a small numkber of umbrella practices. They can't produce the ITIL V3 process model because it isn't processes.

- "Do you or do you not

- "Do you or do you not accept that incident management is about restoration of service?"

No, that's incomplete. The primary goal of the IM process is "to restore normal service operation as quickly as possible and minimize the adverse impact on business operations..." (SO, 4.2.1)

- "...are you saying anything that we think *might* be impacting service in any way is an incident?"

No. I'm saying there is a tendency to think too narrowly about "normal service operation".

For example:

"...if the fault is not impacting service..."

How are you making this evaluation?

the same way

How? Precisely the same way we make the evaluation of whether it IS impacting the service.

Why are we discussing "a tendency to think too narrowly about normal service operation"? I think you brought it up out of nowhere, because I don't think there was any evidence I did so. And I don't think trying to separate out problem solving crap from the core focus of Inc Mgmt has anything at all to do with such a tendency.

No matter HOW you define service operation (which includes provision of capacity and continuity mgmt etc...)
No matter HOW you detect and deduce service impact

If a service is impacted then it is an incident.

Inc Mgmt's job is then "to restore normal service operation as quickly as possible and minimize the adverse impact on business operations" of that incident.

If the service is not impacted (No matter HOW ...)

then Inc Mgmt has nothing to do.

So if I found a bomb at the

So if I found a bomb at the foot of a server, with a 5-minute timer, and then called it in to the service desk, it would be a Service Request.

It is not an incident because service is not yet impacted (insert pun here) -- at least for the next 5 minutes.

It would be logged as a fault (thus no need to raise an alert) and a Problem Record drafted with a Kepner/Tregoe analysis triggered. A workaround is then developed to keep any additional bombs from making their way into the data center.

At which point the outsourcing discussion takes on a new dimension.

Replace bomb with JCB...

...and you have a typical example we used to use in the classroom, even back in v1 days.

Yes you have a major incident that needs fixing, because if the bomb goes off or the JCB digs through the power cable then service will be disrupted to all users. Incident has both high impact and high urgency, hence high priority. I would probably default (and in some of my previous lives finding a bomb near a data centre was quite likely) to invoking the DR plan so that even if the bomb goes off there is no disruption to service. Is it a service request? Hmm, perhaps, if I have a tried and tested SOP for invoking and implementing the DR plan. My personal jury is out on that one.

Obviously my risk assessment, done as part of several capabilities my IT department has, will have identified it as a risk that needs mitigating, but now it has actually come to pass it becomes an issue.

But post event I will still do a separate problem review to find out what lessons have been learnt and to try and avoid it ever happening again, at all the sites I'm responsible for.

The old Microsoft source book for the help desk used issue in a different, but useful way, meaning a generic weakness such as poor HR procedures.

J

straightjacket mindset that urgent = incident

Visitor you are completely locked into the straightjacket mindset that urgent = incident. NO IT DOESNT. A bomb would be an urgent problem indeed, or as JVB says an urgent risk, or an urgent fadurkin' fault I don't care, BUT IT ISN'T AN INCIDENT.

Look, by categorising incoming requests we assign an appropriate process/procedure/whatever to it. Incident management is customer-centric. it is about getting the customer's people back on the air with minimum impact.

An incident requires that we notify the business that their service is impacted. An incident rquires that we refer to the SLA to see what performance is expected of us. incident management calls in Level 1 to determine if there is a problem. Etc etc. If a service is not impacted, if we have proactively uncovered a problem before there is an impact, then none of that is required. It is an entirely different set of procedures to resolve a problem. An incident is managed by the service desk, problems usually are not.

Even when a failed CI DOES impact a service, it is not Inc Mgmt that fixes the failed CI. It is Problem Mgmt. Inc Mgmt just tries to minimise the grief for the customer and users until it is fixed. So what is the point of bringing Inc Mgmt into it if there is zero impact on customer or users?

Once again, the incident process/procedure is not the only way to deal with something urgent.

Insufficient

“Visitor you are completely locked into the straightjacket mindset that urgent = incident.”

No, the point was not that it was urgent. The point was that it was not normal service operations. This is the narrow tendency I refer to.

An existing impact to service is not required to invoke the need for an incident.

And, JVB, this is not an academic exercise. While an extreme case, it is one of many use cases taken from real world operations in industry. The line you and Skep have drawn around IM may be appealing conceptually, but it has shown itself to be insufficient in practice. Its been tried by me and many other practitioners. (And I have to ask, JVB, if this model you keep referring to has been validated in real world ops.)

Infrastructure operations for HP, IBM, EDS, GM and Toyota, for example, on the hook for billions of dollars of customer contracts or business operations, rely on the concept of “non-impactful Incident.”

Why? Because of non-normal scenarios (nowhere near as troublesome as a bomb) that have produced no impact, but cannot, should not, be treated as fault or problem. They require the invocation of an incident and all its attributes; business notification, customer centricity, SLA reference and L-1 support.

So I respectfully disagree.

can see where you are coming from

OK i can see where you are coming from, there are indeed faults that one would want to let the business owner know about, and we do need to prepare for the potential incident by referring to the SLA. I struggle to see where Level 1 come in but i defer to your greater experience. I'd be shippping it direct to Level 2 as a Problem.

but there are no bulletins to go out, no users to be followed up, no multiple incidents to be linked, no need for callcentre operators to be briefed, no problem diagnisis, no workarounds .... a great deal of incident process is still irrelevant

whatever the current orthodoxy amongst the big boys, as represented by ITIL, I don't think that over-rules asking the question: why doesn't incident management focus on restoration of service and leave other emergencies for other procedures? I think you cause no end of confusion by announcing to IT "we have a priority 1 incident here. no put the phone down, no one is impacted yet"

of course if ITIL leaves no room for further improvement...

P.S. thanks for sticking with a long and robust debate - your input is much appreciated :)

This has been quite interesting discussion

Quite refreshing discussion. I think that this shows that Itil V2 and V3 are quite different at this basic operational level.

Here is how I see this Server scenario happening, first in theory and then in practice ;-)

1) Theory
On Thursday evening Operations notices server down and report to Incident Manager, who decides that this a major incident because while the service is ok at the moment, it will not be ok on Friday Morning. The IM assembles a Major Incident Team and alerts Problem Management. While the Major Incident Team starts preparing a workaround to ensure stable service, the PM team finds out that there has been an minor change where the server has been upgraded and decides to suggest an emergency change to revoke the upgrade. This is accepted and service is back to normal before midnight.

2) Real life
On Thursday afternoon Dave decided to do a minor upgrade on of the servers before he started his long weekend. Assuming no trouble appeared, he could then upgrade the rest of the servers on Monday morning. He knows he should make a change ticket but who cares, everybody has started bypassing the processes. When Operations notices the server being down they restart it a couple of times but it goes down again. They try to contact the Operations Manager but the whole Management Team is having their monthly two day Strategy Session and cannot be reached. The Function and Process managers run in circles but cannot decide who is in charge so nobody does anything useful. On Friday morning the company launches an important marketing campaign but the servers crash and the campaign fails.

Aale

unbelievable

....how people still seem to be forced to reason within the system and refuse to look over the borders.... Haven't you learned by now that ITIL is an incomplete and inconsistent set of guidelines? Valuable only to those who have their own understanding of a real system? Doing ITIL by the book ( in any of its versions) will only cost a lot of time and money and leads to little result.

I take a customer approach in this. For most consultant the opposite is the case: consultancy organizations often follow "the hour generator": they try to maximize the amount of hours (=money) spent at a customer site by maximizing the complexity of their solutions. ITIL is a very welcome instrument for these providers.

The bomb clearly is a THREAT to the infrastructure that we designed to deliver agreed services. As I proposed earlier, I would calll that a risk, in this case a very clear risk. Risk management is simply the same as the "proactive problem mgt" described loosely in ITIL. Please let's be practical here and not loose ourselves in academic discussions. I've been an academic researcher for decade and have some idea on the practicality of that approach.

Risks can be prioritized. This one obviously has a high priority (I presume serious impact) because it will most likely turn into an incident within minutes. So you do the necessary thing: you dismantle the bomb in (let's introduce a new term) "an emergency operations action", you invoke the continuity plan, you evacuate the data center, or whatever. Any of these actions runs along their normal procedures: (urgent) change, incident, operations request, or whatever you have in place in your procedures. Remember: the processes are just there to help you; they are not laws if they are not covering all conditions. Practice always needs to win from theory.Theory will never be able to cover all practical options, unless you follow very abstract descriptions - unlike "best practices".

I practice a model that would indeed completely cover this situation, in theory as well as in practice. It's not rocket science. You can easily come up with your own version if you spend some effort in looking over the borders.

Yes, its a trick question;

Yes, its a trick question; an example use case the authors had to work through. There are nuances to be revealed here that apply to many common situations. But I'll hold off on a response and see if anyone else has an opinion.

My opinion

I didnt see it as a trick question.
Pure ITIL this is an incident (failure of a configuration item that has not yet impacted service) but it could generate a problem if there isn't recorded error (you used the word fault which is the same in ITIL terms).

Keeping Pace with Complexity

As often happens with blogs, my rant probably got off of topic and onto my particular concerns. Sorry for the venting. Here's my best reply to your post...

[1] please explain the difference between Event Management and Monitoring

I think I understand where you're coming from, as ITIL v3 (Svc Ops) states, "Event Management is therefore the basis of Operational Monitoring and Control." The same pub goes on to say, "... monitoring is broader than Event Management. For example, monitoring tools will check the status of a device to ensure that is is operating within acceptable limits, even if that device is not generating events. Put more simply, Event Management works with occurrences that are specifically generated to be monitored. Monitoring tracks these occurrences, but it will also seek out conditions that do not generate events.

What I take from this is that your monitoring tools will (hopefully) isolate which event you need to focus on, but you will need an Event Management process to make sure that the monitoring tool is properly instrumented to meet the needs of the particular infrastructure/service. The traditional approach to monitoring, in technology silos, makes service-oriented Event Management pretty tough.

[2] please show me the process that describes "the Monitoring Process" in ITIL

I don't remember seeing an ITIL process called 'monitoring' to be honest. If I said that I stand corrected!!

[3] try looking at the "Event Management Process" in ITIL, say it's a process, and keep your eyes dry.... and your nose from growing

I didn't say it was a process, ITIL v3 did.

As far as an implementation framework, all I'll say is I get frustrated when I see the traditional "bottoms up" CMDB heavy lifting; sometimes before a rudimentary catalog of services are even defined! I'm all for Change/Config and Svc Desk/Inc/Prob, but it just seems to me that with the right monitoring tool you can focus people on end-to-end services (even if they are not yet perfect), service impacts and begin to directly address the tribal culture of the IT organization.

It just seems that this should play a more important role in ITSM implementations, and the bigger ITSM Suite players do not seem to focus on this at all or do so in a way that dilutes the value of the monitor so much there is little value left.

Of course in order to develop an implementation framework for a specific client, process first is always the rule (ITIL again). The you can determine what tools, what processes, when, etc. Unfortunately for most customers it winds up being, let's go with vendor X.

What's wrong with doing a simple assessment, creating a simple service catalog, and starting with a monitoring tool to establish service-oriented Event Management? You could then lay the groundwork for an RFP to deal with the need for workflow automation if/when needed (and strike a better deal in the process).

Keeping pace with complexity demands more effective monitoring and event management.

John M. Worthington
MyServiceMonitor, LLC

generally speaking

Sorry John - this wasn't meant to go against what you said -it was just a general comment. Something I run into in my practice all the time. ITIL is quite unclear and inconsistent about what a process is. And as long as you don't really understand the "what", how would you be able to organize the "who" and the instruments you should use????
But then again - ITIL never claimed it would provide that insight in processes. It's just a set of practices - some of them are acceptable, some are not, some apply to your customer's environment, some not.
In the end, you'll need your own implementation framework, preferably easy to understand, cheap, fast, and effective. The recession should firmly enhance the chances of that approach...

PLEASE don't be sorry! You

PLEASE don't be sorry!

You have more water under the ITIL bridge than I do and I respect your opinion. I've been wrapping my head around ISO 15504 in the hope that in my lifetime we'll see the ISO 15504-8 PAM for ISO 20K. Should I hold my breath?

I certainly think the ISO15504 standard might be a good way to winnow out some of ITIL's process-related issues, no?

John M. Worthington
MyServiceMonitor, LLC

Camels and Committees

Nice try, Skep, but i think ITIL3 Inc. Management is actually done very well. Heck, it was ok in V2...

it is the definition I have a problem with

Doc,
didn't say i don't like the process. it is the muddy definition I have a problem with, and specifically the "or could disrupt" bit (4.2.2)

ITIL defines an incident to be an impact on service or a .......

No sure where you got your definition from - certainly not v.20 or v3.0

You need to read the books

You need to read the books then. ITIL V3 Service Operation 4.2 (p46) or glossary (p234). Don't believe what they teach you in Foundation :)

I did, mine says....

An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. Failure of a Configuration Item that has not yet impacted Service is also an incident. For example Failure of one disk from a mirror set.

crossed wires

OK I'm confused now. that is word for word the definition that I used...???

If you are referring to the Inc Mgmt definition then that is the following paragraph on the page referenced (and also in the glossary)

We obviously have some crossed wires here...?

Syndicate content