We should create the problem record right up front in an incident

Submitted by skeptic on Mon, 2009-12-28 08:55

Share this post with

A BOKKED post three months ago drew a lot of attention. It was about the disconnect between Incident and Problem Management in ITIL V3 Service Operation. [See also the ITIL Wizard stirring the pot about Major Incidents] I've just discovered a response to that post which has popped my brain with its simplicity and clarity

We shouldn't be opening problem records at the end of the incident resolution process, as implied by ITIL and assumed by most, including me. Why do we thrash around trying to diagnose root case as part of the incident process? And why do we assume that Major Incident process is based only on an incident record? We should create the problem record right up front. I got this point from Taruu (David Stucky):

Problems need to be logged anytime a previously unknown issue generates an incident. How can we tell whether an issue is previously unknown or not? Simple: by referencing the KEDB. If no match is located for a given set of incident symptoms, a problem ticket needs to be raised. Recall here that the KEDB does not only and literally catalog Known Errors, but problem reports and resolved problems as well.

The purpose of Initial Diagnosis is to forestall escalation and further handling of the incident if at all possible. Failing that, its purpose is to correctly escalate functionally. The primary tool used to prevent escalation, by enabling rapid resolution, is the KEDB. So, that check for a KEDB match also signals the most logical and efficient point at which the need for a new problem ticket might be logged.

Obvious really. Why doesn't ITIL say that?

You might not agree that a KEDB also includes problem reports and resolved problems. In which case initial incident diagnosis should include a match against both the KEDB and the problem database.

Either way, if this is an incident that we have not seen before, open a problem record right away. Don't wait for some arbitrary undefined point later on when you decide this might be something for the problem boys to look at.

it does require anyone working on the problem to refer back to the underlying incident to see what diagnostic information may have been found later, but it reflects reality: there should be two parallel activities going on , problem and incident, and so there should be two records from the start. Restoration efforts are tracked in the incident record; root cause fix efforts are tracked in the problem record. It makes perfect sense to me.

And if your response is "Of course. I knew that" then please tell me where it is written down. I've always had a fuzzier view that you create a problem record when you realise there is a problem. There also seems to be a view that a problem record is only created if an incident is closed without resolving the underlying problem, and it would seem that is what ITIL V3 says. This latest point from David makes it crisply logically clear that there is a point at which you realise there is a problem and it is early.

To hell with "adopt and adapt" or "everyone is different": this is an important fundamental concept that apparently has been around for ages and bloody well ought to be in the ITIL book.

P.S. it also implies that in the extended and heated LinkedIn argument recently, which we discussed here, everyone had it wrong. Crudely there were two camps: (1) When a Major Incident is resolved, the root cause and its resolution should be documented in a problem record (2) it is OK to document it all in the incident record (by implication because that is what we use all along). ITIL describes how Major Incident response should involve both Incident and Problem teams from the start. So there should be two records from the start too.

Published in The Skeptical Informer, December 2009, Volume 3, No. 10

Previous story: Should every major incident produce a problem record?
Next story: Merry Christmas, ITIL works!

Comments

Submitted by skeptic on Mon, 2009-12-28 21:06.

at end of incident model/script

Hank Marquis added one more (via twitter):
Also at end of incident model/script and no resolution=problem

Submitted by avallesalas on Tue, 2009-12-29 11:41.

Just my 5 cents

Let's act as a storyteller...

A long time ago in a city far, far away, I had a customer... and we started a small project that would help to set up an ITIL based ServiceDesk using an ITIL based ticketing system.

My mission was to have meetings with different managers and to help them to define the processes for Incident, Problem, Change and Configuration management.

My contact for Incident and Problem was the same, so it was going to be quite easy to arrive a consensus, and when we started talking about "when to rise a problem", he was very strict: "Our incident team must apply the solutions that are documented, either by the vendors in their KBs (old times, without google... he was talking about the Technet CD's) or by our specialists teams (System managers, Application managers, etc...). So, if an incident has no previous documentation we will raise a problem, and in this way we will be sure that the specialists will document the solutions"

So he was talking about KEDB and incident model/script. Under this way of thinking, you will have your specialists team working for the solution in the same way if the incident has been escalated and additionally you have a simple way to decide when to rise a problem, the specialists focused on solve them and to document them in order to receive the minimum number of problems possible.

I agree with Hank.

Antonio

Antonio Valle
G2, Gobierno y Gestión de TI
http://www.gedos.es

Submitted by skeptic on Tue, 2009-12-29 13:02.

When you run out of ideas, you have a problem.

..so you also agree with me Antonio. Getting to the end of the script is one specific instance of getting to the end of all our sources of possible known resolutions to the incident. When we run out of ideas, we have a problem.

Submitted by vinodka on Sat, 2010-01-02 08:17.

Functional escalation in Incident

hi Rob and others,

I used to take this interpretation from ITIL earlier regarding Incident and Problem.
if there is no 'known' cause and solution for an incident, hand it over to problem management for further investigation.
A flow diagram in ITIL V2 indicates this approach.

But then there is a confusion on the same.
If that is the approach taken, then Incident management requires only Level 1 and probably a level 2 skill set to execute IM process (as they are only applying known solutions).
But there is a functional escalation in Incident management where tickets are escalated L1-L2-L3 and probably even to the vendor and beyond.

So, my thought process was stuck on two possible approaches:

Approach 1: (as being discussed here)
a) If it is a known issue - then Incident is resolved fully using the 'known solutions. No problem ticket needed.
b) if it is not a known issue - then raise a problem ticket for further investigation. Resolve the incident through a workaround (already known or provided by problem management, while they are working on the issue)

In such an approach, you do not require a multilevel functional escalation within Incident management - probably not beyond Level 2.

Also, here one key assumption is you have two different set of people working on Incident and Problem management

Approach 2:

A multilevel incident functional escalation exists - The first objective is to restore the service. till this is achieved, it goes through functional escalation of Incident management. Once this is achieved, as required, a problem ticket can be created, for further investigation and permanent resolution.
If you have a separate problem management team - then the problem ticket creation can be done earlier as well, even when Incident management is going on with the objective of restoring the service.
If you dont have seperate teams, then the same team carries out incident management process - till service restoration and then initiates problem management for permanent resolution.

I used to advocate Approach 2 more to my clients (in most of whose cases there were no separate teams).

The reason for the preference is that I used to feel that Approach 1 could put Problem management in some 'time pressure' (just like Incident mgmt) while Approach 2 might not.
I hope we reach a better clarity through the insights here.. A good discussion..

And.. Happy New Year to you all...

Vinod

Submitted by skeptic on Sat, 2010-01-02 09:13.

why would you wait a single minute longer

yes very good points Vinod. What fascinates me is that this clearly has NOT being thrashed out before, or if it has then ITIL is a big failure in communication

Some companies apparently have such distinct incident and problem teams that they are in different buildings. i think that is a mistake. there are two activities going on here; incident resolution (read: service restoration) and problem removal. one is tracked in an incident ticket and one in a problem ticket. of course the activities should go on concurrently: this problem is either causing many other incidents right now or potentially going to cause another one soon - there is of course a many to one relationship between incidents and problems. Incident is about putting out the fire, problem about finding the arsonist. or in my favourite analogy incident is bandaging the bite, problem is shooting the alligator.

So why would you wait a single minute longer to initiate the problem removal activity? But just as importantly, why on earth would the problem team work in isolation? There is a wealth of clues for them streaming in as the incident is worked on trying to restore it. And in most of the mere-mortal organisations I work with, there aren't enough resources to allow separate teams: they are the same folks.

The challenge is to try to keep both activities going at the same time with minimal interference. You can't just say "incident always takes precedence. problem has to wait until the service is back". the world is more subtle than that. I've also been in the situation where the Board said this must never happen again and it did. When it happens again, the top priority is getting the diagnostics to ensure there isn't a third time - careers depend on that more than they do on cutting a few hours off MTTR. Management have to make complex value calls constantly during ugly incidents. I think having two distinct tickets helps.

i still think there is a L1-L2-L3 escalation path for incidents. I've been on the phone to vendors asking "how the f*** do we get our system back up?". Service restoration still needs that technical depth.

Submitted by bunkermentality on Mon, 2010-01-04 14:51.

Viewing from the standpoint of a schema for a simple database

The comments here (and in the mammoth Linkedin conversation) provide a surprising insight into the level of confusion, and almost religious zeal, about the relationship between Incident Management and Problem Management.

I absolutely agree with the line of reason which argues that all Incident Records must be linked to Problem Records, and this linkage must be established sooner, rather than later.

Does it help to clarify the argument by viewing it from the standpoint of a schema for a simple database? First an Incident Record is created. In database terms the objective is to link this to a known, or new Work-Around Record. When this is done, implying that the Incident is resolved, then the Incident Record can be closed. When a subsequent Incident Record is created (of the same type) then this must also be linked to the known Work-Around.

So how to we organise these known Work-Arounds? The obvious way is to link them to the Problem Record that describes the underlying problem that is causing the Incidents. Whether the root cause of the Problem is known (an ITIL Known-Error) or not is immaterial. Thus a Problem record is linked to one-or-more Incident Records and also to one-or-more Work-Arounds.

The goal for Incident Management is to match Incident Records with Known-Errors as fast as possible. The goal for Problem Management is to determine the root cause, identify a “fix” and have this applied via Change Management and Release Management. Thereafter no further Incidents occur and the Problem Record and associated Work-Arounds become redundant.

Arguments such as the “which team does this?”, “is this L2 or L3?” etc are kept to one side. Furthermore it also puts the Major Incident debate as the schema applies to all Incidents and Problems. Does this provide a simple way to address the confusion?

Paul

Submitted by skeptic on Mon, 2010-01-04 17:54.

The debate goes on

No. But thanks for trying.

I think everyone understands the schema. The debate rests on other issues

WHEN should a problem record be created?
Is there a problem record for every incident? What if the underlying problem was quickly and permanently resolved as part of restoring the service?
etc

Submitted by bunkermentality on Tue, 2010-01-05 11:00.

Everyone understands the schema (probably)

Your view is that everyone understands the schema. I hope you are correct. I don't recall seeing any schema-related discussions and so I'm not as optimistic as you.

Using the simple schema to address your questions.

Q: When should a problem record be created?
A; The Work-Around is linked to the Problem Record so the Problem record MUST be created prior to the Work-Around being linked with the Incident Record, i.e. before the Incident is closed.

Q: Is there a problem record for every incident?
A: Yes there must be because the Work-Around is linked to the Problem Record.

Q; What if the underlying problem was quickly and permanently resolved as part of restoring the service?
A: If schema integrity is to be maintained then the Work-Around is always linked to a Problem Record. Integrity must be maintained to ensure that management information analyses are accurate.

Submitted by JamesFinister on Tue, 2010-01-05 12:16.

work around link to problem

An overall general purpose schema is just one of the things that I believe should have been delivered as part of ITIL a long time ago. But it is just one thing, it doesn't stand alone, it doesn't answer every question, and it has to be right.

I agree with Rob that a schema doesn't really help us here, in fact there is a danger of putting a horse behind a cart.

As to the schema you suggest:

The workaround (which is different from a fix) has an indirect link to a problem. We can apply a workaround without having an inking of what the underlying problem/known error is and there is no single right place in time to create a new problem record. I can successfully apply a workaround without having an understanding of the problem. Actually we can and do even fix something without going through formal problem management.

In theory all incidents have a cause but in reality most incidents are linked to a predefined catch all problem record such as "printer issue" Many incidents can have a workaround. The relationship between incidents and problems is a many to many one - via a join table obviously because I would hate to encourage poor database design. If it was a on.e to one relationship you might question why we need separate record

Submitted by bunkermentality on Tue, 2010-01-05 14:45.

Incident >- Problem -< Work-Arounds

I agree that the schema doesn’t stand alone, nor does it come first, as it complements the processes and role descriptions. I focussed on it in an attempt to bring a new angle of approach to an old challenge.

The schema structure I had in mind differs to the one suggested by James as mine has Incident Record >- Problem Record (one or more Incidents to one Problem relationship - excuse clumsy crows-foot symbol) plus Problem Record -< Work-Around (one Problem to one or more Work-Arounds relationship). This implies the following process.

1. Match a new Incident Record to an existing Problem Record. If a Problem Record does not exist then create a new one.
2. Apply one, or more, appropriate Work-Arounds linked to the Problem Record in an attempt to clear the Incident. If necessary, create further Work-Arounds and link these to the Problem Record.
3. In the Incident Record describe which Work-Arounds were attempted, and which one(s) succeeded. (Use schema links for efficiency here.)

Thus there is an indirect link between the Incident Record and the successful Work-Around(s).

In addition a proactive Problem Management process is needed to manage the Problem Records and Work-Arounds.

1. For each Problem Record review the volume and impact of related Incident Records and assess this against the costs of implementing a permanent fix. (An ROI case in effect.)
2. Management (the Service Manager?) must then decide on the merits of the return on investment case and approve or disapprove the cost of the fix.
3. Thereafter the support teams, via Change and Release Management, implement and apply the fix, thus clearing the root cause of the Problem and ensuring no further Incidents of the type occur again.
4. Whilst reviewing each Problem Record also perform duplication and efficiency tasks by removing duplicate Problem Records and duplicate and/or inefficient Work-Arounds.

In this way the schema structure supports both Incident Management and proactive Problem Management.

Submitted by mbuzina on Mon, 2010-01-04 10:06.

The Problem is most often the organization or the process

Hi Rob, Hi Vinod,

-- a whole lot of text removed again --

Sometimes writing about something gets you thinking. I started writing in support of the L1-L2-L3 path for incidents, but while I did that, some ideas formed which need some more time. So let me use these famous words "I'll be back".

Marc

Submitted by skeptic on Sat, 2010-01-02 20:49.

points

James, I didn't say FINISH the problem resolution during the incident, i said start it. Still take the proper time and diligence to finish, but valuable insight and time will be lost if you don't start.

I too am in favour of having separate staff for incident L2-L3 and Problem but many organisations don't have the luxury. Also when you call up or call in a supplier they'll be giving incident and problem answers intermingled - both teams need to engage.

Vinod there is great value in opening a ticket as soon as the task is known, not waiting until people are ready to service it. We need tickets in the queue so we can understand the pending workload and so we can prioitise the next task.

Submitted by JamesFinister on Sat, 2010-01-02 23:00.

Intermingledness

Rob,

I didn't say you did say FINISH problem removal during the incident, or soon after, but in may organisations there is a real pressure to do so, which is obviously counter-productive

The intermingling of incident and problem activity is often inevitable when the same teams are involved, so we ned to look at other approaches, such as "Friday afternoon is problem activity time"

Submitted by vinodka on Sun, 2010-01-03 04:00.

Not so easy to segregate and interface

Rob - I agree; I see that perspective and stand corrected. There is definitely a value in opening a ticket as soon as there is a need and that will also ensure you dont miss out on those later...

James,
I don't think it is that simple to allocate specific time for Incident and Problem management. What if Incidents of high business impact crops up during that time (I am talking about situations where there are no separate teams). You can probably do that for specific people in the team - an approach we have tried out - mentioned below:

At a couple of places we implemented in such a way to assign the problem ticket to the same person who worked on the Incident, once he completes restoration (provided he has the skill set) - this helps in ensuring continuity of information flow from Incident to Problem. I am not sure this is possible in all cases. In another case, the support persons were allocated roles in Incident and Problem mgmt on rotational basis - and they carry out the process accordingly. But in this case, unless you have enough Problem management work, you may not be able to dedicate a person for that.

These all were customized work-around that we applied to different situational needs - not sure if there is a framework of good practice for this. Could this be the reason it is not appearing in the ITIL documentation? ;-)

It is obviously ideal and easy if you have separate teams - but as Rob mentioned, most organizations doesn't have the luxury or can afford that.

Vinod

Submitted by aroos on Sun, 2010-01-03 10:23.

Micromanagement

Vinod is quite right here. Different situations create different solutions. ITSM does not control corporate structures. For example, from business sense, B-to-B is different from B-to-C but it is possible to sell same services to both using the same production capability. It makes business sense to split the company in two parts, one offering services to consumers and the other part offering services to business and both will want to have their own service desks. From support point of view many of the incidents and problems are identical but it is unrealistic to expect too much cooperation from the B and C business units. They are in competition for the same resources and all decisions which change the structure has to be done a top level which is usually impractical.

A lot of this discussion seems to me like a bad case of micromanagement. The great idea in ITIL was that there are two different processes to do when service goes down, IM and PM. How to organize this work depends on the situation and the nature of incidents, the type of staff available etc.

Aale

Submitted by JamesFinister on Sun, 2010-01-03 19:15.

A place for micromamangement

Aale,

I think we are hitting a couple of issues here.

The first is the point at which ITIL has to stop being prescriptive. Actually the one big improvement I would love to see in ITIL is a structure that made it easier to pick out the general objectives and principles of ITSM, both its design and its delivery, the elements that are essential to achieve success, and then the elements that are either contingent or more blue sky.

We can all agree we need problem management, we can probably all agree what problem management is there to achieve, but then the boundaries become less clear and the debates more interesting.

The second is that in the ITSM we have to have a connect between the exciting big world of strategy and everyday hands on operational management, where micro management does sadly often become a necessity even if the detail is outside the scope of ITIL guidance. I'm sure we all come across the situation where the biggest barrier to change is a small group of key employees who won't change their ways - a classic example perhaps been the technical team who focus on what is important or interesting to them, not what is important to the business.

That is part of what makes ITSM so much fun.

Submitted by aroos on Mon, 2010-01-04 07:08.

More examples of good practice

Very good points and Rob sums them up nicely "it's called management". Overall it is better to give goals, rules and guidelines but sometimes it has to be more detailed.

Maybe the ITIL books should have examples of good or "best" practices. The cases could show conflicting practices with explanations: "this works here because ... but it would not work if ....".

Aale

Submitted by vinodka on Sat, 2010-01-02 11:15.

No waiting or deprioritizing

I am also not for prioritizing Incidents over problems (at least not always). One can see more customers are dissatisfied with repeating incidents than Incidents taking longer time to resolve! I would definitely have the support team take one or two hours extra and fix it permanently- if I was the user/customer (of course unless the situation demands - which is captured through the 'Urgency' of the Incident ticket)!

The Point i was putting across is -
If you have separate teams, then there is a value of opening a problem ticket at the initial stages itself.
If it is the same person/team handling both Incidents and Problems, then it is like one person having two tickets at the same time - then he has to do some prioritization there!
This is unless, you assume that the Incident ticket is handled by L1 first - at the same time L2 or L3 start working on Problem ticket. But this also can become messy and confusing - since after some time the Incident ticket comes through functional escalations to them.

There should not be any delay in creating and addressing a problem - my point is a problem ticket should be created as soon as some one can start working on that! There is no real value just opening a ticket and keeping it idle till the service restoration is done...

yes, the real challenge is having these processes going hand in hand with minimal interference and maximum interfacing!

Vinod

Submitted by JamesFinister on Sat, 2010-01-02 11:59.

Queuing and prioritisation

Vinod,

You've reminded me of an issue that I keep meaning to bottom out to my own satisfaction.

As we optimise ITSM according to ITIL we have multiple teams working on multiple activities. A support team might be doing incident fixing, contributing to a problem investigation and involved in the development of a change, whilst also implementing some standard requests.

All of those activities will have their own impact, urgency, and priority as well as differing resource requirements.

How do we merge them into a cohesive work flow for the teams ? Partly it is a question of simulations and modelling, but serious thinking about the business requirement has to take place.

Submitted by JamesFinister on Sat, 2010-01-02 10:16.

Heck

Purely a personal view of mine that I like the problem and incident teams to be separate but fully coordinated. Ian might comment on this one but I think that is also fairly consistent with non IT best practice. Reality might mean there are times when the PM team get drawn into incident resolution, either because they happen upon the fix whilst finding the cause, or simply because they are the people with the right knowledge - but I wouldn't aim for that as the norm. I also don't think you can expect ITIl to lay the law down about this one. Incidentally does anyone have a copy of ITIL InSITU handy? I'm not sure what that says about combining/dividing the teams.

Bear in mind as well that my top tip for employing a problem manager is to go for the candidate wearing the bow tie.

Why wait to initiate the problem removal activity? No reason at all, but keep the focus at this stage on the fix. If you think you've identified the causes whilst the incident is still on-going you are almost certainly missing something. What would the board prefer - We tell them we still don't know what the cause was and they need to mitigate a possible second occurrence, or we tell them we've removed the cause and then it happens again the next day?

As for escalation, well I think ITIL still oversimplifies it. There is a lot that should be going on at different levels, over different time periods for different reasons.

If you want to see what can go wrong then watch the first two rounds of a G2G3 simulation.

Submitted by ianclayton on Tue, 2009-12-29 22:39.

Having no problems defined implies undiscovered opportunity...

I see 2010 as the year of the continuous improvement program. Find a problem worth the effort of investigation, make it go away. Your comment Skep reminded me of a noticeable mantra from the world of Lean, "a lack of defined problems implies undiscovered waste", and an opportunity to improve. I know I've banged this drum before but problem management skills are desperately needed at the core of any improvement effort. ITIL V3 positioned this mission critical skillset as a process within Service operation instead of boldly merging it with the CSI discussions... why?

Submitted by aroos on Tue, 2009-12-29 12:57.

A good example

That is the key of the KCS model, searching is creating. You always search first and if you do not find the solutions, you need to create it. Very good principle if you can apply it.

Aale

Submitted by aroos on Mon, 2009-12-28 10:53.

KCS model

Taruu's model looks like Knowledge Centered Support (KCS) model. The problem is that doing KCS and creating a good KEDB is not that easy. KCS can require a culture change.

"if this is an incident that we have not seen before, open a problem record right away" is a bad advice. Who is we? In many cases somebody has seen the incident. Only in pure KCS world everybody has all the information easily available. This website is actually a good KEDB on ITIL but searching this for a specific comment is not easy.

Parallell activities is a bad idea. Two teams might be doing simultaneous tests etc. and confuse each other. First you have the incident people doing their work to minimize the impact and only after that you let the problem people in.

And I disagree with the P.S. It is possible to run a calculated risk. You may be aware of a risk that may cause a major incident. No problem need to be involved.

Aale

Submitted by JamesFinister on Mon, 2009-12-28 12:21.

Parallel activities

Aale,

I can't let that sweeping statement go without comment.

First of all I agree to the extent that we all know development and change horror stories of people working on different versions or releasing incompatible changes. In recent years I've also come across more and more examples of parallel attempts at cultural change that have ended up making things worse not better - for instance a lean team competing against an ITIL team who are competing with a SOX team.

There are also contextual situations where parallel activities can be counterproductive - in an under resourced or immature organisation it might mean that neither incident nor problem management are done well.

In an organisation that is well resourced and mature, however, doing the activity in parallel becomes the sensible thing to do, because there is clarity about the objectives and the interfaces. I have had a client who suffered from appalling availability issues thanks to a supplier. Very quickly senior management lost interest in asking "when will the service be restored?" because the saving grace was that the supplier's incident management was just about OK and the service was normally restored in a predictable time. What they began to ask, even whilst an incident was in progress, became "Why has this happened again, and how can we be sure it won't happen again?"

Submitted by aroos on Mon, 2009-12-28 13:24.

I see the point

James

Yes it was a bit sweeping statement, of course coordinated parallel activity can be a good idea but problems may arise if it is un-coordinated. The case in my mind was a real life printer problem where three persons were trying to solve it without being aware of each other.

I suppose we agree that it would not be a good idea as a standard procedure to activate simultaneously two processes to try solve an incident.

Submitted by skeptic on Mon, 2009-12-28 20:30.

simultaneous records

No i still don't agree, because i didn't mean simultaneous independent processes, i said simultaneous records, one recording the restoration of service and one the underlying fix. Of course the efforts are one coordinated effort. Often the Level 2 and 3 incident people are the same as the problem people anyway in smaller orgs. in larger ones they ought to work as a team. How can an incident that is not familiar be quickly resolved without the help of problem experts?

an incident team don't hand over to a problem team like some kind of baton when they have exhausted their own ideas.

The whole clumsy concept of a Master Incident may even go away with this approach. All incidents are linked to one common problem which appears quickly enough for that approach to be useful.

Submitted by aroos on Tue, 2009-12-29 07:39.

You want a lot of problems?

Are you are saying that 2nd level should always open a problem ticket when they start working on an incident? I have a customer who opens 12.000 tickets/month. SD and PM are situated in different business units, SD has their own 2nd level for some areas but not all. Coordination is not so easy, the BU's are situated in different parts of the city etc.

I think the best approach is that problem management is only interested in preventing the recurrance of the incident. All that is done to fix the incident is incident management but it can include the PM team. In some cases the actions can be identical, fixing the incident might fix the problem too but if the business is waiting it is IM. IM stops when you have found a workaround or a fix. PM can continue with the same people after that if it is considered necessary.

BTW, I have been teaching ITIL for years and this is NOT what my material says. All these discussions have changed my view of what is best practice.

Aale

Submitted by skeptic on Tue, 2009-12-29 08:06.

crystal clear and common-sensical

No I defintely did NOT say open a problem for every incident. i said open a problem when you suspect there is a problem.

Are all those tickets that you mentioned incidents or are some requests? What percentage of incidents are not even incidents (i.e PBCK, problem between chair and keyboard)? Of the remainder what percentage match what we have seen before, either as an incident-model/script or an incident or a known error or a problem?
I would guesstimate that accounts for more than 95% of incidents.

For the remaining tiny number that don't match what we know, don't piss about trying to solve it in Level 2. There's a very high probability we have a new problem - get the problem team on it and create a problem record. If they are organisationally or physically separate, all the more reason to open a problem record early.

We'll have a problem record for all current problems being worked on by anyone - no problem analysis and fixing is hidden in incident records. And afterwards, we'll have a problem record for every problem. How else do you track problem stats? How else do you readily search all previous fixes? etc

To me this is crystal clear and common-sensical.

(I hope Juan never reads this, else he'll have an apoplectic fit: I'm arguing 90% of what he said in the debate over on LinkedIn. I'm gonna hafta go eat humble pie)

Submitted by JamesFinister on Tue, 2009-12-29 09:36.

The clairvoyant problem manager

If we are honest a good problem manager can probably forsee quite a few problems that will occur in the live environment, especially if they have taken an active interest in service transition and the testing regime. So why not pre-populate some problem records and manage them as we do risks and issues (after all they segue into each other)

"Engineer changes static IP address on printers/scanners"

"Vendor recommended patch not applied"

"Out of date drivers"

"Tested in an unrepresentative environment"

"Approved procedure not followed"

"Interdependencies not understood"

I like the philosophy of not having problem management hidden in incident records, but that isn't always easy to achieve, especially when fixing the incident and removing the problem are tightly coupled.

Anyone want to start a discussion about when fixing an incident turns into a change?

Submitted by aroos on Tue, 2009-12-29 09:52.

Agree with James

Fixing an incident is always IM, not PM. Remember that there can be complicated incidents that take knowledge and resources to fix. They are all problems to SD in the sense that SD does not know/understand the root cause but just routine incidents to 3rd level..

The question about fixing turning to change is good.

Submitted by skeptic on Tue, 2009-12-29 12:59.

No longer shall I muddy the two

No I think you guys are being sloppy about "fixing" (gawd I'm now in perfect alignment with Juan's position - how embarassing). Incident management restores service. Problem management removes the underlying problem. i have seen the light brothers. No longer shall I muddy the two. Either can be called "fixing" but they are not the same thing.

Submitted by aroos on Tue, 2009-12-29 14:16.

Intentional use of word

Fix can be temporary or permanent. I think you are being too dogmatic. IM may have to make a permanent fix to solve the incident. There may or may not be a reason to try to find the root-root cause but it depends. IM restores service, PM checks if it is possible to prevent it from happening again.

There are so many possible cases, it is impractical to make very definitive rules like: you always have to open a problem ticket.

Submitted by skeptic on Wed, 2009-12-30 05:50.

What are problem tickets for

It is dogmatic to say "you always have to open a problem ticket whenever you realise or have strong grounds for suspecting there is a new problem"? really? What are problem tickets for then?

Submitted by aroos on Wed, 2009-12-30 08:11.

Juan's position is dogmatic

You said you agree that Juan was right in saying that you need to open a problem ticket always after a major incident. The word "always" is a bit dogmatic. A problem ticket is a decision to use resources to solve a problem. In my experience the problem solvers are a seriously scarce resource.

I think we are mixing two things here. Opening a ticket and proposing a ticket. ITIL model does not have the acceptance stage for problem tickets as it has for RFC's. I would rephrase your statement: SD and IM should always propose a problem when they think they see one. A proposal is a message to problem manager that incident(s) #xxx may indicate a problem. Another route to wake up PM is to ask for their support in solving an incident.

Submitted by ianclayton on Wed, 2009-12-30 20:39.

A problem ticket does not commit you to take action

Aroos, Skep

The recording of a problem ticket does not commit the organization to take action. A problem can be recorded on the suspicion of anything that may or is likely impacting a stakeholder. The first step is to make the case for action - to conduct an investigation by linking the problem to evidence and stating impact. As I said - step 1. Its very common for problems to stall at this stage on a 'problem queue'. The next step costs $ in the way of resources to perform cause analysis - not root cause - thats just one of 4 elements of cause analysis.

I'll return to my earlier comments somewhere - what qualifies as a 'major incident' need general agreement - the criteria. Its typically impact focused. It should be embedded in any governance as it can also be based upon a need to be compliant with a regulation. I would remove the responsibility for defining the criteria from Incident management - they just get to record them. It is unclear in ITIL where this responsibility actually sits but I would suspect Problem has a say as they can count impact. When a major incident occurs I recommend a 'situation management team' be formed.... that spans continuity, incident and problem.... Other caveats. Never allow an incident management team to conduct root cause analysis - just does not fit their mission. In fact root cause analysis is the LAST thing you attempt. See any mainstream cause analysis guidance - not ITIL please.

As for ISO20K - there is more in the Code part 2 than in the Specification Part 1. Part 1 is limited to two obscure references to major incidents... and basically requires you to have a procedure - quite laughable really as the procedure could be as simple as writing the event down on a yellow sticky!

Submitted by aroos on Thu, 2009-12-31 07:20.

Dummy problem record

Ian
Yes you are quite right. Actually Service Support book has a concept called dummy problem record. My assumption has been that once a problem ticket has been recorded, it starts the process but that can be averted with the dummy record. Not very elegant solution but solves the issue.

As I had to dig the book, I checked what V2 says about the matter. It states very clearly that problems should be recorded by the problem management process, i.e. not by Service Desk. It also says that a problem record must be opened after a major incident "for which a structural solution has to be found" This is excactly my position. You need to open the ticket after MI when it is needed, not always.

Rob, you have ignored one of my questions which I think is really critical when you say that we should open a problem ticket as soon as we suspect there is a problem. My question was: Who is we? (Hmm, bad English I suppose) What I mean is who decides that this is a suspected problem and opens the ticket. Many SD's suffer from high turnover rate. The new SD employee can see a lot more problems than the 9 month veteran (this is not a joke, in one SD interview one person told me that she is one of the veterans as she has been working there for 9 months). I know there should be a perfect Known Error database but so there should be a perfect CMDB too. In practice the rate of change can be so rapid that documentation lags behind or is old as soon as it has been published.

Aale

Submitted by JamesFinister on Thu, 2009-12-31 09:46.

Dummy records and decision making

In my mind I don't think of a problem record that isn't subject to further action as a dummy record. It is recording a real problem that we have decide not to action at this time - and of course we might reverse that decison in the light of later developments. It is a real record with status set to NFA or something of that kind.

There are at least two decisions needed: Isd this a problem? and Shall we take action?

Before we go further my view is based on a general approach I've adopted across ITSM for many years is that anyone should be able to bring something to someone's attention, but the decison making should ensure clear accountability. It is also biased by experiences that suggets to me that greta service desk staff and managers do not make the best problem managers.

I think it is absolutly right that SD agents and managers should have a direct route to the problem management team and should be able to flag incidents up as being indicative an underlying problem that needs PM attention. This should cater not just for major incidents but also, indeed especiallyy, for those small niggling but frequent incidents that annoy the users but somehow remain under management's radar.

Other teams also need the ability to flag up issues - including developers, of course.

However I believe the formal decision that what we have here is a problem must rest with the PM team.

The PM team should also recommend whether further action is required, though they might not be the ultimate decison maker since resource requirements for major issues should be fed through the overall prioritisation committee.

Submitted by aroos on Thu, 2009-12-31 11:33.

Neither do I

James Rob
I am a Service manager and have been training and consulting ITIL for several years but I only noticed the dummy record concept today. I do like the idea of flagging suspect incidents.

My argument here has been against dogmatism. What I have told my customers has been this. You can either let SD to open problems or submit incidents to PM. Both methods have their pros and cons. You will probably get more Problem Records if you let SD do it. That can be a good or bad thing.

Happy New Year to all

PS Rob. I just realized that we couldn't even agree what year it is now!

Aale

Submitted by skeptic on Thu, 2009-12-31 19:39.

more work?

Not all Service Desk tools allow parallel workflow on an incident which is required if you submit incidents to PM.

Also I suspect you will in fact get MORE work for the problem team this way, since SD do try to link multiple incidents to one related problem (although I am sure PM team will also need to merge problem records anyway)

Submitted by mbuzina on Sun, 2010-01-03 21:51.

A tool restriction from the skeptic ;-)

Hi Rob,

A pitty I missed this interesting discussion up to now, well having some time with my family is not to bad either. A Happy New Year to you all.

Why do you throw in the tools limit at this pont, are you trying to prove your are not dogmatic? Who cares by which communicational device the two processes IM and PM communicate? They could use carrier pigeons if they like. The importance is, that a record of a potential problem is created as soon as someone discovers a potential problem. All other discussion here is either:
a) It is impractical because it will swamp PM with unneeded problem records
----> Either they are "generally" known errors which are not yet documented --> Great, now we have them documented, get the knowledge out of the head monopoly
----> Or they are duplicate error logs --> Great, have PM personal improve the duplicates, they seem to be undiscoverable by "normal" personal like the SD! If PM has to appoint on record as being the duplicate of another in order to close them we finally have a better quality measurement for known error documentation.
b) It is impractical because SD personal has too many things on mind.
----> They still have to think about if they need a root cause analysis. Make a tool that makes it easy to create a "cloned" problem from an incident. Shouldn't be too hard.
c) It is not good practice because ITIL tells you too:
__1. Create the problem after the incident -> Be more skeptical ;-)
__2. log all PM records at problem management -> Be more skeptical & if you still believe that, use the pigeons or anything else to alert them.

These are the hardest parts. In all my consultant contracts I allways told anyone who not run away quickly enough that problems should be generated when they arise. It is a good additional thought that they probably arise at the same time as the incident does.

So overal all, yes creating a problem record early in the IM process can be useful. Do we need to appoint PM personal to that immediatly? How quickly do we need PM to respond to assess impact and priority for root cause analysis? How quickly should they scan for duplicates?

Submitted by skeptic on Sun, 2010-01-03 22:04.

management

You're right, the tool issue is irrelevant - whatever was I thinking :)

Agree with all you say. Your closing questions fit in with what Aale and James have been discussing about what is and isn't prescriptive or fixed. All that decision-making happens ad-hoc based on resourcing and load - it's called management

Sorry you missed it. there are about 110 comments on linkedin and 50 more here - you'll never catch up on them all

Submitted by ITIL Master on Mon, 2010-01-04 02:58.

My head hurts

Sometimes I wonder why we bother to say ITIL is providing guidelines and not an instruction manual. In my experience Problem management is as much cultural as it is a process. What do I mean by that. The problem Management process states the activities we should follow and it is subject to reactive and systemic models. The reactive is of course in response to an Incident that has an high enough Priority (remember Impact is part of this). Sometimes we will find out what the root cause is as part of the diagnostic process (and that is a good thing) but we may not need to remove the root cause to restore the Service (memory leak reboot server, Incident closed). That is why it may be important to create a reactive PM record where the decision is made on whether to pursue the RC. This decision will be made not just on Impact but on the risk of reoccuring again. I hate just talking about RC as it is just the last thing in a chain of events prior to the service error / failure. I like to see causal effects looked at to see if there is a better way to remove or minimise the chance of reoccuring by addressing one of these causal events. I do not want the decision as to whether to create a permanent fix as part of my IM process which you will have if you do not create an associated Problem record.

This leads to the cultural aspect of PM - everyone who plays a part in SM should be able to create a Problem record and should be encouraged to do so with the supporting evidence, wow, maybe even introduce incentives. This is where I like the link in to KCS where my resolver groups work with PM to identify Problems, work through on the ones that require permanent solutions and also publish workarounds for the others.

Incidently, one of the biggest holding most organistaions back from boing real proactive PM is the ability to data mine the information looking for patterns.

Oh and Ian, I noticed that early on you mentioned that IM needs PM's permission to restore a service, I am sure that you miss-typed as there are very few ocassions when this is true as the overriding objective of restoring the service will nearly always beat the need to gather PM diagnostic info - especially if there has been no decision to proceed to RC analysis.

The last lesson of a Master is simplicity

Submitted by JamesFinister on Mon, 2010-01-04 07:14.

The perils of pre-emption

I'm possibly with Ian on the need for IM to ask PM permission to restore service, with lots of caveats around it.

Obviously it only applies when the incident has attracted problem management interest, and if it is the result of a major incident then problem management and incident will probably be tightly coordinated as part of the major incident team, and it can often be as simple as PM saying upfront whether or not they want to be asked before the service is restored.

What happens if you don't ask problem management? The worst case scenarios include

Vital diagnostic information being lost
Evidence being lost to support any ensuring contractual dispute
Evidence of the full extent of the damage done being lost - which can be a major issue in highly regulated industries
Incident management taking action which has consequences only PM are aware of
Incident management undertaking a fix PM already know won't work

Obviously the more mature your processes the less need there is for this permission to be heavy handed, but in the early days of adding PM to your repertoire it is a useful additional control.

Needless to say there are other areas where the eagerness of a single team to do things quickly, with the very best intentions, have unforeseen consequences. Putting in changes early comes instantly to mind.

Submitted by skeptic on Mon, 2010-01-04 10:13.

to restore service or to get the diagnostics

IM should no more ask PM permission than vice versa. There are conflicting goals here - that is one of the first things ITIL taught me. it is management's role to decide what is more important: to restore service or to get the diagnostics. if this is the third time it has hit and management tells IM to wait, then they wait. if no sales are being made and management tell PM to pull their heads in, they pull them in. It is these rare moments when managers actually earn their fat salaries.

Submitted by mbuzina on Mon, 2010-01-04 22:47.

And the working people still need to cover their aXX

I agree. But still the "working class" will get the blame if important evidence is destroyed. Think about a security incident by a hacker attack. You bring back the service because management told you so and afterwords you have to tell them "no, sorry we could have identified the bastard, but we erased the logs by doing a restore....".

So my proposal: Use modern technology to ease the evidence recording and still stick to a fixed resoration routine. Backup your virtual machines and keep copies of the SAN disks of your system before starting the restore. Ideally keep some seperate HW available to restore the failed systems / machines / data to for inspection. Then PM can start poking at the smoking remains trying to find the black box while IM can put your passengers into a new plane heading towards their destination.

Submitted by JamesFinister on Mon, 2010-01-04 23:38.

Three environments

It is also a good idea to have another alternative environment where people can play around without impacting the live environment, but it has to reflect the live environment so that what works as a fix there will work in live. So we end up with a live environment, an operational test environment where changes can be made without impacting the live environment or losing forensic data., and a forensic environment to preserve evidence.

I know to some that sounds like overkill, but in safety critical, highly secure, or highly regulated systems you sometimes need to do such things.

Submitted by ITIL Master on Tue, 2010-01-05 02:46.

Still more I cannot believe

It should be in only the smallest number of cases (by exception) that IM asks PM for permission to restore a Service and some of those few exceptions are discussed above. We seem to be having this discussion as if it were an abstract concept and totally missing the point of being business focused. The mere concept of IT going back on a regular basis to the business and saying we cannot restore a business service because we need to capture diagnostic information to determine if there is a root cause that we may or may not determine we need to fix at a later date is so far away from busines and customer focus that I find it incredible.

The question on PM OLA's is an interesting one when you relate it back to 'who cares'. Has the Service been restored? If yes, take a breath, smoke them if you have them and off to the next crisis. If you were to put OLA's on PM what would they be; time to diagnose RC, time to remove RC. I work with a large multinational finance organisation that have an OLA for RC on every Problem, but nothing for removing the RC which I have told them this is crazy and explained why (I dont need to do that here do I?). One of the driving factors in determining whether we even need to proceed to RC analysis is risk, what is the risk of this happening again and then in conjunction with Impact Prioritise which Problems need to be addressed first.

It really is not that hard with reactive PM - Mandatory for P1 Incidents, desirable for P2! For everything else it should be dealt with Proactive PM (with some exceptions of course). How come nobody is talking about Proactive PM as insn this where we will start driving down the largest number of reoccuring Incidents? We are all talking about reactive PM as we know that in most organisations the IM and PM people are exactly the same people so there is a tight connection

The last lesson of a Master is simplicity

Submitted by JamesFinister on Tue, 2010-01-05 09:15.

The first lesson of a master...

...is to listen to what people have been saying and the context in which they are saying it ;-)

The majoriy of posters here, if not on the linkedin group, are NOT talking about creating a problem record for every single incident, or at least certainly not on a one to one basis . We are talking about problem management only taking an active interest when the chronic or acute impact of the incidents is hurting the business. So we already have a by exception filter. We are mostly talking in this context about major incidents, if they are a regular occurrence then problem management isn't doing its job properly, so we already have a filter by exception.

That focus on impact on the business should make it clear that we haven't been discussing it as an abstract concept.

I suspect we might be differing substantially in our views of what tpoe of person the problem manager should be. From very early on in my ITIL career I've put the emphasis on the management of problems, that is to say the problem manager is making the call over whether the incident should be fixed is making decisions as a business centric manager. My view of problem management is emphatically NOT that of the L3 techie trying to collect every last scrap of information, but of a manager weighing up the risks and acting as an important balance to the incident manager's desire to close the incident ASAP.

So you appear to be finding incredible a view none of us have. What we are saying is that ITIL, and this is really a very very basic ITIL 101 point, wants to break out of the constant break fix cycle that frustrates the business and give them an assurance that the service is dependable. And we aren't talking abstract theory, I've spent the last two years in just that situation and time after time the C level message was the supplier was "Tell us what you've done to stop this happening again"

Incidentally I would say the greatest need for proactive PM is in stopping P1 incidents . You can afford to be reactive to the more minor ones, but then we are drifting into availability management.

Submitted by Visitor (not verified) on Wed, 2010-01-06 00:36.

"Shifting the Burden"

What you've described is very common archetype found in organizations (complex systems).

http://www.systems-thinking.org/theWay/ssb/sb.htm

Submitted by JamesFinister on Wed, 2010-01-06 09:01.

Multiple applications

That probably isn't surprising since my systems experience pre-dates my ITIL experience, and systems thinking obviously influenced ITIL, and it really was pretty much the first thing we taught people on early v1 courses. Some readers here might remember the cookie factory story.

As an aside one of my concerns about the ITIL world is that it latches on to good ideas from external sources and then tries to develop them itself, rather than syncing with what the professionals in the domain are doing.

That Shifting the Burden model is applicable in the ITIL world in lots of areas, not just problem management. We fall into its trap in trying to change the behaviour of suppliers, in trying to keep customers happy and, from very painful experience, in capacity management.

Submitted by aroos on Tue, 2010-01-05 07:57.

Proactive PM is the point

Simplicity is good.

I have been trying to argue that all the things you do to get business back running are actually Incident Management, even if you need to find and fix the cause of the incident. Reactive Problem Management would be just another name for the same activity and that seems to be the cause of this confusion.

It becomes much simpler if you decide that problem management is always proactive, the goal is to prevent the recurrance of the incident. It is very likely that some people need to do both activities so it is important that they understand the different goals.

Aale

Submitted by mbuzina on Tue, 2010-01-05 15:06.

Simple is not enough

Why is preventing a recurrance "proactive"?? Proactive mean preventing occurance. In your point of view you could still open the problem early on, as not to lose vital information for your "pro (re) active" problem management. After everything is smooth again you can start investigating how to prevent this. If you documented your incident well, you may even have half a fault tree. But using that is dangerous, you may have missed the other half, which was not needed to restore the service.

Have look at the sample FTA in my recent article http://buzina.wordpress.com/2010/01/05/incident-and-problem-management-r....

Submitted by mbuzina on Mon, 2010-01-04 08:22.

OLAs on PM

Hi James,

I still am a bit torn between having a deeper PM involvement and the requirements of IM. IMs focus is on restoring quickly and within dedicated time frames. Availability, Capacity and SLM design processes should make sure that a service is recoverable within a dedicated time frame. There is no design process for PM (maybe there should be), so with most of the clients I work for, PM is an activity that is done on a "if resource permits" scheduling, meaning that some have allocated resources (some don't) but nowhere I have been are there enough resources to guarantee any kind of response times, which would be required in such a case. In common cases of off-shoring ITSM execution this really is an issue, since often PM staff is not off-shored, but not available enough.

What are your experiances in getting OLA like commitments from PM?

Submitted by JamesFinister on Mon, 2010-01-04 11:16.

PM design and specification

First of all I think we have to recognise that there is a crossover between PM and other processes/functions - CSI/quality included.

Secondly trying to design/specify problem management runs the danger that we get a framework for PM but not the substance. There have been other posts on this site about justifiably highly recommended reading in this area, most of it from a non IT perspective. And it is important to remember that just because ITIL doesn't include something doesn't mean it isn't available elsewhere.

At the heart of problem management is a good problem manager. It makes or breaks PM.

The shift from an incident focus to a problem focus is one of the most important cultural shifts in mature ITSM. As with much of ITIL though it is hard to build the business case because part of the implementation is the development of the metrics that would prove the case - but obviously that comes after the decision to do it. A useful starting point is to do a one off PM project to look at something that is causing the customer pain but is low on the IT team's agenda - printer issues for instance.

Fixing a problem once is obviously a much better use of resources than fixing the same incident multiple times.

Organising PM across a supply chain is a major topic in its own right, as is how you capture PM in OLAs and contracts. You can’t set a target for how long a problem will take in the same way that you can for an incident, and other things become important, such as the transparency of the process.

Another point about metrics for PM is what I frequently talk about at conferences on metrics is “The Problem Manager’s Dilemma” This is that the metrics that prove you have done a good job one year are actually counterproductive the next year, so cannot be used to measure year on year improvement. Take something simple like the number of problems identified per year. If that number goes up does it show that problem management is working well, or that it isn’t?

Submitted by mbuzina on Mon, 2010-01-04 22:42.

OLA/SLA for PM not on Resolving Problems

Of course fixing a problem is better resource usage than restoring X incidents. No one is there to argue, but working mostly on IT service providers where the client purchases a platform service (often including higher level services as well) I see contracts / SLAs that bind the provider to specific restoration times. This usually includes restoring to some previous point in time, which in turn destroys the logs (it can even be worse on security related issues, there it may be real evidence required to "fix the problem" in a court) for problem resolution process.

My usual aproach is to define a fixed procedure for restoration of the service and have that run. This way I get back up & running within the time my client required. If evidence is important, I would add a backup of the current data / system. In virtualization this is usually an easy job. I would prefer to add an evidence recording step in the service restoration process skript rather than involving problem management, who (in my experiance) ar not used to deal with fixed timelines.

If you recommend to ask PM for permission to restore (even a short "yes, go ahead" requires resource availability) you need to put an OLA in place to meet your targets on the total restoration process. This is quite similar to the execution of your DR / continuity plans. They also run a predifined skript (much more detailed than a mere "process" ;-).

Your metrics remark is right. And not only does it make it's own metrics look worse, it can also make other processes look worse (the old "easy incidents are what make IM & SD look good" vs. Solve the problem of the most occuring incident for example). The same is true for your problem manager. (even if the same holds for change-, service level- and a few other -management topics).

Submitted by JamesFinister on Mon, 2010-01-04 23:25.

The bucket is the key

All measures have to be taken as part of an overall framework of measures. In reality there are multiple possible outcomes, some of which the customer will find acceptable, and some they will not. SLAs, OLAs and contracts rarely capture this. And of course those possible outcomes have a range of probabilities. In an esoteric frame of mind I might suggest that what problem management does is to alter the balance of those probabilities. The classic question "How many incidents did problem management prevent this year?" can perhaps only be answered in that way.

It helps improve both IM and PM if standard procedures are followed. Somewhere I have the pilot notes for a Cessna 150 and the laconic page for fire says pretty much "Land the aircraft and get out" but in commercial aircraft we know a more detailed standard check list is followed, part of the intent of which is to avoid wrongly identifying the cause of the alarm. Misidentification of the alarm has been in a factor in turning several aviation incidents into major incidents.

Someone who has not yet been suggested as the decision maker about restoration is the Service Manager, a role early versions of ITIL tended to ignore because it didn't pigeon hole neatly into a process.

Submitted by skeptic on Mon, 2010-01-04 23:23.

Firemen's SLAs

"contracts / SLAs that bind the provider to specific restoration times"

Priority 1: If the entire factory catches fire we will extinguish the fire and rebuild the factory within 1 hour
Priority 2: If one room of the factory catches fire we will extinguish the fire and refurbish the room within 3 hours.
Priority 3: If a piece of equipment catches fire we will respond within 4 hours and extinguish within 1 working day
Priority 4: For waste-bin fires, trapped cats etc: we will resolve within 7 working days

Quote from real life: "Dammit the entire systems has been down for hours. how much longer will it take you people to find the problem?"

Submitted by skeptic on Thu, 2009-12-31 08:06.

what i thought we didn't want

Let me turn that question around and ask: if problem records are only opened by problem management, what is the procedure to involve them so they can make the judgment? transfer the incident to them? We are no further ahead. Send them an email or phone them? We just lost track of the work queue.

i have no problem with problem making he judgment. that is what the problem record is for: to tell them what situations are out there awaiting their almighty consideration. if it is not a problem they categorise it under "C for crap" and move on.

if we DON'T open a problem record then yes indeed it is someone other than problem management making that decision, such as an inexperienced service desk operative, which is what i thought we didn't want

i'd rather problem management get too many false positives than have service desk generate false negatives. besides, all the false positives are an incentive for the problem experts to improve the knowledgebase

Submitted by skeptic on Wed, 2009-12-30 11:10.

I'm as dogmatic

I'm as dogmatic as Juan now - I'm the newly converted zealot. No problem ticket when there is a (suspected) problem is no different to no incident ticket when there is a (suspected) incident - they are both wrong. Tickets are there to track reality so we can manage it as it happens and measure it afterwards. And the rules should be simple and clearcut, especially during periods of high stress.

Submitted by aroos on Wed, 2009-12-30 13:56.

What's happening down under

Rob
Skeptic zealot, hmm. Is it very hot there?

;-) Aale

Submitted by JamesFinister on Wed, 2009-12-30 10:40.

Need for a ticket

Aale,

If a decision is taken not to pursue problem investigation there is also a decision to accept an unmitigated risk of the problem causing further incidents. This is a perfectly acceptable approach BUT the decison, and the rationale for the decison needs to be recorded and the obvious place would be on a problem ticket that is then closed with no further action required. This parallels what we would do for an incident that turns out to be a non-event or can be closed by the desk - we would still want a record to be raised.

Submitted by aroos on Wed, 2009-12-30 12:15.

Makes sense

Yes, I started thinking about that after I wrote the last comment but had to go. Makes sense but I don't think ITIL describes it like that.

Submitted by JamesFinister on Tue, 2009-12-29 14:00.

Removing, not fixing

Rob,

I thought I was actually being very careful to talk about "fixing" the incident and "Removing" (for want of a better term) the problem. We can also mitigate a problem I suppose.

Submitted by ianclayton on Mon, 2009-12-28 21:05.

Remember - its starts as a problem...

Pre-ITIL, IBM was one of the outfits to get this right - documenting a problem-change cycle as part of the redbook series (if anyone has a copy of this sky blue 60 page gem please let me know - I lost mine to a hurricane!).

In the beginning we have a problem. Its recorded as an incident due to the ruminations of ITIL v1, to reflect the low impact or triviality of the event, and to forestall the effort (and cost) of a full court problem investigation. Note - in the old mainframe days most 'events' forced us to resort to in depth diagnosis and investigation - recorded as a problem. The incident reaction was also propelled by the customer demand for service restoration caused by the increasing dependence on IT.

ITIL spent years clumsily explaining this with infamous definitions of an incident such as ... 'a problem for which the cause is unknown'. The plain fact is, an event is recorded as an incident to cheapen the cost of support - a reasonable default tactic. If it is later (or coincidentally) discovered or suspected to have some level of significant impact on one stakeholder or another - we go back to basics and record a problem record, to justify the investment of investigating and perhaps eliminating the causes.

As for hierarchy - any and all incidents are related to each other, and to the problem record as evidence of the impact. Rules as to whether the support staff should apply workarounds are invoked under management of the problem team. Note: ITIL forgets the tactic of 'containment'.

Although the 'processes' seem parallel they are integrated under an uber process - service support. As Skep says - there are incident, problem, and perhaps even change records that are related and networked. ITIL V3 has made the task of following the dots in the life of a customer impact event harder by distributing the guidance across three books (Design, Operation, and Transition), and leaving enough in the CSI book to impede the design of a continuous improvement program.

Submitted by JamesFinister on Tue, 2009-12-29 07:08.

Hidden in plain sight

After a couple of years teaching incident and problem management it struck me that a lot of people were mistaking the problem record for the problem itself, and didn't realise that the chronology of the extended incident life cycle is usually significantly different from the chronology of the records we use to manage that lifecycle. As Ian says, the problem comes first, but that doesn't mean the problem record comes first...unless we are doing proactive problem management of course ;-)

James

Submitted by skeptic on Tue, 2009-12-29 07:16.

Why not?

The problem record doesn't come first. it comes much sooner after the incident record than is ITIL advice or common practice. Why do a retrospective record? Why not start documenting the problem resolution process as soon as we initiate it? And we should initiate it as soon as we suspect a problem. And we should suspect a problem as soon as we fail to get a match with existing knowledge and/or we fall out the end of an incident model procedure

the problem team are the ones who should prioritise and judge how real a problem is - no-one else. Get them involved early, and give them a record to manage their actions.

Why not?

Submitted by JamesFinister on Mon, 2009-12-28 10:34.

IIRC....

....you might find that approach documented in the model answer to the old v1 managers' exam question on whether the objectives of problem management and incident management conflict with each other, which I think was openly published as a sample paper.

I've certainly never taken the view that you open the problem record at the end of an incident, not least because it contradicts the basic ITIL tenet that incidents don't turn into problems. In addition if you don't open the problem record until then you lose the ability to collect diagnostic information that is not of any use in restoring the service, but is useful to find out what happened. To use the now ubiquitous analogy with air crash investigation - the accident investigators arrive on site before the bodies are moved.

You might also remember the diagram about the link between the development and live environments where problem and known error records are populated before the system goes live and therefore before any incidents have occurred. I must admit I've not checked to see if that diagram is still in v3.

James

Submitted by ianclayton on Mon, 2009-12-28 18:53.

Impact, Impact, Impact

IMHO... When and what procedure to follow regarding the creation of problem records is all about impact. Its the right of problem management to generate a problem hypothesis (I suspect or think something is going on...), or a problem record, at any point, regardless of whether its at the beginning or end of an incident. The degree of impact upon either a customer or provider community should be documented as a policy and preferably linked to organizational objectives and results.

As for the KEDB - its a bit of a distractor - as it represents key information that should be contained within a problem record. If a problem record is suitably formed the cause and solution statements are searchable, and this includes during matching of incidents. The fewer moving parts resulting from proper architecture is a good thing...

When determining the degree of impact, be sure to ASK affected parties, not assume. Experience has shown that the same event happening to the same person(s) on the same day of the week, and interrupting the same activity, can have a totally different impact....

Major Incident: What constitutes a major incident requires prior documentation and agreement, with the opportunity to add new ones as required. Again, as I am sure I have blogged elsewhere on this site, 'major' can include something as simple as a security related incident (as this could have a significant long term impact if not addressed quickly), or as complex as many related incidents, or even one incident with a large impact zone.... Its so much simpler when the impact is connected to activities and desired results.

As to why ITIL did not help us all by 'walking the process/procedure' - I could hypothesize but..... as we can all appreciate, its healthy to test theory but exposing it to a real-life situation.

Submitted by skeptic on Mon, 2009-12-28 20:21.

the duty of all staff

Hi Ian

I'd add two things:
"Its the right of problem management to generate a problem hypothesis" ... and the duty of all staff to do so. I will blog on that soon.

"What constitutes a major incident requires prior documentation and agreement," yes that would be useful as guidance but let's not let people get into the mindset of following that as a rule. A Major Incident is exactly analogous to a State of Emergency. If aliens land on the White House lawn you do not want to be in the situation of "I'm sorry Mr President, that's not an emergency we know about". A Major Incident is declared by authorised management on any grounds. later we can review whether it was justified.

Submitted by ianclayton on Mon, 2009-12-28 20:43.

The glue between incident, problem, continuity - the situation

Hi Skep

As always I was speaking with speed..... defining a problem hypothesis is a skill. There are rules to avoid raised fur and cat fights. This is what I teach (strange how I so often fail to practice it!). Its the duty of all staff to bring a potential issue to the attention of problem management - analogous to a 'suggestions box', and for evidence to be recorded in a consistent manner, allowing easier trend analysis. But its problem management who develops the skill and then delegates it use as need be....

As for aliens on the WH lawn... agree. I work with organizations who now operate a 'Situation Management' protocol that is intertwined with incident and problem management due to the need to respect outside influences. What qualifies as a major incident or 'situation' is specific - it needs to be - driving a raising or lowering of the situation level, which in turn affects the sensitivity of service contracts and continuity plans.

For example, we can't predict when fog on the freeway might cause a crash, but we can suspect from prior experience. Our situation level and sensitivity to local issues might be influenced by the fact we might have the only medi-vac helicopter in the region. In house priorities and procedures may change due to outside influences - a situation, and the need to coordinate a host of activities with other entities (aliens?). Sophisticated - yes, but this reflects the integrated, wired and community organized world we now live in.

I use aspects of non-IT thinking all the time as that world is changing at a pace consistent with technology, IT's is not. Leveraging universal incident management methods to address service related incidents - more on that in a new book due out in January.

So my words meant what you said - the criteria is governed, how it is activated depends on the situation and by default, as in continuity, we are able to leap and ask for forgiveness later.....

Submitted by ianclayton on Tue, 2009-12-29 18:00.

Problem Managers have crystal balls

James - excellent analogy. The problem manager sits all day in their tent surrounded by tarot cards, crystal balls and incident reports. They scour dashboards looking for potential problems. When they encounter an actual problem happening its a small defeat as they believe its their duty to predict and avoid or mitigate these events.

As Skep suggests, a problem record can be opened at any time for any reason by the problem manager. The key here is to define the problem and its impact using specific language that avoids the scent of blame or a solution. It may often be in the form of a hypothetical statement. "I think this may be happening with this likely effect, but I have no evidence, yet".

The problem and impact statements should call out stakeholder interest and link to key performance targets (see USMBOK). The whole effect is to garnish skin in the game from stakeholders and to justify the effort of cause analysis - which is time consuming and typically expensive. Meanwhile, with the permission of problem management, incident management can restore service using approved workarounds, and based upon agreed response times defined within agreements. Incident management can also follow containment procedures prior to applying workarounds.

Cause analysis starts not with root cause analysis (root causes are seldom found), but with checking existing policies and procedures were being followed - control barrier analysis. Going in to cause analysis we may have a number of likely suspects - presumptive causes.... Related cause analysis sub tasks (change and task analysis) are conducted and a bunch of contributing causes are developed. Each requires a solution/fix/countermeasure.

This snippet and my explanation is from the class being resurrected by myself in January 2010 will begin to show how much is missing from ITIL when you compare it with methods used universally.....