Conducting a root cause investigation on a 3rd party

Submitted by Vespasian on Wed, 2012-01-18 12:44

Share this post with

Dear ITIL Wizard,

Our problem is this: We are being expected to conduct root cause on problems that impacted us but which are owned and maintained by 3rd parties who will also have their own Problem and Incident teams. This means spearheading investigation and exploring ways to remedy problems at another company! It's like sorting out your mate's wife's gambling addiction because it's making YOU have to fork out for John's beers whenever you go out! No, John should be telling YOU what he's doing to sort the problem out.

So, how does the below sound.

--We still raise a problem ticket to track the root cause investigation, but is updated as and when we get information about company b's own root cause investigation.

--If we need to conduct a Major Problem Review on an incident then we start one of our own and review how WE handled the incident before it was sent to company B, and simultaneously ask Company B to conduct one of their own into how they handled the incident etc and when they complete it they send it to us and we amalgamate the two in some way.

Yours or anyone elses thought on this would be ace.

Cheers
Vespasian

Dear Vespasian

ITIL says you can do a root cause analysis on an incident without opening a problem record - this is more efficient and saves duplicating information. You don't track all problems, only those you feel like it, usually when they have caused multiple incidents.

Also, since you have transferred risk and responsibility to a third party supplier, you are effectively "black-boxing" their domain. What they do about internal problems is their business, that is the whole point of outsourcing. Measure service levels and penalise if they are not met but don't go opening the box - if you do, you might as well have kept it in house.

Keep It Simple, Sir.
Good luck!
The ITIL Wizard

root cause

Comments

Submitted by skeptic on Wed, 2012-01-18 20:55.

a Real ITSM view

The DogmaITSM division of the Real IT Service Institute (RITSI) has weighed in with a Real ITSM view of this question

Submitted by Real World ITIL (not verified) on Fri, 2012-01-20 20:24.

Us too

I also work in a similar situation where we are penalised on SLA for problem records and fixes made by 3rd party companies that are out of our control.

The only way we have been able to bring them in line is to stipulate in the contract to our customer that the issue is out of SLA when it is passed to the 3rd party and that the SLA we have is "aspirational" in those situations. We have also ensured that the quality we get from the 3rd parties is tied down to the contract between us and them.

Service credits anyone?

Submitted by skeptic on Fri, 2012-01-20 21:07.

Accountability for service levels

I guess we should give Vespasian a straight answer and not leave him at the mercy of the ITIL Wizard.

3rd party suppliers are not out of your control. You have sub-contracted them. You cannot - or should not - contract out of accountability for your suppliers by saying it is out of SLA.

I bought a wood-burner. The cap on the chimney broke in the wind. the wood-burner vendor tried to tell me my problem was with the manufacturer of the cap directly. i told him in no uncertain (Anglo-Saxon) terms that my contract was with the vendor for an installed system. I eventually got a new - different - cap for free from the vendor. Trying to pass through accountability is a cop-out and your customers shouldn't stand for it either.

It is our responsibility as a service provider to ensure our underpinning contract is adequate to hold the third party accountable to us - not to our customer - in order that we can fulfil our SLA obligations. If we do not meet SLA because a third party let us down, we have failed our customer. the third party is only out of our control because we let them be. We're paying them.

if for some reason we cannot get adequate service levels - eg a monopolistic telco provider - then we shouldn't be committing in the first place to service levels we can't deliver.

Something has to give but laying the mess in the customer's lap is not the answer.

getting to vespasian's actual question, i think he's on the money with the process:
Open a problem record
The 3rd party supplier should be contractually bound to participate in our root cause analysis sessions, usually by phoning in.
They should manage their own problem resolution within their organisation but we only have limited interest in that just as our customers have little interest in how we resolve problems. We just want to know it was resolved.

later, as part of Supplier Management not Problem Management, we may call for a report from the 3rd party, but this is to decide if we want to continue doing business with them and what contractual changes may be required.
The Wizard is right. The whole point of outsourcing is to black-box the 3rd party service. if you are going to integrate the two problem processes in your organisations, why outsource them in the first place? Don't open the box.

ITIL is quite clear: stitch up the underlying contract in such a way that you can meet your own SLAs. if they don't align, it is a contractual issue for supplier management, not a performance or process issue for tech support.

Submitted by MichaelC (not verified) on Wed, 2012-01-25 15:59.

+1, absolutely bang on. I've

+1, absolutely bang on.

I've heard of organisations that agree a nice big juicy SLA for a service with their customer, despite that being provided by a 3rd party and the supplier contract guaranteeing less than that to the organisation - sheer madness!

Submitted by skeptic on Wed, 2012-01-25 20:27.

I've seen that

I've seen that. often. Imagine what it is going to be like when the non-IT business units are signing direct with suppliers

Submitted by Visitor (not verified) on Thu, 2012-01-26 09:28.

yes, imagine that ...

Funny you should bring that up. Unfortunately, I don't have to imagine what it would be like; we actually see it quite a lot and it's causing a few headaches at the moment.

Submitted by MichaelC (not verified) on Wed, 2012-02-01 13:07.

I might've seen it.

I might've seen it. Currently. Couldn't possibly confirm that though :D

I daredn't imagine the non-IT business units getting hold of Supplier Management!

As an aside, I've been involved in the IT world for 13 years (and thus share a decent level of skepticism) but am relatively new to ITIL/ITSM. Having recently found your blog I'm enjoying your writings, so may just hang around a bit :)

Submitted by Visitor (not verified) on Mon, 2012-02-06 05:58.

nice in theory

But I know of someone at Company A who buys $xM of IT service from Provider B. Provider B buys 10 times the $ amount of various (non-IT) services from Company A. It is not a reciprocal deal, but unwritten understanding is not to beat them up for crappy service quality, and accept a swiss cheese contract.

Taken out of this poor fellah's hands.

Submitted by Vespasian on Wed, 2012-02-08 15:55.

Thank you

Thanks very much to the IT Wizard, the skep and everyones contribution.

I've got some questions in repsonse to the post but don't want to go off topic so I will ask them in other threads.

It sounds like I need to ruffle some feathers about the contractual obligations of 3rd parties that we have with management; I have but a little voice though.

Regards

Vespasian

Submitted by Vespasian on Mon, 2012-02-13 12:16.

report from 3rd party..

Hi Skeptic,

Just wondering what this report is that is requested by Supplier Management, is it like a Major Problem Report but just to do with the way the 3rd party handled the incident and problem? Does that report have a specific ITIL name?

Cheers

Submitted by skeptic on Mon, 2012-02-13 20:58.

please explain

It's a "please explain": what went wrong - including what went wrong with your procedures - and what you have done to ensure it won't happen again so we can have confidence to continue doing business with you.

With an important supplier I'd do it face to face - i have done so. They tried to blame a technical fault - IT people always do. Under interrogation we finally got them to the realisation that they had f***ed up for two reasons: their guy did lazy research and they didn't give our SAN upgrade the priority it deserved because they hadn't even considered the impact of it failing even though we went to great links to explain that. That is a very different explanation than "bad firmware from the manufacturer". it was a learning experience for our supplier whose service was improved as a result.

We told them it was "part of our Major Problem Review" but that was over. it was part of our Supplier Management.

Submitted by Vespasian on Mon, 2012-02-13 09:41.

Root cause without a problem ticket

Thanks again for the response.

Regards to conducting root cause without a problem ticket; but if ITIL says "Problem = unknown root cause of one or more existing/potential incidents", I'm a bit confused as to how we are allowed to conduct root cause on an Incident ticket, I didn't think we could use incident tickets to track root cause (and hence problems).

Are you saying that for low impact incidents you can do RCA and update the incident ticket with the progress and eventual outcome? On our ticketing system, the incident ticket can be put into a 'resolved' state, and then it can go into an 'RCA' state, but there has been a whole debate around strictly keeping RCA with problem, and hence do RCA on a separate problem ticket.

This might need a separate thread?

Submitted by skeptic on Mon, 2012-02-13 21:04.

a problem is a problem

It already has a separate thread :) In fact several

PERSONALLY I think a problem is a problem. You should open a Problem ticket for every underlying cause.

Other folk think "you can do RCA and update the incident ticket with the progress and eventual outcome" and only sometimes - under ill defined circumstances - open a separate Problem ticket. and ITIL appears to back them up. I think it's rubbish. You might as well not have a separate Problem at all (yes Aale I hear you)

Submitted by Vespasian on Tue, 2012-02-14 14:48.

You make sense. And I

You make sense. And I suppose ITIL want to keep things vague so they can clarify things later as a justification for a new version and make even more money from certs and exams eh?! hehe. We will use logic here on this one I think.

Thanks

Submitted by Wraith on Wed, 2012-02-15 03:03.

Every Incident record has a Problem record

It's possible to argue that incident and problem records are simple artefacts of the way in which any organisation has chosen to implement their incident and problem processes; nevertheless, there's no simple getting away from the reality that every (ITIL Definition) Incident has an underlying (ITIL Definition) Problem.

To clarify: I agree with the Skeptic. Every Incident record requires a corresponding Problem record.

Those who have difficulty determining what should be logged and when are confused and haven't separated the processes sufficiently to understand the points of interface between them. Incident records are managed by Incident management. Problem records are managed by Problem management.

Every time an incident is resolved successfully, a solution is applied. If that solution is a new one which the support staff have generated spontaneously, it requires analysis to determine if, for example, the solution is temporary (a workaround) or permanent. In other words, do we have a resolved problem or a known error? The solution may need to be examined for cost, practicality and potential unintended consequences.

The task of performing that analysis falls to problem management and the responsibility for maintaining that solution, determining the impact of the ongoing incidents which arise from this problem and justifying the resources necessary to investigate and implement a more permanent solution also rests with them.

Technical staff document their diagnostics and attempts to resolve the incident in the original record. The incident record is a record of the progression of the incident process. In other words, the management of the symptoms of the incident. The problem record is a record of the progression of the problem process. In other words the management of the underlying root cause of the incident.

Thus, Problem records are solely created, updated and maintained by the Problem Management team. That team is responsible for sifting through the original incident record to determine the salient facts which are fed into the analytical (Kepner-Tregoe etc..) stage. In extreme cases, the Problem management team may have to sift through dozens, hundreds or thousands of instances of these incidents to extract the data required.,

Now, while every incident requires a corresponding Problem record, it does not require a corresponding UNIQUE problem record. Incidents which are clear repetitions of an earlier incident merely need to be linked to the original Problem record.

Once you shift your thinking from wading through a morass of incidents to simply linking Incident and Problem records together, the reporting which becomes available allows you to clearly demonstrate which Problems are causing you the most pain.

A few final points:

* Incidents without a Problem record cannot be closed. This enforces the following-

* Resolved incidents which have no corresponding problem record should be routed to Problem Management who will create a Known Error.

* Unresolvable incidents become open Problem records. If an incident cannot be resolved at all, this information is relevant to the urgency of the Problem. A problem record linked to 1,000 unresolved incidents which continue to impact customers will gain rapid attention as the "minutes of impact" metric climbs.

Submitted by skeptic on Wed, 2012-02-15 05:33.

a problem gone is a problem solved

I love your summation except that I never actually said "Every Incident record requires a corresponding Problem record." In theory i suppose yes. pragmatically no.

If the user is happy to close the incident and we have restored service with no idea what actually happened (something that happens many times a day) then the incident is closed, period. We may never know what happened.

if similar things happen often enough i trust the support staff to recognise a pattern and open a problem.

I think you are being too idealistic in setting hard and fast rules. just like a GP saying "take 2 aspirin and call me back in the morning", the service desk may reboot or restart something and the issue goes away. often the user has got themselves in a tangle, or there may not actually be anything wrong at all: the IT equivalent of hypochondria.

In the hurly burly of the usual understaffed service desk, a problem gone is a problem solved I say :)

Submitted by aroos on Wed, 2012-02-15 14:27.

Unlearn ITIL

All this is crazy ITILspeak. I translate it to plain language.

1 A customer has a problem with service X. Customer service solves it by any means as soon as possible.
2 The customer problem may have revealed a fault. It must be fixed asap.
3 There may be a risk that the customer problem may reoccur. That risk needs to be managed.

Three tickets if necessary. Different processes, probably different teams. All stages may need PROBLEM SOLVING but nobody needs Problem Management.

If you got that, it was your first step in unlearning ITIL. Cheers, its worth -5 APMG points and all free. Should I create also a free certificate?

Aale

Submitted by skeptic on Thu, 2012-02-16 00:41.

unlearning

Aale, you are free to use a different terminology than ITIL if you like but I don't see that as "unlearning".

Sure ITIl is fuzzy about when to create a ~~problem~~fault ticket from an ~~incident~~problem, and sure it is weak in systematically describing risk management but it does recognise a risk register and the need to deal with risks, so I can't see anything you have said that is un-ITIL

Submitted by aroos on Thu, 2012-02-16 08:50.

sorry, no points

Rob,
There are three concepts to ITILs two. Customer problem and fault are different animals. In itil, they are just incidents. Fixing faults is not problem management.Try harder ;)

Submitted by skeptic on Thu, 2012-02-16 09:32.

inExpert

What the hell is (reactive) problem management if it isn't fixing faults. Perhaps the Finnish translation is different. Or maybe I'm an unITIL InExpert

Submitted by aroos on Thu, 2012-02-16 16:45.

What is the problem

Fixing faults is usually simple routine activity which must be done urgently. You think it makes sense to call it Problem management?

Submitted by skeptic on Thu, 2012-02-16 18:25.

exact match

Yes. The problem process exactly matches what u need to do to fix a fault.

Actually you know I'd prefer there were a Fault entity, and maybe a separate process. Unlike you, I dont see that as justification for abandoning ITIL. The Incident/Problem model is good enough.

Tracking problems as incidents is a big issue for service desks. It confuses front office with back office, tries to do 2 processes at once off the one entity, and screws the stats

Submitted by aroos on Sat, 2012-02-18 09:31.

Good enough for what

Rob,
please print that previous comment, frame and hang it on your wall. It is classic. Let me rephrase it in English:

Tracking faults and customer problems in a single process is a big issue for service desks. It confuses front office with back office, tries to do 2 processes at once off the one entity, and screws the stats. Yes, we agree completely.

You use a seriously flawed model so that you would not have to abandon holy ITIL? Your comment is a proof of why we need to unlearn ITIL. As I said, it is not easy. Remember, you don't have to abandon everything, just cut the bad parts away.

Submitted by Wraith on Wed, 2012-02-15 20:08.

That's a tricky edge case

That's a tricky edge case which requires a bit of consideration.

In an ideal world, your service desk staff filter out issues which can be tracked to user misunderstanding and lodge them as "Queries" rather than incidents.

If the service desk reboots anything and solves the user's incident, that's fine - what you have there is a clear Known Error which has "reboot" as a workaround. You don't need to do anything with these except record them and tag them appropriately. If they hit critical mass, you investigate and perform root cause analysis. From my perspective "reboot" is usually an inadequate workaround because it occupies both service desk and user time. If I'm tracking these reboots, I can do something about them if they're proving particularly onerous.

Failing to capture these issues throws away data which - in hindsight - could prove to be useful or critical later on. Any root cause analysis is going to live or die on the quality of the data you feed it and knowing WHEN an issue first arises and which users first experienced it is damn useful. If the issue never recurs, then we simply don't care. The resulting Problem record can languish until the end of time without anybody giving a damn.

As long as you provide the Service Desk with tools capable of quickly matching user symptoms to Problem records, the impact on workflow is minimal and the Problem records themselves can provide the Service Desk with analytical steps, data capture requirements and solutions. Constructed correctly, this allows your KEDB to educate your Service Desk as part of their normal workflow.

> In the hurly burly of the usual understaffed service desk, a problem gone is a problem solved I say :)

I wish. Users only give the Service Desk a limited number of chances to solve their issue. If they can solve it themselves by rebooting, or if you fail to resolve their issue the first few times they call you, then they stop calling you and telling you there's a problem and start complaining to their peers and their management instead. Great for the Service Desk - terrible for the reputation of the IT organisation.

Submitted by skeptic on Thu, 2012-02-16 00:37.

in the real world

James correctly says one of the problems with this blog is that it is hard to tell when i have my tongue in my cheek, which is why I use the ":)" symbol. I was being facetious about "a problem gone is a problem solved I say".

nevertheless I never met a service desk that wasn't understaffed except one miraculous one Chris and i visited. I'm well aware of and agree with all the theoretical considerations you cite. meanwhile in the real world there is only so much you can capture and deal with - that was my point.