Great paper on failure of complex systems

It is not often you read something that completely changes the way you look at IT. This paper How Complex Systems Fail rocked me. Reading this made me completely rethink ITSM, especially Root Cause Analysis, Major Incident Reviews, and Change Management.

It dates from 1998!!. Richard Cook is a doctor, an MD. He seemingly knocked this paper off on his own, it is a whole four pages long, and he wrote it with medical systems in mind. But that doesn't matter: it is deeply profound in its insight into any complex system and it applies head-on to our delivery and support of IT services.

"complex systems run as broken systems"

"Change introduces new forms of failure"

"Views of ‘cause’ limit the effectiveness of defenses against future events... likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly."

"Failure free operations require experience with failure."

Read this paper. And READ it: none of this 21st Century 10-second-attention-span scanning. READ IT HARD. Blow your service management mind.

Does this change any of your ideas of ITSM? Should any of these ideas be in ITIL?

(My apologies to whoever sent the link to me. This old brain has forgotten and LinkedIn makes it almost impossible to find your message again! Thanks. Remind me and I'll credit you)

[For more on this see this later blog post]

For those of you rendered so feeble minded by the internet that you can no longer read, here's a video

Comments

normal accident theory

See also "normal accident theory" (thanks Michael Krigsman) The related book is cited in a comment above.

A bloody case study from the building trades

Getting my basement remodeled. Was talking with my contractor Chris about workplace safety and he told me that the two worst building site accidents he'd ever seen were strangely similar, and I think troublesome to understand in terms of a single root cause.

In both cases, a worker was cutting lumber with a circular saw. When cutting a large amount of lumber, the sawdust tends to accumulate and make the spring driven blade guard stick; i.e., when the saw is removed from contact with the stock being cut, the blade is left unguarded and spinning.

In both cases, the worker stepped on a nail concealed in the sawdust, recoiled from the work, pulled the saw away (blade guard not deploying quickly due to accumulated sawdust), and was injured by the saw blade, in one case lifting his leg into it, and getting cut to the bone. (Errk...)

Root cause? Improperly designed blade guard? Lack of recovery of excess nails, i.e. proper work site cleaning & preparation? More frequent cleaning of excess material from blade? Improper footwear? All 4? And what would people from different cultures say?

Wonder what the OSHA reports said. I'm sure both accidents generated lots of paperwork around these questions.

Charles T. Betz
http://www.erp4it.com

instinct

My instinct is that where there are four fairly equal causes, that implies there is a deeper one they all point back to. Safety culture? Safety officer?...

Is culture a "cause"?

Can we characterize something as nebulous and overarching as culture as a "cause"? How could we falsify such a hypothesis?

And I am not sure that the lack of a safety officer should be seen as "cause." Sounds like a wasteful QA role; just like quality, safety should be embedded.

Perhaps we could say that the search for a single cause on the first incident would misguided.

But perhaps, the single cause of the second incident was failure to embrace the 4 lessons of the first.

All four causes should have been understood and addressed: better designed saws, safety-conscious work procedures (e.g. training on hazards of sawdust buildup and nails on the ground), and required heavy footwear.

In terms of common industrial practices, and what is practical to enforce, I'm homing in on footwear a little more than the others. Perhaps the potential to effectively mitigate a factor elevates it in terms of our attention - but strictly speaking, the ability to mitigate should not privilege a factor in a multi-causal situation.

Theory vs. practice?

Charles T. Betz
http://www.erp4it.com

Homer is the answer - remove him

'Safety culture' as a significant factor was exposed by an excellent episode of The Simpsons, in which Homer was promoted to Safety Office of the Nuclear Power Station in which he works. This had a dramatic effect and the safety record immediately improved. Needless to say it wasn't the impostion of Homer's new safety cluture that improved things - but the fact that his promotion meant he was no longer on the 'shop-floor' causing all the accidents. Homer was the root-cause and once he was removed the accident rate dropped.

The sequence of events described above does have something of the ring of a Homer type accident. Perhaps the operator was the root cause. What are the chances he himself was responsbile for the stray nail, cleaning the machine and wearing proper footware?

Instinct & safety

Here's an interesting one. The NTSB have done a study into the impact on safety of introducing "glass" cockpits into general aviation compared to old fashioned analogue instruments.

http://www.ntsb.gov/Pressrel/2010/100309.html

The number of accidents went down, which you might expect, but those accidents that did happen were more likely to be fatal.

Lots of factors to be taken into account, for instance a lower proportion of glass cockpit aircraft are used for pilot training, which is when a lot of accidents happen.

Ackoff

Russell L. Ackoff has passed away at 90 years young. 4 p.m. Oct. 29, 2009.

He was one of the greatest organizational thinkers of the last 100 years. Condolences to the systems thinking community, his students, readers and colleagues.

Nice paper - thanks for

Nice paper - thanks for posting! Yes it's very apposite to link this to IT. It also reminds me very much of the work of James Reason (1997) and his 'swiss cheese' model of accidents and failures: i.e. there are multiple points of failure (holes) that all have to line up for failure to occur.

Another recommendation

Bignell & Fortune's "Understanding Systems failure" Open University.

Dated case studies but very readable. I presume out of print now.

second-hand

A few available second hand: Understanding Systems Failures

another recommendation

Here's another one. Check out Phil Simon's first book: Why New Systems Fail: Theory and Practice Collide. It addresses many of the same topics.

http://www.amazon.com/Why-New-Systems-Fail-Practice/dp/1438944241/ref=sr_1_1?ie=UTF8&s=books&qid=1257517726&sr=1-1

Excellent Paper

Very thought provoking and well worth a read - the section "Hindsight biases post-accident assessments of human performance" made me think immediately of the blame seeking postmortems that happen after failures in the child protection systems in the UK. A baby was the victim of unspeakable cruelty recently in Haringey, London and afterwards the resulting enquiry castigated the professionals involved in the child's care. "It seems that practitioners “should have known” that the factors would “inevitably” lead to an accident." could be a quote from the inquiry report. I wonder how well informed the authors of such reports are in the difficulties of rigorous and effective analyses of system failures. I am doubtful.

Alex Jones

This is a great paper

Way back in the late 70s, when I was a young technician and thought I knew it all, I read a book that changed my thinking forever: SYSTEMANTICS: How Systems Really Work and How They Fail, by John Gall. (now titled The Systems Bible) http://en.wikipedia.org/wiki/Systemantics

This paper reinforces and expands on the lessons in that book. What a wonderful reminder that complexity itself creates unique problems.

Agreed - an excellent document

Often when we have a major incident, there is single-point-of-failure elimination campaign. Which is kind of humorous. During the almost two dozens of years I have been involved in such things, very rarely is there such a thing - the elusive SPOF; the one thing which we can fix that we prevent *all* future failures. The silver bullet. With very few exceptions, major incidents involve at least five failures - in my experience. Trivial failures only cause trivial incidents.

I think these things, in the context of complex systems are a bit like Black Swans (See the book by Nassim Nicholas Taleb). A Black Swan is by definition unpredictable. So, in theory. most service disruptions in complex systems are NOT black swans. They would be predictable if you we all of the data. But large corporate IT systems are too dynamic to have all of it in order to predict reliably.

Arguably specific human behaviour is not predictable; will a given operator react in a predictable manner in the timeframe required? We may know what the documentation says, but we cannot account for all possible human errors.

CMDB is like building giant temples to the gods

It seems to me CMDB/CMS is part of the denial of IT's imperfection and unpredictability; it is a desperate attempt to get control over the uncontrollable. CMDB is like building giant temples to the gods to make the crops reliable.

Normal Accidents

People wanting to read more on complex systems and their failures will find the book "Normal Accidents: Living with High-Risk Technologies" by Charles Perrow of interest. Written in layman terms, it is very interesting. ISBN 0-691-00412-9.

More Deming than Maslow

"safety cannot be purchased or manufactured..."
Replace 'safety' with 'quality' and you have a quote from Deming.
This was my first exposure to "hindsight bias" and I appreciated having some dialog on the role of people in accidents to support that term.

Cook's Book

Resilience Engineering
ISBN: 978-0-7546-4641-9
This book appears to contain a paper by Cook entitled "Resilience engineering: chronicling the emergence of confused consensus"

Syndicate content