Shit happens, or how I learned to love the incident
Complex systems are by definition broken. They will always break and sometimes they will break when everybody did what they are supposed to. Fixing the problem won't necessarily reduce the risk of another incident.
As a fitting complement to that paper, read Malcolm Gladwell's "Blowup".
(I read it in the excellent book What the Dog Saw: And Other Adventures. Gladwell writes mind candy, but he does it brilliantly.)
To summarise the main message I took from both:
In complex systems, shit happens. And all the change control and customer focus and automated systems in the world won't stop it happening.
Someone said on LinkedIn recently that problem management is about making incident management obsolete. Not only is such thinking wrong but it is also dangerous. We can't stop incidents, including major ones.
And reducing the risk of a major incident recurrence through fixing a problem does not necessarily reduce the odds of a major incident happening again. We can't ever significantly reduce our need for support. We need to be ready and rehearsed to deal with major incidents because they come like earthquakes.
And when it is over there may well be nobody to blame.
This seems a reversal of some things I have said in the past about the need for change control. I said that "shit happens" is not an excuse any more. I still believe that. Just because some incidents will remain unpreventable doesn't mean that many others can't be prevented. Just because fixing a problem in one place means higher risks will be taken elsewhere doesn't mean we shouldn't fix the problems. And just because complex systems are impossible to stop breaking doesn't mean that there isn't negligence behind some breakages.
We need to be more embracing of incidents as part of normal operations, not as aberrations that can be eliminated. Incidents aren't deviations from some idyllic norm: they are the norm.