Shrödinger’s reboot

I'm no expert on probability statistics. In fact, I struggled to pass the paper in university. So this is a layman's view from observation. But I think I got a handle on a couple of principles. How's this?

Suppose you have a transaction you perform which is 95% successful. Maybe a client has an issue with their desktop and 95% of the time a reboot clears it. Is there any value in the service desk saying to a client “you have a 95% probability that this will fix it: please try rebooting.” If it doesn't work they say “oh sorry, you were in the 5%”. This is of no value to the person. For any single instance of the transaction, there is no 95% probability. Either it works or it doesn't. Probability of it working is either 100% or 0% and the only way to know that is to perform the experiment. It is Shrödinger’s reboot. Only in observation will we know. This is the only useful information we have about any one execution of that procedure. If we know we are performing the procedure 100 times a week, then the 95% is useful. We know that for 5 of those times we had better be ready with further action, so it gives us an idea of volumes. But for any one transaction, it is useless. Worse than useless because it deludes us into expecting it to work. Not working is just as possible as working, before you try.

Even worse is when it is a one-off transaction that we have never executed before, such as migrating to a new datacenter or upgrading a product. When you calculate a probability of success of something that you have never done before, it is only one step away from voodoo. The exercise of calculating the number is itself useful to ensure that we have thought of everything, but the probability of success is either 100% or 0%, and we will not know until we try. Therefore when people say they have calculated a 98% probability of success of the data migration, we damn well better be ready for it to fail because effectively the probability of success or failure beforehand is 50/50. We have no more real information. If we don't build our systems resilient to failure, we have only ourselves to blame. Every time we change anything we gamble everything. The Knight Capital story makes this clear, where they destroyed the organisation because of a simple error in a list of server names.
So calculating probabilities is mildly interesting, but the results are not. We should make our decisions based on the confidence of those involved, the potential impact, and most of all the resilience of our systems to withstand the potential impact. What are the chances?

Ok, you probability experts can @ me now, but that’s how I see it.

Syndicate content