The most important IT monitoring tools are those that measure the end user experience

It has always seemed to me that most IT monitoring and measuring tools are very self-serving. They look at the world from the internal IT silo perspective. In ITSM terms they are mildly interesting diagnostic tools for incident and problem resolution, but in terms of service level measurement the only really useful tools are the ones that measure the end user experience.

I know these tools exist - there are plenty of them, the ones that measure response times from the desktop, that treat IT as a black box service provider.

They measure response time and availability. How do you define availability? I think the best definition is when response times are greater than some agreed value. This includes the case where response times are infinite which is too often the IT definition of a service not being available, but as far as a user is concerned very slow response is generally as useless as none at all.

These are the only metrics that really map to what the user sees. All the network monitors and server probes and traffic agents and database monitors and snorage (sorry storage has never excited me) consoles... they are toys for the geeks but they don't tell us much about the service.

It is almost always impossible to consolidate their information up into a true depiction of the service. I've spoken before about the boundary problem: no matter how good your IT, there is usually something you can't measure adequately somewhere in the chain delivering the service. Even when it is possible, it's very hard.

True, the user experience tools aren't perfect either. It generally is not practical to put an agent on every desktop and measure every single user's experience. But I believe a representative sampling is enough.

Certainly users believe it. There is a sigh of relief when they finally see metrics that measure their world instead of the arcane insides of the IT beast.

And yet when you listen to IT staff and vendors, these experience monitors don't seem to be seen as the most important IT monitoring tool there is. They are ancillary. Accessories. Plug-ins. Add-ons.

They are not. They are your lead monitoring tool. All the others are the add-ons: the drill-down tools that let you work out why the user experience is outside the bounds of the SLA.

The very first monitoring to put in is the end-user experience (OK OK closely followed by the event console). Quick wins in service level reporting. And clear guidance in prioritising problems.


Stop Herding Cats

Traditional monitoring solutions reflect traditional organizational structures, specializing on technical domains. Unfortunately, measuring end-user experience is only half the solution; when IT sucks it is critical to understand WHY as quickly as possible. This lack of monitoring intelligence results in what I refer to in a recent White Paper as The Event Management Gap.

The focus on workflow automation (Change, Incident, Request, etc.) and establishing cross-functional processes seems a bit like herding cats to me, especially when you consider that the basic monitoring infrastructure is so out of touch with the new paradigm we seek. The tribal culture of most IT organizations, along with the lack of real capability by the leading management players contribute to this mess.

So, do you need to monitor end-user experience? Absolutely! Every business has vital business functions that are often expressed in terms of transactions -- whether they be desktop based, web based, queries, batch jobs, etc. --- find out what these are and watch them.

More importantly, as your service infrastructure evolves to an n-tier, virtual-ized mess, make sure you have intelligence built into your monitor. There's nothing worse than knowing things are not performing well but not knowing why. Personally, I do not think the legacy vendors are the best place to look for this new breed of solution. Since some of the big gorillas have been mentioned I'll say I tend to favor a product from eG Innovations, but everyone should do their own homework.

So stop herding cats and consider a new monitoring paradigm. You may not find it in the usual place.

John M. Worthington
MyServiceMonitor, LLC

I couldn't agree more on

I couldn't agree more on importance of end-user experience monitoring. I do not much agree on all the other tools being just add-ons. You hinted already that they help you (a great deal) to find the real issue. So the first ones tell you that there (objectively) is an issue the later should tell you why.

Also I do not agree much with your opinion about vendors not giving these what they deserve. Where I work we're partner of HP, IBM and BMC. All of them have a product that does exactly this. Sometimes maybe they do not stress the importance of these enough - but it is not the company, rather it's about individual (marketing and pre-sales) people and their understanding of the technology and its possibilities to build a solution to current (and future) requirements...

At what price?

HP, IBM and BMC sell their products at a price that is not cost effective. As a result of the high price there are few customers.
The salesmen naturally do not sell or mention features to potential customers who won't be able to afford their product. My opinion is that the products are usually overpriced.

A cheap Solution

I find going out to customer sites and watching them use the system, and let off steam, is quite an effective low cost way of measuring the user experience at the desktop.

The customer detection system

There is a large measure of truth in what you say, James. If we go back to my old trusted friend, the expanded incident life cycle, which none of those expensive tools support out of the box, then you'll quickly realize that time to detection is not a big problem. Typically detection times are quick because your customers tell you, but this is the only part of the expanded incident life cycle that vendors deliver upon in their monitoring products.
The real big time wastage, and hence money, is in the diagnosis time. Something like SAA does help in this diagnosis and I'll provide a few examples which relate to workstation slow down. Infosec people usually think they are god's gift to IT. They arrogantly do things without process and stuff up operations. Invariably the ratio of self induced major incidents due to Infosec process failure is high.
* Workstation virus scans launched at peak times of 11 am.
* Patches installed first thing after a week-end.
* Vulnerability scanning over month end or over the WAN.
* DOS simulations, just because they feel like it.
When customers have issues, it a tool like SAA that quickly provides a measure of diagnosis:
* Do all desktops have the issue?
* Is it the desktop or network?
* Is it application specific or generic.
This shortens the diagnosis time.
BTW: Is it just me that has to deal with a certain type of Infosec gene pool or are they bonkers the world over?


I have been a fan of SAA (Cisco's Service Assurance Agent) for this very reason. Typically SAA can be deployed sufficiently close to the desktops to provide a reasonable representation. Like most things Cisco, SAA was acquired from someone else, in this case it was IBM who used reponse time measurements in SNA in an end to end manner!
Somewhere the marketing department got hold of it and renamed it IPSLA. SLA is another one of those vendor abused terms and has been misused in this context. Service Assurance remains a better description, while SLA is a misrepresentation.

measure the experienced behaviour

There's lots of them, including those that really DO sit on the desktop and measure the experienced behaviour, which to me is preferable, for perception as much as anything, or is that nonrepudiation? BMC has one, CA has one, I bet plenty of others so...

Wrong end of the telescope

This certainly rings true. And only yesterday I had an It service provider giving me a figure for the number of people impacted by an incident based on theri "monitoring the people who couldn't access the system." Oddly that figure was very different from the one you got by subtracting the number of users who could access the system from our total user population.

Syndicate content