The up-time myth: Why up-time does not mean availability
FYI, this story is more than a year old
When I came back from lunch yesterday and wanted to prepare for the next meeting, I noticed that I couldn't access my emails, my Microsoft Teams client was unable to connect to the outside world and accessing a website was beyond slow.
Just a few minutes later, our IT team reported an internal issue at our data centre provider. Our servers in the data centre were all up and running; we "just" could no longer access them. It quickly became apparent that a DDOS attack had largely disabled the provider network.
After about 3 hours of downtime, which we bridged with "analogue activities" (we can implement the Clean Desk Policy in our office now!), we were able to access our resources again. The uptime of the IT infrastructure - an important KPI in many companies and IT departments - was unchanged after the failure, but the availability of applications and services suffered.
100% Uptime Is Irrelevant Nowadays
These days, statements like these are heard repeatedly. They do not come from admins or operators of a data centre farm, but from application administrators. In times of high availability, distributed systems and container solutions, the administrator of a particular application no longer has to rely on a single piece of hardware. Much more important is that the service itself, i.e. the connected business process, is available and operational at all times.
The fabulous 100% uptime is and has been an unattainable objective. In times of high availability solutions, an application or service can still be available even during the installation of hardware updates, since the application may be moved dynamically and without interruption to another hardware system, but the physical component, in turn, requires a restart, which leads inevitably to a downtime (and with it < 100% uptime).
Effects on Monitoring
Besides the monitoring of hardware and components, the monitoring of complex, coherent business processes becomes more and more important. The administrator of an email system may no longer need to know how many megabytes of RAM the hardware is currently using. For them, it is far more interesting whether the mailboxes are available, the clients can access the server fast enough if POP and SMTP services are running and the Active Directory connection is stable.
This requires that the service processes are clearly defined and implemented as thoroughly and transparently as possible in the monitoring environment. Find out more about the PRTG Business Process Sensor.
Have an Eye on Existing SLAs
For SLAs, for instance from data centre operators or web hosting providers, I recommend taking a close look at the individual definition of uptime. Many providers limit their uptime to pure hardware system availability, but not to service or process availability. I can imagine that your contracts also offer room for improvement in this regard. Talk to your suppliers and minimize risks wherever it makes sense!