IT Brief Australia - Technology news for CIOs & IT decision-makers
Story image
APAC firms need a clear MTTR strategy to combat costly downtime
Thu, 26th Oct 2023

More Asia Pacific (APAC) businesses than ever are relying on software to run their businesses, which means they are facing a corresponding threat of the growing impact of outages on bottom lines.

New Relic's 2023 Observability Forecast reveals that APAC organisations experience more frequent high business impact outages than any other region, with 41% of respondents saying that they experience an outage once a week. When customer-facing applications and services are impacted by outages, every minute matters. However, more than half (54%) in APAC said it takes at least 30 minutes to detect these outages, while 64% said it takes an additional 30 minutes or more to resolve them.

While the research shows observability adoption is high and increasing in the region, APAC organisations continue to struggle with the significant cost of outages, spending US$500,000 or more per hour of downtime. Outages also come with the highest median annual outage cost globally at US$19.07 million—more than double the figure in Europe and nearly 16x that of North America.

Why MTTR matters
Developers and engineers often use observability to solve three key business and technical challenges: reducing downtime, reducing latency, and improving efficiency. Outage frequency, mean time to detection (MTTD) and mean time to resolution (MTTR) are common service-level metrics used in security and IT incident management.

The Observability Forecast found that APAC respondents were least likely to say MTTR improved (61%) compared to the other regions. While respondents in Europe were the most likely to say they learned about interruptions with one observability platform, more than half in APAC (53%) learned about these outages with multiple monitoring tools, while a whopping 30% still relied on manual checks, tests, or complaints.

MTTR, or mean time to resolution, is one of the most widely used metrics in the systems reliability toolbox. Many developers and operations teams lack a clear vision for how to define MTTR, how to use it, and how to improve it in a consistent and sustainable way.

The lack of clarity and lag in MTTR improvements threaten the bottom line by potentially disrupting the increasingly important digital customer experience, not to mention adding significant cost, risk, and complexity to the software development process.

Defining and applying MTTR
A progressive approach towards MTTR combines comprehensive instrumentation and monitoring, a robust and reliable incident-response process, and a team that understands how and why to use MTTR to maximise application availability and performance.

Incident response covers every point in a chain of events that begins with the discovery of an application or infrastructure performance issue and ends with learning as much as possible about how to prevent issues from happening again. A solid strategy for reducing MTTR ensures that teams can continue to improve even as the business and applications scale.
●        Begin by creating a robust incident-management action plan - At the most basic level, teams need a clear escalation policy that explains what to do if something breaks: who to call, how to document what's happening, and how to set things in motion to solve the problem. The fluid approach that many organisations take allows responses to be shaped according to the specific nature of individual incidents, and they involve significant cross-functional collaboration and training to solve problems more efficiently.
●        Define and cross-train roles in the incident-management command structure - While this structure will typically have a centralised point of leadership during incidents and directing the response process, it is important to train the entire team on different roles and functions to maximise the benefits of a fluid model. Cross-training and knowledge transfer enable team members to assume multiple incident-response roles and functions. It avoids situations in which one person is the only source of knowledge for a particular system or technology. If that person goes on vacation or abruptly leaves the organisation, critical systems can turn into black boxes that nobody on the team has the skills or the knowledge to fix.
●        Monitor, monitor, monitor - Getting proper visibility across applications and infrastructure will make or break any incident-response process. New AIOps capabilities enable on-call teams to harness AI and machine learning (ML) capabilities to detect, diagnose, and resolve incidents faster. AIOps uses AI and machine learning to analyse data generated by software systems in order to predict possible problems, determine the root causes, and drive automation to fix them.
●        Document and understand - As organisations develop incident response procedures and establish monitoring and alerting practices, be sure to document everything to create "runbooks" or documentation that tells responders exactly what to do when a specific problem occurs. Part of reducing MTTR involves a strong incident follow-up procedure. This is when the team investigates what happened, figures out how it happened, identifies the triggering event and likely causes, and strategises ways to prevent the problem from cropping up again.

While MTTR is important, it remains one of many metrics for measuring incident response. The key to systematically and efficiently resolving incidents requires putting tools in place that provide a continuous stream of real-time data, coordinating with carefully calibrated alert policies, and then using those tools to support a robust incident-management process. They will also ensure continuous improvement of the organisation's efforts to reduce MTTR and deliver long-term, sustainable gains in application availability.