Organisations face challenges in the rising cost of goods and services driven by a potent combination of COVID-19 and the great resignation. This has adversely impacted the supply of tech talent and created pressure on employees working on lean teams.
Staffing shortages have impacted site reliability engineers (SREs) in particular since they are under extreme pressure to ensure that digital assets perform at optimum levels 24/7. SREs are tasked with providing the best possible customer experiences with limited resources, while business leaders strive for responsive and error-free services while competing for market share.
Unfortunately, manually tracking performance and incident data is difficult and time-consuming and, in turn, frustrating for both IT and the business. But by adopting automation through a programmatic approach, extraneous human intervention can be a thing of the past.
Under the SLM hood
SREs are key to understanding exactly how customers experience a product or service and tracking system performance and reliability through customers' eyes. Service level indicators (SLIs) and service level objectives (SLOs) are central to every SRE practice.
SRE teams will often set strict SLOs on customer-facing components within their applications that support the SLA (Service Level Agreement) the business has agreed with customers. From here, the team can apply error budgets to understand how much tolerance they have to resolve issues to stay compliant with the SLOs, and, therefore, SLAs.
Service levels allow teams to express expectations through observability, which creates an objective, data-driven view of service delivery across the entire organisation. At a glance, business leaders can use service levels to oversee compliance across multiple teams and business units that reflects team and business performance related to the customer experience.
To reduce the burden on engineers in manually tracking performance and incident data, programmatically tracked SLIs and SLOs are foundational to SRE practices.
Defining relevant indicators and objectives
SLIs need to be relevant to a delivered service and should be simple and easy to understand. When an SLI underperforms an SLO target over the measurement period, it signals a business impact such as excessive unavailability or a sub-optimal user experience.
SLIs often focus on user experience measures. Typical indicators include latency/response time, error rate/quality, availability and uptime. Indicators that are less relevant to service delivery include CPU/disk/memory consumption, cache hit rate and garbage collection time. These indicators do not directly correlate with user experience unless resource saturation is present.
The key to a useful SLI is to pick an indicator that is clearly and unambiguously related to service delivery, is simple to measure and most importantly, actionable.
Programmatic SLIs have three key characteristics: they're current, reflecting the state of a system in real-time; they're automated (they are measured and reported consistently by instrumentation, not by users); and lastly, they're useful, as they're selected based on what a system's user cares about.
With programmatic SLIs in place, engineering teams can easily automate tasks such as tracking the performance of service boundaries, end-to-end user journeys and measuring reliability across teams that fall within defined tolerances. They can also reduce manual toil because DevOps teams have a clear signal indicating when something is occurring that impacts users and, therefore, the business.
An important part of creating programmatic SLIs is identifying the capability of each system or service:
- A system is a collection of services and resources that exposes one or more capabilities to external customers (either end-users or other internal teams).
- A service is a runtime process (or a horizontally-scaled tier of processes) that makes up a subset of the system.
- A capability is a particular aspect of functionality exposed by a service to its users, phrased in plain-language terms.
SLOs express the target objective that the SLIs must meet over a defined period of time.
SLOs should be easy for even non-technical stakeholders to understand. For example, for each SLI, create a baseline SLO using a statistic such as a percentile (e.g. 99%) that reflects the size of the population that must be satisfied by the SLIs over a rolling one week window.
In non-technical terms, this could be described as satisfying 99% of all user requests within the conditions defined by the SLI over the period. Importantly, when using statistics to characterise distributions, averages should be avoided as they fail to capture extreme conditions present in skewed distributions, which are common and can ignore the impact of service delivery for a significant number of users.
SLOs reflect the entire population consuming a service over a period of time. If there are different cohorts with different SLAs attached to service delivery, separate SLOs should be defined that track and measure the cohorts independently.
SLOs are designed to balance behaviour amongst members of DevOps teams and ensure the customer remains front and centre in any activity that could risk non-compliance with SLAs. To achieve this in practice, teams' daily activities must be guided by the current state of SLOs. When an SLO is trending in the wrong direction, teams should revert to activities and behaviours that bring the SLO back in line. Once SLOs recover, regular activities can resume.
At cloud-based payments player Zico, using a Service Level Management feature that automates tasks has been key in enabling its engineers to visualise and report on the company's service level indicators and objectives as well as calculating error budgets. It breaks down the process of defining an SLI and setting the targets into an easily understandable and repeatable process for the engineering teams.
Establishing SLIs and SLOs will result in a simpler and more responsive observability practice, tighter alignment with the business, and a faster path to improvement. To lighten the load on SREs, providing the right tools that can automatically configure and deliver meaningful SLIs and SLOs will be key.