What to Consider When Determining SLOs and SLIs

    In this day and age, virtually every industry leverages data to improve their service in one manner or another. This is no different in the field of software development. In order to ensure they keep their customers happy, DevOps organizations are constantly seeking out metrics and developing strategies to derive insights and measure application reliability.

    For organizations that utilize modern application development and operations practices, ensuring system availability and reliability is the job of Site Reliability Engineering (SRE) folks. In conjunction with the business, SREs must leverage metrics to help establish benchmarks by which application reliability is defined. This process culminates in the determination of service-level objectives (SLOs) and service-level indicators (SLIs).

    What are SLOs and SLIs?

    Without defining the terms SLO and SLI, it’s very difficult to establish them for a particular system or service. So before delving into the process that SREs must go through to determine them, let’s address exactly what is meant when we say SLO and SLI, as well as what an organization is trying to accomplish when setting them for a system.

    In DevOps, SLO refers to a service-level objective. These objectives are defined as a means of determining whether or not an application is operating at a level that's acceptable to the business. All personnel within a development organization understand the fact that when a system is unavailable and unreliable, it is unable to perform its function to the customer with any level of consistency.

    While it would be nice if services were able to operate perfectly, 100% of the time, it is typically not possible. Therefore, organizations develop SLOs as a benchmark for determining a target level of system reliability. This can include measurable objectives for anything from simple availability (e.g., the system must be accessible 99.9% of the time) to system performance (e.g., 99.5% of requests must complete within 500ms). Together, these objectives combine to provide an accurate depiction of system quality for the organization.

    SLOs are hard to monitor without indicators that provide the backing metrics and associated insights for measuring these objectives. These measurements are known as SLIs or “service-level indicators.” By tracking and evaluating SLIs (error rate, request latency, etc.), an organization can effectively monitor service availability, quality, and performance to ensure that the system is living up to the level of reliability promised by the SLOs.

    What to Consider When Determining SLOs

    The process of developing SLOs for a particular system requires that SREs keep several important factors in mind. Consider the following:

    • When SREs determine SLOs for a product, they should make sure that the objective is consistently achievable. If it is a struggle to meet this objective, the SLO will quickly become the worst enemy of the DevOps team.
    • When formulating an SLO, both business stakeholders and SREs should agree to utilize the lowest acceptable level of reliability in their definition. For instance, if the lowest acceptable level of latency for requests is 500ms, then this timeframe should be used as the benchmark. Anything exceeding this level should be considered a bonus. In other words, it’s good practice for SREs to be mindful of overpromising and avoid creating a nightmare scenarios in which they're constantly struggling to meet unreasonable and unnecessary objectives.
    • Determining SLOs is an iterative process. Just as the system for which they are defined will evolve, so will the SLOs themselves. As systems mature, they generally become more reliable. Over time, functionality is refined, resulting in fewer errors and improved performance, thus raising the overall level of quality and reliability for the application.
    • As this maturation occurs, this requires a commitment from the DevOps team to maintain the new level of quality and reliability. In short, continuous improvement is the goal and those improvements will be reflected in changing expectations of reliability.

    Contextualizing Metrics and the Process for Defining SLIs

    The purpose of service-level indicators (SLIs) is to provide measurements for the level of reliability promised for the system. Therefore, when developing SLIs, SREs need to identify metrics that can be easily contextualized to produce statistics that can be leveraged to accurately determine whether or not their service-level objectives (SLOs) are being met. In many cases, the same service metrics that have been around for years form the basis of the most useful SLIs. Let’s consider the case of an SRE looking to determine SLIs for a web application. Service metrics related to traffic data—such as response codes and response times—can be utilized to produce metrics for determining and monitoring SLO agreements.

    Keep in mind that simply tracking service metrics will not be enough. Adding context to this data will be necessary to gain useful insights from this information to help define and monitor SLOs.

    Imagine for a moment that an organization is looking to produce and measure an SLO related to web application latency. Collecting basic service metrics such as response time is a great place to begin, but randomly looking at individual response times or simply taking the average of all response times does not adequately tell the whole story. As we know, some response times (hopefully many) will be fast, while intermittent issues with the application may cause other response times to be extremely slow. In the case of excessively slow outliers, the average may be skewed.

    Instead, when building an SLI for this purpose, it’s more useful to contextualize the response-time data in a manner that shows what the majority of end-users are experiencing. For example, consider the scenario where the response time for 99,500 requests fell beneath a 500ms threshold while another 500 requests resulted in a response time of over 500ms. This is a measurement that is more indicative of how reliable the system actually is. In 99.5% of cases, the system responded successfully in under 500ms. If this is an acceptable level of quality for the system, then an SLO may be defined as, “99.5% of requests must complete within 500ms,” and this procedure for measuring response-time data can be defined as a corresponding SLI.

    While this is just a single example, the principles established within this example hold true in most cases when establishing SLIs. There will always be outlying data, but the most durable metrics avoid allowing these outliers to define the system itself as unreliable. Instead, it’s critical that SRE folks determine formulas in a manner that accurately depicts the level of system reliability for the vast majority of end users.

    Managing Competing Priorities Through SRE Analysis

    The purpose of SLIs and SLOs is to ensure system reliability. Whether or not SLOs are being met tells an organization where their most pressing priorities lie in the days ahead. In the instance where SLIs indicate the consistent failure to meet corresponding SLOs, the DevOps team will want to focus on improving availability and performance in an effort to begin meeting their reliability objectives.

    By contrast, if SLIs indicate that an application is indeed meeting the objectives outlined by their SLOs, then an organization can focus on innovation. This can include a focus on adding value for the customer through the creation of new functionality, the modification of existing functionality, or by addressing any of the many other priorities the leadership team surely has in mind.