For organizations seeking to pursue digital transformation efforts, site reliability engineering (SRE) models have emerged to take on increasingly strategic significance. When employing SRE models, teams take a software engineering approach to operations. SRE teams focus on developing service level objectives that are closely aligned with the business’ most important key performance indicators. Through these approaches, teams can establish the metrics, processes, and capabilities that are needed to improve service levels and business results.
However, in spite of all the promise of SRE, the reality is that the organizations that have been successful with SRE have tended to be digital natives with a wealth of engineering talent. For most mainstream enterprises, the reality is that the success of SRE models is anything but assured. Further, the default tooling approaches that organizations are currently taking—either building capabilities in house or cobbling together disjointed point tools—introduce risk and are fundamentally counter-productive to SRE objectives. This paper examines the key tool requirements that are integral to supporting SRE models.
While the SRE model has been around for more than ten years, the reality is that some enterprises are just now beginning to pursue this approach. Within these organizations, IT leaders are wrestling with ways to maximize agility, and the SRE model is emerging as a key enabler. While a lot has been written about SRE, teams are well served by focusing on the key principles below.
SRE models ultimately place a singular focus on what really matters: the customer experience. This must be integral to the approaches, metrics, and tactics teams employ. In an SRE model, teams manage the customer experience by establishing the following metrics:
By employing SLIs and SLOs to manage environments, teams can realize a number of advantages, including more effectively communicating expectations across teams, establishing better role clarity, and more effectively identifying and prioritizing efforts. Through these metrics, teams can more effectively define and communicate the tasks needed to prevent SLA breaches.
In this effort, teams need to determine what specific monitoring data will be used to inform SLIs. This requires identifying and capturing so-called “golden signals” for SLIs. There are four common categories of golden signals: latency, traffic, errors, and saturation. Teams can also establish more business-outcomebased KPIs, such as order processing time, download completion time, and so on.
In many ways, the SRE model represents a pragmatic bridge between the “fail-never” mindset of traditional IT operations and the “fail-fast” perspective embodied by DevOps. The concept of “error budgets” is integral in the SRE model. An error budget establishes a clear SLO-based metric for how unreliable a service can be.
Through these error budgets, teams establish an acceptable risk. Effectively, this can be viewed as operating within the gap between 100% availability, and a lower metric that is still acceptable from a customer experience standpoint.
Through error budgets, teams can establish concrete, measurable ways to balance between innovation velocity and reliability. As long as SLOs are met, release frequency can increase. Through this approach, teams can move from risk aversion to risk management. For example, rather than trying to build 100% availability, they can focus on establishing graceful failure handling.
In many ways, the SRE model represents a pragmatic bridge between the “fail-never” mindset of traditional IT operations and the “fail-fast” perspective embodied by DevOps.
When implementing SRE models, teams should be focusing on harnessing automation to reduce toil—those high-volume, low-complexity tasks that inhibit teams. Toward this end, teams can institute automation across several key areas:
Given its legacy of being developed and employed by so-called digital natives like Google or Facebook, SRE models either implicitly or explicitly assume that large teams of technology experts are available to apply engineering approaches to IT operations and application development. However, that’s not the reality for most of the mainstream enterprises that are now seeking to institute SRE models.
As teams seek to pursue SRE initiatives, the tools in place can offer significant support—or pose a massive detriment. The reality is that many organizations looking to adopt SRE models are either employing in-house developed open-source tools or loosely connected toolchains. Quite often, these decisions are made at departmental level. This results in tool sprawl and, due to the ensuing heterogeneity, makes it harder for staff to find a single source of truth for problem solving. In addition, by introducing a multitude of tools, integrations, and administrative privileges that are difficult to monitor and manage, these approaches can pose significant security risks.
While some of the largest technology companies have the internal resources to make do-it-yourself or point-tool approaches work; that doesn’t mean it’s the optimal approach for most enterprises. On the contrary, these approaches can represent significant efforts that derail SRE adoption, leaving teams taking too long to adopt SRE, and realizing too little value for their efforts.
The following sections look in more detail at key requirements that tools must address in order to effectively support SRE models.
It is vital that the golden signals and SLIs being monitored ultimately track what matters most: the user experience. Having this visibility is essential if teams are to manage their error budgets intelligently. However, in today’s environments, determining how to identify and track the right metrics is more easily said than done.
To effectively adopt SRE models, teams need to establish comprehensive coverage that delivers unified visibility of the entire enterprise ecosystem, whether teams are running legacy on-premises technologies, modern services and systems, or a mix of both. They need visibility that spans from mobile applications to networks and mainframes.
With the increasing prevalence of approaches like CI/CD, DevOps, containers, and microservices, environments continue to grow more dynamic, ephemeral, interrelated, and complex. In this type of environment, it’s difficult to apply traditional monitoring, virtually impossible to keep it consistently current, and challenging to get the outputs needed to understand performance.
Today, it’s no longer enough to monitor a monolithic computing stack or a discrete infrastructure element; it’s about making complex, modern ecosystems observable.
“Many people, on first introduction to the SRE concept, think it looks a lot like DevOps because it also focuses on silo breaking, automation, and efficiency. They’re not entirely mistaken.”
Source: Gartner, “How To Apply Google’s Site Reliability Engineering Approach To Your Infrastructure”
It is true that SRE and DevOps share many fundamental themes. In addition, both SRE and DevOps present a similar challenge: How do previously isolated teams begin to work together seamlessly? This requires a shift in workflows and cultures, and it presents an entirely new set of requirements for tools. If teams simply seek to connect disparate tools, silos will remain.
Ultimately, for SRE models to succeed, DevSecOps teams must have a holistic view of the stack—the frontend, backend, libraries, storage, kernels, and physical machine. The solution is to expand upon the concept of a data lake, and build a “digital river.” A digital river enables all teams across the software development lifecycle (SDLC) to gain role-specific views into a unified data model, so they can maximize the utility of data in solving problems and gaining insights.
As referenced above, employing automation to reduce toil is a core tenet of SRE models. Through automation, for example, teams can take a software engineering approach to prevent an incident, for example, preventing an outage, rather reacting to an issue after the fact. However, that doesn’t mean teams should work on automation efforts in an ad hoc, one-off fashion.
In many cases, teams have employed limited automation that is based on custom-developed scripts or APIs that are connected to domain-specific tools. These approaches create islands of automation, which presents a number of challenges. First, with these approaches, teams can’t pragmatically automate complex workflows that span multiple technology platforms and domains. Second, these integrations don’t work well in most cases. For example, an alert from a server monitoring tool can trigger serverrelated remediation, while the actual issue may stem from a network device.
To combat these challenges, teams need platforms that provide scalable, flexible, and easy-to-use automation that can be aligned with complex, dynamic enterprise IT environments and rapidly evolving business requirements.
Security is paramount today, not only in the success of SRE initiatives, but in the success of the business. Seamless yet secure interactions now represent one of the most important facets of the customer’s digital experience. To establish the level of trust required, organizations must address security across every facet of the IT infrastructure, from administrative privileges deep within the enterprise to potential vulnerabilities in public APIs, websites, and applications.
As DevSecOps teams rush to accelerate their pipelines through SRE initiatives, security professionals consistently identify three major areas of risk:
At their best, SRE models can enhance security, improve compliance, and reduce the toil of repetitive security tasks by automating risk assessment and threat protection. But the reality for many organizations, especially those in which DevOps teams are isolated from their cybersecurity counterparts, is that SRE fails to adequately address security and itself creates risk via uncontrolled tool, process, and administrative privilege sprawl.
For a team’s SRE initiative to truly succeed at the trifecta of optimizing the user experience, reducing IT overhead, and limiting cybersecurity risks, their toolchain and data model must incorporate security events and be capable of automating risk assessment and threat protection actions—without themselves introducing new threat vectors.
For many enterprises, SRE models have the potential to deliver significant improvements in the customer experience. However, most teams won’t be able to realize this objective with siloed point tools and custom scripts. On the other hand, with the right platforms, teams can gain the advanced, comprehensive capabilities they need to adopt SRE models pragmatically and securely. As a result, customers can best position themselves to capitalize on the advantages of SRE models, gaining the intelligence they need to realize their business objectives, scale their digital transformation, and transform the customer experience.