Unlocking the Value of the SRE Model

    Capitalizing on the Opportunity to Transform the Customer Experience

    Download White Paper

    Executive Summary

    For organizations seeking to pursue digital transformation efforts, site reliability engineering (SRE) models have emerged to take on increasingly strategic significance. When employing SRE models, teams take a software engineering approach to operations. SRE teams focus on developing service level objectives that are closely aligned with the business’ most important key performance indicators. Through these approaches, teams can establish the metrics, processes, and capabilities that are needed to improve service levels and business results.

    However, in spite of all the promise of SRE, the reality is that the organizations that have been successful with SRE have tended to be digital natives with a wealth of engineering talent. For most mainstream enterprises, the reality is that the success of SRE models is anything but assured. Further, the default tooling approaches that organizations are currently taking—either building capabilities in house or cobbling together disjointed point tools—introduce risk and are fundamentally counter-productive to SRE objectives. This paper examines the key tool requirements that are integral to supporting SRE models.

    Introduction to the Key Principles of SRE Models

    While the SRE model has been around for more than ten years, the reality is that some enterprises are just now beginning to pursue this approach. Within these organizations, IT leaders are wrestling with ways to maximize agility, and the SRE model is emerging as a key enabler. While a lot has been written about SRE, teams are well served by focusing on the key principles below.

    The Problem

    Given its legacy of being developed and employed by so-called digital natives like Google or Facebook, SRE models either implicitly or explicitly assume that large teams of technology experts are available to apply engineering approaches to IT operations and application development. However, that’s not the reality for most of the mainstream enterprises that are now seeking to institute SRE models.

    As teams seek to pursue SRE initiatives, the tools in place can offer significant support—or pose a massive detriment. The reality is that many organizations looking to adopt SRE models are either employing in-house developed open-source tools or loosely connected toolchains. Quite often, these decisions are made at departmental level. This results in tool sprawl and, due to the ensuing heterogeneity, makes it harder for staff to find a single source of truth for problem solving. In addition, by introducing a multitude of tools, integrations, and administrative privileges that are difficult to monitor and manage, these approaches can pose significant security risks.

    While some of the largest technology companies have the internal resources to make do-it-yourself or point-tool approaches work; that doesn’t mean it’s the optimal approach for most enterprises. On the contrary, these approaches can represent significant efforts that derail SRE adoption, leaving teams taking too long to adopt SRE, and realizing too little value for their efforts.

    The following sections look in more detail at key requirements that tools must address in order to effectively support SRE models.

    1. Establish Ecosystem Observability that’s Aligned with Business Services

    It is vital that the golden signals and SLIs being monitored ultimately track what matters most: the user experience. Having this visibility is essential if teams are to manage their error budgets intelligently. However, in today’s environments, determining how to identify and track the right metrics is more easily said than done.

    To effectively adopt SRE models, teams need to establish comprehensive coverage that delivers unified visibility of the entire enterprise ecosystem, whether teams are running legacy on-premises technologies, modern services and systems, or a mix of both. They need visibility that spans from mobile applications to networks and mainframes.

    With the increasing prevalence of approaches like CI/CD, DevOps, containers, and microservices, environments continue to grow more dynamic, ephemeral, interrelated, and complex. In this type of environment, it’s difficult to apply traditional monitoring, virtually impossible to keep it consistently current, and challenging to get the outputs needed to understand performance.

    Today, it’s no longer enough to monitor a monolithic computing stack or a discrete infrastructure element; it’s about making complex, modern ecosystems observable.

    2. Establish Unified, AI-Driven Intelligence that Spans the Software Development Lifecycle

    “Many people, on first introduction to the SRE concept, think it looks a lot like DevOps because it also focuses on silo breaking, automation, and efficiency. They’re not entirely mistaken.”

    Source: Gartner, “How To Apply Google’s Site Reliability Engineering Approach To Your Infrastructure”

    It is true that SRE and DevOps share many fundamental themes. In addition, both SRE and DevOps present a similar challenge: How do previously isolated teams begin to work together seamlessly? This requires a shift in workflows and cultures, and it presents an entirely new set of requirements for tools. If teams simply seek to connect disparate tools, silos will remain.

    Ultimately, for SRE models to succeed, DevSecOps teams must have a holistic view of the stack—the frontend, backend, libraries, storage, kernels, and physical machine. The solution is to expand upon the concept of a data lake, and build a “digital river.” A digital river enables all teams across the software development lifecycle (SDLC) to gain role-specific views into a unified data model, so they can maximize the utility of data in solving problems and gaining insights.

    3. Establish Comprehensive, Intelligent Automation

    As referenced above, employing automation to reduce toil is a core tenet of SRE models. Through automation, for example, teams can take a software engineering approach to prevent an incident, for example, preventing an outage, rather reacting to an issue after the fact. However, that doesn’t mean teams should work on automation efforts in an ad hoc, one-off fashion.

    In many cases, teams have employed limited automation that is based on custom-developed scripts or APIs that are connected to domain-specific tools. These approaches create islands of automation, which presents a number of challenges. First, with these approaches, teams can’t pragmatically automate complex workflows that span multiple technology platforms and domains. Second, these integrations don’t work well in most cases. For example, an alert from a server monitoring tool can trigger serverrelated remediation, while the actual issue may stem from a network device.

    To combat these challenges, teams need platforms that provide scalable, flexible, and easy-to-use automation that can be aligned with complex, dynamic enterprise IT environments and rapidly evolving business requirements.

    4. Establish Frictionless Security Throughout the Software Development Lifecycle

    Security is paramount today, not only in the success of SRE initiatives, but in the success of the business. Seamless yet secure interactions now represent one of the most important facets of the customer’s digital experience. To establish the level of trust required, organizations must address security across every facet of the IT infrastructure, from administrative privileges deep within the enterprise to potential vulnerabilities in public APIs, websites, and applications.

    As DevSecOps teams rush to accelerate their pipelines through SRE initiatives, security professionals consistently identify three major areas of risk:

    • How do you ensure that appropriate levels of security are built into the digital experience—without adversely affecting customer convenience, developer velocity, or operational overhead?
    • How do you protect DevOps toolchains and processes, which have elevated access privileges that can easily be exploited?
    • How are you governing access to the overall SRE architecture and data to protect user trust and ensure a “least privileged” posture?

    At their best, SRE models can enhance security, improve compliance, and reduce the toil of repetitive security tasks by automating risk assessment and threat protection. But the reality for many organizations, especially those in which DevOps teams are isolated from their cybersecurity counterparts, is that SRE fails to adequately address security and itself creates risk via uncontrolled tool, process, and administrative privilege sprawl.

    For a team’s SRE initiative to truly succeed at the trifecta of optimizing the user experience, reducing IT overhead, and limiting cybersecurity risks, their toolchain and data model must incorporate security events and be capable of automating risk assessment and threat protection actions—without themselves introducing new threat vectors.




    For many enterprises, SRE models have the potential to deliver significant improvements in the customer experience. However, most teams won’t be able to realize this objective with siloed point tools and custom scripts. On the other hand, with the right platforms, teams can gain the advanced, comprehensive capabilities they need to adopt SRE models pragmatically and securely. As a result, customers can best position themselves to capitalize on the advantages of SRE models, gaining the intelligence they need to realize their business objectives, scale their digital transformation, and transform the customer experience.