The Scaled Agile Framework represents a strong model for bringing agile into the enterprise, but it has traditionally lacked an operational focus. By combining Site Reliability Engineering (SRE) models with the Scaled Agile Framework, teams can more fully harness the benefits of both of these approaches. In this blog, I outline how SRE teams are evolving, and present a model around how the SRE function may be integrated into the Scaled Agile Framework.
Introduction to SRE and Scaled Agile Framework Concepts
Established more than a decade ago, Site Reliability Engineering (SRE) has been gaining significant popularity recently. Through SRE approaches, teams can improve system and operational reliability across a range of dimensions, including availability, latency, performance, efficiency, change management, monitoring, emergency response, capacity planning, and more. SRE teams use techniques such as service level objectives (SLOs) and error budgets (EBs) to quantify the risk tolerance for systems and services, as well as to balance the needs of velocity and system stability and reliability. SRE can also provide a prescriptive approach to implementing DevOps.
The Scaled Agile Framework (SAFe) has long been popular for teams seeking to implement agile at the enterprise level. SAFe defines a DevOps model, but the framework does not formally include the SRE role. One reason for this is that SRE role has traditionally been more operations focused, while SAFe is more centered on system development and delivery.
However, given the fact that the SRE role is rooted in software engineering, I expect that this function will increasingly shift left. In order to proactively ensure system reliability, SRE approaches will become integral parts of development.
SREs Shift Left
To understand how and why the SRE function will progressively shift left, consider what has already happened with the testing function. In the past, testing was considered a distinct phase in the software lifecycle. Teams tested the software well after developers were done with development. The function of testers was to find defects for developers to fix. This obviously created many problems, key among them being that software quality could not be ensured. Instead, quality could only be checked by testers, often inadequately.
With the recent evolution of continuous testing, testing shifted left, and a new breed of testers—called software development engineers in testing (SDET) emerged. Essentially, these are software developers with testing skills. SDETs were embedded within development teams, performing the testing function in conjunction with developers to better enhance the quality of software. Ultimately, this helps ensure software is built better in the first place, not just checked for bugs afterwards. SDETs also brought valuable software engineering skills around “as-code” test automation. This allowed for better automated tests and test processes, helping to reduce testing effort and elapsed time.
Strikingly, the genesis behind the SRE function is similar to that of SDETs. SREs are essentially “software engineers in operations“ who seek to promote greater agility in operations functions. Like SDETs, SREs bring valuable software engineering skills to the operations domain. SRES take an “as-code” approach to many operations functions, helping to minimize toil and bridge the gap between development and operations.
However, if teams limit the SRE function to the operations stage of the software lifecycle, they miss out on the opportunity to realize the full potential of SRE approaches. Rather than ensuring software reliability, they’ll simply be managing reliability better.
Therefore, the SRE function needs to shift left to better ensure software reliability. Like quality, reliability cannot be “checked in” to software late in the lifecycle—it has to be “built in.” If fact, if we consider that reliability is in fact a non-functional quality metric, it goes without saying that, like SDETs, SREs have to be embedded with development and release teams. (For more on this synergy, see my blog post on SDETs and SREs.)
As SREs shift left to ensure built-in reliability, some of their key activities include ensuring architecture and configuration quality, establishing early monitoring (see figure below), and participating in deployment and release engineering activities.
Employing a shift-left approach means that SREs play an integral role in the entire software lifecycle, from planning to operations. Some of the key functions they play in the different stages of the lifecycle are shown in the figure below.
In most initial implementations, SRE teams have focused on the reliability of services that are consumed by customers or users. The key goal was to keep sites, which often include multiple services, up and running well within the parameters of agreed-upon SLAs and SLOs.
However, a service is a composite of many application components (sometimes called service components), that are hosted on computing infrastructure, such as virtual machines, servers, and so on. The reliability of services in turn depends on the reliability of these application components.
To ensure service reliability, therefore, SREs need to ensure application component reliability. This means that as SREs shift left, they need to be deployed at different levels in the organization:
Within development teams to ensure application components are built with reliability.
Within release teams to ensure releases don’t disrupt system reliability.
At the enterprise level to support some of the key functions referenced in the figure above.
Different types of SREs can be allocated to different functions in the application lifecycle:
Enterprise reliability engineers. These are SREs that are responsible for the reliability of enterprise business services. They are also responsible for setting and running SRE centers of enablement, developing best practices and SRE governance for the enterprise. For example, these engineers will be responsibility for ensuring reliability in the enterprise architecture.
System reliability engineers. These are SREs that are responsible for the reliability of application systems. They focus on such activities as coordinating the release and launch of application components in a release train, establishing system architecture governance, tracking system-wide SLOs and error budgets, and so on.
Application reliability engineers. These are SREs that are embedded in development teams that are responsible for business-critical applications. These engineers provide day-to-day support at the application or component level, managing such efforts as setting up and tracking application-specific SLOs and error budgets, setting up application-component-level DevOps pipelines, and more.
SREs Meet SAFe
SAFe principles describe agility at different levels in the organization, from development teams up to the enterprise. However, the framework is largely focused on development and delivery. SREs truly embody all the principles of “agile operations.”
As a result, I see a great deal of synergy between SAFe organizational agility levels and the SRE functional areas described above. The following figure represents where I see the different types of SREs map into the SAFe model.
Application reliability engineers clearly map into (and would be embedded into) SAFe agile teams, where actual application development takes place. System reliability engineers map into the SAFe system team since they are responsible for release train activities associated with multiple applications and components. Finally, enterprise reliability engineers map into the SAFe enterprise solution delivery function, which is responsible for the architecture and serviceability of enterprise systems.
As I mentioned before, SAFe is somewhat light on the operations aspects of software systems. By employing SREs into these teams, organizations can better incorporate optimized operational functions into their SAFe implementations.
As adoption of the SRE model becomes more prevalent, we expect to see SREs playing a greater role, both across the software lifecycle and at different levels of the service stack. An integration between SRE and SAFe models provides a complete framework for enterprise agility that incorporates both agile development and operations capabilities.
Shamim is a thought leader in DevOps, Continuous Delivery, Continuous Testing and Application Life-cycle Management (ALM). He has more than 15 years of experience in large-scale application design and development, software product development and R&D, application quality assurance and testing, organizational quality management, IT consulting, and practice management. Shamim is currently the CTO for DevOps business unit at Broadcom, where he is responsible for innovating DevOps solutions using Broadcom's industry leading technologies.