Site Reliability Engineering (SRE) has been gaining popularity recently as a way to help improve system reliability and establish a prescriptive approach to implementing DevOps.
SRE teams use techniques such as service level objectives (SLOs) and error budgets to quantify the risk tolerance for systems and services, as well as to balance the needs of velocity and system stability and reliability.
Similarly, testers, and software (development) engineers in test (SDETs or SETs) specifically, play a key role in balancing the needs of velocity with overall system quality. We feel that these approaches can be synergized through better collaboration between developers, testers, SDETs, and SREs and by leveraging each other’s practices.
In this blog we will explore the synergies between SREs and SDETs and how these staff members can work with development teams to balance the needs of velocity and quality.
Error Budgets Overview
Error budgets allow SREs to balance between the needs of velocity and stability. As long as there is sufficient room in the error budget, teams prioritize new feature development and frequent deployments. However, as the error budget is exhausted, teams slow down (or stop) new feature development and deployment, and focus more on system hardening and testing.
Synergies Between SREs and Testers
The error budget approach used by SREs is analogous to how testers use overall application quality and release risk to modulate velocity. Since reliability is only one of many overall quality metrics, the error budget approach is in fact a more specific approach to modulating velocity based on quality. Therefore, this approach can (and should) be synergized with overall QA modulation.
Before we discuss the unified approach, let’s first discuss the synergies between the roles of SREs and testers—specifically SDETs. Both have their roots in software development, and therefore share much in common.
Some of the key points of commonality include the following:
Software engineering approach: Both SREs and SDETs are software engineers. They bring core software engineering approaches to their domain and work closely with the application development team. These include practices such as “everything-as-code” (such as configurations or tests), version control of assets, and white-box focus.
Shift left: Both SREs and SDETs help to shift left their respective disciplines early in the lifecycle to ensure reliability/quality is built in. This includes architecture quality, configuration quality, and early monitoring (see figure below).
Overlap on shift-right activities: Increasingly, testers are also practicing shift-right approaches, which overlap with SRE activities. These include chaos engineering, canary testing, A/B testing, and extraction of insights from operational data.
Toil reduction through automation: Both SREs and SDETs actively pursue automation efforts in their domain to remove waste and reduce cycle time and errors.
Technical debt ownership: An error budget is a form of technical debt for software. Just like technical debt, error budgets are used to trigger decisions on hardening and velocity.
Velocity with safety: Finally, both roles help to balance the needs of velocity with reliability and quality. As we have discussed, reliability is a sub-set of overall system quality.
Synergizing Error Budgets and Release Quality to Modulate Velocity
Testers and QA professionals use a variety of techniques and measures to assess release quality and risk. These include things like code quality, batch size, functional and non-functional requirements coverage (through tests), defect detection and removal, user and customer experience, compliance, supportability, technical debt, and so on.
Various approaches exist to quantify release risk based on the measures from the above techniques. Organizations make business decisions to proceed with releases despite risk. However, deficiencies in each of these measures add up to the quality debt of an application. As quality debt increases, the risk of releasing software progressively increases. At some point, the risk threshold is crossed, and releases are slowed or halted to allow time for remediation or hardening.
Clearly, this is analogous to how SREs use error budgets. Therefore, it makes sense to use them in a combined, synergistic manner.
In this unified approach, the velocity is modulated by a combination of error budget and release risk. This provides a more holistic view of balancing velocity and quality and essentially subsumes reliability as a measure of overall quality.
Summary and Looking Forward
Hopefully, this article has provided readers with some insight into the synergies between SREs and testers and how an integrated approach can be used for modulating velocity. We don’t quite have well defined models for “release risk budgets;” however, we can define those along the lines of SRE error budgets.
In addition, if you’re interested in learning more on the topic of SREs and development teams, see my post on mapping SRE functions to the Scaled Agile Framework.
Shamim is a thought leader in DevOps, Continuous Delivery, Continuous Testing and Application Life-cycle Management (ALM). He has more than 15 years of experience in large-scale application design and development, software product development and R&D, application quality assurance and testing, organizational quality management, IT consulting, and practice management. Shamim is currently the CTO for DevOps business unit at Broadcom, where he is responsible for innovating DevOps solutions using Broadcom's industry leading technologies.