If you never heard about Site Reliability Engineering (SRE), you probably don’t know about operations “toil.” However, according to a 2019 Gartner DevOps survey, “41% of respondents have already adopted certain elements of SRE, and an additional 42% plan to implement SRE practices by YE20.”1 So, we believe it’s no doubt you’ll become familiar with the concept of operations toil very soon, and also with the challenges of adopting an SRE approach. This is just a matter of time.
Site Reliability Engineering is not new; it is based on practices that Google started to put in place even before DevOps became mainstream. When employing SRE models, teams take a software engineering approach to IT operations. SRE is aimed at handling a bigger volume of changes faster, and accepting the risk induced by change. That explains why DevOps and SRE approaches work well together, and why SRE is sometimes considered an extension to DevOps (even if SRE is older than DevOps).
DevOps |
SRE |
Focus on continuous delivery | Focus on service management |
Bridge organizational silos | Leverage tooling across teams |
Accept failure as “normal” | Accept risk on service levels |
Implement iterative changes | Implement “atomic” changes |
Automate delivery toolchain | Automate standard operating procedures |
Optimize time to market (TTM) | Optimize mean time to repair (MTTR) |
As teams seek to pursue SRE initiatives, the tools in place can offer significant benefit—or pose a massive challenge. The reality is that many organizations looking to adopt SRE models are employing loosely connected toolchains. After introducing a multitude of tools, and confronting the corresponding heterogeneity, staff find it harder to manage the infrastructure efficiently and to expedite problem solving. As a consequence, SRE adopters are facing two notable challenges:
Because of the need to handle more changes, faster, while protecting service levels, many organizations view SRE as a monumental undertaking. As a result, they either stall initiatives or try to implement too many changes too quickly, often failing to deliver enough value to justify sustaining the effort.
Enhancing infrastructure monitoring with intelligent recommendations and automated remediation capabilities can help organizations create more resilient production environments, streamlining their SRE initiatives. By leveraging site reliability automation capabilities, teams can seamlessly integrate root cause alarms with remedial workflows. This contextual awareness enables SRE teams to easily automate standard operating procedures that can be reused across environments. While contextual automation contributes to reducing operations toil, it also enables teams to deal with a bigger volume of events.
In parallel, advanced site reliability automation offerings can provide a recommendation engine that leverages cross-domain insight to assist staff in choosing the most effective course of action for issue remediation. Machine learning algorithms can be used to rank the most successful remediation workflows in regard to the context. That continuous learning helps teams resolve more issues faster and reduces MTTR.
Site reliability automation addresses two major challenges of SRE teams by efficiently reducing operations toil and improving MTTR. Ultimately, site reliability automation empowers DevOps initiatives by aligning infrastructure management with the pace of modern continuous delivery. In the very near future, SRE will become mainstream; automation will make decisions and ensure consistent operations. For all these reasons, now’s probably a good time to start reviewing your automation strategies.
Yann has several decades of experience in the software industry, from development to operations to marketing of enterprise solutions. He helps Broadcom deliver market-leading solutions with a focus on Automation, DevOps, and Big Data.
bizops.com is sponsored by Broadcom, a leading provider of solutions that empower teams to maximize the value of BizOps approaches.