If you never heard about Site Reliability Engineering (SRE), you probably don’t know about operations “toil.” However, according to a 2019 Gartner DevOps survey, “41% of respondents have already adopted certain elements of SRE, and an additional 42% plan to implement SRE practices by YE20.”1 So, we believe it’s no doubt you’ll become familiar with the concept of operations toil very soon, and also with the challenges of adopting an SRE approach. This is just a matter of time.
What is Site Reliability Engineering?
Site Reliability Engineering is not new; it is based on practices that Google started to put in place even before DevOps became mainstream. When employing SRE models, teams take a software engineering approach to IT operations. SRE is aimed at handling a bigger volume of changes faster, and accepting the risk induced by change. That explains why DevOps and SRE approaches work well together, and why SRE is sometimes considered an extension to DevOps (even if SRE is older than DevOps).
Focus on continuous delivery
Focus on service management
Bridge organizational silos
Leverage tooling across teams
Accept failure as “normal”
Accept risk on service levels
Implement iterative changes
Implement “atomic” changes
Automate delivery toolchain
Automate standard operating procedures
Optimize time to market (TTM)
Optimize mean time to repair (MTTR)
The Main Challenges of Site Reliability Engineering
As teams seek to pursue SRE initiatives, the tools in place can offer significant benefit—or pose a massive challenge. The reality is that many organizations looking to adopt SRE models are employing loosely connected toolchains. After introducing a multitude of tools, and confronting the corresponding heterogeneity, staff find it harder to manage the infrastructure efficiently and to expedite problem solving. As a consequence, SRE adopters are facing two notable challenges:
Reducing operations toil. In Google’s definition, toil is not just “work I don’t like to do.” It is the kind of manual, repetitive, and mundane work that provides little value to operations. It is essential to reduce toil if you ever think about dealing with faster pace and greater volume of changes.
Reducing mean time to repair (MTTR). MTTR measures how long it takes operational teams to fix a problem, either through a workaround, a rollback, or another action. As a matter of fact, reducing MTTR has a significant impact on the overall customer experience. It is also critical for SRE teams to minimize MTTR, because as they accept the risk of change, they need to recover fast to protect the service levels the business expects.
Because of the need to handle more changes, faster, while protecting service levels, many organizations view SRE as a monumental undertaking. As a result, they either stall initiatives or try to implement too many changes too quickly, often failing to deliver enough value to justify sustaining the effort.
How Site Reliability Automation Can Help
Enhancing infrastructure monitoring with intelligent recommendations and automated remediation capabilities can help organizations create more resilient production environments, streamlining their SRE initiatives. By leveraging site reliability automation capabilities, teams can seamlessly integrate root cause alarms with remedial workflows. This contextual awareness enables SRE teams to easily automate standard operating procedures that can be reused across environments. While contextual automation contributes to reducing operations toil, it also enables teams to deal with a bigger volume of events.
In parallel, advanced site reliability automation offerings can provide a recommendation engine that leverages cross-domain insight to assist staff in choosing the most effective course of action for issue remediation. Machine learning algorithms can be used to rank the most successful remediation workflows in regard to the context. That continuous learning helps teams resolve more issues faster and reduces MTTR.
Site reliability automation addresses two major challenges of SRE teams by efficiently reducing operations toil and improving MTTR. Ultimately, site reliability automation empowers DevOps initiatives by aligning infrastructure management with the pace of modern continuous delivery. In the very near future, SRE will become mainstream; automation will make decisions and ensure consistent operations. For all these reasons, now’s probably a good time to start reviewing your automation strategies.
Yann has several decades of experience in the software industry, from development to operations to marketing of enterprise solutions. He helps Broadcom deliver market-leading solutions with a focus on Automation, DevOps, and Big Data.
Connect with the author
1. Gartner, “Analyst Report: DevOps Teams Must Use Site Reliability Engineering to Maximize Customer Value,” January 2020, ID: G00448020, Analyst(s): George Spafford, Manjunath Bhat, URL: https://www.gartner.com/en/documents/3979405