Many Site Reliability Engineering (SRE) teams invest a lot of time establishing and tracking metrics, but they lack visibility into the high-level SRE metrics that really translate to success. This blog post offers a look at why many SRE metrics fall short, and uncovers how to establish and track the SRE metrics that matter.
The Evolution of IT Operations and the Promise of SRE Metrics
Over the past 40 years, the IT operations discipline has matured dramatically. Within most enterprises, the role of IT teams has evolved from simply provisioning technology to playing an integral role in the business’ value chain.
By employing Site Reliability Engineering (SRE) approaches, teams seek to take this maturation further. Through SRE, teams take a software engineering approach to IT operations. A key component of SRE approaches is the process of taking service level indicators (SLIs) and using them to establish service level objectives (SLOs) that are closely aligned with the business’ most important objectives. Through well-designed and executed SRE initiatives, teams can establish the metrics, processes, and capabilities needed to improve service levels and business results.
SLOs are a tool to help determine what engineering work to prioritize. SLOs set a target level of reliability for services delivered to customers. In principle, when service levels fall below the threshold defined by SLOs, you can expect customers to start complaining. Once teams can establish consensus around how SLOs should be created, they can move forward with adopting an error budget-based approach, and safely implement their SRE initiatives.
SLOs offer a framework for discussing system behavior with greater clarity, and determining appropriate remediation steps when failures occur. However, these metrics can also create confusion when teams are trying to evaluate the overall performance of their SRE initiatives.
The Elephant Parable: Why the Part isn’t the Whole
We’ve all heard the ancient parable of the blind men and the elephant. The story recounts how a group of blind men encounter an elephant for the first time. Each feels a different, specific part of the body, one touching the tusk, one touching the leg, one touching the tail, and so on. Based on their experiences, they come away with widely varying perceptions and conclusions, and each is convinced the others have it wrong.
The parable is an apt metaphor for perceptions around metrics. By using individual SLOs to try and gauge the overall performance of your SRE practice, you’ll be focusing on a small subset of a bigger reality. When you have different teams looking at different SLOs, you’ll have widely divergent takeaways.
So what’s the answer? Should teams be resigned to treating SRE endeavors as a black box, inscrutable to others, and isolated from any business governance? Fortunately, there is a way to assess performance after all, and there are some proven strategies that can be applied to inform your approach.
The Toyota Story
A long time before SRE concepts, and even Google itself, existed, Toyota had created the Toyota Production System (TPS), an approach that offers many similarities to SRE. Both companies talk a lot about waste, even if Toyota calls it Muda and Google calls it toil.
Both TPS and SRE are aimed at improving the way customer demands are met and can be related to lean manufacturing.
Within the discipline of lean manufacturing, teams employ the overall equipment effectiveness (OEE) metric. OEE is the gold standard for measuring manufacturing productivity. By measuring OEE, you get essential insights on how to improve your manufacturing process. These insights apply to pretty much any production process, whether you’re making cars or software. OEE is one single metric for identifying waste and improving the productivity of manufacturing equipment. OEE is based on three factors: availability (A), performance (P), and quality (Q), and is calculated as follows:
OEE = A x P x Q
Applying OEE to Your SRE Metrics
Availability, performance, and quality are very much aligned with the core tenets of SRE models. Also, think about the primary responsibilities of SREs, who are tasked with responding to outages and acting as consultants that continuously look for opportunities for improvement. The reality is that your SRE teams have probably already defined SLOs related to service availability, application performance, and release defects—key performance indicators (KPIs) that perfectly map to A, P, and Q as defined in OEE calculations.
OEE represents a KPI for an entire process, which makes it ideally suited to holistically tracking the performance of SRE teams and initiatives. Because OEE has the properties of a geometric mean, it is especially affected by variability among its subcomponents. That level of sensitivity is helpful in contending with the fact that SLO variations are expected in the range of just a few hundredths of a percent.
Of course, when a KPI like OEE shows great sensitivity, the results can initially be shocking, even demoralizing, for teams who were used to presenting 99.999% service quality indicators. But the perspective is different when you are aiming for performance management and improvement. Areas where waste (or toil) can be eliminated become more apparent as the OEE indicator gives an objective picture of the situation. An OEE value in the range of 60% to 80% is not unusual and just means that an organization has room for improving its overall performance.
Another exciting aspect is that the OEE can also be used at the level of resource bottlenecks. In these scenarios, a direct result of the theory of constraints applies: when a specific resource determines performance of an entire process, it is useless to keep track of other resources. This simplifies the tasks of determining the critical path for optimizations and identifying wasted resources.
Establishing True Business Alignment Through BizOps
In virtually every conversation about setting performance indicators and dashboards, choosing the right KPIs, the ones that really help understand the underlying reality, is the proverbial “elephant in the room.” Stakeholders all know the elephant is there, but nobody wants to talk about it.
The issue is even more profound than simply getting IT and business on the same page. The real difficulty lies in getting everyone to think about how technology initiatives tie directly to the things that matter to the organization as a whole. For that, you need to effectively track, understand, and leverage the data that delivers meaningful insights for IT operations and the business. That’s why BizOps is becoming so important.
By providing heightened visibility, BizOps can help make SRE initiatives even more valuable and better ensure stakeholders truly understand that value. By employing BizOps approaches and technologies, teams can establish uniform performance dashboards that combine systems availability, application performance, and application release quality measurements.
When you want people to contribute their best, stay constructive, and continuously improve performance, you have to focus on the right KPIs. While SLOs are critically important, they don’t serve teams well in gauging success at the initiative level. To boost their efficacy, teams can employ OEE and BizOps to establish SRE metrics that truly track how initiatives are delivering business outcomes.
Yann has several decades of experience in the software industry, from development to operations to marketing of enterprise solutions. He helps Broadcom deliver market-leading solutions with a focus on Automation, DevOps, and Big Data.