What do the reliability of electrical grids have to do with application reliability? Quite a bit actually. The recent disaster created by failures of the Texas power grid reveal some significant similarities between these domains. This blog outlines some of the parallels I have noticed between the failures in these two areas, and it offers some lessons that can be applied to each discipline.
This experience was personal for me. Our home in the Dallas area was without power for the better part of four days. We had four busted pipes (accompanied by flooding and damage), and were left without any running water for over a week.
As I reflected on this disaster, the parallels between this situation and massive software failures came to mind. While I am no expert in power grid management, I quickly began to realize the similarities between the two.
The Texas Power Grid Failure and Software Failures Share Many Similarities
We keep learning more about why the grid failure occurred, so our knowledge of this situation is still evolving. However, based on what we have learned so far, several takeaways have emerged that are highly relevant to both power grids and application reliability. The following sections include a few of my observations.
Advanced Planning is Key to Reliability
As the adage goes, “Nobody plans to fail, we fail to plan.” The leaders at the Electric Reliability Council of Texas (ERCOT), the organization that oversees the Texas power grid, had more than sufficient advance warnings about winter-related vulnerabilities, but chose not to act on them. There was no spare capacity to tap into. Not only did the organization’s leadership fail to act on these warnings, they failed to provide sufficient advance warning to their power subscribers.
Similarly, software teams need to do careful planning around how to ensure application reliability. This needs to occur from multiple perspectives, such as understanding software usage and load conditions, system architecture and security, capacity planning, and scalability analysis. To address these requirements, many organizations rely on site reliability engineers (SREs), who spend time planning out key metrics, such as SLOs, SLIs, and error budgets, and developing plans for mitigation in the event of an outage. In addition, the use of cloud computing technologies allows teams to establish autoscaling of many types of systems, helping to ensure availability and scalability.
Don’t Lose Sight of Customer Experience
Probably the biggest lapse on the part of the leadership of ERCOT and the power grid providers was their insensitivity to customer experience. They did not provide sufficient advance warnings about the blackouts, they provided misleading information, they refused to take any accountability for these failures, and they didn’t show any empathy for the hardship these issues were causing their customers.
I personally signed up for outage notifications from my power provider and found that 99% of their service notifications (such as communicating when power would be down and when it would be restored) were inaccurate. Rather than helping us plan, these communications only added to our frustration and helplessness. Not only that, many power subscribers have started receiving massive bills, despite not having power for nearly a week. That’s like putting salt on a wound.
For any other business, this type of insensitivity to customer experience would probably be a death knell. ERCOT, being an autocratic state entity, will likely survive. However, for all other enterprises, customer experience should be at the heart of all software reliability initiatives. Customers deserve accountability, timely and accurate notifications about downtime and issues, support with mitigation activities, and at the very least, empathy. As they say, “a little empathy goes a long way” to satisfy customers.
Chaos Engineering and Testing is a Necessity
The failure of the Texas power grid was a classic example of chaos. Compared to software, it is clearly much more difficult to plan for and deal with such chaos in the domain of physical power systems and equipment.
However, for software systems, chaos engineering is an essential part of ensuring application reliability, and this event drives home the point. Companies like Netflix, which are committed to the customer experience, excel at chaos engineering practices, so they can better understand the impact of failures on their ability to deliver services and establish adequate mitigation plans. Technologies such as service virtualization make it easier for us to simulate failures or abnormal conditions in dependent software components.
Need to Localize Failures and Outages
Due to its nature and structure, the entire Texas power grid succumbed to blackouts, even though different parts of the state experienced very different levels of severity in terms of weather. For example, north Texas experienced more time at below-freezing temperatures than southern parts of the state. Not only that, most of the power sources, including natural gas, coal-fired plants, and wind turbines, failed in unison.
In software systems, this would be exactly the type of thing we’d work to avoid, so failures could be as localized or isolated as possible. For example, micro-services architectures lend themselves to graceful failures and failure isolation.
Multi-Stakeholder Collaboration is Key
The Texas power grid meltdown clearly demonstrated the lack of collaboration and coordination between various stakeholders, including ERCOT, the power generators, the power distributors, state regulators and policy makers, and subscribers.
In software domains, a fundamental lack of collaboration between different stakeholders (such as product owners, developers, testers, SREs, and operation engineers) would clearly be a mess. It is imperative that we foster collaboration and coordination using DevOps and SRE principles.
We Need Effective Interoperability Between Systems
This is a corollary to the point about collaboration above. The Texas power grid is generally isolated from other major power grids in the US. As a result, Texas was unable to leverage excess power capacity from other states during the crisis. Many states in the northern US, which are more accustomed to such severe winter weather, did not suffer outages, and may have been able to assist the Texas power grid if power sharing arrangements were in place.
In the domain of software systems, seamless interoperability between components is absolutely needed for efficiency and effectiveness. For example, modern systems extensively leverage API-based interoperability to enable data sharing between applications. Software load balancing is another solution that allows us to shift workloads from one part of a system to another.
Summary and Key Takeaways
This blog captures the remarkable parallels between reliability in the domains of power distribution and software systems. For me, the key similarities come down to the need to do advanced capacity planning and to prepare for chaos.
As we digitize more and more of our physical systems, the parallels will grow even more striking. We’re likely to see greater intertwining of reliability approaches and technologies from different disciplines. For example, organizations can better ensure the reliability of water supply systems by using digital sensors embedded in pipes. Would extensive use of such sensors have enabled people to prevent the bursting of water supply lines in this most recent storm?
As I continue to monitor the Texas power grid situation, I see more striking takeaways for other aspects of software delivery, especially DevOps. Maybe there will be another blog on that in the days to come. Until then my friends, stay warm and stay safe.
Shamim is a thought leader in DevOps, Continuous Delivery, Continuous Testing and Application Life-cycle Management (ALM). He has more than 15 years of experience in large-scale application design and development, software product development and R&D, application quality assurance and testing, organizational quality management, IT consulting, and practice management. Shamim is currently the CTO for DevOps business unit at Broadcom, where he is responsible for innovating DevOps solutions using Broadcom's industry leading technologies.