AIOps: Establishing a Data-Driven Model that Helps SREs Predict the Future

    As a kid I grew up reading a lot of science fiction. My forbearing parents used to let me take out from the library the max number of books each week they would allow (30, I still remember that number). And each week I would go back for more. Given this constant consumption of augury you would think something I read would have prepared me for the future we now face within the operations space.

    While there are definitely some inklings in the science fiction canon about computer systems constructed at such scale that they would be hard for humans to understand, there is precious little attention paid to what it would take to operate them in production. Welcome to my world (and your reality, too, I bet).

    The good news is that two approaches are emerging that can help, and they’re more science than fiction. The first is the engineering discipline known as Site Reliability Engineering (SRE) which aims to take a software engineering approach to operations. The second, AIOps, short for artificial intelligence in operations, is an approach for employing a class of advanced algorithms to the massive corpus of operational data we are now accumulating as part of the ordinary day-to-day activity of running all of our systems and services.

    One goal of the former is to construct a set of operational practices that allow us to navigate the tricky path between a desired feature velocity (iterating the software as fast as possible to provide the features a business needs to deliver to its customer base) and a desired level of operational stability (keeping the system available for those customers). This is trickier than it sounds for at least three reasons:

    • There are often completely different sets of people working on these problems.
    • They have very different incentives around the work.
    • Communication between these groups is often, shall we say, a little dicey.

    SRE, like many other engineering disciplines, is a data-driven approach. It uses data to help create productive conversations and streamline decision making between these different groups.

    AIOps similarly tries to use operational data to provide a big win for an organization. It attempts to address this key question: “We have all of this data on the operational status and performance of our infrastructure, what can we learn from it?”

    Can the record of the past help us understand how things are working in the present or even help predict the future? Is there information in the data I have already that might provide some insight into how my systems are behaving? For example:

    • Is this just a spike in traffic or an indication my systems are about to experience a tailspin into failure?
    • Are there any difficult-to-see patterns in the load in my system that could help me optimally provision my resources so I don’t pay more than I need to?
    • Have we ever seen an outage like the one we are experiencing? If so, how did we deal with it last time?

    Some of this is real today, some of it is easily imagined. There are definitely limits on what AIOps can offer our operations practices, but we surely haven’t taken it to its full potential yet. Ultimately, with SRE and AIOps, your teams can automate incident response and they can start to bring a little of your future into your present.