AI for IT Operations: Secrets to Success Beyond Great Math

    Once upon a time we had visibility across IT Infrastructure. We had physical data centers and lovingly nurtured our servers and networks. Of course, the applications under our control became increasingly complicated, but we could always get under the hood when things went wrong.

    But consider this: Due to all-things cloud, most folks entering the tech workforce today will never get to see a physical server or play with a patch panel and configure a router. They’ll never need to acquire that sysadmin “sixth-sense” knowledge of what’s needed to keep systems up and running.

    So, what’s needed to fill the void? Well, two things: data and analytics.

    There’s no shortage of data or big data in IT operations. Acquire a new cloud service or dip your toes into serverless computing and IoT and you get even more data—more sensor data, logs, and metrics to supplement the existing overabundance of application component maps, clickstreams, and capacity information.

    But what’s missing from this glut of data are analytics and artificial intelligence (AI) for IT operations (AIOps for short). It’s tragic that organizations rich in recorded information lack the ability to derive knowledge and insights from this information. It is kind of like owning the highest-grade, gold-bearing ore, but not having the tools to extract it—or worse, not even realizing you have the gold at all.

    Most organizations understand there’s “gold in them thar hills,” and are employing methods to mine it. In the last few years, we’ve seen fantastic strides in data gathering and instrumentation, with new monitoring tools appearing almost as fast as each new tech and data source. So, as organizations sign up for a new cloud service, there always seems to be another monitoring widget or open-source dashboard to go with it—and those get added to the other 20+ dashboards already in place. That’s like ordering a burger and being offered free fries. Immediate visual gratification yes, but it’ll only add maintenance pounds in the long term.

    This isn’t to say that these new tools aren’t useful. Visualizing data is a great starting point for any analytics journey. The problem, however, is that these offerings present information in a narrow observational range. Plus, by exponentially increasing the metrics under watch, they can increase the likelihood that something critical will be missed.

    It’s not surprising then that some organizations sauce their fries with automated alerting. At its crudest, this involves setting static baselines with binary pass/fail conditions. All fine for predictable legacy systems, but only increasing noise levels and false-positives in more fluid and modern applications. Or worse, false negatives—those “everything looked hunky dory” moments just before the major web application fell off the cliff. Plus, these mechanisms only analyze one variable (they’re univariate in math speak), so they cannot always deliver a complete picture of performance.

    To address this and other issues, many IT operations teams are turning to math and data science, which has been used to great effect in many facets of a business, ranging from fraud and credit risk to marketing and retail pricing. Surely, if a team of archeologists can apply some neat math to find lost cities in the Middle East, IT operations teams can apply similar thinking. After all, you have data on tap and unlike archeologists, you don’t have to decipher ancient clay tablets and scrolls.

    AI for IT Operations: Key Impediments to Success

    So why has IT operations lagged behind in the application of analytics? Here are three factors that impede success:

    1. Cursing the Math

    It’s tempting to get sucked into data science and start applying algorithms in a piecemeal fashion. When things go wrong, it’s easy to blame the math. However,  it’s never the math that fails, but the incorrect application of the math within the IT operations context. This can be especially troubling when machine learning is used sub-optimally: What can happen is that a lack of actionable alerting and workflows means critical events are missed. Further, the system can start teaching itself that a prolonged period of poor customer experience is the new normal.

    2. Narrow Scope

    Many organizations start an analytics program by looking at single metrics using normal standard deviations and probability models—basically, if a metric falls outside a range there’s an alert. This seems reasonable, but what if 100 metrics are being singularly captured every minute in a twenty-four hour period. This nets out to 144,000 observations and probably 1000+ alerts (assuming events are triggered at the second standard deviation)—too many for anyone to reasonably process.

    Furthermore, these models fail to accommodate multi-metric associations. For example, this can include the non-linear relationship between CPU utilization across a container cluster and application latency, where a small increase can massively impact responsiveness. Nuances like these suggest new approaches are needed in data modeling (vs simplistic tagging), where metrics can be automatically collected, aggregated, and correlated into groups. This is foundational and extremely beneficial since algorithms gain context across previously siloed information, and can be applied more powerfully.

    3. Limited Range

    The complexity of modern applications and customer engagement means there’ll be many unexpected conditions—the unknown unknowns. What becomes critical therefore is the ability to capture and group elements and analyze across the entire application stack, without which unanticipated events will tax even the best algorithms.

    Take for example a case in which an organization has experienced declining revenue capture from a web or mobile application. The initial response could be to correlate revenue with a series of performance metrics, such as response times, page load times, API and back-end calls, and so on. But what if the actual cause is an unintentional blacklisting of outbound customer emails by a service provider? Unless that metric along with other indicators across the tech stack is captured and correlated, there’ll be longer recovery times and unnecessary finger pointing across teams.

    Analytics–Human Disconnect

    Analytics can fall short if it can’t help staff (or systems) make decisions as to what to do when an anomalous pattern or condition is detected. Key then is a workflow-driven approach in which the system not only collects, analyzes, and contextualizes, but also injects decision-making processes, such as self-healing, into the model. To do this the best systems will be those that leverage existing knowledge and learnings of staff and solution providers who have significant experience working in complex IT environments.

    Today, cloud and modern tech stacks are the only way to conduct business at scale. While we might bemoan the loss of visibility, what we can win back is far more beneficial: analytical insights based on the digital experience of our customers. Doing this requires great math, but also demands fresh approaches to its application within context of business goals, underpinning technologies, and the people and processes needed to support them.

    By analyzing increased data volumes and solving more complex problems, AIOps equips teams to speed delivery, gain efficiencies, and deliver a superior user experience.