AI for IT Operations: Secrets to Success Beyond Great Math--b

Once upon a time we had visibility across IT Infrastructure. We had physical data centers and lovingly nurtured our servers and networks. Of course, the applications under our control became increasingly complicated, but we could always get under the hood when things went wrong.

 But consider this: Due to all-things cloud, most folks entering the tech workforce today will never get to see a physical server or play with a patch panel and configure a router. They’ll never need to acquire that sysadmin “sixth-sense” knowledge of what’s needed to keep systems up and running.  So, what’s needed to fill the void? Well, two things: data and analytics.  

There’s no shortage of data or big data in IT operations. Acquire a new cloud service or dip your toes into serverless computing and IoT and you get even more data—more sensor data, logs, and metrics to supplement the existing overabundance of application component maps, clickstreams and capacity information.  

But what’s missing from this glut of data are analytics and artificial intelligence (AI) for IT operations (AIOps for short). It’s tragic that organizations rich in recorded information lack the ability to derive knowledge and insights from this information. It is kind of like owning the highest grade, gold-bearing ore, but not having the tools to extract it—or worse, not even realizing you have the gold at all.  Most organizations understand there’s “gold in them there thar hills,” and are employing methods to mine it. In the last few years, we’ve seen fantastic strides in data gathering and instrumentation, with new monitoring tools appearing almost as fast as each new tech and data source.

So, as organizations sign up for a new cloud service, there always seems to be another monitoring widget or open-source dashboard to go with it—and those get added to the other 20+ dashboards already in place. That’s like ordering a burger and being offered free fries. Immediate visual gratification yes, but it’ll only add maintenance pounds in the long term.  

This isn’t to say that these new tools aren’t useful. Visualizing data is a great starting point for any analytics journey.

The problem, however, is that these offerings present information in a narrow observational range. Plus, by exponentially increasing the metrics under watch, they can increase the likelihood that something critical will be missed.  

It’s not surprising then that some organizations sauce their fries with automated alerting. At its crudest, this involves setting static baselines with binary pass/fail conditions. All fine for predictable legacy systems, but only increasing noise levels and false-positives in more fluid and modern applications. Or worse, false negatives—those “everything looked hunky dory” moments just before the major web application fell off the cliff. Plus, these mechanisms only analyze one variable (they’re univariate in math speak), so they cannot always deliver a complete picture of performance.  

To address this and other issues, many IT operations teams are turning to math and data science, which has been used to great effect in many facets of a business, ranging from fraud and credit risk to marketing and retail pricing. Surely, if a team of archeologists can apply some neat math to find lost cities in the Middle East, IT operations teams can apply similar thinking. After all you have data on tap and unlike archeologists, you don’t have to decipher ancient clay tablets and scrolls.  

AI for IT Operations: Key Impediments to Success

So why has IT operations lagged behind in the application of analytics? Here are three factors that impede success:

1. Cursing the Math

It’s tempting to get sucked into data science and start applying algorithms in a piecemeal fashion. When things go wrong, it’s easy to blame the math. However, it’s never the math that fails, but the incorrect application of the math to within the IT operations context. This can be especially troubling when machine learning is used sub-optimally: What can happen is that a lack of actionable alerting and workflows means critical events are missed. Further, the system can start teaching itself that a prolonged period of poor customer experience is the new normal.

2. Narrow Scope

Many organizations start an analytics program by looking at single metrics using normal standard deviations and probability models—basically if a metric falls outside a range there’s an alert. This seems reasonable, but what if 100 metrics are being singularly captured every minute in a twenty four hour period. This nets out to 144,000 observations and probably 1000+ alerts (assuming events are triggered at the second standard deviation)—too many for anyone to reasonably process.

  Furthermore, these models fail to accommodate multi-metric associations. For example, the non-linear relationship between CPU utilization across a container cluster and application latency, where a small increase can massively impact responsiveness. Nuances like these suggest new approaches are needed in data modeling (vs simplistic tagging) where metrics can be automatically collected, aggregated and correlated intro groups. This is foundational and extremely beneficial since algorithms gain context across previously silo’d information and can be more powerfully applied.

3. Limited Range

The complexity of modern applications and customer engagement means there’ll be many unexpected conditions — the unknown unknowns. What becomes critical therefore is the ability to capture, group elements and analyze across the entire application stack, without which unanticipated events will tax even the best algorithms.  Take for example a case where an organization has experienced declining revenue capture from a web or mobile application. The initial response could be to correlate revenue with a series of performance metrics such as response-times, page load times, API and back-end calls etc. But what if the actual cause is an unintentional blacklisting of outbound customer emails by a service provider? Unless that metric along with other indicators across the tech stack is captured and correlated there’ll be longer recovery times and unnecessary finger pointing across teams.

Analytics – Human Disconnect

Analytics can fall short if it can’t help staff (or systems) make decisions as to what to do when an anomalous pattern or condition is detected. Key then is a workflow-driven approach where the system not only collects, analyzes and contextualizes, but also injects decision making processes such as self-healing into the model. To do this the best systems will be those that leverage existing knowledge and learnings of staff and solution providers with significant experience working in complex IT environments.  Today, cloud and modern tech stacks are the only way to conduct business at scale. While we might bemoan the loss of visibility, what we can win back is far more beneficial: analytical insights based on the digital experience of our customers. Doing this requires great math, but also demands fresh approaches to its application in context of business goals, underpinning technologies, and the people and processes needed to support them.  By analyzing increased data volumes and solving more complex problems, AIOps equips teams to speed delivery, gain efficiencies and deliver a superior user experience

AI for IT Operations: Secrets to Success Beyond Great Math--b

AI for IT Operations: Key Impediments to Success

1. Cursing the Math

2. Narrow Scope

3. Limited Range

Analytics – Human Disconnect

Erhan Giral

Keyword Definition: Lorem Ipsum

U3M Blog Post Cards

The 5 Habits of Highly Successful Value Streams

Using VSM to Break Down Organizational Silos

The Rapid Emergence of Value Stream Management and Why it’s Here to Stay

Subscribe to our newsletter and get the latest news on all things BizOps.