See What Really Matters: Why You Need Observability vs. Monitoring

    For decades, monitoring has been a critical component of effective business service delivery. However, today, monitoring isn’t enough. Now, teams need observability—a way to track, manage, and optimize the performance and availability of critical business services. This post examines why the need for observability vs. monitoring is now so vital, and it reveals the key requirements observability platforms must address.

    Background: Monitoring Has a Long History

    Since the deployment of mainframes decades ago, enterprise IT operations (ITOps) teams have also been tasked with monitoring. In the ensuing decades, the nature of the technologies, environments, and approaches employed have continued to evolve. Open systems were introduced, which subsequently gave way to the adoption of internet-based approaches.

    Across all the technology chasms that enterprises progressed through, monitoring remained a critical discipline. Throughout this time, teams typically relied on purpose-built tools to monitor specific elements, such as web servers, storage systems, or routers.

    At a high level, this domain-specific, component-level monitoring detected a system’s state and health. Ultimately, these approaches can be viewed as “keeping the lights on.” Fortune 500 enterprises have been spending hundreds of millions of dollars just to keep the lights on.

    The Problem: Monitoring Worked, and Then it Didn’t

    In recent years, challenges with traditional monitoring approaches arose. As enterprises moved to cloud-based environments, the nature of applications evolved dramatically, and the number of applications in use began to proliferate rapidly.

    As environments continued to grow more diverse and complex, so did monitoring approaches. For example, there have been clear boundaries that separated the monitoring tools used for applications, infrastructure, and networks, and teams often ended up employing a number of different tools in each domain. Over time, as new technologies were deployed, new monitoring tools were implemented to manage them.

    Today, environments are highly distributed, highly reliant upon networks, and increasingly comprised of multiple clouds. By 2025, it is estimated 85% of enterprises will have a cloud-first principle.1 Further, driven by increased adoption of cloud-native services, microservices, software-defined components and networks, and more, environments keep getting more dynamic and complex. With the increasingly widespread use of microservices and cloud-native architectures, the management of applications and infrastructures is starting to converge.

    Now, it’s normal for ITOps teams to be tasked with managing ephemeral environments, such as containers, with resources in virtually constant motion. In fact, by 2022, it is estimated that 75% of global organizations will be running containerized applications in production.2

    Environments typically feature millions of components and diverse mixes of technologies, standards, platforms, and more. In these modern environments, employing traditional approaches of doing component-centric monitoring isn’t enough. Simply keeping the lights on doesn’t cut it any more.

    Employing traditional, domain-centric monitoring approaches, teams can track the performance of specific elements, but they don’t have any visibility into the broader picture. Fundamentally, they’re missing the most critical element: the quality of the service levels end users are actually experiencing. Since ITOps teams are increasingly tasked to do more with less, they now have to pivot from monitoring metrics to observing behavior, especially those behaviors that have an impact on critical business services.

    Further, with traditional monitoring approaches, teams are being overwhelmed by monitoring “noise”—component-level alerts and data that make it impossible to distinguish between the items to ignore and the alerts that really matter. One report indicated 31% of alarms are false, and the average mean time to resolution is currently 4.5 hours.3 Further, when you consider that every minute of downtime costs businesses $5600, teams can’t stick with the status quo.4 Contending with this inefficiency and noise isn’t tenable in the long term—particularly as teams continue to be tasked with supporting more systems, applications, and users, while making do with existing staffing and resource levels.

    In short, monitoring is absolutely necessary, but it is not sufficient for teams looking to observe application and infrastructure behavior in highly dynamic modern environments.

    Objectives: Service Visibility and Business Alignment

    As ITOps teams look to adapt to these new realities, they need to realize two key objectives:

    • Service and end-user visibility. For virtually every organization in any industry, ensuring optimal performance and availability of digital services plays an increasingly integral role in whether the business ultimately meets its most critical objectives. For example, in a retail business, the performance of online commerce and mobile applications will have a direct impact on revenues. It’s now more vital than ever for teams to be able to track and optimize performance of business services. Fundamentally, what’s critical is what the customer experience is like, not whether specific components are up or down. To establish effective governance, teams need to be able to track, manage, and optimize key performance indicators, such as service level objectives (SLOs) and service level indicators (SLIs).
    • Business alignment. In addition, it’s vital to tie service level measurements back to the business. Not all services should be treated alike, particularly given the fact that IT is continually being asked to do more with less, and sift through high volumes of incidents. The reality is that some services are higher priority than others. When teams have to weigh among competing issues, efforts, and investments, their prioritization needs to be driven by business impact. Having visibility into business KPIs is therefore vital to managing ongoing operations, plans, and remediation efforts.

    The Need for Observability vs. Monitoring

    To get the visibility they need to meet their most critical imperatives, today’s ITOps teams need observability vs. monitoring. Where monitoring is about tracking the health of specific components, observability is about the health of business services.

    In the example of a retail application, a user’s transaction may rely on a web server, a microservices-based application running in a cloud, a number of distributed network components, a backend ERP system running in the corporate data center, and more. Teams need observability, first, to ensure they’re immediately aware if service levels are suboptimal, and, second, to quickly identify where within this complex digital supply chain the issue arose.

    Ultimately, observability is about connection, not merely collection. In the following sections, we discuss the requirements for achieving true observability.

    Five Requirements for Establishing Effective Observability

    The sections below offer an overview of some of the key requirements for achieving highly effective observability in an enterprise environment.

    #1. Avoid Blind Spots

    In today’s hybrid environments, data is constantly crossing various domains, moving beyond the enterprise perimeter into edge networks and across any number of cloud environments. This cross-domain movement creates blind spots for teams relying solely on siloed toolsets. Consequently, it is vital that teams gain the visibility needed to eliminate these blind spots, and ensure they can track data across its lifecycle.

    #2. Monitor Topologies and Interdependencies in Real-Time

    To put the pieces together, topology data is extremely important. But today’s cloud-native environments and software-defined components are inherently dynamic, making topologies extremely difficult to create, manage, and keep current. Observability solutions therefore have to be extremely nimble in their handling of topologies.

    Platforms need to be able to establish and deliver topology visibility, and to ensure it stays current as environments change. Platforms must employ modeling to track how various components relate, and the business services that are reliant upon them. These models must be established along multiple layers. Topologies should address physical systems as well as logical, depicting how data flows through the environment.

    #3. Handle Massive Data Demands

    Teams need to be able to accommodate the volume, variety, and velocity of data that modern environments generate, while ensuring data veracity. Following is more information on each of these requirements:

    • Volume. Observability solutions must be able accommodate massive scale, harnessing petabytes of data collected at hundreds of thousands of endpoints, while managing the increasing data volumes associated with microservices, virtualized environments, and more.
    • Variety. To establish true observability, it’s vital to accommodate all the distinct elements that business services rely upon. In today’s environments, that means accommodating diverse types of data. Teams need to be able to leverage structured data, alarms, metrics, and topology information, as well as unstructured logs and traces, and ingest it into big data platforms.
    • Velocity. To keep pace with highly dynamic environments, teams need platforms that can provide the scalability and responsiveness required to deliver real-time observability.
    • Veracity. To make data actionable, platforms need mechanisms that ensure data is semantically valid, current, and accurate.

    #4. Correlate and Unify Intelligence

    Collecting discrete data points is only the beginning. It’s critical that ITOps teams can establish unified visibility of all the distributed elements and cloud environments a business service relies upon. Toward this end, it’s vital that data from disparate, siloed systems can be correlated, connected, and unified.

    To realize the full potential of observability, platforms need to support the convergence of data, not only from across the ITOps domain, but from the security operations arena. Through harnessing this observability, teams leverage monitoring data at the component level and then apply correlation to measure the performance of services. Finally, while platforms must be able to aggregate data from diverse sources, it’s critical that they do so while retaining contextual relationships.

    #5. Gain Actionable Insights Through Artificial Intelligence (AI) and Machine Learning

    Given the massive data volumes being generated, individuals relying upon manual analysis can’t possibly keep on top of it all, let alone mine maximum intelligence from it. Today, AI and machine learning therefore represent increasingly critical capabilities.

    Through these capabilities, platforms can deliver fast root cause analysis and predictive analytics. By leveraging machine learning, teams can correlate data from different domains. For example, a platform may correlate application metric data with network log data to provide insights for performance optimization. AI and machine learning can provide the real-time intelligence that form the basis of automated, closed-loop remediation, and ultimately the establishment of self-healing operations.

    Finally, with these capabilities, teams can leverage business-driven, service-level analytics to gain insights needed optimize ongoing operations, refine planning, and maximize return on investments.

    Conclusion

    Today’s digital services are too critical to leave to guesswork or chance. In complex, dynamic environments, it’s vital to move beyond monitoring and establish true observability. By harnessing observability platforms that leverage data from across the IT ecosystem, and applying unified, AI-driven intelligence, teams can become equipped to ensure critical business services deliver the optimized availability and responsiveness required.

    1. Gartner, “Predicts 2021: Building on Cloud Computing as the New Normal,” December 14, 2020, ID: G00735465, Analyst(s): Yefim Natis, David Smith, Sid Nag, Gregor Petri, David Cearley, Michael Warrilow, Henrique Cecci

    2. Gartner, “Top Emerging Trends in Cloud-Native Infrastructure,” May 28, 2019, ID: G00385619, Analyst(s): Arun Chandrasekaran, Wataru Katsurashima

    3. IDG Research

    4. Gartner, “The Cost of Downtime,” Andrew Lerner, July 16, 2014, URL: https://blogs.gartner.com/andrew-lerner/2014/07/16/the-cost-of-downtime/