Retail Digital Transformation: Why Observability is Key

    For today’s retailers, opportunities and challenges seem to keep getting bigger. Those that successfully pursue retail digital transformation will be best positioned to contend with both. This post examines how operations teams can establish the observability that can be instrumental in supporting retail digital transformation.

    The Rising Stakes for Digital Agility

    As the world navigates the COVID-19 pandemic and its massive implications, businesses in a number of industries have thrived, including financial services and high tech. At the same time, many of those in travel, hospitality, and retail are hanging on to survive.

    Clearly, however, whether an organization is thriving or not is predicated on more than just the industry they’re operating in. No matter the industry, the pandemic has placed an increased premium on digital agility. Over and over, we’re seeing that the businesses that have been leading the way in digital transformation have been able to adapt much more quickly to the sudden shifts in challenges and opportunities that have been presented in recent months. Others, who have had digital transformation on the back burner or lower on the priority list, are now having their hands forced.

    I’m particularly empathetic for those in the retail industry. In a prior role, I’d made monitoring and service assurance solutions for a large global retailer. When I look back at my time in retail, there are two things in particular that I cherish:

    1. The focus and collective commitment toward hitting the holiday’s sales numbers—which play a make-or-break role in the company’s entire year.
    2. The core reality of retail businesses: People buy from those they trust.

    We were a highly collaborative team. By delivering an IT infrastructure that offered five-nines availability, optimally performing applications, and frictionless, rewarding customer experiences, our team could play an indispensable role in building customer trust. Further, by establishing holistic observability, we could play a big role in retail digital transformation.

    Today, a global retailer may have a massive presence, including 3,500 to 7,000 stores, hundreds of thousands of point-of-sale terminals, hundreds of depots, and hundreds of warehouses. They have operations that need to deliver continuous, 24/7 support to customers making purchases via Web and mobile apps, contend with the spikes associated with large-scale promotional events, and more. In the midst of all these demands, these teams also are facing intense pressure to reduce costs, and these financial pressures have only intensified in recent months.

    IT operations teams need to meet these imperatives while navigating ever-more complex environments:

    • In-store, data centers, and a blend of cloud environments.
    • Monolithic and microservices architectures.
    • Modern platforms and mainframes.
    • Fragmented pockets of automation.
    • And a dizzying array of monitoring and management tools to control these environments.

    The Key Building Blocks to Observability

    Within retail, there have been traditional problems in contending with this complexity. The point tool approaches used to manage these environments only offered a fragmented view; teams couldn’t gain the unified visibility that service owners needed to really understand, and take steps to optimize, the service levels delivered to customers.

    When operations teams were notified about issues, whether downtime, slow performance, or errors, it was easy for staff not to take the issues seriously—they were missing the critical answer to the “so what?” question. There was no visibility into how these specific technology issues would translate to issues for customers or the business. Fundamentally, what IT tools were generating was data; what the business really needed was actionable information.

    To deliver the intelligence required, teams need to move to establish observability. Observability represents the move to advance IT monitoring, and it is vital to enabling retail IT teams to meet their charters. Through observability, teams can gain better insights into digital business services, so they can deliver higher quality, more innovative customer experiences. In the following sections, I’ve outlined some key building blocks that help establish the observability needed to support retail digital transformation.

    Foundation: Simpler, Better, and Cheaper

    Any technology component added must produce an outcome that aligns with one or more of these themes: simpler, better, or cheaper. Invest in technology with a long-term, strategic perspective. Look beyond features and functionality and evaluate the scale and repeatability that can be delivered across the organization.

    Processes: Establish Optimized Metrics and Approaches

    Align OLAs and SLAs with Key Business Objectives

    Bring transparency to operational level agreements (OLA). Establishing OLAs that business stakeholders buy into is a critical first step. Then it’s important to have milestone checks against these OLAs, and to keep tracking them for every cycle, whether daily, weekly, or monthly. This is the best way to provide a sense of accountability to everyone who contributes to the bigger organizational goal—such as hitting the holiday sales numbers.

    There can be many approaches to do this, depending on the people, processes, and technologies in place. At a high level, it’s important roll up data into dashboards, which can use color-coded indicators (red, amber, and green) to reveal whether the OLA is being met, and to provide the status for each task in the retail function.

    Another example could be adding a layer of OLA/SLA management above a digital automation platform, such as workload automation, process automation, and so on. In summary, be sure to align OLAs with key retail functions, such as payment authorization and settlement, customer loyalty programs, depot operations, advance shipment notifications, human resource attendance management, checkout, back office, encryption and settlement, and more.

    Tap into Tribal Knowledge

    Before a team member submits a half-baked ticket or attempts an erroneous approach to remediation, establish a way to leverage the power of existing tribal knowledge. Employ a push versus a pull model. Attach fixes that have been learned over the years to alerts, so operations staff can take charge before having to engage a development expert. Push insights into workflows, so all team members can be better informed and more efficient.

    Evolve Self-Healing

    Try to gamify IT service management. For example, for incidents that arise over the month, run a contest for support tribes to provide knowledge articles and tally points for each team based on submission accuracy and applicability. This aggregated knowledge can set the stage to embark on an automated self-healing journey. Your investment in a versatile automation platform is a key to make this move successfully.

    Champion Your SREs

    Equip teams with intelligence, which can promote factual collaboration between development and operations. In the process, you can free up developers to innovate for the business, instead of firefighting a flood of incidents. Invest in modern application performance management solutions, which enables operations to identify the root cause of the issue, including down to the exact line of underperforming code. Based on an intelligent application map, they can immediately identify the developer who owns that line of code and engage with them.

    Optimize Monitoring and Alerting Governance

    It’s important to establish effective business service views. Toward that end, look to refine in-house standards and templates for a wide range of elements and environments, including event viewers, system logs, infrastructure, logical partitions, public cloud, applications, middleware, networks, knowledge bases, and automation logs.

    Procedures: Gain Observability of Modern Environments

    Establish End-to-End Observability

    Initially, leverage as much data as possible coming from existing monitoring and automation tools. Employ a domain-agnostic AIOps solution to consolidate insights. In subsequent phases, run strategic rationalization for monitoring and automation tools. For example, in the monitoring arena, rationalize four pillars of monitoring around digital experience, application performance management, infrastructure management, and network operations. Within the automation area, invest in an automation platform that has breadth and depth, enabling automation across development and operations and all possible technology domains.

    Address Modern Networks

    Modern networks present a number of emerging challenges. So-called “microbursts” can temporarily overtax switch buffers and cause packet loss or backpressure. However, these events do not last long enough to be detected via SNMP or port statistics. To address this challenge, your teams need modern network operations capabilities. This requires high-resolution visibility for streaming data directly from silicon and the ability to handle big-data scale.

    Another observability challenge comes from the rising adoption of software-defined networking (SDN). Modern network operations tools need to support the broad range of SDN and software-defined wide area networking (SD-WAN) standards and technologies.

    Support Modern Applications

    To address their agility demands, many organizations are adopting microservices and other modern application architectures. The challenge is that legacy monitoring approaches don’t suffice in this new paradigm. With an interconnected mesh of services, many of which are third party or open source, it becomes far more challenging to understand system performance based on external outputs. Technologies for distributed tracing, such as OpenTracing, OpenCensus, and OpenTelemetry, have continued to evolve. AIOps and monitoring platforms must support these technologies in some fashion.

    Understand the Customer Experience

    In the physical brick-and-mortar paradigm, retailers want to position important goods at the consumer’s eye level. In a similar way, when a customer uses a retailer’s e-commerce web or mobile app, it is important to ensure the customer is seeing the relevant content at the right time. Most importantly, when a customer clicks the “buy” button, it better work. To track the user experience, get sophisticated digital experience monitoring capabilities that deliver analytical details for code, crashes, design, and more.

    Intelligence: Establish Intuitive Business Service Dashboards

    Teams need to know the “so what?” They need to understand why their work in addressing issues and optimizing services is so important. By establishing dashboards that depict, via a simple color-coded view, whether business services are running well, running poorly, or down, is critical to providing this perspective.

    To establish this visibility, teams need to connect technology to business services. This can include using configuration management databases (CMDB) or dynamic service views. Following are some key pointers for establishing this visibility:

    • Build service analytics that depict health, risk, and other KPIs, so teams can understand the potential impacts of underperforming IT components and mitigate them beforehand.
    • Track services that span complex environments. Today, services rely on complex topologies composed of various sources, business functions, and technology layers, from application to infrastructure. Teams need to be able to connect OLA status of milestone-automated workloads as well. Be inclusive and creative to build dashboards to track them all.
    • Establish standards around alerting. An acknowledged critical alert is very different than one that’s unacknowledged. It is critical that teams agree on some working protocol to distinguish and promote alerts based on their impact on business service degradation.
    • Incorporate internet of things (IoT) data. A retailer may have thousands of trucks on the road moving goods between warehouses, depots, and stores. These vehicles are prone to various risks, whether due to weather conditions, driver fatigue, and a number of unexpected events. Leverage the power of IoT sensors, internet, GPS, big data, Kafka streaming, and workload automation to build risk profile dashboards. In this way, leaders can constantly track and minimize risk.

    Analytics: Every Ticket Tells a Story

    Keep the quantity and quality of tickets in check. If teams have ticket counts that are too high, establish continual service improvement (CSI) monitoring. Track performance and look to identify steps for improvement, without shaming any team members. Perhaps suboptimal thresholds are causing high volumes of alerts, which can lead to staff missing important alerts, deal with ticket fatigue, and so on. Invest in an AI and machine learning-enabled alarm analytics solution, which can cluster alerts, identify patterns, and reduce alarm noise.


    For retailers, challenges and opportunities continue to arise with increasing speed. Those that can successfully pursue retail digital transformation can establish the agility that fosters business success. Through harnessing advanced AIOps and automation capabilities to promote enhanced observability, operations teams can be better equipped to support this transformation, while delivering the optimized service levels required.