1_Cheng23_AIOps_Survey (2)

pdf

School

Concordia University *

*We aren’t endorsed by this school

Course

691

Subject

Information Systems

Date

Oct 30, 2023

Type

pdf

Pages

Uploaded by BaronSandpiperMaster927

1 AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges Qian Cheng *† , Doyen Sahoo * , Amrita Saha, Wenzhuo Yang, Chenghao Liu, Gerald Woo, Manpreet Singh, Silvio Saverese, and Steven C. H. Hoi Salesforce AI Abstract —Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities. Index Terms —AIOps, Artificial Intelligence, IT Operations, Machine Learning, Anomaly Detection, Root-cause Analysis, Failure Prediction, Resource Management I. I NTRODUCTION Modern software has been evolving rapidly during the era of digital transformation. New infrastructure, techniques and design patterns - such as cloud computing, Software-as-a- Service (SaaS), microservices, DevOps, etc. have been devel- oped to boost software development. Managing and operating the infrastructure of such modern software is now facing new challenges. For example, when traditional software transits to SaaS, instead of handing over the installation package to the user, the software company now needs to provide 24/7 software access to all the subscription based users. Besides developing and testing, service management and operations are now the new set of duties of SaaS companies. Meanwhile, traditional software development separates functionalities of the entire software lifecycle. Coding, testing, deployment and operations are usually owned by different groups. Each of these groups requires different sets of skills. However, agile development and DevOps start to obfuscate the boundaries between each process and DevOps engineers are required to take E2E responsibilities. Balancing development and opera- tions for a DevOps team become critical to the whole team’s productivity. * Equal Contribution † Work done when author was with Salesforce AI Software services need to guarantee service level agree- ments (SLAs) to the customers, and often set internal Service Level Objectives (SLOs). Meeting SLAs and SLOs is one of the top priority for CIOs to choose the right service providers[1]. Unexpected service downtime can impact avail- ability goals and cause significant financial and trust issues. For example, AWS experienced a major service outage in December 2021, causing multiple first and third party websites and heavily used services to experience downtime [2]. IT Operations plays a key role in the success of modern software companies and as a result multiple concepts have been introduced, such as IT service management (ITSM) specifically for SaaS, and IT operations management (ITOM) for general IT infrastructure. These concepts focus on different aspects IT operations but the underlying workflow is very similar. Life cycle of Software systems can be separated into several main stages, including planning, development/coding, building, testing, deployment, maintenance/operations, moni- toring, etc. [3]. The operation part of DevOps can be further broken down into four major stages: observe, detect, engage and act, shown in Figure 1. Observing stage includes tasks like collecting different telemetry data (metrics, logs, traces, etc.), indexing and querying and visualizing the collected telemetries. Time-to-observe (TTO) is a metric to measure the performance of the observing stage. Detection stage includes tasks like detecting incidents, predicting failures, finding cor- related events, etc. whose performance is typically measured as the Time-to-detect (TTD) (in addition to precision/recall). Engaging stage includes tasks like issue triaging, localiza- tion, root-cause analysis, etc., and the performance is often measured by Time-to-triage (TTT). Acting stage includes immediate remediation actions such as reboot the server, scale-up / scale-out resources, rollback to previous versions, etc. Time-to-resolve (TTR) is the key metric measured for the acting stage. Unlike software development and release, where we have comparatively mature continuous integration and continuous delivery (CI/CD) pipelines, many of the post- release operations are often done manually. Such manual operational processes face several challenges: • Manual operations struggle to scale. The capacity of manual operations is limited by the size of the DevOps team and the team size can only increase linearly. When the software usage is at growing stage, the throughput and workloads may grow exponentially, both in scale and complexity. It is difficult for DevOps team to grow at the arXiv:2304.04661v1 [cs.LG] 10 Apr 2023

2 Fig. 1. Common DevOps life cycles[3] and ops breakdown. Ops can comprise four stages: observe, detect, engage and act. Each of the stages has a corresponding measure: time-to-observe, time-to-detect, time-to-triage and time-to-resolve. same pace to handle the increasing amount of operational workload. • Manual operations is hard to standardize. It is very hard to keep the same high standard across the entire DevOps team given the diversity of team members (e.g. skill level, familiarity with the service, tenure, etc.). It takes significant amount of time and effort to grow an operational domain expert who can effectively handle incidents. Unexpected attrition of these experts could significantly hurt the operational efficiency of a DevOps team. • Manual operations are error-prone. It is very common that human operation error causes major incidents. Even for the most reliable cloud service providers, major incidents have been caused by human error in recent years. Given these challenges, fully-automated operations pipelines powered by AI capabilities becomes a promising approach to achieve the SLA and SLO goals. AIOps, an acronym of AI for IT Operations, was coined by Gartner at 2016. According to Gartner Glossary, ”AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination”[4]. In order to achieve fully- automated IT Operations, investment in AIOps technolgies is imperative. AIOps is the key to achieve high availability, scalability and operational efficiency . For example, AIOps can use AI models can automatically analyze large volumes of telemetry data to detect and diagnose incidents much faster, and much more consistently than humans, which can help achieve ambitious targets such as 99.99 availability. AIOps can dynamically scale its capabilities with growth demands and use AI for automated incident and resource management, thereby reducing the burden of hiring and training domain experts to meet growth requirements. Moreover, automation through AIOps helps save valuable developer time, and avoid fatigue. AIOps, as an emerging AI technology, appeared on the trending chart of Gartner Hyper Cycle for Artificial Intelligence in 2017 [5], along with other popular topics such as deep reinforcement learning, nature-language generation and artificial general intelligence. As of 2022, enterprise AIOps solutions have witnessed increased adoption by many companies’ IT infrastructure. The AIOps market size is predicted to be $11.02B by end of 2023 with cumulative annual growth rate (CAGR) of 34%. AIOps comprises a set of complex problems. Transforming from manual to automated operations using AIOps is not a one-step effort. Based on the adoption level of AI techniques, we break down AIOps maturity into four different levels based on the adoption of AIOps capabilities as shown in Figure 2. Fig. 2. AIOps Transformation. Different maturity levels based on adoption of AI techniques: Manual Ops, human-centric AIOps, machine-centric AIOps, fully-automated AIOps. Manual Ops. At this maturity level, DevOps follows tra- ditional best practices and all processes are setup manually. There is no AI or ML models. This is the baseline to compare with in AIOps transformation. Human-centric . At this level, operations are done mainly in manual process and AI techniques are adopted to replace sub- procedures in the workflow, and mainly act as assistants. For example, instead of glass watching for incident alerts, DevOps or SREs can set dynamic alerting threshold based on anomaly detection models. Similarly, the root cause analysis process requires watching multiple dashboards to draw insights, and AI can help automatically obtain those insights. Machine-centric . At this level, all major components (mon- itoring, detecting, engaging and acting) of the E2E operation process are empowered by more complex AI techniques.

3 Humans are mostly hands-free but need to participate in the human-in-the-loop process to help fine-tune and improve the AI systems performance. For example, DevOps / SREs operate and manage the AI platform to guarantee training and inference pipelines functioning well, and domain experts need to provide feedback or labels for AI-made decisions to improve performance. Fully-automated . At this level, AIOps platform achieves full automation with minimum or zero human intervention. With the help of fully-automated AIOps platforms, the current CI/CD (continuous integration and continuous deployment) pipelines can be further extended to CI/CD/CM/CC (continu- ous integration, continuous deployment, continuous monitor- ing and continuous correction) pipelines. Different software systems, and companies may be at dif- ferent levels of AIOps maturity, and their priorities and goals may differ with regard to specific AIOps capabilities to be adopted. Setting up the right goals is important for the success of AIOps applications. We foresee the trend of shifting from manual operation all the way to fully-automated AIOps in the future, with more and more complex AI techniques being used to address challenging problems. In order to enable the community to adopt AIOps capabilities faster, in this paper, we present a comprehensive survey on the various AIOps problems and tasks and the solutions developed by the community to address them. II. C ONTRIBUTION OF T HIS S URVEY Increasing number of research studies and industrial prod- ucts in the AIOps domain have recently emerged to address a variety of problems. Sabharwal et al. published a book ”Hands- on AIOps” to discuss practical AIOps and implementation [6]. Several AIOps literature reviews are also accessible [7] [8] to help audiences better understand this domain. However, there are very limited efforts to provide a holistic view to deeply connect AIOps with latest AI techniques. Most of the AI related literature reviews are still topic-based, such as deep learning anomaly detection [9] [10], failure management, root-cause analysis [11], etc. There is still limited effort to provide a holistic view about AIOps, covering the status in both academia and industry. We prepare this survey to address this gap, and focus more on AI techniques used in AIOps. Except for the monitoring stage, where most of the tasks focus on telemetry data collection and management, AIOps covers the other three stages where the tasks focus more on analytics. In our survey, we group AIOps tasks based on which operational stage they can contribute to, shown in Figure 3. Incident Detection. Incident detection tasks contribute to detection stage. The goal of these tasks are reducing mean- time-to-detect (MTTD). In our survey we cover time series incident detection (Section IV-A), log incident detection (Sec- tion IV-B), trace and multimodal incident detection (Section IV-C). Failure Prediction. Failure prediction also contributes to detection stage. The goal of failure prediction is to predict the potential issue before it actually happens so actions can be taken in advance to minimize impact. Failure prediction also contributes to reducing mean-time-to-detect (MTTD). In our survey we cover metric failure prediction (Section V-A) and log failure prediction (Section V-B). There are very limited efforts in literature that perform traces and multimodal failure prediction. Root-cause Analysis. Root-cause analysis tasks contributes to multiple operational stages, including triaging, acting and even support more efficient long-term issue fixing and reso- lution. Helping as an immediate response to an incident, the goal is to minimize time to triage (MTTT), and simultaneously contribute to reduction on reducing Mean Time to Resolve (MTTR). An added benefit is also reduction in human toil. We further breakdown root-cause analysis into time-series RCA (Section VI-B), logs RCA (Section VI-B) and traces and multimodal RCA (Section VI-C). Automated Actions. Automated actions contribute to acting stage, where the main goal is to reduce mean-time-to-resolve (MTTR), as well as long-term issue fix and resolution. In this survey we discuss about a series of methods for auto- remediation (Section VII-A), auto-scaling (Section VII-B) and resource management (Section VII-C). III. D ATA FOR AIO PS Before we dive into the problem settings, it is important to understand the data available to perform AIOps tasks. Modern software systems generate tremendously large volumes of observability metrics. The data volume keeps growing expo- nentially with digital transformation [12]. The increase in the volume of data stored in large unstructured Data lake systems makes it very difficult for DevOps teams to consume the new information and fix consumers’ problems efficiently [13]. Successful products and platforms are now built to address the monitoring and logging problems. Observability platforms, e.g. Splunk, AWS Cloudwatch, are now supporting emitting, storing and querying large scale telemetry data. Similar to other AI domains, observability data is critical to AIOps. Unfortunately there are limited public datasets in this domain and many successful AIOps research efforts are done with self-owned production data, which usually are not available publicly. In this section, we describe major telemetry data type including metrics, logs, traces and other records, and present a collection of public datasets for each data type. A. Metrics Metrics are numerical data measured over time which provide a snapshot of the system behavior. Metrics can rep- resent a broad range of information, broadly classified into compute metrics and service metrics. Compute metrics (e.g. CPU utilization, memory usage, disk I/O) are an indicator of the health status of compute nodes (servers, virtual machines, pods). They are collected at the system level using tools such as Slurm [14] for usage statistics from jobs and nodes, and the Lustre parallel distributed file system for I/O information. Service metrics (e.g. request count, page visits, number of errors) measure the quality and level of service of customer facing applications. Aggregate statistics of such numerical data

Your preview ends here