1_Cheng23_AIOps_Survey (2)

pdf

School

Concordia University *

*We aren’t endorsed by this school

Course

691

Subject

Information Systems

Date

Oct 30, 2023

Type

pdf

Pages

34

Uploaded by BaronSandpiperMaster927

Report
1 AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges Qian Cheng *† , Doyen Sahoo * , Amrita Saha, Wenzhuo Yang, Chenghao Liu, Gerald Woo, Manpreet Singh, Silvio Saverese, and Steven C. H. Hoi Salesforce AI Abstract —Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities. Index Terms —AIOps, Artificial Intelligence, IT Operations, Machine Learning, Anomaly Detection, Root-cause Analysis, Failure Prediction, Resource Management I. I NTRODUCTION Modern software has been evolving rapidly during the era of digital transformation. New infrastructure, techniques and design patterns - such as cloud computing, Software-as-a- Service (SaaS), microservices, DevOps, etc. have been devel- oped to boost software development. Managing and operating the infrastructure of such modern software is now facing new challenges. For example, when traditional software transits to SaaS, instead of handing over the installation package to the user, the software company now needs to provide 24/7 software access to all the subscription based users. Besides developing and testing, service management and operations are now the new set of duties of SaaS companies. Meanwhile, traditional software development separates functionalities of the entire software lifecycle. Coding, testing, deployment and operations are usually owned by different groups. Each of these groups requires different sets of skills. However, agile development and DevOps start to obfuscate the boundaries between each process and DevOps engineers are required to take E2E responsibilities. Balancing development and opera- tions for a DevOps team become critical to the whole team’s productivity. * Equal Contribution Work done when author was with Salesforce AI Software services need to guarantee service level agree- ments (SLAs) to the customers, and often set internal Service Level Objectives (SLOs). Meeting SLAs and SLOs is one of the top priority for CIOs to choose the right service providers[1]. Unexpected service downtime can impact avail- ability goals and cause significant financial and trust issues. For example, AWS experienced a major service outage in December 2021, causing multiple first and third party websites and heavily used services to experience downtime [2]. IT Operations plays a key role in the success of modern software companies and as a result multiple concepts have been introduced, such as IT service management (ITSM) specifically for SaaS, and IT operations management (ITOM) for general IT infrastructure. These concepts focus on different aspects IT operations but the underlying workflow is very similar. Life cycle of Software systems can be separated into several main stages, including planning, development/coding, building, testing, deployment, maintenance/operations, moni- toring, etc. [3]. The operation part of DevOps can be further broken down into four major stages: observe, detect, engage and act, shown in Figure 1. Observing stage includes tasks like collecting different telemetry data (metrics, logs, traces, etc.), indexing and querying and visualizing the collected telemetries. Time-to-observe (TTO) is a metric to measure the performance of the observing stage. Detection stage includes tasks like detecting incidents, predicting failures, finding cor- related events, etc. whose performance is typically measured as the Time-to-detect (TTD) (in addition to precision/recall). Engaging stage includes tasks like issue triaging, localiza- tion, root-cause analysis, etc., and the performance is often measured by Time-to-triage (TTT). Acting stage includes immediate remediation actions such as reboot the server, scale-up / scale-out resources, rollback to previous versions, etc. Time-to-resolve (TTR) is the key metric measured for the acting stage. Unlike software development and release, where we have comparatively mature continuous integration and continuous delivery (CI/CD) pipelines, many of the post- release operations are often done manually. Such manual operational processes face several challenges: Manual operations struggle to scale. The capacity of manual operations is limited by the size of the DevOps team and the team size can only increase linearly. When the software usage is at growing stage, the throughput and workloads may grow exponentially, both in scale and complexity. It is difficult for DevOps team to grow at the arXiv:2304.04661v1 [cs.LG] 10 Apr 2023
2 Fig. 1. Common DevOps life cycles[3] and ops breakdown. Ops can comprise four stages: observe, detect, engage and act. Each of the stages has a corresponding measure: time-to-observe, time-to-detect, time-to-triage and time-to-resolve. same pace to handle the increasing amount of operational workload. Manual operations is hard to standardize. It is very hard to keep the same high standard across the entire DevOps team given the diversity of team members (e.g. skill level, familiarity with the service, tenure, etc.). It takes significant amount of time and effort to grow an operational domain expert who can effectively handle incidents. Unexpected attrition of these experts could significantly hurt the operational efficiency of a DevOps team. Manual operations are error-prone. It is very common that human operation error causes major incidents. Even for the most reliable cloud service providers, major incidents have been caused by human error in recent years. Given these challenges, fully-automated operations pipelines powered by AI capabilities becomes a promising approach to achieve the SLA and SLO goals. AIOps, an acronym of AI for IT Operations, was coined by Gartner at 2016. According to Gartner Glossary, ”AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination”[4]. In order to achieve fully- automated IT Operations, investment in AIOps technolgies is imperative. AIOps is the key to achieve high availability, scalability and operational efficiency . For example, AIOps can use AI models can automatically analyze large volumes of telemetry data to detect and diagnose incidents much faster, and much more consistently than humans, which can help achieve ambitious targets such as 99.99 availability. AIOps can dynamically scale its capabilities with growth demands and use AI for automated incident and resource management, thereby reducing the burden of hiring and training domain experts to meet growth requirements. Moreover, automation through AIOps helps save valuable developer time, and avoid fatigue. AIOps, as an emerging AI technology, appeared on the trending chart of Gartner Hyper Cycle for Artificial Intelligence in 2017 [5], along with other popular topics such as deep reinforcement learning, nature-language generation and artificial general intelligence. As of 2022, enterprise AIOps solutions have witnessed increased adoption by many companies’ IT infrastructure. The AIOps market size is predicted to be $11.02B by end of 2023 with cumulative annual growth rate (CAGR) of 34%. AIOps comprises a set of complex problems. Transforming from manual to automated operations using AIOps is not a one-step effort. Based on the adoption level of AI techniques, we break down AIOps maturity into four different levels based on the adoption of AIOps capabilities as shown in Figure 2. Fig. 2. AIOps Transformation. Different maturity levels based on adoption of AI techniques: Manual Ops, human-centric AIOps, machine-centric AIOps, fully-automated AIOps. Manual Ops. At this maturity level, DevOps follows tra- ditional best practices and all processes are setup manually. There is no AI or ML models. This is the baseline to compare with in AIOps transformation. Human-centric . At this level, operations are done mainly in manual process and AI techniques are adopted to replace sub- procedures in the workflow, and mainly act as assistants. For example, instead of glass watching for incident alerts, DevOps or SREs can set dynamic alerting threshold based on anomaly detection models. Similarly, the root cause analysis process requires watching multiple dashboards to draw insights, and AI can help automatically obtain those insights. Machine-centric . At this level, all major components (mon- itoring, detecting, engaging and acting) of the E2E operation process are empowered by more complex AI techniques.
3 Humans are mostly hands-free but need to participate in the human-in-the-loop process to help fine-tune and improve the AI systems performance. For example, DevOps / SREs operate and manage the AI platform to guarantee training and inference pipelines functioning well, and domain experts need to provide feedback or labels for AI-made decisions to improve performance. Fully-automated . At this level, AIOps platform achieves full automation with minimum or zero human intervention. With the help of fully-automated AIOps platforms, the current CI/CD (continuous integration and continuous deployment) pipelines can be further extended to CI/CD/CM/CC (continu- ous integration, continuous deployment, continuous monitor- ing and continuous correction) pipelines. Different software systems, and companies may be at dif- ferent levels of AIOps maturity, and their priorities and goals may differ with regard to specific AIOps capabilities to be adopted. Setting up the right goals is important for the success of AIOps applications. We foresee the trend of shifting from manual operation all the way to fully-automated AIOps in the future, with more and more complex AI techniques being used to address challenging problems. In order to enable the community to adopt AIOps capabilities faster, in this paper, we present a comprehensive survey on the various AIOps problems and tasks and the solutions developed by the community to address them. II. C ONTRIBUTION OF T HIS S URVEY Increasing number of research studies and industrial prod- ucts in the AIOps domain have recently emerged to address a variety of problems. Sabharwal et al. published a book ”Hands- on AIOps” to discuss practical AIOps and implementation [6]. Several AIOps literature reviews are also accessible [7] [8] to help audiences better understand this domain. However, there are very limited efforts to provide a holistic view to deeply connect AIOps with latest AI techniques. Most of the AI related literature reviews are still topic-based, such as deep learning anomaly detection [9] [10], failure management, root-cause analysis [11], etc. There is still limited effort to provide a holistic view about AIOps, covering the status in both academia and industry. We prepare this survey to address this gap, and focus more on AI techniques used in AIOps. Except for the monitoring stage, where most of the tasks focus on telemetry data collection and management, AIOps covers the other three stages where the tasks focus more on analytics. In our survey, we group AIOps tasks based on which operational stage they can contribute to, shown in Figure 3. Incident Detection. Incident detection tasks contribute to detection stage. The goal of these tasks are reducing mean- time-to-detect (MTTD). In our survey we cover time series incident detection (Section IV-A), log incident detection (Sec- tion IV-B), trace and multimodal incident detection (Section IV-C). Failure Prediction. Failure prediction also contributes to detection stage. The goal of failure prediction is to predict the potential issue before it actually happens so actions can be taken in advance to minimize impact. Failure prediction also contributes to reducing mean-time-to-detect (MTTD). In our survey we cover metric failure prediction (Section V-A) and log failure prediction (Section V-B). There are very limited efforts in literature that perform traces and multimodal failure prediction. Root-cause Analysis. Root-cause analysis tasks contributes to multiple operational stages, including triaging, acting and even support more efficient long-term issue fixing and reso- lution. Helping as an immediate response to an incident, the goal is to minimize time to triage (MTTT), and simultaneously contribute to reduction on reducing Mean Time to Resolve (MTTR). An added benefit is also reduction in human toil. We further breakdown root-cause analysis into time-series RCA (Section VI-B), logs RCA (Section VI-B) and traces and multimodal RCA (Section VI-C). Automated Actions. Automated actions contribute to acting stage, where the main goal is to reduce mean-time-to-resolve (MTTR), as well as long-term issue fix and resolution. In this survey we discuss about a series of methods for auto- remediation (Section VII-A), auto-scaling (Section VII-B) and resource management (Section VII-C). III. D ATA FOR AIO PS Before we dive into the problem settings, it is important to understand the data available to perform AIOps tasks. Modern software systems generate tremendously large volumes of observability metrics. The data volume keeps growing expo- nentially with digital transformation [12]. The increase in the volume of data stored in large unstructured Data lake systems makes it very difficult for DevOps teams to consume the new information and fix consumers’ problems efficiently [13]. Successful products and platforms are now built to address the monitoring and logging problems. Observability platforms, e.g. Splunk, AWS Cloudwatch, are now supporting emitting, storing and querying large scale telemetry data. Similar to other AI domains, observability data is critical to AIOps. Unfortunately there are limited public datasets in this domain and many successful AIOps research efforts are done with self-owned production data, which usually are not available publicly. In this section, we describe major telemetry data type including metrics, logs, traces and other records, and present a collection of public datasets for each data type. A. Metrics Metrics are numerical data measured over time which provide a snapshot of the system behavior. Metrics can rep- resent a broad range of information, broadly classified into compute metrics and service metrics. Compute metrics (e.g. CPU utilization, memory usage, disk I/O) are an indicator of the health status of compute nodes (servers, virtual machines, pods). They are collected at the system level using tools such as Slurm [14] for usage statistics from jobs and nodes, and the Lustre parallel distributed file system for I/O information. Service metrics (e.g. request count, page visits, number of errors) measure the quality and level of service of customer facing applications. Aggregate statistics of such numerical data
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Fig. 3. AIOps Tasks. In this survey we discuss a series of AIOps tasks, categorized by which operational stages these tasks contribute to, and the observability data type it takes. also fall under the category of metrics, providing a more coarse-grained view of system behavior. Metrics are constantly generated by all components of the cloud platform life cycle, making it one of the most ubiquitous forms of AIOps data. Cloud platforms and supercomputer clusters can generate petabytes of metrics data, making it a challenge to store and analyze, but at the same time, brings immense observability to the health of the entire IT operation. Being numerical time-series data, metrics are simple to interpret and easy to analyze, allowing for simple threshold- based rules to be acted upon. At the same time, they contain sufficiently rich information to be used to power more complex AI based alerting and actions. The major challenge in leveraging insights from metrics data arises due to their diverse nature. Metrics data can exhibit a variety of patterns, such as cyclical patterns (repeating patterns hourly, daily, weekly, etc.), sparse and intermittent spikes, and noisy signals. The characteristics of the metrics ultimately depend on the underlying service or job. In Table I, we briefly describe the datasets and benchmarks of metrics data. Metrics data have been used in studies char- acterizing the workloads of cloud data centers, as well as the various AIOps tasks of incident detection, root cause analysis, failure prediction, and various planning and optimization tasks like auto-scaling and VM pre-provisioning. B. Logs Software logs are specifically designed by the software developers in order to record any type of runtime information about processes executing within a system - thus making them an ubiquitous part of any modern system or software maintenance. Once the system is live and throughout its life- cycle, it continuously emits huge volumes of such logging data which naturally contain a lot of rich dynamic runtime information relevant to IT Operations and Incident Man- agement of the system. Consequently in AI driven IT-Ops pipelines, automated log based analysis plays an important role in Incident Management - specifically in tasks like Incident Detection and Causation and Failure Prediction, as have been Fig. 4. GPU utilization metrics from the MIT Supercloud Dataset exhibiting various patterns (cyclical, sparse and intermittant, noisy). studied by multiple literature surveys in the past [15], [16], [17], [18], [19], [20], [21], [22], [23]. In most of the practical cases, especially in industrial settings, the volume of the logs can go upto an order of petabytes of loglines per week. Also because of the nature of log content, log data dumps are much more heavier in size in comparison to time series telemetry data. This requires special handling of logs observability data in form of data streams, - where today, there are various services like Splunk, Datadog, LogStash, NewRelic, Loggly, Logz.io etc employed to efficiently store and access the log stream and also visualize, analyze and query past log data using specialized structured query language. Nature of Log Data. Typically these logs consist of semi- structured data i.e. a combination of structured and unstruc- tured data. Amongst the typical types of unstructured data there can be natural language tokens, programming language constructs (e.g. method names) and the structured part can consist of quantitative or categorical telemetry or observability metrics data, which are printed in runtime by various logging statements embedded in the source-code or sometimes gener- ated automatically via loggers or logging agents. Depending on the kind of service the logs are dumped from, there can be
5 a diverse types of logging data with heterogeneous form and content. For example, logs can be originating from distributed systems (e.g. hadoop or spark), operating systems (windows or linux) or in complex supercomputer systems or can be dumped at hardware level (e.g. switch logs) or middle-ware level (like servers e.g. Apache logs) or by specific applications (e.g. Health App). Typically each logline comprises of a fixed part which is the template that had been designed by the developer and some variable part or parameters which capture some runtime information about the system. Complexities of Log Data. Thus, apart from being one of the most generic and hence crucial data-sources in IT Ops, logs are one of the most complex forms of observability data due to their open-ended form and level of granularity at which they contain system runtime information. In cloud computing context, logs are the source of truth for cloud users to the underlying servers that running their applications since cloud providers don’t grant full access to their users of the servers and platforms. Also, being designed by developers, logs are immediately affected by any changes in the source- code or logging statements by developers. This results in non-stationarity in the logging vocabulary or even the entire structure or template underlying the logs. Log Observability Tasks. Log observability typically in- volves different tasks like anomaly detection over logs during incident detection (Section IV-B), root cause analysis over logs (Section VI-B) and log based failure prediction (Section V-B). Datasets and Benchmarks. Out of the different log ob- servability tasks, log based anomaly detection is one of the most objective tasks and hence most of the publicly released benchmark datasets have been designed around anomaly de- tection. In Table B, we give a comprehensive description about the different public benchmark datasets that have been used in the literature for anomaly detection tasks. Out of these, datasets Switch and subsets of HPC and BGL have also been redesigned to serve failure prediction task. On the other hand there are no public benchmarks on log based RCA tasks, which has been typically evaluated on private enterprise data. C. Traces Trace data are usually presented as semi-structured logs, with identifiers to reconstruct the topological maps of the applications and network flows of target requests. For example, when user uses Google search, a typical trace graph of this user request looks like in Figure 6. Traces are composed system events (spans) that tracks the entire progress of a request or execution. A span is a sequence of semi-structured event logs. Tracing data makes it possible to put different data modality into the same context. Requests travel through multiple services / applications and each application may have totally different behavior. Trace records usually contains two required parts: timestamps and span id. By using the timestamps and span id, we can easily reconstruct the trace graph from trace logs. Fig. 5. An example of Log Data generated in IT Operations Fig. 6. An snapshot of trace graph of user requests when using Google Search. Trace analysis requires reliable tracing systems. Trace col- lection systems such as ReTrace [24] can help achieve fast and inexpensive trace collections. Trace collectors are usually code agnostic and can emit different levels of performance trace data back to the trace stores in near real-time. Early summarization is also involved in the trace collection process to help generate fine-grained events [25]. Although trace collection is common for system observ- ability, it is still challenging to acquire high quality trace data to train AI models. As far as we know, there are very few public trace datasets with high quality labels. Also the only few existing public trace datasets like [26] are not widely adopted in AIOps research. Instead, most AIOps related trace analysis research use self-owned production or simulation trace data,
6 which are generally not available publicly. D. Other Data Besides the machine generated observability data like met- rics, logs, traces, etc., there are other types of operational data that could be used in AIOps. Human activity records is part of these valuable data. Ticketing systems are used for DevOps/SREs to communicate and efficiently resolve the issues. This process generates large amount of human activity records. The human activity data contains rich knowledge and learnings about solutions to existing issues, which can be used to resolve similar issues in the future. User feedback data is also very important to improve AIOps system performance. Unlike the issue tickets where human needs to put lots of context information to describe and discuss the issue, user feedback can be as simple as one click to confirm if the alert is good or bad. Collecting real-time user feedback of a running system and designing human-in-the- loop workflows are also very significant for success of AIOps solutions. Although many companies collects these types of data and use them to improve their operation workflows, there are still very limited published research discussing how to systematically incorporate these other types of operational data in AIOps solutions. This brings challenges as well as opportunities to make further improvements in AIOps domain. Next, we discuss the key AIOps Tasks - Incident Detec- tion, Failure Prediction, Root Cause Analysis, and Automated Actions, and systematically review the key contributions in literature in these areas. IV. I NCIDENT D ETECTION Incident detection employs a variety of anomaly detec- tion techniques. Anomaly detection is to detect abnormali- ties, outliers or generally events that not normal. In AIOps context, anomaly detection is widely adopted in detecting any types of abnormal system behaviors. To detect such anomalies, the detectors need to utilize different telemetry data, such as metrics, logs, traces. Thus, anomaly detection can be further broken down to handling one or more specific telemetry data sources, including metric anomaly detection, log anomaly detection, trace anomaly detection. Moreover, multi-modal anomaly detection techniques can be employed if multiple telemetry data sources are involved in the detec- tion process. In recent years, deep learning based anomaly detection techniques [9] are also widely discussed and can be utilized for anomaly detection in AIOps. Another way to distinguish anomaly detection techniques is depending on different application use cases, such as detecting service health issues, detecting networking issues, detecting security issues, fraud transactions, etc. Usually these variety of techniques are derived from same set of base detection algorithms and localized to handle specific tasks. From technical perspective, detecting anomalies from different telemetry data sources are better aligned with the AI technology definitions, such as, metric are usually time-series, logs are text / natural language, traces are event sequences/graphs, etc. In this article, we discuss anomaly detection by different telemetry data sources. A. Metrics based Incident Detection Problem Definition To ensure the reliability of services, billions of metrics are constantly monitored and collected at equal-space timestamp [27]. Therefore, it is straightforward to organize metrics as time series data for subsequent analysis. Metric based incident detection, which aims to find the anomalous behaviors of monitored metrics that significantly deviate from the other observations, is vital for operators to timely detect software failures and trigger failure diagnosis to mitigate loss. The most basic form of incident detection on metrics is the rule-based method which sets up an alert when a metric breaches a certain threshold. Such an approach is only able to capture incidents which are defined by the metric exceeding the threshold, and is unable to detect more complex incidents. The rule- based method to detect incidents on metrics are generally too naive, and only able to account for the most simple of incidents. They are also sensitive to the threshold, producing too many false positives when the threshold is too low, and false negatives when the threshold is too high. Due to the open- ended nature of incidents, increasingly complex architectures of systems, and increasing size of these systems and number of metrics, manual monitoring and rule-based methods are no longer sufficient. Thus, more advanced metric-based incident detection methods that leveraging AI capability is urgent. As metrics are a form of time series data, and incidents are expressed as an abnormal occurrence in the data, metric incident detection is most often formulated as a time series anomaly detection problem [28], [29], [30]. In the following, we focus on the AIOps setting and categorize it based on several key criteria: (i) learning paradigm, (ii) dimensionality, (iii) system, and (iv) streaming updates. We further summa- rize a list of time series anomaly detection methods with a comparison over these criteria in Table IV. Learning Setting a) Label Accessibility: One natural way to formulate the anomaly detection problem, is as the supervised binary classification problem, to classify whether a given obser- vation is an anomaly or not [31], [32]. Formulating it as such has the benefit of being able to apply any supervised learning method, which has been intensely studied in the past decades [33]. However, due to the difficulty in obtaining labelled data for metrics incident detection [34] and labels of anomalies are prone to error [35], unsupervised approaches, which do not require labels to build anomaly detectors, are generally preferred and more widespread. Particularly, unsu- pervised anomaly detection methods can be roughly catego- rized into density-based methods, clustering-based methods, and reconstruction-based methods [28], [29], [30]. Density- based methods compute local density and local connectivity for outlier decision. Clustering-based methods formulate the anomaly score as the distance to cluster center. Reconstruction- based methods explicitly model the generative process of the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 data and measure the anomaly score with the reconstruction error. While methods in metric anomaly detection are generally unsupervised, there are cases where there is some access to labels. In such situations, semi-supervised, domain adaptation, and active learning paradigms come into play. The semi- supervised paradigm [36], [37], [38] enables unsupervised models to leverage information from sparsely available posi- tive labels [39]. Domain adaptation [40] relies on a labelled source dataset, while the target dataset is unlabeled, with the goal of transferring a model trained on the source dataset, to perform anomaly detection on the target. b) Streaming Update: Since metrics are collected in large volume every minute, the model is used online to detect anomalies. It is very common that temporal patterns of metrics change overtime. The ability to perform timely model updates when receiving new incoming data is an important criteria. On the one hand, conventional models can handle the data stream via retraining the whole model periodically [31], [41], [32], [38]. However, this strategy could be computationally expensive, and bring extra non-trivial questions, such as, how often should this retraining be performed. On the other hand, some methods [42], [43] have efficient updating mechanisms inbuilt, and are naturally able to adapt to these new incoming data streams. It can also support active learning paradigm [41], which allows models to interactively query users for labels on data points for which it is uncertain about, and subsequently update the model with the new labels. c) Dimensionality: Each metric of monitoring data forms a univariate time series, and thus a service usually contains multiple metrics, each of which describes a different part or attribute of a complex entity, constituting a multivariate time series. The conventional solution is to build univariate time series anomaly detection for each metric. However, for a complex system, it ignores the intrinsic interactions among each metric and cannot well represent the system’s overall status. Naively combining the anomaly detection results of each univariate time series performs poorly for multivariate anomaly detection method [44], since it cannot model the inter-dependencies among metrics for a service. Model A wide range of machine learning models can be used for time series anomaly detection, broadly classified as deep learning models, tree-based models, and statistical models. Deep learning models [45], [36], [46], [47], [38], [48], [49], [50] leverage the success and power deep neural networks to learn representations of the time series data. These represen- tations of time series data contain rich semantic information of the underlying metric, and can be used as a reconstruction- based, unsupervised method. Tree-based methods leverage a tree structure as a density-based, unsupervised method [42]. Statistical models [51] rely on classical statistical tests, which are considered a reconstruction-based method. Industrial Practices Building a system which can handle the large amounts of metric data generated in real cloud IT operations is often an issue. This is because the metric data in real-world scenarios is quite diverse and the definition of anomaly may vary in different scenarios. Moreover, almost all time series anomaly detection systems require to handle a large amount of metrics in parallel with low-latency [32]. Thus, works which propose a system to handle the infrastructure are highlighted here. EGADS [41] is a system by Yahoo!, scaling up to millions of data points per second, and focuses on optimizing real-time processing. It comprises a batch time series modelling module, an online anomaly detection module, and an alerting module. It leverages a variety of unsupervised methods for anomaly detection, and an optional active learning component for filtering alerts. [52] is a system by Microsoft, which includes three major components, a data ingestion, experimentation, and online compute platform. They propose an efficient deep learning anomaly detector to achieve high accuracy and high efficiency at the same time. [32] is a system by Alibaba group, comprising data ingestion, offline training, online service, and visualization and alarms modules. They propose a robust anomaly detector by using time series decomposition, and thus can easily handle time series with different characteristics, such as different seasonal length, different types of trends, etc. [38] is a system by Tencent, comprising of a offline model training component and online serving component, which employs active learning to update the online model via a small number of uncertain samples. Challenges Lack of labels The main challenge of metric anomaly detection is the lack of ground truth anomaly labels [53], [44]. Due to the open-ended nature and complexity of incidents in server architectures, it is difficult to define what an anomaly is. Thus, building labelled datasets is an extremely labor and resource intensive exercise, one which requires the effort of domain experts to identify anomalies from time series data. Furthermore, manual labelling could lead to labelling errors as there is no unified and formal definition of an anomaly, leading to subjective judgements on ground truth labels [35]. Real-time inference A typical cloud infrastructure could collect millions of data points in a second, requiring near real- time inference to detect anomalies. Metric anomaly detection systems need to be scalable and efficient [54], [53], optionally supporting model retraining, leading to immense compute, memory, and I/O loads. The increasing complexity of anomaly detection models with the rising popularity of deep learning methods [55] add a further strain on these systems due to the additional computational cost these larger models bring about. Non-stationarity of metric streams The temporal patterns of metric data streams typically change over time as they are generated from non-stationary environments [56]. The evo- lution of these patterns is often caused by exogenous factors which are not observable. One such example is that the growth in the popularity of a service would cause customer metrics (e.g. request count) to drift upwards over time. Ignoring these factors would cause a deterioration in the anomaly detector’s performance. One solution is to continuously update the model with the recent data [57], but this strategy requires carefully balancing of the cost and model robustness with respect to the updating frequency. Public benchmarks While there exists benchmarks for general anomaly detection methods and time series anomaly detection methods [33], [58], there is still a lack of benchmark- ing for metric incident detection in AIOps domain. Given the
8 wide and diverse nature of time series data, they often exhibit a mixture of different types of anomaly depends on specific domain, making it challenging to understand the pros and cons of algorithms [58]. Furthermore, existing datasets have been criticised to be trivial and mislabelled [59]. Future Trends Active learning/human-in-the-loop To address the prob- lem of lacking of labels, a more intelligent way is to integrate human knowledge and experience with minimum cost. As special agents, humans have rich prior knowledge [60]. If the incident detection framework can encourage the machine learning model to engage with learning operation expert wis- dom and knowledge, it would help deal with scarce and noise label issue. The use of active learning to update online model in [38] is a typical example to incorporate human effort in the annotation task. There are certainly large research scope for incorporating human effort in other data processing step, like feature extraction. Moreover, the human effort can also be integrated in the machine learning model training and inference phase. Streaming updates Due to the non-stationarity of metric streams, keeping the anomaly detector updated is of utmost importance. Alongside the increasingly complex models and need for cost-effectiveness, we will see a move towards methods with the built-in capability of efficient streaming updates. With the great success of deep learning methods in time series anomaly detection tasks [30]. Online deep learning is an increasingly popular topic [61], and we may start to see a transference of techniques into metric anomaly detection for time-series in the near future. Intrinsic anomaly detection Current research works on time series anomaly detection do not distinguish the cause or the type of anomaly, which is critical for the subsequent mitigation steps in AIOps. For example, even anomaly are suc- cessfully detected, which is caused by extrinsic environment, the operator is unable to mitigate its negative effect. Intro- duced in [50], [48], intrinsic anomaly detection considers the functional dependency structure between the monitored metric, and the environment. This setting considers changes in the environment, possibly leveraging information that may not be available in the regular (extrinsic) setting. For example, when scaling up/down the resources serving an application (perhaps due to autoscaling rules), we will observe a drop/increase in CPU metric. While this may be considered as an anomaly in the extrinsic setting, it is in fact not an incident and accordingly, is not an anomaly in the intrinsic setting. B. Logs based Incident Detection Problem Definition Software and system logging data is one of the most popular ways of recording and tracking runtime information about all ongoing processes within a system, to any arbitrary level of granularity. Overall, a large distributed system can have massive volume of heterogenous logs dumped by its different services or microservices, each having time-stamped text messages following their own unstructured or semi- structured or structured format. Throughout various kinds of IT Operations these logs have been widely used by relia- bility and performance engineers as well as core developers in order to understand the system’s internal status and to facilitate monitoring, administering, and troubleshooting [15], [16], [17], [18], [19], [20], [21], [22], [62]. More, specifically, in the AIOps pipeline, one of the foremost tasks that log analysis can cater to is log based Incident Detection. This is typically achieved through anomaly detection over logs which aims to detect the anomalous loglines or sequences of loglines that indicate possible occurrence of an incident, from the humungous amounts of software logging data dumps generated by the system. Log based anomaly detection is generally applied once an incident has been detected based on monitoring of KPI metrics, as a more fine-grained incident detection or failure diagnosis step in order to detect which service or micro-service or which software module of the system execution is behaving anomalously. Task Complexity Diversity of Log Anomaly Patterns : There are very diverse kinds of incidents in AIOps which can result in different kinds of anomaly patterns in the log data - either manifesting in the log template (i.e. the constant part of the log line) or the log parameters (i.e. the variable part of the log line containing dynamic information). These are i) keywords - appearance of keywords in log lines bearing domain-specific semantics of failure or incident or abnormality in the system (e.g. out of memory or crash) ii) template count - where a sudden increase or decrease of log templates or log event types is indicative of anomaly iii) template sequence - where some significant deviation from the normal order of task execution is indicative of anomaly iv) variable value - some variables associated with some log templates or events can have physical meaning (e.g. time cost) which could be extracted out and aggregated into a structured time series on which standard anomaly detection techniques can be applied. v) variable distribution - for some categorical or numerical variables, a deviation from the standard distribution of the variable can be indicative of an anomaly vi) time interval - some performance issues may not be explicitly observed in the logline themselves but in the time interval between specific log events. Need for AI : Given the humongous nature of the logs, it is often infeasible for even domain experts to manually go through the logs to detect the anomalous loglines. Addi- tionally, as described above, depending on the nature of the incident there can be diverse types of anomaly patterns in the logs, which can manifest as anomalous key words (like ”errors” or ”exception”) in the log templates or the volume of specific event logs or distribution over log variables or the time interval between two log specific event logs. However, even for a domain expert it is not possible to come up with rules to detect these anomalous patterns, and even when they can, they would likely not be robust to diverse incident types and changing nature of log lines as the software functionalities change over time. Hence, this makes a compelling case for
9 employing data-driven models and machine intelligence to mine and analyze this complex data-source to serve the end goals of incident detection. Log Analysis Workflow for Incident Detection In order to handle the complex nature of the data, typically a series of steps need to be followed to meaningfully analyze logs to detect incidents. Starting with the raw log data or data streams, the log analysis workflow first does some preprocess- ing of the logs to make them amenable to ML models. This is typically followed by log parsing which extracts a loose structure from the semi-structured data and then grouping and partitioning the log lines into log sequences in order to model the sequence characteristics of the data. After this, the logs or log sequences are represented as a machine-readable matrix on which various log analysis tasks can be performed - like clustering and summarizing the huge log dumps into a few key log patterns for easy visualization or for detecting anomalous log patterns that can be indicative of an incident. Figure 7 provides an outline of the different steps in the log analysis wokflow. While some of these steps are more of engineering challenges, others are more AI-driven and some even employ a combination of machine learning and domain knowledge rules. i) Log Preprocessing: This step typically involves cus- tomised filtering of specific regular expression patterns (like IP addresses or memory locations) that are deemed irrelevant for the actual log analysis. Other preprocessing steps like tokenization requires specialized handling of different wording styles and patterns arising due to the hybrid nature of logs consisting of both natural language and programming language constructs. For example a log line can contain a mix of text strings from source-code data having snake-case and camelCase tokens along with white-spaced tokens in natural language. ii) Log Parsing: To enable downstream processing, unstruc- tured log messages first need to be parsed into a structured event template (i.e. constant part that was actually designed by the developers) and parameters (i.e. variable part which contain the dynamic runtime information). Figure 8 provides one such example of parsing a single log line. In literature there have been heuristic methods for parsing as well as AI- driven methods which include traditional ML and also more recent neural models. The heuristic methods like Drain [63], IPLoM [64] and AEL [65] exploit known inductive bias on log structure while Spell [66] uses Longest common subsequence algorithm to dynamically extract log patterms. Out of these, Drain and Spell are most popular, as they scale well to industrial standards. Amongst the traditional ML methods, there are i) Clustering based methods like LogCluster [67], LKE [68], LogSig [69], SHISO [70], LenMa [71], LogMine [72] which assume that log message types coincide in similar groups ii) Frequent pattern mining and item-set mining meth- ods SLCT [73], LFA [74] to extract common message types iii) Evolutionary optimization approaches like MoLFI [75]. On the other hand, recent neural methods include [76] - Neural Transformer based models which use self-supervised Masked Language Modeling to learn log parsing vii) UniParser [77] - an unified parser for heterogenous log data with a learnable similarity module to generalize to diverse logs across different systems. There are yet another class of log analysis methods [78], [79] which aim at parsing free techniques, in order to avoid the computational overhead of parsing and the errors cascading from erroneous parses, especially due to the lack of robustness of the parsing methods. iii) Log Partitioning: After parsing the next step is to partition the log data into groups, based on some semantics where each group represents a finite chunk of log lines or log sequences. The main purpose behind this is to decompose the original log dump typically consisting of millions of log lines into logical chunks, so as to enable explicit modeling on these chunks and allow the models to capture anomaly patterns over sequences of log templates or log parameter values or both. Log partitioning can be of different kinds [20], [80] - Fixed or Sliding window based partitions, where the length of window is determined by length of log sequence or a period of time, and Identifier based partitions where logs are partitioned based on some identifier (e.g. the session or process they originate from). Figure 9 illustrates these different choices of log grouping and partitioning. A log event is eventually deemed to be anomalous or not, either at the level of a log line or a log partition. iv) Log Representation: After log partitioning, the next step is to represent each partition in a machine-readable way (e.g. a vector or a matrix) by extracting features from them. This can be done in various ways [81], [80]- either by extracting specific handcrafted features using domain knowledge or through ii) sequential representation which converts each partition to an ordered sequence of log event ids ii) quantitative represen- tation which uses count vectors, weighted by the term and inverse document frequency information of the log events iii) semantic representation captures the linguistic meaning from the sequence of language tokens in the log events and learns a high-dimensional embedding vector for each token in the dataset. The nature of log representation chosen has direct consequence in terms of which patterns of anomalies they can support - for example, for capturing keyword based anomalies, semantic representation might be key, while for anomalies related to template count and variable distribution, quantitative representations are possibly more appropriate. The semantic embedding vectors themselves can be either obtained using pretrained neural language models like GloVe, FastText, pretrained Transformer like BERT, RoBERTa etc or learnt using a trainable embedding layer as part of the target task. v) Log Analysis tasks for Incident Detection: Once the logs are represented in some compact machine-interpretable form which can be easily ingested by AI models, a pipeline of log analysis tasks can be performed on it - starting with Log compression techniques using Clustering and Summarization, followed by Log based Anomaly Detection. In turn, anomaly detection can further enable downstream tasks in Incident
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 Fig. 7. Steps of the Log Analysis Workflow for Incident Detection Fig. 8. Example of Log Parsing Fig. 9. Different types of log partitioning Management like Failure Prediction and Root Cause Analysis. In this section we discuss only the first two log analysis tasks which are pertinent to incident detection and leave failure prediction and RCA for the subsequent sections. v.1) Log Compression through Clustering & Summariza- tion: This is a practical first-step towards analyzing the huge volumes of log data is Log Compression through various clustering and summarization techniques. The objective of this analysis serves two purposes - Firstly, this step can independently help the site reliability engineers and service owners during incident management by providing a practical and intuitive way of visualizing these massive volumes of complex unstructured raw log data. Secondly, the output of log clustering can directly be leveraged in some of the log based anomaly detection methods. Amongst the various techniques of log clustering, [82], [67], [83] employ hierarchical clustering and can support online settings by constructing and retrieving from knowledge base of representative log clusters. [84], [85] use frequent pattern matching with dimension reduction techniques like PCA and locally sensitive hashing with online and streaming support. [86], [64], [87] uses efficient iterative or incremental clustering and partitioning techniques that support online and streaming logs and can also handle clustering of rare log instances. Another area of existing literature [88], [89], [90], [91] focus on log compression through summarization - where, for example, [88] uses heuristics like log event ids and timings to summarize and [89], [21] does openIE based triple extraction using semantic information and domain knowledge and rules to generate summaries, while [90], [91] use sequence clustering using linguistic rules or through grouping common event sequences. v.2) Log Anomaly Detection: Perhaps the most common use of log analysis is for log based anomaly detection where a wide variety of models have been employed in both research and industrial settings. These models are categorized based on various factors i) the learning setting - supervised, semi- supervised or unsupervised: While the semi-supervised models assume partial knowledge of labels or access to few anomalous instances, unsupervised ones train on normal log data and detect anomaly based on their prediction confidence. ii) the type of Model - Neural or traditional statistical non-neural models iii) the kinds of log representations used iv) Whether to use log parsing or parser free methods v) If using parsing, then whether to encode only the log template part or both template and parameter representations iv) Whether to restrict modeling of anomalies at the level of individual log lines or to support sequential modeling of anomaly detection over log sequences. The nature of log representation employed and the kind of modeling used - both of these factors influence what type of anomaly patterns can be detected - for example keyword and variable value based anomalies are captured by semantic representation of log lines, while template count and vari- able distribution based anomaly patterns are more explicitly modeled through quantitative representations of log events. Similarly template sequence and time-interval based anomalies need sequential modeling algorithms which can handle log sequences. Below we briefly summarize the body of literature dedicated to these two types of models - Statistical and Neural; and In Table III we provide a comparison of a more comprehensive list of existing anomaly detection algorithms and systems. Statistical Models are the more traditional machine learning models which draw inference from various statistics under- lying the training data. In the literature there have been various statistical ML models employed for this task under
11 different training settings. Amongst the supervised methods, [92], [93], [94] using traditional learning strategies of Lin- ear Regression, SVM, Decision Trees, Isolation Forest with handcrafted features extracted from the entire logline. Most of these model the data at the level of individual log-lines and cannot not explicitly capture sequence level anomalies. There are also unsupervised methods like ii) dimension reduction techniques like Principal Component Analysis (PCA) [84] iii) clustering and drawing correlations between log events and metric data as in [67], [82], [95], [80]. There are also unsupervised pattern mining methods which include mining invariant patterns from singular value decomposition [96] and mining frequent patterns from execution flow and control flow graphs [97], [98], [99], [68]. Apart from these there are also systems which employ a rule engine built using domain knowledge and an ensemble of different ML models to cater to different incident types [20] and also heuristic methods for doing contrast analysis between normal and incident- indicating abnormal logs [100]. Neural Models, on the other hand are a more recent class of machine learning models which use artificial neural networks and have proven remarkably successful across numerous AI applications. They are particularly powerful in encoding and representing the complex semantics underlying in a way that is meaningful for the predictive task. One class of unsuper- vised neural models use reconstruction based self-supervised techniques to learn the token or line level representation, which includes i) Autoencoder models [101], [102] ii) more powerful self-attention based Transformer models [103] iv) specific pretrained Transformers like BERT language model [104], [105], [21]. Another offshoot of reconstruction based models is those using generative adversarial or GAN paradigm of training for e.g. [106], [107] using LSTM or Transformer based encoding. The other types of unsupervised models are forecasting based, which learn to predict the next log token or next log line in a self-supervised way - for e.g i) Recurrent Neural Network based models like LSTM [108], [109], [110], [18], [111] and GRU [104] or their attention based counterparts [81], [112], [113] ii) Convolutional Neural Network (CNN) based models [114] or more complex models which use Graph Neural Network to represent log event data [115], [116]. Both reconstruction and forecasting based models are capable of handling sequence level anomalies, it depends on the nature of training (i.e. whether representations are learnt at log line or token level) and the capacity of model to handle long sequences (e.g. amongst the above, Autoencoder models are the most basic ones). Most of these models follow the practical setup of unsu- pervised training, where they train only non-anomalous log data. However, other works have also focused on supervised training of LSTM, CNN and Transformer models [111], [114], [78], [117], over anomalous and normal labeled data. On the other hand, [104], [110] use weak supervision based on heuristic assumptions for e.g. logs from external systems are considered anomalous. Most of the neural models use semantic token representations, some with pretrained fixed or trainable embeddings, initialized with GloVe, fastText or pretrained transformer based models, BERT, GPT, XLM etc. vi) Log Model Deployment: The final step in the log analysis workflow is deployment of these models in the actual industrial settings. It involves i) a training step, typically over offline log data dump, with or without some supervision labels collected from domain experts ii) online inference step, which often needs to handle practical challenges like non- stationary streaming data i.e. where the data distribution is not independently and identically distributed throughout the time. For tackling this, some of the more traditional statistical methods like [103], [95], [82], [84] support online streaming update while some other works can also adapt to evolving log data by incrementally building a knowledge base or memory or out-of-domain vocabulary [101]. On the other hand most of the unsupervised models support syncopated batched online training, allowing the model to continually adapt to changing data distributions and to be deployed on high throughput streaming data sources. However for some of the more advanced neural models, the online updation might be too computationally expensive even for regular batched updates. Apart from these, there have also been specific work on other challenges related to model deployment in practical settings like transfer learning across logs from different do- mains or applications [110], [103], [18], [18], [118] under semi-supervised settings using only supervision from source systems. Other works focus on evaluating model robustness and generalization (i.e. how well the model adapts to) to unstable log data due to continuous logging modifications throughout software evolutions and updates [109], [111], [104]. They achieve these by adopting domain adversarial paradigms during training [18], [18] or using counterfactual explanations [118] or multi-task settings [21] over various log analysis tasks. Challenges & Future Trends Collecting supervision labels: Like most AIOps tasks, collecting large-scale supervision labels for training or even evaluation of log analysis problems is very challenging and impractical as it involves significant amount of manual inter- vention and domain knowledge. For log anomaly detection, the goal being quite objective, label collection is still possible to enable atleast a reliable evaluation. Whereas, for other log analysis tasks like clustering and summarization, collecting supervision labels from domain experts is often not even possible as the goal is quite subjective and hence these tasks are typically evaluated through the downstream log analysis or RCA task. Imbalanced class problem: One of the key challenges of anomaly detection tasks, is the class imbalance, stemming from the fact that anomalous data is inherently extremely rare in occurrence. Additionally, various systems may show different kinds of data skewness owing to the diverse kinds of anomalies listed above. This poses a technical challenge both during model training with highly skewed data as well as choice of evaluation metrics, as Precision, Recall and F- Score may not perform satisfactorily. Further at inference, thresholding over the anomaly score gets particularly chal-
12 lenging for unsupervised models. While for benchmarking purposes, evaluation metrics like AUROC (Area under ROC curve) can suffice, but for practical deployment of these models require either careful calibrations of anomaly scores or manual tuning or heuristic means for setting the threshold. This being quite sensitive to the application at hand, also poses realistic challenges when generalizing to heterogenous logs from different systems. Handling large volume of data: Another challenge in log analysis tasks is handling the huge volumes of logs, where most large-scale cloud-based systems can generate petabytes of logs each day or week. This calls for log processing algorithms, that are not only effective but also lightweight enough to be very fast and efficient. Handling non-stationary log data: Along with humon- gous volume, the natural and most practical setting of logs analysis is an online streaming setting, involving non- stationary data distribution - with heterogenous log streams coming from different inter-connected micro-services, and the software logging data itself evolving over time as developers naturally keep evolving software in the agile cloud devel- opment environment. This requires efficient online update schemes for the learning algorithms and specialized effort towards building robust models and evaluating their robustness towards unstable or evolving log data. Handling noisy data: Annotating log data being ex- tremely challenging even for domain experts, supervised and semi-supervised models need to handle this noise during training, while for unsupervised models, it can heavily mislead evaluation. Even though it affects a small fraction of logs, the extreme class imbalance aggrevates this problem. Another related challenge is that of errors compounding and cascading from each of the processing steps in the log analysis workflow when performing the downstream tasks like anomaly detec- tion. Realistic public benchmark datasets for anomaly detec- tion: Amongst the publicly available log anomaly detection datasets, only a limited few contain anomaly labels. Most of those benchmarks have been excessively used in the literature and hence do not have much scope of furthering research. Infact, their biggest limitation is that they fail to showcase the diverse nature of incidents that typically arise in real- world deployment. Often very simple handcrafted rules prove to be quite successful in solving anomaly detection tasks on these datasets. Also, the original scale of these datasets are several orders of magnitude smaller than the real-world use-cases and hence not fit for showcasing the challenges of online or streaming settings. Further, the volume of unique patterns collapses significantly after the typical log processing steps to remove irrelevant patterns from the data. On the other hand, a vast majority of the literature is backed up by empirical analysis and evaluation on internal proprietary data, which cannot guarantee reproducibility. This calls for more realistic public benchmark datasets that can expose the real-world challenges of aiops-in-the-wild and also do a fair benchmarking across contemporary log analysis models. Public benchmarks for parsing, clustering, summariza- tion: Most of the log parsing, clustering and summarization literature only uses a very small subset of data from some of the public log datasets, where the oracle parsing is available, or in-house log datasets from industrial applications where they compare with oracle parsing methods that are unscalable in practice. This also makes fair comparison and standardized benchmarking difficult for these tasks. Better log language models: Some of the recent advances in neural NLP models like transformer based language models BERT, GPT has proved quite promising for representing logs in natural language style and enabling various log analysis tasks. However there is more scope of improvement in building neural language models that can appropriately encode the semi-structured logs composed of fixed template and variable parameters without depending on an external parser. Incorporating Domain Knowledge: While existing log anomaly detection systems are entirely rule-based or auto- mated, given the complex nature of incidents and the di- verse varieties of anomalies, a more practical approach would involve incorporating domain knowledge into these models either in a static form or dynamically, following a human- in-the-loop feedback mechanism. For example, in a complex system generating humungous amounts of logs, which kinds of incidents are more severe and which types of logs are more crucial to monitor for which kind of incidents. Or even at the level of loglines, domain knowledge can help understand the real-world semantics or physical significance of some of the parameters or variables mentioned in the logs. These aspects are often hard for the ML system to gauge on its own especially in the practical unsupervised settings. Unified models for heterogenous logs: Most of the log analysis models are highly sensitive towards the nature of log preprocessing or grouping, needing customized preprocessing for each type of application logs. This alludes towards the need for unified models with more generalizable preprocessing layers that can handle heterogenous kinds of log data and also different types of log analysis tasks. While [21] was one of the first works to explore this direction, there is certainly more research scope for building practically applicable models for log analysis. C. Traces and Multimodal Incident Detection Problem Definition Traces are semi-structured event logs with span information about the topological structure of the service graph. Trace anomaly detection relies on finding abnormal paths on the topological graph at given moments, as well as discovering abnormal information directly from trace event log text. There are multiple ways to process trace data. Traces usually have timestamps and associated sequential information so it can be covered into time-series data. Traces are also stored as trace event logs, containing rich text information. Moreover, traces store topological information which can be used to reconstruct the service graphs that represents the relation
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 among components of the systems. From the data perspective, traces can easily been turned into multiple data modalities. Thus, we combines trace-based anomaly detection with multi- modal anomaly detection to discuss in this section. Recently, we can see with the help of multi-modal deep learning technologies, trace anomaly detection can combine different levels of information relayed by trace data and learn more comprehensive anomaly detection models [119][120]. Empirical Approaches Traces draw more attention in microservice system archi- tectures since the topological structure becomes very complex and dynamic. Trace anomaly detection started from practical usages for large scale system debugging [121]. Empirical trace anomaly detection and RCA started with constructing trace graphs and identifying abnormal structures on the constructed graph. Constructing the trace graph from trace data is usually very time consuming, an offline component is designed to train and construct such trace graph. Apart from , to adapt to the usage requirements to detect and locate issues in large scale systems, trace anomaly detection and RCA algorithms usually also have an online part to support real-time service. For example, Cai et al. . released their study of a real-time trace-level diagnosis system, which is adopted by Alibaba datacenters. This is one of the very few studies to deal with real large distributed systems [122]. Most empirical trace anomaly detection work follow the offline and online design pattern to construct their graph mod- els. In the offline modeling, unsupervised or semi-supervised techniques are utilized to construct the trace entity graphs, very similar to techniques in process discovery and mining domain. For example, PageRank has been used to construct web graphs in one of the early web graph anomaly detection works [123]. After constructing the trace entity graphs, a variety of techniques can be used to detect anomalies. One common way is to compare the current graph pattern to normal graph patterns. If the current graph pattern significantly deviates from the normal patterns, report anomalous traces. An alternative approach is using data mining and statistical learning techniques to run dynamic analysis without construct- ing the offline trace graph. Chen et al. proposed Pinpoint [124], a framework for root cause analysis that using coarse- grained tagging data of real client requests at real-time when these requests traverse through the system, with data mining techniques. Pinpoint discovers the correlation between success / failure status of these requests and fault components. The entire approach processes the traces on-the-fly and does not leverage any static dependency graph models. Deep Learning Based Approaches In recent years, deep learning techniques started to be employed in trace anomaly detection and RCA. Also with the help of deep learning frameworks, combining general trace graph information and the detailed information inside of each trace event to train multimodal learning models become possible. Long-short term memory (LSTM) network [125] is a very popular neural network model in early trace and multimodal anomaly detection. LSTM is a special type of recurrent neural network (RNN) and has been proved to success in lots of other domains. In AIOps, LSTM is also commonly used in metric and log anomaly detection applications. Trace data is a natural fit with RNNs, majorly in two ways: 1) The topological order of traces can be modeled as event sequences. These event sequences can easily be transformed into model inputs of RNNs. 2) Trace events usually have text data that conveys rich information. The raw text, including both the structured and unstructured parts, can be transformed into vectors via standard tokenization and embedding techniques, and feed the RNN as model inputs. Such deep learning model architectures can be extended to support multimodal input, such as combining trace event vector with numerical time series values [119]. To better leverage the topological information of traces, graph neural networks have also been introduced in trace anomaly detection. Zhang et al. developed DeepTraLog, a trace anomaly detection technique that employs Gated graph neural networks [120]. DeepTraLog targets to solve anomaly detection problems for complex microservice systems where service entity relationships are not easy to obtain. Moreover, the constructed graph by GGNN training can also be used to localize the issue, providing additional root-cause analysis capability. Limitations Trace data became increasingly attractive as more applica- tions transitioned from monolithic to microservice architec- ture. There are several challenges in machine learning based trace anomaly detection. Data quality. As far as we know, there are multiple trace collection platforms and the trace data format and quality are inconsistent across these platforms, especially in the pro- duction environment. To use these trace data for analysis, researchers and developers have to spend significant time and effort to clean and reform the data to feed machine learning models. Difficult to acquire labels. It is very difficult to acquire labels for production data. For a given incident, labeling the corresponding trace requires identifying the incident occurring time and location, as well as the root cause which may be located in totally different time and location. Obtaining such full labels for thousands of incidents is extremely difficult. Thus, most of the existing trace analysis research still use synthetic data to evaluate the model performance. This brings more doubts whether the proposed solution can solve problems in real production. No sufficient multimodal and graph learning models. Trace data are complex. Current trace analysis simplifies trace data into event sequences or time-series numerical values, even in the multimodal settings. However, these existing model architectures did not fully leverage all information of trace data in one place. Graph-based learning can potentially be a solution but discussions of this topic are still very limited. Offline model training. The deep learning models in existing research relies on offline model training, partially because model training is usually very time consuming and
14 contradicts with the goal of real-time serving. However, offline model training brings static dependencies to a dynamic system. Such dependencies may cause additional performance issues. Future Trends Unified trace data Recently, OpenTelemetry leads the effort to unify observability telemetry data, including metrics, logs, traces, etc., across different platforms. This effort can bring huge benefits to future trace analysis. With more unified data models, AI researchers can more easily acquire necessary data to train better models. The trained model can also be easily plug-and-play by other parties, which can further boost model quality improvements. Unified engine for detection and RCA Trace graph contains rich information about the system at a given time. With the help of trace data, incident detection and root cause localization can be done within one step, instead of the current two consecutive steps. Existing work has demonstrated that by simply examining the constructed graph, the detection model can reveal sufficient information to locate the root causes [120]. Unified models for multimodal telemetry data Trace data analysis brings the opportunities for researchers to create a holistic view of multiple telemetry data modality since traces can be converted into text sequence data and time-series data. The learnings can be extended to include logs or metrics from different sources. Eventually we can expect unified learning models that can consume multimodal telemetry data for incident detection and RCA. Online Learning Modern systems are dynamic and ever- changing. Current two-step solution relies on offline model training and online serving or inference. Any system evolution between two offline training cycles could cause potential issues and damage model performance. Thus, supporting online learning is critical to guarantee high performance in real production environments. V. F AILURE P REDICTION Incident Detection and Root-Cause Analysis of Incidents are more reactive measures towards mitigating the effects of any incident and improving service availability once the incident has already occurred. On the other hand, there are other proactive actions that can be taken to predict if any potential incident can happen in the immediate future and prevent it from happening. Failures in software systems are such kind of highly disruptive incidents that often start by showing symptoms of deviation from the normal routine behavior of the required system functions and typically result in failure to meet the service level agreement. Failure prediction is one such proactive task in Incident Management, whose objective is to continuously monitor the system health by analyzing the different types of system data (KPI metrics, logging and trace data) and generate early warnings to prevent failures from occurring. Consequently, in order to handle the different kinds of telemetry data sources, the task of predicting failures can be tailored to metric based and log based failure prediction. We describe these two in details in this section. A. Metrics based Failure Prediction Metric data are usually fruitful in monitoring system. It is straightforward to directly leverage them to predict the occurrence of the incident in advance. As such, some proactive actions can be taken to prevent it from happening instead of reducing the time for detection. Generally, it can be formulated as the imbalanced binary classification problem if failure labels are available, and formulated as the time series forecasting problem if the normal range of monitored metrics are defined in advance. In general, failure prediction [126] usually adopts machine learning algorithms to learn the characteristics of historical failure data, build a failure prediction model, and then deploy the model to predict the likelihood of a failure in the future. Methods General Failure Prediction: Recently, there are increasing efforts on considering general failure incident prediction with the failure signals from the whole monitoring system. [127] collected alerting signals across the whole system and dis- covered the dependence relationships among alerting signals, then the gradient boosting tree based model was adopted to learn failure patterns. [128] proposed an effective feature engineering process to deal with complex alert data. It used multi-instance learning and handle noisy alerts, and inter- pretable analysis to generate an interpretable prediction result to facilitate the understanding and handling of incidents. Specific Type Failure Prediction: In contrast, some works In contrast, [127] and [128] aim to proactively predict various specific types of failures. [129] extracted statistical and textual features from historical switch logs and applied random forest to predict switch failures in data center networks. [130] collected data from SMART [131] and system-level signals, and proposed a hybrid of LSTM and random forest model for node failure prediction in cloud service system. [132] developed a disk error prediction method via a cost-sensitive ranking models. These methods target at the specific type of failure prediction, and thus are limited in practice. Challenges and Future Trends While conventional supervised learning for classification or regression problems can be used to handle failure prediction, it needs to overcome the following main challenges. First, datasets are usually very imbalanced due to the limited number of failure cases. This poses a significant challenge to the prediction model to achieve high precision and high recall simultaneously. Second, the raw signals are usually noisy, not all information before incident is helpful. How to extract omen features/patterns and filter out noises are critical to the prediction performance. Third, it is common for a typical system to generate a large volume of signals per minute, leading to the challenge to update prediction model in the streaming way and handle the large-scale data with lim- ited computation resources. Fourth, post-processing of failure prediction is very important for failure management system to improve availability. For example, providing interpretable failure prediction can facilitate engineers to take appropriate action for it.
15 B. Logs based Incident Detection Like Incident Detection and Root Cause Analysis, Failure Prediction is also an extremely complex task, especially in enterprise level systems which comprise of many distributed but inter-connected components, services and micro-services interacting with each other asynchronously. One of the main complexities of the task is to be able to do early detection of signals alluding towards a major disruption, even while the system might be showing only slight or manageable deviations from its usual behavior. Because of this nature of the problem, often monitoring the KPI metrics alone may not suffice for early detection, as many of these metrics might register a late reaction to a developing issue or may not be fine-grained enough to capture the early signals of an incident. System and software logs, on the other hand, being an all- pervasive part of systems data continuously capture rich and very detailed runtime information that are often pertinent to detecting possible future failures. Thus various proactive log based analysis have been applied in different industrial applications as a continuous monitoring task and have proved to be quite effective for a more fine- grained failure prediction and localizing the source of the potential failure. It involves analyzing the sequences of events in the log data and possibly even correlating them with other data sources like metrics in order to detect anomalous event patterns that indicate towards a developing incident. This is typically achieved in literature by employing supervised or semi-supervised machine learning models to predict future failure likelihood by learning and modeling the characteristics of historical failure data. In some cases these models can also be additionally powered by domain knowledge about the intricate relationships between the systems. While this task has not been explored as popularly as Log Anomaly Detection and Root Cause Analysis and there are fewer public datasets and benchmark data, software and systems maintainance logging data still plays a very important role in predicting potential future failures. In literature, generally the failure prediction task over log data has been employed in broadly two types of systems - homogenous and heterogenous. Failure Prediction in Homogenous Systems In homogenous systems, like high-performance computing systems or large-scale supercomputers, this entails prediction of independent failures, where most systems leverage sequen- tial information to predict failure of a single component. Time-Series Modeling : Amongst homogenous systems, [133], [134] extract system health indicating features from structured logs and modeled this as time series based anomaly forecasting problem. Similarly [135] extracts specific patterns during critical events through feature engineering and build a supervised binary classifier to predict failures. [136] converts unstructured logs into templates through parsing and apply feature extraction and time-series modeling to predict surge, frequency and seasonality patterns of anomalies. Supervised Classifiers Some of the older works predict failures in a supervised classification setting using tradi- tional machine learning models like support vector machines, nearest-neighbor or rule-based classifiers [137], [93], [138], or ensemble of classifiers [93] or hidden semi-markov model based classifier [139] over features handcrafted from log event sequences or over random indexing based log encoding while [140], [141] uses deep recurrent neural models like LSTM over semantic representations of logs. [142] predict and diagnose failures through first failure identification and causality based filtering to combine correlated events for filtering through association rule-mining method. Failure Prediction in Heterogenous Systems In heterogenous systems, like large-scale cloud services, es- pecially in distributed micro-service environment, outages can be caused by heterogenous components. Most popular meth- ods utilize knowledge about the relationship and dependency between the system components, in order to predict failures. Amongst such systems, [143] constructed a Bayesian network to identify conditional dependence between alerting signals extracted from system logs and past outages in offline setting and used gradient boosting trees to predict future outages in the online setting. [144] uses a ranking model combining temporal features from LSTM hidden states and spatial features from Random Forest to rank relationships between failure indicating alerts and outages. [145] trains trace-level and micro-service level prediction models over handcrafted features extracted from trace logs to detect three common types of micro-service failures. VI. R OOT C AUSE A NALYSIS Root-cause Analysis (RCA) is the process to conduct a series of actions to discover the root causes of an incident. RCA in DevOps focuses on building the standard process workflow to handle incidents more systematically. Without AI, RCA is more about creating rules that any DevOps member can follow to solve repeated incidents. However, it is not scalable to create separate rules and process workflow for each type of repeated incident when the systems are large and complex. AI models are capable to process high volume of input data and learn representations from existing incidents and how they are handled, without humans to define every single details of the workflow. Thus, AI-based RCA has huge potential to reform how root cause can be discovered. In this section, we discuss a series of AI-based RCA topics, separeted by the input data modality: metric-based, log-based, trace-based and multimodal RCA. A. Metric-based RCA Problem Definition With the rapidly growing adoption of microservices ar- chitectures, multi-service applications become the standard paradigm in real-world IT applications. A multi-service ap- plication usually contains hundreds of interacting services, making it harder to detect service failures and identify the root causes. Root cause analysis (RCA) methods leverage the KPI metrics monitored on those services to determine the root causes when a system failure is detected, helping engineers and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 SREs in the troubleshooting process * . The key idea behind RCA with KPI metrics is to analyze the relationships or dependencies between these metrics and then utilize these relationships to identify root causes when an anomaly occurs. Typically, there are two types of approaches: 1) identifying the anomalous metrics in parallel with the observed anomaly via metric data analysis, and 2) discovering a topology/causal graph that represent the causal relationships between the services and then identifying root causes based on it. Metric Data Analysis When an anomaly is detected in a multi-service application, the services whose KPI metrics are anomalous can possibly be the root causes. The first approach directly analyzes these KPI metrics to determine root causes based on the assumption that significant changes in one or multiple KPI metrics happen when an anomaly occurs. Therefore, the key is to identify whether a KPI metric has pattern or magnitude changes in a look-back window or snapshot of a given size at the anomalous timestamp. Nguyen et al. [146], [147] propose two similar RCA meth- ods by analyzing low-level system metrics, e.g., CPU, memory and network statistics. Both methods first detect abnormal behaviors for each component via a change point detection algorithm when a performance anomaly is detected, and then determine the root causes based on the propagation patterns obtained by sorting all critical change points in a chronological order. Because a real-world multi-service application usually have hundreds of KPI metrics, the change point detection algorithm must be efficient and robust. [146] provides an algo- rithm by combining cumulative sum charts and bootstrapping to detect change points. To identify the critical change point from the change points discovered by this algorithm, they use a separation level metric to measure the change magnitude for each change point and extract the critical change point whose separation level value is an outlier. Since the earliest anomalies may have propagated from their corresponding services to other services, the root causes are then determined by sorting the critical change points in a chronological order. To further improve root cause pinpointing accuracy, [147] develops a new fault localization method by considering both propagation patterns and service component dependencies. Instead of change point detection, Shan et al. [148] devel- oped a low-cost RCA method called -Diagnosis to detect root causes of small-window long-tail latency for web services. - Diagnosis assumes that the root cause metrics of an abnormal service have significantly changes between the abnormal and normal periods. It applies the two-sample test algorithm and - statistics for measuring similarity of time series to identify root causes. In the two-sample test, one sample (normal sample) is drawn from the snapshot during the normal period while the other sample (anomaly sample) is drawn during the anomalous period. If the difference between the anomaly sample and the normal sample are statistically significant, the corresponding metrics of the samples are potential root causes. * A good survey for anomaly detection and RCA in cloud applications [22] Topology or Causal Graph-based Analysis The advantage of metric data analysis methods is the ability of handling millions of metrics. But most of them don’t consider the dependencies between services in an ap- plication. The second type of RCA approaches leverages such dependencies, which usually involves two steps, i.e., constructing topology/causal graphs given the KPI metrics and domain knowledge, and extracting anomalous subgraphs or paths given the observed anomalies. Such graphs can either be reconstructed from the topology (domain knowledge) of a certain application ([149], [150], [151], [152]) or automatically estimated from the metrics via causal discovery techniques ([153], [154], [155], [156], [157], [158], [159]). To identify the root causes of the observed anomalies, random walk (e.g., [160], [156], [153]), page-rank (e.g., [150]) or other techniques can be applied over the discovered topology/causal graphs. When the service graphs (the relationships between the services) or the call graphs (the communications among the services) are available, the topology graph of a multi-service application can be reconstructed automatically, e.g., [149], [150]. But such domain knowledge is usually unavailable or partially available especially when investigating the relation- ships between the KPI metrics instead of API calls. Therefore, given the observed metrics, causal discovery techniques, e.g., [161], [162], [163] play a significant role in constructing the causal graph describing the causal relationships between these metrics. The most popular causal discovery algorithm applied in RCA is the well-known PC-algorithm [161] due to its simplicity and explainability. It starts from a complete undirected graph and eliminates edges between the metrics via conditional independence test. The orientations of the edges are then determined by finding V-structures followed by orientation propagation. Some variants of the PC-algorithm [164], [165], [166] can also be applied based on different data properties. Given the discovered causal graph, the possible root causes of the observed anomalies can be determined by random walk. A random walk on a graph is a random process that begins at some node, and randomly moves to another node at each time step. The probability of moving from one node to another is defined in the the transition probability matrix. Random walk for RCA is based on the assumption that a metric that is more correlated with the anomalous KPI metrics is more likely to be the root cause. Each random walk starts from one anomalous node corresponding to an anomalous metric, then the nodes visited the most frequently are the most likely to be the root causes. The key of random walk approaches is to determine the transition probability matrix. Typically, there are three steps for computing the transition probability matrix, i.e., forward step (probability of walking from a node to one of its parents), backward step (probability of walking from a node to one of its children) and self step (probability of staying in the current node). For example, [153], [158], [159], [150] computes these probabilities based on the correlation of each metric with the detected anomalous metrics during the anomaly period. But correlation based random walk may not accurately localize root cause [156]. Therefore, [156] proposes to use the partial correlations instead of correlations to compute the transition
17 probabilities, which can remove the effect of the confounders of two metrics. Besides random walk, other causal graph analysis tech- niques can also be applied. For example, [157], [155] find root causes for the observed anomalies by recursively visiting all the metrics that are affected by the anomalies, e.g., if the parents of an affected metric are not affected by the anomalies, this metric is considered a possible root cause. [167] adopts a search algorithm based on a breadth-first search (BFS) algorithm to find root causes. The search starts from one anomalous KPI metric and extracts all possible paths outgoing from this metric in the causal graph. These paths are then sorted based on the path length and the sum of the weights associated to the edges in the path. The last nodes in the top paths are considered as the root causes. [168] considers counterfactuals for root cause analysis based on the causal graph, i.e., given a functional causal model, it finds the root cause of a detected anomaly by computing the contribution of each noise term to the anomaly score, where the contributions are symmetrized using the concept of Shapley values. Limitations Data Issues For a multi-service application with hundreds of KPI metrics monitored on each service, it is very chal- lenging to determine which metrics are crucial for identifying root causes. The collected data usually doesn’t describe the whole picture of the system architecture, e.g., missing some important metrics. These missing metrics may be the causal parents of other metrics, which violates the assumption of PC algorithms that no latent confounders exist. Besides, due to noises, non-stationarity and nonlinear relationships in real- world KPI metrics, recovering accurate causal graphs becomes even harder. Lack of Domain Knowledge The domain knowledge about the monitored application, e.g., service graphs and call graphs, is valuable to improve RCA performance. But for a complex multi-service application, even developers may not fully un- derstand the meanings or the relationships of all the monitored metrics. Therefore, the domain knowledge provided by experts is usually partially known, and sometimes conflicts with the knowledge discovered from the observed data. Causal Discovery Issues The RCA methods based on causal graph analysis leverage causal discovery techniques to recover the causal relationships between KPI metrics. All these techniques have certain assumptions on data properties which may not be satisfied with real-world data, so the discovered causal graph always contains errors, e.g., incorrect links or orientations. In recent years, many causal discovery methods have been proposed with different assumptions and characteristics, so that it is difficult to choose the most suitable one given the observed data. Human in the Loop After DevOps or SRE teams receive the root causes identified by a certain RCA method, they will do further analysis and provide feedback about whether these root causes make sense. Most RCA methods cannot leverage such feedback to improve RCA performance, or provide explanations why the identified root causes are incorrect. Lack of Benchmarks Different from incident detection problems, we lack benchmarks to evaluate RCA performance, e.g., few public datasets with groundtruth root causes are available, and most previous works use private internal datasets for evaluation. Although some multi-service application de- mos/simulators can be utilized to generate synthetic datasets for RCA evaluation, the complexity of these demo applications is much lower than real-world applications, so that such evalu- ation may not reflect the real performance in practice. The lack of public real-world benchmarks hampers the development of new RCA approaches. Future Trends RCA Benchmarks Benchmarks for evaluating the per- formance of RCA methods are crucial for both real-world applications and academic research. The benchmarks can either be a collection of real-world datasets with groundtruth root causes or some simulators whose architectures are close to real-world applications. Constructing such large-scale real- world benchmarks is essential for boosting novel ideas or approaches in RCA. Combining Causal Discovery and Domain Knowledge The domain knowledge provided by experts are valuable to improve causal discovery accuracy, e.g., providing required or forbidden causal links between metrics. But sometimes such domain knowledge introduces more issues when recovering causal graphs, e.g., conflicts with data properties or conditional independence tests, introducing cycles in the graph. How to combine causal discovery and expert domain knowledge in a principled manner is an interesting research topic. Putting Human in the Loop Integrating human interactions into RCA approaches is important for real-world applications. For instance, the causal graph can be built in an iterative way, i.e., an initial causal graph is reconstructed by a certain causal discovery algorithm, and then users examine this graph and provide domain knowledge constraints (e.g., which relation- ships are incorrect or missing) for the algorithm to revise the graph. The RCA reports with detailed analysis about incidents created by DevOps or SRE teams are valuable to improve RCA performance. How to utilize these reports to improve RCA performance is another importance research topic. B. Log-based RCA Problem Definition Triaging and root cause analysis is one of the most complex and critical phases in the Incident Management life cycle. Given the nature of the problem which is to investigate into the origin or the root cause of an incident, simply analyzing the end KPI metrics often do not suffice. Especially in a micro- service application setting or distributed cloud environment with hundreds of services interacting with each other, RCA and failure diagnosis is particularly challenging. In order to localize the root cause in such complex environments, engi- neers, SREs and service owners typically need to investigate into core system data. Logs are one such ubiquitous forms of systems data containing rich runtime information. Hence one of the ultimate objectives of log analysis tasks is to enable triaging of incident and localization of root cause to diagnose faults and failures.
18 Starting with heterogenous log data from different sources and microservices in the system, typical log-based aiops workflows first have a layer of log processing and analysis, involving log parsing, clustering, summarization and anomaly detection. The log analysis and anomaly detection can then cater to a causal inference layer that analyses the relationships and dependencies between log events and possibly detected anomalous events. These signals extracted from logs within or across different services can be further correlated with other observability data like metrics, traces etc in order to detect the root cause of an incident. Typically this involves constructing a causal graph or mining a knowledge graph over the log events and correlating them with the KPI metrics or with other forms of system data like traces or service call graphs. Through these, the objective is to analyze the relationships and dependencies between them in order to eventually identify the possible root causes of an anomaly. Unlike the more concrete problems like log anomaly detection, log based root cause analysis is a much more open-ended task. Subsequently most of the literature on log based RCA has been focused on industrial applications deployed in real-world and evaluated with internal benchmark data gathered from in-house domain experts. Typical types of Log RCA methods In literature, the task of log based root cause analysis have been explored through various kinds of approaches. While some of the works build a knowledge graph and knowledge and leverage data mining based solutions, others follow funda- mental principles from Causal Machine learning or and causal knowledge mining. Other than these, there are also log based RCA systems using traditional machine learning models which use feature engineering or correlational analysis or supervised classifier to detect the root cause. Handcrafted features based methods: [169] uses hand- crafted feature engineering and probabilistic estimation of specific types of root causes tailored for Spark logs. [170] uses frequent item-set mining and association rule mining on feature groups for structured logs. Correlation based Methods: [171], [172] localizes root cause based on correlation analysis using mutual information between anomaly scores obtained from logs and monitored metrics. Similarly [173] use PCA, ICA based correlation analysis to capture relationships between logs and consequent failures. [84], [174] uses PCA to detect abnormal system call sequences which it maps to application functions through frequent pattern mining.[175] uses LSTM based sequential modeling of log templates identified through pattern matching over clusters of similar logs, in order to predict failures. Supervised Classifier based Methods: [176] does auto- mated detection of exception logs and comparison of new error patterns with normal cloud behaviours on OpenStack by learning supervised classifiers over statistical and neural rep- resentations of historical failure logs. [177] employs statistical technique on the data distribution to identify the fine-grained category of a performance problem and fast matrix recovery RPCA to identify the root cause. [178], [179] uses KNN or its supervised versions to identify loglines that led to a failure. Knowledge Mining based Methods: [180], [181] takes a different approach of summarizing log events into an entity- relation knowledge graph by extracting custom entities and relationships from log lines and mining temporal and proce- dural dependencies between them from the overall log dump. While this gives a more structured representation of the log summary, it is also an intuitive way of aggregating knowledge from logs, it is also a way to bridge the knowledge gap developer community who creates the log data and the site reliability engineers who typically consume the log data when investigating incidents. However, eventually the end goal of constructing this knowledge graph representation of logs is to facilitate RCA. While these works do provide use-cases like case-studies on RCA for this vision, but they leave ample scope of research towards a more concrete usage of this kind of knowledge mining in RCA. Knowledge Graph based Methods: Amongst knowledge graph based methods, [182] diagnoses and triages performance failure issues in an online fashion by continuously building a knowledge base out of rules extracted from a random forest constructed over log data using heuristics and domain knowl- edge. [151] constructs a system graph from the combination of KPI metrics and log data. Based on the detected anomalies from these data sources, it extracts anomalous subgraphs from it and compares them with the normal system graph to detect the root cause. Other works mine normal log patterns [183] or time-weighted control flow graphs [99] from normal exe- cutions and on estimates divergences from them to executions during ongoing failures to suggest root causes. [184], [185], [186] mines execution sequences or user actions [187] either from normal and manually injected failures or from good or bad performing systems, in a knowledge base and utilizes the assumption that similar faults generate similar failures to match and diagnose type of failure. Most of these knowledge based approaches incrementally expand their knowledge or rules to cater to newer incident types over time. Causal Graph based Methods: [188] uses a multivariate time-series modeling over logs by representing them as error event count. This work then infers its causal relationship with KPI error rate using a pagerank style centrality detection in order to identify the top root causes. [167] constructs a knowledge graph over operation and maintenance entities extracted from logs, metrics, traces and system dependency graphs and mines causal relations using PC algorithm to detect root causes of incidents. [189] uses a Knowledge informed Hierarchical Bayesian Network over features extracted from metric and log based anomaly detection to infer the root causes. [190] constructs dynamic causality graph over events extracted from logs, metrics and service dependency graphs. [191] similarly constructs a causal dependency graph over log events by clustering and mining similar events and use it to infer the process in which the failure occurs. Also, on a related domain of network analysis, [192], [193], [194] mines causes of network events through causal analysis on network logs by modeling the parsed log template counts as a multivariate time series. [195], [156] use causality inference on KPI metrics and service call graphs to localize
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
19 root causes in microservice systems and one of the future research directions is to also incorporate unstructured logs to such causal analysis. Challenges & Future Trends Collecting supervision la- bels Being a complex and open-ended task, it is challenging and requires a lot of domain expertise and manual effort to col- lect supervision labels for root cause analysis. While a small scale supervision can still be availed for evaluation purposes, reaching the scale required for training these models is simply not practical. At the same time, because of the complex nature of the problem, completely unsupervised models often perform quite poorly. Data quality: The workflow of RCA over hetero- geneous unstructured log data typically involves various dif- ferent analysis layers, preprocessing, parsing, partitioning and anomaly detection. This results in compounding and cascading of errors (both labeling errors as well as model prediction errors) from these components, needing the noisy data to be handled in the RCA task. In addition to this, the extremely challenging nature of RCA labeling task further increases the possibility of noisy data. Imbalanced class problem: RCA on huge voluminous logs poses an additional problem of extreme class imbalance - where out of millions of log lines or log templates, a very sparse few instances might be related to the true root cause. Generalizability of models: Most of the exist- ing literature on RCA tailors their approach very specifically towards their own application and cannot be easily adopted even by other similar systems. This alludes towards need for more generalizable architectures for modeling the RCA task which in turn needs more robust generalizable log analysis models that can handle hetergenous kinds of log data coming from different systems. Continual learning framework: One of the challenging aspects of RCA in the distributed cloud setting is the agile environment, leading to new kinds of incidents and evolving causation factors. This kind of non- stationary learning setting poses non-trivial challenges for RCA but is indeed a crucial aspect of all practical industrial applications. Human-in-the-loop framework: While neither completely supervised or unsupervised settings is practical for this task, there is need for supporting human-in-the-loop framework which can incorporate feedbacks from domain experts to improve the system, especially in the agile settings where causation factors can evolve over time. Realistic public benchmarks: Majority of the literature in this area is focused on industrial applications with in-house evaluation setting. In some cases, they curate their internal testbed by injecting failures or faults or anomalies in their internal simulation environment (for e.g. injecting CPU, memory, network and Disk anomalies in Spark platforms) or in popular testing settings (like Grid5000 testbed or open-source microservice applications based on online shopping platform or train ticket booking or open source cloud operating system OpenStack). Other works evaluate by deploying their solution in real- world setting in their in-house cloud-native application, for e.g. on IBM Bluemix platform, or for Facebook applications or over hundreds of real production services at big data cloud computing platforms like Alibaba or thousands of services at e-commerce enterprises like eBay. One of the striking limitations in this regard is the lack of any reproducible open- source public benchmark for evaluating log based RCA in practical industrial settings. This can hinder more open ended research and fair evaluation of new models for tackling this challenging task. C. Trace-based and Multimodal RCA Problem Definition. Ideally, RCA for a complex system needs to leverage all kind of available data, including machine generated telemetry data and human activity records, to find potential root causes of an issue. In this section we discuss trace-based RCA together with multi-modal RCA. We also include studies about RCA based on human records such as incident reports. Ultimately, the RCA engine should aim to process any data types and discover the right root causes. RCA on Trace Data In previous section (Section IV-C) we discussed trace can be treated as multimodal data for anomaly detection. Similar to trace anomaly detection, trace root cause analysis also lever- ages the topological structure of the service map. Instead of detecting abnormal traces or paths, trace RCA usually started after issues were detected. Trace RCA techniques help ease troubleshooting processes of engineers and SREs. And trace RCA can be triggered in a more ad-hoc way instead of running continuously. This differentiates the potential techniques to be adopted from trace anomaly detection. Trace Entity Graph. From the technical point of view, trace RCA and trace anomaly detection share similar perspectives. To our best knowledge, there are not too many existing works talking about trace RCA alone. Instead, trace RCA serves as an additional feature or side benefit for trace anomaly detection in either empirical approaches [121] [196] or deep learning approaches [120] [197]. In trace anomaly detection, the constructed trace entity graph (TEG) after offline training provides a clean relationship between each component in the application systems. Thus, besides anomaly detection, [122] implemented a real-time RCA algorithm that discovers the deepest root of the issues via relative importance analysis after comparing the current abnormal trace pattern with normal trace patterns. Their experiment in the production environment demonstrated this RCA algorithm can achieve higher precision and recall compared to naive fixed threshold methods. The effectiveness of leverage trace entity graph for root cause analysis is also proven in deep learning based trace anomaly detection approaches. Liu et al. [198] proposed a multimodal LSTM model for trace anomaly detection. Then the RCA algorithm can check every anomalous trace with the model training traces and discover root cause by localizing the next called microservice which is not in the normal call paths. This algorithm performs well for both synthetic dataset and produc- tion datasets of four large production services, according to the evaluation of this work.
20 Online Learning. An alternative approach is using data mining and statistical learning techniques to run dynamic analysis without constructing the offline trace graph. Tra- ditional trace management systems usually provides basic analytical capabilities to diagnose issues and discover root causes [199]. Such analysis can be performed online without costly model training process. Chen et al. proposed Pinpoint [124], a framework for root cause analysis that using coarse- grained tagging data of real client requests at real-time when these requests traverse through the system, with data mining techniques. Pinpoint discovers the correlation between success / failure status of these requests and fault components. The entire approach processes the traces on-the-fly and does not leverage any static dependency graph models. Another related area is using trouble-shooting guide data, where [200] rec- ommends troubleshooting guide based on semantic similarity with incident description while [201] focuses on automation of troubleshooting guides to execution workflows, as a way to remediate the incident. RCA on Incident Reports Another notable direction in AIOps literature has been mining useful knowledge from domain-expert curated data (incident report, incident investigation data, bug report etc) towards enabling the final goals of root cause analysis and automated remediation of incidents. This is an open ended task which can serve various purposes - structuring and parsing unstructured or semi-structured data and extracting targeted information or topics from them (using topic modeling or in- formation extraction) and mining and aggregating knowledge into a structured form. The end-goal of these tasks is majorly root cause analysis, while some are also focused on recommending remediation to mitigate the incident. Especially since in most cloud- based settings, there is an increasing number of incidents that occur repeatedly over time showing similar symptoms and having similar root causes. This makes mining and curating knowledge from various data sources, very crucial, in order to be consumed by data-driven AI models or by domain experts for better knowledge reuse. Causality Graph. [202] extracts and mines causality graph from historical incident data and uses human-in-the-loop su- pervision and feedback to further refine the causality graph. [203] constructs an anomaly correlation graph, FacGraph using a distributed frequent pattern mining algorithm. [204] recom- mends appropriate healing actions by adapting remediations retrieved from similar historical incidents. Though the end task involves remediation recommendation, the system still needs to understand the nature of incident and root cause in order to retrieve meaningful past incidents. Knowledge Mining. [205], [206] mines knowledge graph from named entity and relations extracted from incident re- ports using LSTM based CRF models. [207] extracts symp- toms, root causes and remediations from past incident inves- tigations and builds a neural search and knowledge graph to facilitate a retrieval based root cause and remediation recommendation for recurring incidents. Future Trends More Efficient Trace Platform. Currently there are very limited studies in trace related topics. A fundamental challenge is about the trace platforms.There are bottlenecks in collection, storage, query and management of trace data. Traces are usually at a much larger scale than logs and metrics. How to more efficiently collect, store and retrieve trace data is very critical to the success of trace root cause analysis. Online Learning. Compared to trace anomaly detection, online learning plays a more important role for trace RCA, especially for large cloud systems. An RCA tool usually needs to analyze the evidence on the fly and correlate the most suspicious evidence to the ongoing incidents, this approach is very time sensitive. For example, we know trace entity graph (TEG) can achieve accurate trace RCA but the preassumpiton is the TEG is reflecting the current status of the system. If offline training is the only way to get TEG, the performance of such approaches in real-world production environments is always questionable. Thus, using online learning to obtain the TEG is a much better way to guarantee high performance in this situation. Causality Graphs on Multimodal Telemetries. The most precious information conveyed by trace data is the complex topological order of large systems. Without traces, causal anal- ysis for system operations relies on temporal and geometrical correlations to infer causal relationships, and practically very few existing causal inference can be adopted in real-world systems. However, with traces, it is very convenient to obtain the ground truth of how requests flow through the entire system. Thus, we believe higher quality causal graphs will be much easier achievable if it can be learned by multimodel telemetry data. Complete Knowledge Graph of Systems. Currently knowledge mining has been tried for single data type. How- ever, to reflect the full picture of a complex system, the AI models need to mining knowledge from any kind of data types, including metrics, logs, traces, incident reports and other system activity records, then construct a knowledge graph with complete system information. VII. A UTOMATED A CTIONS While both incident detection and RCA capabilities of AIOps help provide information about ongoing issues, tak- ing the right actions is the step that solve the problems. Without automation to take actions, human operators will still be needed in every single ops task. Thus, automated actions is critical to build fully-automated end-to-end AIOps systems. Automated actions contributes to both short-term actions and longer-term actions: 1) short-term remediation : immediate actions to quickly remediate the issue, including server rebooting, live migration, automated scaling, etc.; and 2) longer-term resolutions : actions or guidance for tasks such as code bug fixing, software updating, hard build-out and re- source allocation optimization. In this section, we discuss three common types of automated actions: automated remediation, auto-scaling and resource management.
21 A. Automated Remediation Problem Definition Besides continuously monitoring the IT infrastructure, de- tecting issues and discovering root causes, remediating issues with minimum, or even no human intervention, is the path towards the next generation of fully automated AIOps. Auto- mated issue remediation (Auto-remediation) is taking a series of actions to resolve issues by leveraging known information, existing workflows and domain knowledge. Auto-remediation is a concept already adopted in many IT operation scenarios, including cloud computing, edge computing, SaaS, etc. Traditional auto-remediation processes are based on a vari- ety of well-defined policies and rules to get which workflows to use for a given issue. While machine learning driven auto-remediation means utilizing machine learning models to decide the best action workflows to mitigate or resolve the issue. ML based auto-remediation is exceptionally useful in large scale cloud systems or edge-computing systems where it’s impossible to manually create workflows for all issue categories. Existing Work End-to-end auto-remediation solutions usually contain three main components: anomaly or issue detection, root cause anal- ysis and remediation engine [208]. This means successful auto- remediation solutions highly rely on the quality of anomaly de- tection and root cause analysis, which we’ve already discussed in the above sections. Besides, the remediation engine should be able to learn from the analysis results, make decisions and execute. Knowledge learning. The knowledge here refers to a va- riety of categories. Anomaly detection and root cause analysis for this specific issue contributes to a majority of the learnable knowledge [208]. Remediation engine uses these information to locate and categorize the issue. Besides, the human activity records (such as tickets, bug fixing logs) of past issues are also significant for the remediation to learn the full picture of how issues were handled in history. In Sections VI-A VI-B VI-C we discussed about mining knowledge graphs from system metrics, logs and human-in-the-loop records. A high quality knowledge graph which clearly describes the relationship of system components. Decision making and execution. Levy et al. [209] proposed Narya, a system to handle failure remediation for running virtual machines in cloud systems. For a given issue where the host is predicted to fail, the remediation engine needs to decide what is the best action to take from a few options such as live migration, soft reboot, service healing, etc. The decision on which actions to take are made via A/B testing and reinforcement learning. With adopting machine learning in their remediation engine, they see significant virtual machine interruption savings compared to the previous static strategies. Future Trends Auto-remediation research and development is still in very early stages. The existing work mainly focuses on an inter- mediate step such as constructing a causal graph for a given scenario, or an end-to-end auto-remediation solution for very specific use cases such as virtual machine interruptions. Below are a few topics that can significantly improve the quality of auto-remediation systems. System Integration Now there is still no unified platform that can perform all the issue analysis, learn the context knowledge, make decisions and execute the actions. Learn to generate and update knowledge graphs Quality of auto-remediation decision making strongly depends on domain knowledge. Currently humans collect most of the domain knowledge. In the future, it is valuable to explore approaches that learn and maintain knowledge graphs of the systems in a more reliable way. AI driven decision making and execution Currently most of the decision making and action execution are rule-based or statistical learning based. With more powerful AI techniques, the remediation engine can then consume rich information and make more complex decisions. B. Auto-scaling Problem Definition The cloud native technologies are becoming the de facto standard for building scalable applications in public or private clouds, enabling loosely coupled systems that are resilient, manageable, and observable . The cloud systems such as GCP and AWS provide users on-demand resources including CPU, storage, memory and databases. Users needs to specify a limit of these resources to provision for the workloads of their applications. If a service in an application exceeds the limit of a particular resource, end-users will experience request delays or timeouts, so that system operators will request a larger limit of this resource to avoid degraded performance. But if hundreds of services are running, such large limit results in massive resource wastage. Auto-scaling aims to resolve this issue without human intervention, which enables dynamic provisioning of resources to applications based on workload behavior patterns to minimize resource wastage without loss of quality of service (QoS) to end-users. Auto-scaling approaches can be categorized into two types: reactive auto-scaling and proactive (or predictive) auto-scaling. Reactive auto-scaling monitors the services in a application, and brings them up and down in reaction to changes in workloads. Reactive auto-scaling . Reactive auto-scaling is very effec- tive and supported by most cloud platforms. But it has one potential disadvantage, i.e., it won’t scale up resources until workloads increase so that there is a short period in which more capacity is not yet available but workloads becomes higher. Therefore, end-users can experience response delays in this short period. Proactive auto-scaling aims to solve this problem by predicting future workloads based on historical data. In this paper, we mainly discuss proactive auto-scaling algorithms based on machine learning. Proactive Auto-scaling. Typically, proactive auto-scaling involves three steps, i.e., predicting workloads, estimating https://github.com/cncf/foundation/blob/main/charter.md
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
22 capacities and scaling out. Machine learning techniques are usually applied to predict future workloads and estimate the suitable capacities for the monitored services, and then adjust- ments can be done accordingly to avoid degraded performance. One type of proactive auto-scaling approaches applies re- gression models (e.g., ARIMA [210], SARIMA [211], MLP, LSTM [212]). Given the historical metrics of a monitored service, this type of approaches trains a particular regression model to learn the workload behavior patterns. For example, [213] investigated the ARIMA model for workload prediction and showed that the model improves efficiency in resource utilization with minimal impact in QoS. [214] applied a time window MLP to predict phases in containers with different types of workloads and proposed a predictive vertical auto- scaling policy to resize containers. [215] also leveraged neural networks (especially MLP) for workload prediction and com- pared this approach with traditional machine learning models, e.g., linear regression and K-nearest neighbors. [216] applied a bidirectional LSTM to predict the number of HTTP workloads and showed that Bi-LSTM works better than LSTM and ARIMA on the tested use cases. These approaches require accurate forecasting results to avoid over- or under-allocated of resources, while it is hard to develop a robust forecasting- based approach due to the existence of noises and sudden spikes in user requests. The other type is based on reinforcement learning (RL) that treats auto-scaling as an automatic control problem, whose goal is to learn an optimal auto-scaling policy for the best resource provision action under each observed state. [217] presents an exhaustive survey on reinforcement learning-based auto-scaling approaches, and compares them based on a set of proposed taxonomies. This survey is very worth reading for developers or researchers who are interested in this direction. Although RL looks promising in auto-scaling, there are many issues needed to be resolved. For example, model-based meth- ods require a perfect model of the environment and the learned policies cannot adapt to the changes in the environment, while model-free methods have very poor initial performance and slow convergence so that they will introduce high cost if they are applied in real-world cloud platforms. C. Resource Management Problem Definition Resource management is another important topic in cloud computing, which includes resource provisioning, allocation and scheduling, e.g., workload estimation, task scheduling, energy optimization, etc. Even small provisioning inefficien- cies, such as selecting the wrong resources for a task, can affect quality of service (QoS) and thus lead to significant monetary costs. Therefore, the goal of resource management is to provision the right amount of resources for tasks to improve QoS, mitigate imbalance workloads, and avoid service level agreements violations. Because of multiple tenants sharing storage and computa- tion resources on cloud platforms, resource management is a difficult task that involves dynamically allocating resources and scheduling tenants’ tasks. How to provision resources can be determined in a reactive manner, e.g., creating static rules manually based on domain knowledge. But similar to auto- scaling, reactive approaches result in response delays and ex- cessive overheads. To resolve this issue, ML-based approaches for resource management have gained much attention recently. ML-based Resource Management Many ML-based resource management approaches have been developed in recent years. Due to space limitation, we will not discuss them in details. We recommend readers who are interested in this research topic to read the following nice review papers: [218], [219], [220], [221], [222]. Most of these approaches apply ML techniques to forecast future resource consumption and then do resource provisioning or scheduling based on the forecasting results. For instance, [223] uses ran- dom forest and XGBoost to predict VM behaviors including maximum deployment sizes and workloads. [224] proposes a linear regression based approach to predict the resource utilization of the VMs based on their historical data, and then leverage the prediction results to reduce energy consumption. [225] applies gradient boosting models for temperature pre- diction, based on which a dynamic scheduling algorithm is developed to minimize the peak temperature of hosts. [226] proposes a RL-based workload-specific scheduling algorithm to minimize average task completion time. The accuracy of the ML model is the key factor that affects the efficiency of a resource management system. Applying more sophisticated traditional ML models or even deep learn- ing models to improve prediction accuracy is a promising research direction. Besides accuracy, the time complexity of model prediction is another important factor needed to be con- sidered. If a ML model is over-complicated, it cannot handle real-time requests of resource allocation and scheduling. How to make a trade-off between accuracy and time complexity needs to be explored further. VIII. F UTURE OF AIO PS A. Common AI Challenges for AIOps We have discussed the challenges and future trends in each task sections according to how to employ AI techniques. In summary, there are some common challenges across different AIOps tasks. Data Quality. For all AIOps task there are data quality issues. Most real-world AIOps data are extremely imbalanced due to the nature that incidents only occurs occasionally. Also, most of the real-world AIOps data are very noisy. Significant efforts are needed in data cleaning and pre-processing before it can be used as input to train ML models. Lack of Labels. It’s extremely difficult to acquire quality labels sufficiently. We need a lot of domain experts who are very familiar with system operations to evaluate incidents, root-causes and service graphs, in order to provide high-quality labels. This is extremely time consuming and require specific expertise, which cannot be handled by general crowd sourcing approaches like Mechanical Turk.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
23 Non-stationarity and heterogeneity. Systems are ever- changing. AIOps are facing non-stationary problem space. The AI models in this domain need to have mechanisms to deal with this non-stationary nature. Meanwhile, AIOps data are heterogeneous, meaning the same telemetry data can have a variety of underlying behaviors. For example, CPU utilization pattern can be totally different when the resources are used to host different applications. Thus, discovery the hidden states and handle heterogeneity is very important for AIOps solutions to succeed. Lack of Public Benchmarking. Even though AIOps re- search communities are growing rapidly, there are still very limited number of public datasets for researchers to benchmark and evaluate their results. Operational data are highly sensitive. Existing research are done either with simulated data or enterprise production data which can hardly be shared with other groups and organizations. Human-in-the-loop. Human feedback are very important to build AIOps solutions. Currently most of the human feedback are collected in ad-hoc fashion, which is inefficient. There are lack of human-in-the-loop studies in AIOps domain to automate feedback collection and utilize the feedback to improve model performance. B. Opportunities and Future Trends Our literature review of existing AIOps work shows cur- rent AIOps research still focuses more on infrastructure and tooling. We see AI technologies being successfully applied in incident detection, RCA applications and some of the solutions has been adopted by large distributed systems like AWS, Alibaba cloud. While it is still in very early stages for AIOps process standardization and full automation. With these evidences, we can foresee the promising topics of AIOps in the next few years. High Quality AIOps Infrastructure and Tooling There are some successful AIOps platforms and tools being developed in recent years. But still there are opportunities where AI can help enhance the efficiency of IT operations. AI is also growing rapidly and new AI technologies are invented and successfully applied in other domains. The digital transformation trend also brings challenges to traditional IT operation and Devops. This creates tremendous needs for high quality AI tooling, including monitoring, detection, RCA, predictions and automations. AIOps Standardization While building the infrastructure and tooling, AIOps experts also better understand the full picture of the entire domain. AIOps modules can be identified and extracted from traditional processes to form its own standard. With clear goals and measures, it becomes possible to standardize AIOps systems, just as what has been done in domains like recommendation systems or NLP. With such standardization, it will be much easier to experiment a large variety of AI techniques to improve AIOps performance. Human-centric to Machine-centric AIOps Human-centric AIOps means human processes still play critical roles in the entire AIOps eco-systems, and AI modules help humans with better decisions and executions. While in Machine-centric mode, AIOps systems require minimum human intervention and can be in human-free state for most of its lifetime. AIOps systems continuously monitor the IT infrastructure, detecting and analysis issues, finding the right paths to drive fixes. In this stage, engineers focus primarily on development tasks rather than operations. IX. C ONCLUSION Digital transformation creates tremendous needs for com- puting resources. The trend boosts strong growth of large scale IT infrastructure, such as cloud computing, edge computing, search engines, etc. Since proposed by Gartner in 2016, AIOps is emerging rapidly and now it draws the attention from large enterprises and organizations. As the scale of IT infrastructure grows to a level where human operation cannot catch up, AIOps becomes the only promising solution to guarantee high availability of these gigantic IT infrastructures. AIOps covers different stages of software lifecycles, including development, testing, deployment and maintenance. Different AI techniques are now applied in AIOps applica- tions, including anomaly detection, root-cause analysis, fail- ure predictions, automated actions and resource management. However, the entire AIOps industry is still in a very early stage where AI only plays supporting roles to help human conducting operation workflows. We foresee the trend shifting from human-centric Operations to AI-centric Operations in the near future. During the shift, Development of AIOps techniques will also transit from build tools to create human- free end-to-end solutions. In this survey, we discovered that most of the current AIOps outcomes focus on detections and root cause analysis, while research work on automations is still very limited. The AI techniques used in AIOps are mainly traditional machine learning and statistical models. A CKNOWLEDGMENT We want to thank all participants who took the time to ac- complish this survey. Their knowledge and experiences about AI fundamentals were invaluable to our study. We are also grateful to our colleagues at the Salesforce AI Research Lab and collaborators from other organizations for their helpful feedback and support. A PPENDIX A T ERMINOLOGY DevOps: Modern software development requires not only higher development quality but also higher operations quality. DevOps, as a set of best practices that combines the devel- opment (Dev) and operations (Ops) processes, is created to achieve high quality software development and after release management [3]. Application Performance Monitoring (APM): Applica- tion performance monitoring is the practice of tracking key software application performance using monitoring software
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
24 and telemetry data[227]. APM is used to guarantee high system availability, optimize service performance and improve user experiences. Originally APM was mostly adopted in websites, mobile apps and other similar online business appli- cations. However, with more and more traditional softwares transforming to leverage cloud based, highly distributed sys- tems, APM is now widely used for a larger variety of software applications and backends. Observability: Observability is the ability to measure the internal states of a system by examining its outputs [228]. A system is “observable” if the current state can be estimated by only using the information from outputs. Observability data includes metrics, logs, traces and other system generated information. Cloud Intelligence: The artificial intelligent features that improve cloud applications. MLOps: MLOps stands for machine learning operations. MLOps is the full process life cycle of deploying machine learning models to production. Site Reliability Engineering (SRE): The type of engineer- ing that bridge the gap between software development and operations. Cloud Computing: Cloud computing is a technique, and a business model, that builds highly scalable distributed computer systems and lends computing resources, e.g. hosts, platforms, apps, to tenants to generate revenue. There are three main category of cloud computing: infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS) IT Service Management (ITSM): ITSM refers to all processes and activities to design, create, deliver, and support the IT services to customers. IT Operations Management (ITOM): ITOM overlaps with ITSM, focusing more on the operation side of IT services and infrastructures.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
25 A PPENDIX B T ABLES TABLE I T ABLE OF POPULAR PUBLIC DATASETS FOR METRICS OBSERVABILITY Name Description Tasks Azure Public Dataset These datasets contain a representative subset of first-party Azure virtual machine workloads from a geographical region. Workload characterization, VM Pre-provisioning, Workload prediction Google Cluster Data 30 continuous days of information from Google Borg cells. Workload characterization, Workload prediction Alibaba Cluster Trace Cluster traces of real production servers from Alibaba Group. Workload characterization, Workload prediction MIT Supercloud Dataset Combination of high-level data (e.g. Slurm Workload Manager scheduler data) and low-level job-specific time series data. Workload characterization Numenta Anomaly Benchmark (re- alAWSCloud- watch) AWS server metrics as collected by the AmazonCloudwatch service. Example metrics include CPU Utilization, Network Bytes In, and Disk Read Bytes. Incident detection Yahoo S5 (A1) A1 benchmark contains real Yahoo! web traffic metrics. Incident detection Server Machine Dataset A 5-week-long dataset collected from a large Internet company containing metrics like CPU load, network usage, memory usage, etc. Incident detection KPI Anomaly Detection Dataset A A large-scale realworld KPI anomaly detection dataset, covering various KPI patterns and anomaly patterns. This dataset is collected from five large Internet companies (Sougo, eBay, Baidu, Tencent, and Ali). Incident detection TABLE II T ABLE OF P OPULAR P UBLIC D ATASETS FOR L OG O BSERVABILITY Dataset Description Time-span Data Size # logs Anomaly Labels # Anomalies # Log Templates Distributed system logs HDFS Hadoop distributed file system log 38.7 hours 1.47 GB 11,175,629 3 16,838(blocks) 30 N.A. 16.06 GB 71,118,073 7 Hadoop Hadoop map-reduce job log N.A. 48.61MB 394,308 3 298 Spark Spark job log N.A. 2.75GB 33,236,604 7 456 Zookeeper ZooKeeper service log 26.7 days 9.95MB 74,380 7 95 OpenStack OpenStack infrastructure log N.A. 58.61MB 207,820 3 503 51 Supercomputer logs BGL Blue Gene/L supercomputer log 214.7 days 708.76MB 4,747,963 3 348,460 619 HPC High performance cluster log N.A. 32MB 433,489 7 104 Thunderbird Thunderbird supercomputer log 244 days 29.6GB 211,212,192 3 3,248,239 4040 Operating System logs Windows Windows event log 226.7 days 16.09GB 114,608,388 7 4833 Linux Linux system log 263.9 days 2.25MB 25,567 7 488 Mac Mac OS log 7 days 16.09MB 117,283 7 2214 Mobile System logs Android Android framework log N.A. 183.37MB 1,555,005 7 76,923 Health App Health app log 10.5days 22.44MB 253,395 7 220 Server application logs Apache Apache server error logs 263.9 days 4.9MB 56,481 7 44 OpenSSH OpenSSH server logs 28.4 days 70.02MB 655,146 7 62 Standalone software logs Proxifier Proxifier software logs N.A. 2.42MB 21,329 7 9 Hardware logs Switch Switch hardware failures 2 years - 29,174,680 3 2,204 -
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
26 TABLE III C OMPARISON OF EXISTING L OG A NOMALY D ETECTION M ODELS Reference Learning Setting Type of Model Log Representation Log Tokens Parsing Sequence modeling [92], [93], [94] Supervised Linear Regression, SVM, Deci- sion Tree handcrafted feature log template 3 7 [84] Unsupervised Principal Component Analysis (PCA) quantitative log template 3 3 [67], [82], [95], [80] Unsupervised Clustering and Correlation be- tween logs and metrics sequential, quantitative log template 3 7 [96] Unsupervised Mining invariants using singu- lar value decomposition quantitative, sequential log template 3 7 [97], [98], [99], [68] Unsupervised Frequent pattern mining from Execution Flow and control flow graph mining quantitative, sequential log template 3 7 [20], [100] Unsupervised Rule Engine over Ensembles and Heuristic contrast analysis over anomaly characteristics sequential (with tf-idf weights) log template 3 7 [101] Supervised Autoencoder for log specific word2vec semantic (trainable embedding) log template 3 3 [102] Unsupervised Autoencoder w/ Isolation Forest semantic (trainable embedding) all tokens 7 7 [114] Supervised Convolutional Neural Network semantic (trainable embedding) log template 3 3 [108] Unsupervised Attention based LSTM sequential, quantitative, semantic (GloVe embedding) log template, log parameter 3 3 [81] Unsupervised Attention based LSTM quantitative and semantic (GloVe em- bedding) log template 3 3 [111] Supervised Attention based LSTM semantic (fastText embedding with tf- idf weights) log template 3 3 [104] Semi- Supervised Attention based GRU with clus- tering semantic (fastText embedding with tf- idf weights) log template 3 3 [112] Unsupervised Attention based Bi-LSTM semantic (with trainable embedding) all tokens 7 3 [109] Unsupervised Bi-LSTM semantic (token embedding from BERT, GPT, XLM) all tokens 7 3 [113] Unsupervised Attention based Bi-LSTM semantic (BERT token embedding) log template 3 3 [110] Semi- Supervised LSTM, trained with supervision from source systems semantic (GloVe embedding) log template 3 3 [18] Unsupervised LSTM with domain adversarial training semantic (GloVe embedding) all tokens 7 3 [118], [18] Unsupervised LSTM with Deep Support Vec- tor Data Description semantic (trainable embedding) log template 3 3 [115] Supervised Graph Neural Network semantic (BERT token embedding) log template 3 3 [116] Semi- Supervised Graph Neural Network semantic (BERT token embedding) log template 3 3 [103], [229], [230], [231] Unsupervised Self-Attention Transformer semantic (trainable embedding) all tokens 7 3 [78] Supervised Self-Attention Transformer semantic ( trainable embedding) all tokens 7 3 [117] Supervised Hierarchical Transformer semantic (trainable GloVe embedding) log template, log parameter 3 3 [104], [105] Unsupervised BERT Language Model semantic (BERT token embedding) all tokens 7 3 [21] Unsupervised Unified BERT on various log analysis tasks semantic (BERT token embedding) all tokens 7 3 [232] Unsupervised Contrastive Adversarial model semantic (BERT and VAE based em- bedding) and quantitative log template 3 3 [106], [107], [233] Unsupervised LSTM,Transformer based GAN (Generative Adversarial) semantic (trainable embedding) log template 3 3 Log Tokens refers to the tokens from the logline used in the log representations Parsing and Sequence Modeling columns respectively refers to whether these models need log parsing and they support modeling log sequences
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
27 TABLE IV C OMPARISON OF E XISTING M ETRIC A NOMALY D ETECTION M ODELS Reference Label Accessibility Machine Learning Model Dimensionality Infrastructure Streaming Updates [31] Supervised Tree Univariate 7 3 (Retraining) [41] Active - Univariate 3 3 (Retraining) [42] Unsupervised Tree Multivariate 7 3 [43] Unsupervised Statistical Univariate 7 3 [51] Unsupervised Statistical Univariate 7 7 [37] Semi-supervised Tree Univariate 7 3 [36] Unsupervised, Semi- supervised Deep Learning Univariate 7 7 [52] Unsupervised Deep Learning Univariate 3 7 [40] Domain Adaptation, Active Tree Univariate 7 7 [46] Unsupervised Deep Learning Multivariate 7 7 [49] Unsupervised Deep Learning Univariate 7 7 [45] Unsupervised Deep Learning Multivariate 7 7 [32] Supervised Deep Learning Univariate 3 3 (Retraining) [47] Unsupervised Deep Learning Multivariate 7 7 [48] Unsupervised Deep Learning Multivariate 7 7 [50] Unsupervised Deep Learning Multivariate 7 7 [38] Semi-supervised, Active Deep Learning Multivariate 3 3 (Retraining) TABLE V C OMPARISON OF E XISTING T RACE AND M ULTIMODAL A NOMALY D ETECTION AND RCA M ODELS Reference Topic Deep Learning Adoption Method [124] Trace RCA 7 Clustering [121] Trace RCA 7 Heuristic [234] Trace RCA 7 Multi-input Differential Sum- marization [197] Trace RCA 7 Random forest, k-NN [122] Trace RCA 7 Heuristic [235] Trace Anomaly Detection 7 Graph model [198] Multimodal Anomaly Detection 3 Deep Bayesian Networks [236] Trace Representation 3 Tree-based RNN [196] Trace Anomaly Detection 7 Heuristic [120] Multimodal Anomaly Detection 3 GGNN and SVDD TABLE VI C OMPARISON OF SEVERAL EXISTING METRIC RCA APPROACHES Reference Metric or Graph Analysis Root Cause Score [147] Change points Chronological order [146] Change points Chronological order [148] Two-sample test Correlation [149] Call graphs Cluster similarity [150] Service graph PageRank [151] Service graph Graph similarity [152] Service graph Hierarchical HMM [153] PC algorithm Random walk [154] ITOA-PI PageRank [155] Service graph and PC Causal inference [156] PC algorithm Random walk [157] Service graph and PC Causal inference [158] PC algorithm Random walk [159] PC algorithm Random walk [237] Service graph Causal inference [168] Service graph Contribution-based
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
28 R EFERENCES [1] T. Olavsrud, “How to choose your cloud service provider,” 2012. [Online]. Available: https://www2.cio.com.au/article/416752/ how choose your cloud service provider/ [2] “Summary of the amazon s3 service disruption in the northern virginia (us-east-1) region,” 2021. [Online]. Available: https://aws. amazon.com/message/41926/ [3] S. Gunja, “What is devops? unpacking the purpose and importance of an it cultural revolution,” 2021. [Online]. Available: https: //www.dynatrace.com/news/blog/what-is-devops/ [4] Gartner, “Aiops (artificial intelligence for it operations).” [On- line]. Available: https://www.gartner.com/en/information-technology/ glossary/aiops-artificial-intelligence-operations [5] S. Siddique, “The road to enterprise artificial intelligence: A case studies driven exploration,” Ph.D. dissertation, 05 2018. [6] N. Sabharwal, Hands-on AIOps . Springer, 2022. [7] Y. Dang, Q. Lin, and P. Huang, “Aiops: Real-world challenges and re- search innovations,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) , 2019, pp. 4–5. [8] L. Rijal, R. Colomo-Palacios, and M. S´anchez-Gord´on, “Aiops: A multivocal literature review,” Artificial Intelligence for Cloud and Edge Computing , pp. 31–50, 2022. [9] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” arXiv preprint arXiv:1901.03407 , 2019. [10] L. Akoglu, H. Tong, and D. Koutra, “Graph based anomaly detection and description: a survey,” Data mining and knowledge discovery , vol. 29, no. 3, pp. 626–688, 2015. [11] J. Soldani and A. Brogi, “Anomaly detection and failure root cause analysis in (micro)service-based cloud applications: A survey,” 2021. [Online]. Available: https://arxiv.org/abs/2105.12378 [12] V. Davidovski, “Exponential innovation through digital transfor- mation,” in Proceedings of the 3rd International Conference on Applications in Information Technology , ser. ICAIT’2018. New York, NY, USA: Association for Computing Machinery, 2018, p. 3–5. [Online]. Available: https://doi.org/10.1145/3274856.3274858 [13] D. S. Battina, “Ai and devops in information technology and its future in the united states,” INTERNATIONAL JOURNAL OF CREATIVE RESEARCH THOUGHTS (IJCRT), ISSN , pp. 2320–2882, 2021. [14] A. B. Yoo, M. A. Jette, and M. Grondona, “Slurm: Simple linux utility for resource management,” in Workshop on job scheduling strategies for parallel processing . Springer, 2003, pp. 44–60. [15] J. Zhaoxue, L. Tong, Z. Zhenguo, G. Jingguo, Y. Junling, and L. Liangxiong, “A survey on log research of aiops: Methods and trends,” Mob. Netw. Appl. , vol. 26, no. 6, p. 2353–2364, dec 2021. [Online]. Available: https://doi.org/10.1007/s11036-021-01832-3 [16] S. He, P. He, Z. Chen, T. Yang, Y. Su, and M. R. Lyu, “A survey on automated log analysis for reliability engineering,” ACM Comput. Surv. , vol. 54, no. 6, jul 2021. [Online]. Available: https://doi.org/10.1145/3460345 [17] P. Notaro, J. Cardoso, and M. Gerndt, “A survey of aiops methods for failure management,” ACM Trans. Intell. Syst. Technol. , vol. 12, no. 6, nov 2021. [Online]. Available: https://doi.org/10.1145/3483424 [18] X. Han and S. Yuan, “Unsupervised cross-system log anomaly detection via domain adaptation,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management , ser. CIKM ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 3068–3072. [Online]. Available: https://doi.org/ 10.1145/3459637.3482209 [19] V.-H. Le and H. Zhang, “Log-based anomaly detection with deep learning: How far are we?” in Proceedings of the 44th International Conference on Software Engineering , ser. ICSE ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 1356–1367. [Online]. Available: https://doi.org/10.1145/3510003.3510155 [20] N. Zhao, H. Wang, Z. Li, X. Peng, G. Wang, Z. Pan, Y. Wu, Z. Feng, X. Wen, W. Zhang, K. Sui, and D. Pei, “An empirical investigation of practical log anomaly detection for online service systems,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2021. New York, NY, USA: Association for Computing Machinery, 2021, p. 1404–1415. [Online]. Available: https://doi.org/10.1145/3468264.3473933 [21] Y. Zhu, W. Meng, Y. Liu, S. Zhang, T. Han, S. Tao, and D. Pei, “Unilog: Deploy one model and specialize it for all log analysis tasks,” CoRR , vol. abs/2112.03159, 2021. [Online]. Available: https://arxiv.org/abs/2112.03159 [22] J. Soldani and A. Brogi, “Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,” ACM Comput. Surv. , vol. 55, no. 3, feb 2022. [Online]. Available: https://doi.org/10.1145/3501297 [23] L. Korzeniowski and K. Goczyla, “Landscape of automated log anal- ysis: A systematic literature review and mapping study,” IEEE Access , vol. 10, pp. 21 892–21 913, 2022. [24] M. Sheldon and G. V. B. Weissman, “Retrace: Collecting execution trace with virtual machine deterministic replay,” in Proceedings of the Third Annual Workshop on Modeling, Benchmarking and Simulation (MoBS 2007) . Citeseer, 2007. [25] R. Fonseca, G. Porter, R. H. Katz, and S. Shenker, “ { X-Trace } : A pervasive network tracing framework,” in 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07) , 2007. [26] J. Zhou, Z. Chen, J. Wang, Z. Zheng, and M. R. Lyu, “Trace bench: An open data set for trace-oriented monitoring,” in 2014 IEEE 6th International Conference on Cloud Computing Technology and Science . IEEE, 2014, pp. 519–526. [27] S. Zhang, C. Zhao, Y. Sui, Y. Su, Y. Sun, Y. Zhang, D. Pei, and Y. Wang, “Robust KPI anomaly detection for large-scale software services with partial labels,” in 32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021, Wuhan, China, October 25-28, 2021 , Z. Jin, X. Li, J. Xiang, L. Mariani, T. Liu, X. Yu, and N. Ivaki, Eds. IEEE, 2021, pp. 103–114. [Online]. Available: https://doi.org/10.1109/ISSRE52982.2021.00023 [28] M. Braei and S. Wagner, “Anomaly detection in univariate time-series: A survey on the state-of-the-art,” ArXiv , vol. abs/2004.00433, 2020. [29] A. Bl´azquez-Garc´ ıa, A. Conde, U. Mori, and J. A. Lozano, “A review on outlier/anomaly detection in time series data,” ACM Computing Surveys (CSUR) , vol. 54, no. 3, pp. 1–33, 2021. [30] K. Choi, J. Yi, C. Park, and S. Yoon, “Deep learning for anomaly detection in time-series data: review, analysis, and guidelines,” IEEE Access , 2021. [31] D. Liu, Y. Zhao, H. Xu, Y. Sun, D. Pei, J. Luo, X. Jing, and M. Feng, “Opprentice: Towards practical and automatic anomaly de- tection through machine learning,” in Proceedings of the 2015 internet measurement conference , 2015, pp. 211–224. [32] J. Gao, X. Song, Q. Wen, P. Wang, L. Sun, and H. Xu, “Robusttad: Robust time series anomaly detection via decomposition and convolu- tional neural networks,” arXiv preprint arXiv:2002.09545 , 2020. [33] S. Han, X. Hu, H. Huang, M. Jiang, and Y. Zhao, “ADBench: Anomaly detection benchmark,” in Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2022. [Online]. Available: https://openreview.net/forum?id=foA SFQ9zo0 [34] Z. Li, N. Zhao, S. Zhang, Y. Sun, P. Chen, X. Wen, M. Ma, and D. Pei, “Constructing large-scale real-world benchmark datasets for aiops,” arXiv preprint arXiv:2208.03938 , 2022. [35] R. Wu and E. J. Keogh, “Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress,” CoRR , vol. abs/2009.13807, 2020. [Online]. Available: https://arxiv. org/abs/2009.13807 [36] H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei, Y. Feng et al. , “Unsupervised anomaly detection via variational auto- encoder for seasonal kpis in web applications,” in Proceedings of the 2018 world wide web conference , 2018, pp. 187–196. [37] J. Bu, Y. Liu, S. Zhang, W. Meng, Q. Liu, X. Zhu, and D. Pei, “Rapid deployment of anomaly detection models for large number of emerging kpi streams,” in 2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC) . IEEE, 2018, pp. 1–8. [38] T. Huang, P. Chen, and R. Li, “A semi-supervised vae based active anomaly detection framework in multivariate time series for online systems,” in Proceedings of the ACM Web Conference 2022 , 2022, pp. 1797–1806. [39] X.-L. Li and B. Liu, “Learning from positive and unlabeled examples with different data distributions,” in European conference on machine learning . Springer, 2005, pp. 218–229. [40] X. Zhang, J. Kim, Q. Lin, K. Lim, S. O. Kanaujia, Y. Xu, K. Jamieson, A. Albarghouthi, S. Qin, M. J. Freedman et al. , “Cross-dataset time series anomaly detection for cloud systems,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19) , 2019, pp. 1063–1076. [41] N. Laptev, S. Amizadeh, and I. Flint, “Generic and scalable framework for automated time-series anomaly detection,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining , 2015, pp. 1939–1947. [42] S. Guha, N. Mishra, G. Roy, and O. Schrijvers, “Robust random cut forest based anomaly detection on streams,” in International conference on machine learning . PMLR, 2016, pp. 2712–2721.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
29 [43] S. Ahmad, A. Lavin, S. Purdy, and Z. Agha, “Unsupervised real-time anomaly detection for streaming data,” Neurocomputing , vol. 262, pp. 134–147, 2017. [44] Z. Li, Y. Zhao, J. Han, Y. Su, R. Jiao, X. Wen, and D. Pei, “Multivariate time series anomaly detection and interpretation using hierarchical inter-metric and temporal embedding,” in KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021 , F. Zhu, B. C. Ooi, and C. Miao, Eds. ACM, 2021, pp. 3220–3230. [Online]. Available: https://doi.org/10.1145/3447548.3467075 [45] J. Audibert, P. Michiardi, F. Guyard, S. Marti, and M. A. Zuluaga, “Usad: Unsupervised anomaly detection on multivariate time series,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2020, pp. 3395–3404. [46] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly detection for multivariate time series through stochastic recurrent neural network,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , 2019, pp. 2828– 2837. [47] Z. Li, Y. Zhao, J. Han, Y. Su, R. Jiao, X. Wen, and D. Pei, “Multivariate time series anomaly detection and interpretation using hierarchical inter-metric and temporal embedding,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining , 2021, pp. 3220–3230. [48] W. Yang, K. Zhang, and S. C. Hoi, “Causality-based multivariate time series anomaly detection,” arXiv preprint arXiv:2206.15033 , 2022. [49] F. Ayed, L. Stella, T. Januschowski, and J. Gasthaus, “Anomaly detection at scale: The case for deep distributional time series models,” in International Conference on Service-Oriented Computing . Springer, 2020, pp. 97–109. [50] S. Rabanser, T. Januschowski, K. Rasul, O. Borchert, R. Kurle, J. Gasthaus, M. Bohlke-Schneider, N. Papernot, and V. Flunkert, “In- trinsic anomaly detection for multi-variate time series,” arXiv preprint arXiv:2206.14342 , 2022. [51] J. Hochenbaum, O. S. Vallis, and A. Kejariwal, “Automatic anomaly detection in the cloud via statistical learning,” arXiv preprint arXiv:1704.07706 , 2017. [52] H. Ren, B. Xu, Y. Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang, J. Tong, and Q. Zhang, “Time-series anomaly detection service at microsoft,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , 2019, pp. 3009– 3017. [53] J. Soldani and A. Brogi, “Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,” ACM Comput. Surv. , vol. 55, no. 3, pp. 59:1–59:39, 2023. [Online]. Available: https://doi.org/10.1145/3501297 [54] J. Bu, Y. Liu, S. Zhang, W. Meng, Q. Liu, X. Zhu, and D. Pei, “Rapid deployment of anomaly detection models for large number of emerging KPI streams,” in 37th IEEE International Performance Computing and Communications Conference, IPCCC 2018, Orlando, FL, USA, November 17-19, 2018 . IEEE, 2018, pp. 1–8. [Online]. Available: https://doi.org/10.1109/PCCC.2018.8711315 [55] Z. Z. Darban, G. I. Webb, S. Pan, C. C. Aggarwal, and M. Salehi, “Deep learning for time series anomaly detection: A survey,” CoRR , vol. abs/2211.05244, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.05244 [56] B. Huang, K. Zhang, M. Gong, and C. Glymour, “Causal discovery and forecasting in nonstationary environments with state-space models,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 2901–2910. [Online]. Available: http://proceedings.mlr.press/v97/huang19g.html [57] Q. Pham, C. Liu, D. Sahoo, and S. C. H. Hoi, “Learning fast and slow for online time series forecasting,” CoRR , vol. abs/2202.11672, 2022. [Online]. Available: https://arxiv.org/abs/2202.11672 [58] K. Lai, D. Zha, J. Xu, Y. Zhao, G. Wang, and X. Hu, “Revisiting time series outlier detection: Definitions and benchmarks,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual , J. Vanschoren and S. Yeung, Eds., 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/ 2021/hash/ec5decca5ed3d6b8079e2e7e7bacc9f2-Abstract-round1.html [59] R. Wu and E. Keogh, “Current time series anomaly detection bench- marks are flawed and are creating the illusion of progress,” IEEE Transactions on Knowledge and Data Engineering , 2021. [60] X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, and L. He, “A survey of human-in-the-loop for machine learning,” Future Gener. Comput. Syst. , vol. 135, pp. 364–381, 2022. [Online]. Available: https://doi.org/10.1016/j.future.2022.05.014 [61] D. Sahoo, Q. Pham, J. Lu, and S. C. H. Hoi, “Online deep learning: Learning deep neural networks on the fly,” CoRR , vol. abs/1711.03705, 2017. [Online]. Available: http://arxiv.org/abs/1711.03705 [62] Z. Chen, J. Liu, W. Gu, Y. Su, and M. R. Lyu, “Experience report: Deep learning-based system log analysis for anomaly detection,” 2021. [Online]. Available: https://arxiv.org/abs/2107.05908 [63] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in 2017 IEEE International Conference on Web Services (ICWS) , 2017, pp. 33–40. [64] A. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “Clustering event logs using iterative partitioning,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , ser. KDD ’09. New York, NY, USA: Association for Computing Machinery, 2009, p. 1255–1264. [Online]. Available: https://doi.org/10.1145/1557019.1557154 [65] Z. M. Jiang, A. E. Hassan, P. Flora, and G. Hamann, “Abstracting execution logs to execution events for enterprise applications (short paper),” in 2008 The Eighth International Conference on Quality Software , 2008, pp. 181–186. [66] M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in 2016 IEEE 16th International Conference on Data Mining (ICDM) , 2016, pp. 859–864. [67] R. Vaarandi and M. Pihelgas, “Logcluster - A data clustering and pattern mining algorithm for event logs,” in 11th International Conference on Network and Service Management, CNSM 2015, Barcelona, Spain, November 9-13, 2015 , M. Tortonesi, J. Sch¨onw¨alder, E. R. M. Madeira, C. Schmitt, and J. Serrat, Eds. IEEE Computer Society, 2015, pp. 1–7. [Online]. Available: https: //doi.org/10.1109/CNSM.2015.7367331 [68] Q. Fu, J.-G. Lou, Y. Wang, and J. Li, “Execution anomaly detection in distributed systems through unstructured log analysis,” in 2009 Ninth IEEE International Conference on Data Mining , 2009, pp. 149–158. [69] L. Tang, T. Li, and C.-S. Perng, “Logsig: Generating system events from raw textual logs,” in Proceedings of the 20th ACM International Conference on Information and Knowledge Management , ser. CIKM ’11. New York, NY, USA: Association for Computing Machinery, 2011, p. 785–794. [Online]. Available: https://doi.org/10.1145/2063576.2063690 [70] M. Mizutani, “Incremental mining of system log format,” in 2013 IEEE International Conference on Services Computing , 2013, pp. 595–602. [71] K. Shima, “Length matters: Clustering system log messages using length of words,” CoRR , vol. abs/1611.03213, 2016. [Online]. Available: http://arxiv.org/abs/1611.03213 [72] H. Hamooni, B. Debnath, J. Xu, H. Zhang, G. Jiang, and A. Mueen, “Logmine: Fast pattern recognition for log analytics,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management , ser. CIKM ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 1573–1582. [Online]. Available: https://doi.org/10.1145/2983323.2983358 [73] R. Vaarandi, “A data clustering algorithm for mining patterns from event logs,” in Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764) , 2003, pp. 119– 126. [74] M. Nagappan and M. A. Vouk, “Abstracting log lines to log event types for mining software system logs,” in 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010) , 2010, pp. 114–117. [75] S. Messaoudi, A. Panichella, D. Bianculli, L. Briand, and R. Sasnauskas, “A search-based approach for accurate identification of log message formats,” in Proceedings of the 26th Conference on Program Comprehension , ser. ICPC ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 167–177. [Online]. Available: https://doi.org/10.1145/3196321.3196340 [76] S. Nedelkoski, J. Bogatinovski, A. Acker, J. Cardoso, and O. Kao, “Self-supervised log parsing,” in Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track , Y. Dong, D. Mladeni´c, and C. Saunders, Eds. Cham: Springer International Publishing, 2021, pp. 122–138. [77] Y. Liu, X. Zhang, S. He, H. Zhang, L. Li, Y. Kang, Y. Xu, M. Ma, Q. Lin, Y. Dang, S. Rajmohan, and D. Zhang, “Uniparser: A unified log parser for heterogeneous log data,” in Proceedings of the ACM Web Conference 2022 , ser. WWW ’22. New York, NY, USA:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
30 Association for Computing Machinery, 2022, p. 1893–1901. [Online]. Available: https://doi.org/10.1145/3485447.3511993 [78] V.-H. Le and H. Zhang, “Log-based anomaly detection without log parsing,” in 2021 36th IEEE/ACM International Conference on Auto- mated Software Engineering (ASE) , 2021, pp. 492–504. [79] Y. Lee, J. Kim, and P. Kang, “Lanobert : System log anomaly detection based on BERT masked language model,” CoRR , vol. abs/2111.09564, 2021. [Online]. Available: https://arxiv.org/abs/2111.09564 [80] M. Farshchi, J.-G. Schneider, I. Weber, and J. Grundy, “Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis,” in 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE) , 2015, pp. 24– 34. [81] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang, S. Tao, P. Sun, and R. Zhou, “Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence , ser. IJCAI’19. AAAI Press, 2019, p. 4739–4745. [82] Q. Lin, H. Zhang, J.-G. Lou, Y. Zhang, and X. Chen, “Log clustering based problem identification for online service systems,” in Proceedings of the 38th International Conference on Software Engineering Companion , ser. ICSE ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 102–111. [Online]. Available: https://doi.org/10.1145/2889160.2889232 [83] R. Yang, D. Qu, Y. Qian, Y. Dai, and S. Zhu, “An online log template extraction method based on hierarchical clustering,” EURASIP J. Wirel. Commun. Netw. , vol. 2019, p. 135, 2019. [Online]. Available: https://doi.org/10.1186/s13638-019-1430-4 [84] W. Xu, L. Huang, A. Fox, D. A. Patterson, and M. I. Jordan, “Detecting large-scale system problems by mining console logs,” in Proceedings of the 22nd ACM Symposium on Operating Systems Principles 2009, SOSP 2009, Big Sky, Montana, USA, October 11-14, 2009 , J. N. Matthews and T. E. Anderson, Eds. ACM, 2009, pp. 117–132. [Online]. Available: https://doi.org/10.1145/1629575.1629587 [85] B. Joshi, U. Bista, and M. Ghimire, “Intelligent clustering scheme for log data streams,” in Computational Linguistics and Intelligent Text Processing , A. Gelbukh, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pp. 454–465. [86] J. Liu, J. Zhu, S. He, P. He, Z. Zheng, and M. R. Lyu, “Logzip: Extracting hidden structures via iterative clustering for log compression,” in Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering , ser. ASE ’19. IEEE Press, 2019, p. 863–873. [Online]. Available: https://doi.org/10.1109/ ASE.2019.00085 [87] M. Wurzenberger, F. Skopik, M. Landauer, P. Greitbauer, R. Fiedler, and W. Kastner, “Incremental clustering for semi-supervised anomaly detection applied on log data,” in Proceedings of the 12th International Conference on Availability, Reliability and Security , ser. ARES ’17. New York, NY, USA: Association for Computing Machinery, 2017. [Online]. Available: https://doi.org/10.1145/3098954.3098973 [88] D. Gunter, B. L. Tierney, A. Brown, M. Swany, J. Bresnahan, and J. M. Schopf, “Log summarization and anomaly detection for troubleshooting distributed systems,” in 2007 8th IEEE/ACM International Conference on Grid Computing , 2007, pp. 226–234. [89] W. Meng, F. Zaiter, Y. Huang, Y. Liu, S. Zhang, Y. Zhang, Y. Zhu, T. Zhang, E. Wang, Z. Ren, F. Wang, S. Tao, and D. Pei, “Summarizing unstructured logs in online services,” CoRR , vol. abs/2012.08938, 2020. [Online]. Available: https://arxiv.org/abs/2012.08938 [90] R. Dijkman and A. Wilbik, “Linguistic summarization of event logs a practical approach,” Information Systems , vol. 67, pp. 114–125, 2017. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0306437916303192 [91] S. Locke, H. Li, T.-H. P. Chen, W. Shang, and W. Liu, “Logassist: Assisting log analysis through log summarization,” IEEE Transactions on Software Engineering , pp. 1–1, 2021. [92] P. Bodik, M. Goldszmidt, A. Fox, D. B. Woodard, and H. Andersen, “Fingerprinting the datacenter: Automated classification of performance crises,” in Proceedings of the 5th European Conference on Computer Systems , ser. EuroSys ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 111–124. [Online]. Available: https://doi.org/10.1145/1755913.1755926 [93] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo, “Failure prediction in ibm bluegene/l event logs,” in Seventh IEEE International Conference on Data Mining (ICDM 2007) , 2007, pp. 583–588. [94] M. Chen, A. Zheng, J. Lloyd, M. Jordan, and E. Brewer, “Failure diag- nosis using decision trees,” in International Conference on Autonomic Computing, 2004. Proceedings. , 2004, pp. 36–43. [95] S. He, Q. Lin, J.-G. Lou, H. Zhang, M. R. Lyu, and D. Zhang, “Identifying impactful service system problems via log analysis,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2018. New York, NY, USA: Association for Computing Machinery, 2018, p. 60–70. [Online]. Available: https://doi.org/10.1145/3236024.3236083 [96] J. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li, “Mining invariants from console logs for system problem detection,” in 2010 USENIX Annual Technical Conference, Boston, MA, USA, June 23-25, 2010 , P. Barham and T. Roscoe, Eds. USENIX Association, 2010. [Online]. Available: https://www.usenix.org/conference/usenix-atc-10/ mining-invariants-console-logs-system-problem-detection [97] A. Nandi, A. Mandal, S. Atreja, G. B. Dasgupta, and S. Bhattacharya, “Anomaly detection using program control flow graph mining from execution logs,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , ser. KDD ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 215–224. [Online]. Available: https://doi.org/10. 1145/2939672.2939712 [98] T. Jia, P. Chen, L. Yang, Y. Li, F. Meng, and J. Xu, “An approach for anomaly diagnosis based on hybrid graph model with logs for distributed services,” in 2017 IEEE International Conference on Web Services (ICWS) , 2017, pp. 25–32. [99] T. Jia, L. Yang, P. Chen, Y. Li, F. Meng, and J. Xu, “Logsed: Anomaly diagnosis through mining time-weighted control flow graph in logs,” in 2017 IEEE 10th International Conference on Cloud Computing (CLOUD) , 2017, pp. 447–455. [100] X. Zhang, Y. Xu, S. Qin, S. He, B. Qiao, Z. Li, H. Zhang, X. Li, Y. Dang, Q. Lin, M. Chintalapati, S. Rajmohan, and D. Zhang, “Onion: Identifying incident-indicating logs for cloud systems,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2021. New York, NY, USA: Association for Computing Machinery, 2021, p. 1253–1263. [Online]. Available: https://doi.org/10.1145/3468264.3473919 [101] W. Meng, Y. Liu, Y. Huang, S. Zhang, F. Zaiter, B. Chen, and D. Pei, “A semantic-aware representation framework for online log analysis,” in 2020 29th International Conference on Computer Communications and Networks (ICCCN) , 2020, pp. 1–7. [102] A. Farzad and T. A. Gulliver, “Unsupervised log message anomaly detection,” ICT Express , vol. 6, no. 3, pp. 229–237, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S2405959520300643 [103] S. Nedelkoski, J. Bogatinovski, A. Acker, J. Cardoso, and O. Kao, “Self-attentive classification-based anomaly detection in unstructured logs,” in 2020 IEEE International Conference on Data Mining (ICDM) , 2020, pp. 1196–1201. [104] L. Yang, J. Chen, Z. Wang, W. Wang, J. Jiang, X. Dong, and W. Zhang, “Semi-supervised log-based anomaly detection via probabilistic label estimation,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) , 2021, pp. 1448–1460. [105] H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via bert,” in 2021 International Joint Conference on Neural Networks (IJCNN) , 2021, pp. 1–8. [106] B. Xia, Y. Bai, J. Yin, Y. Li, and J. Xu, “Loggan: A log-level generative adversarial network for anomaly detection using permutation event modeling,” Information Systems Frontiers , vol. 23, no. 2, p. 285–298, apr 2021. [Online]. Available: https://doi.org/10.1007/s10796-020-10026-3 [107] Z. Zhao, W. Niu, X. Zhang, R. Zhang, Z. Yu, and C. Huang, “Trine: Syslog anomaly detection with three transformer encoders in one generative adversarial network,” Applied Intelligence , vol. 52, no. 8, p. 8810–8819, jun 2022. [Online]. Available: https://doi.org/10.1007/ s10489-021-02863-9 [108] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security , ser. CCS ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 1285–1298. [Online]. Available: https://doi.org/10.1145/3133956.3134015 [109] H. Ott, J. Bogatinovski, A. Acker, S. Nedelkoski, and O. Kao, “Robust and transferable anomaly detection in log data using pre-trained language models,” in 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence) . Los Alamitos, CA, USA: IEEE Computer Society, may 2021, pp. 19–
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
31 24. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ CloudIntelligence52565.2021.00013 [110] R. Chen, S. Zhang, D. Li, Y. Zhang, F. Guo, W. Meng, D. Pei, Y. Zhang, X. Chen, and Y. Liu, “Logtransfer: Cross-system log anomaly detection for software systems with transfer learning,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) , 2020, pp. 37–47. [111] X. Zhang, Y. Xu, Q. Lin, B. Qiao, H. Zhang, Y. Dang, C. Xie, X. Yang, Q. Cheng, Z. Li, J. Chen, X. He, R. Yao, J.-G. Lou, M. Chintalapati, F. Shen, and D. Zhang, “Robust log-based anomaly detection on unstable log data,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2019. New York, NY, USA: Association for Computing Machinery, 2019, p. 807–817. [Online]. Available: https://doi.org/10.1145/3338906.3338931 [112] A. Brown, A. Tuor, B. Hutchinson, and N. Nichols, “Recurrent neural network attention mechanisms for interpretable system log anomaly detection,” in Proceedings of the First Workshop on Machine Learning for Computing Systems , ser. MLCS’18. New York, NY, USA: Association for Computing Machinery, 2018. [Online]. Available: https://doi.org/10.1145/3217871.3217872 [113] X. Li, P. Chen, L. Jing, Z. He, and G. Yu, “Swisslog: Robust and unified deep learning based log anomaly detection for diverse faults,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) , 2020, pp. 92–103. [114] S. Lu, X. Wei, Y. Li, and L. Wang, “Detecting anomaly in big data system logs using convolutional neural network,” in 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress, DASC/PiCom/DataCom/CyberSciTech 2018, Athens, Greece, August 12-15, 2018 . IEEE Computer Society, 2018, pp. 151–158. [Online]. Available: https://doi.org/10.1109/ DASC/PiCom/DataCom/CyberSciTec.2018.00037 [115] Y. Xie, H. Zhang, and M. A. Babar, “Loggd:detecting anomalies from system logs by graph neural networks,” 2022. [Online]. Available: https://arxiv.org/abs/2209.07869 [116] Y. Wan, Y. Liu, D. Wang, and Y. Wen, “Glad-paw: Graph-based log anomaly detection by position aware weighted graph attention network,” in Advances in Knowledge Discovery and Data Mining , K. Karlapalem, H. Cheng, N. Ramakrishnan, R. K. Agrawal, P. K. Reddy, J. Srivastava, and T. Chakraborty, Eds. Cham: Springer International Publishing, 2021, pp. 66–77. [117] S. Huang, Y. Liu, C. Fung, R. He, Y. Zhao, H. Yang, and Z. Luan, “Hi- tanomaly: Hierarchical transformers for anomaly detection in system log,” IEEE Transactions on Network and Service Management , vol. 17, no. 4, pp. 2064–2076, 2020. [118] H. Cheng, D. Xu, S. Yuan, and X. Wu, “Fine-grained anomaly detection in sequential data via counterfactual explanations,” 2022. [Online]. Available: https://arxiv.org/abs/2210.04145 [119] S. Nedelkoski, J. Cardoso, and O. Kao, “Anomaly detection from system tracing data using multimodal deep learning,” in 2019 IEEE 12th International Conference on Cloud Computing (CLOUD) , 2019, pp. 179–186. [120] C. Zhang, X. Peng, C. Sha, K. Zhang, Z. Fu, X. Wu, Q. Lin, and D. Zhang, “Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning.” Pittsburgh, PA, USA: IEEE, 2022, pp. 623–634. [121] D. C. Arnold, D. H. Ahn, B. R. De Supinski, G. L. Lee, B. P. Miller, and M. Schulz, “Stack trace analysis for large scale debugging,” in 2007 IEEE International Parallel and Distributed Processing Symposium . IEEE, 2007, pp. 1–10. [122] Z. Cai, W. Li, W. Zhu, L. Liu, and B. Yang, “A real-time trace- level root-cause diagnosis system in alibaba datacenters,” IEEE Access , vol. 7, pp. 142 692–142 702, 2019. [123] P. Papadimitriou, A. Dasdan, and H. Garcia-Molina, “Web graph similarity for anomaly detection,” Journal of Internet Services and Applications , vol. 1, no. 1, pp. 19–30, 2010. [124] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, “Pinpoint: Problem determination in large, dynamic internet services,” in Proceed- ings International Conference on Dependable Systems and Networks . IEEE, 2002, pp. 595–604. [125] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation , vol. 9, no. 8, pp. 1735–1780, 1997. [126] F. Salfner, M. Lenk, and M. Malek, “A survey of online failure prediction methods,” ACM Comput. Surv. , vol. 42, no. 3, pp. 10:1– 10:42, 2010. [Online]. Available: https://doi.org/10.1145/1670679. 1670680 [127] Y. Chen, X. Yang, Q. Lin, H. Zhang, F. Gao, Z. Xu, Y. Dang, D. Zhang, H. Dong, Y. Xu, H. Li, and Y. Kang, “Outage prediction and diagnosis for cloud service systems,” in The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019 , L. Liu, R. W. White, A. Mantrach, F. Silvestri, J. J. McAuley, R. Baeza-Yates, and L. Zia, Eds. ACM, 2019, pp. 2659–2665. [Online]. Available: https://doi.org/10.1145/3308558.3313501 [128] N. Zhao, J. Chen, Z. Wang, X. Peng, G. Wang, Y. Wu, F. Zhou, Z. Feng, X. Nie, W. Zhang, K. Sui, and D. Pei, “Real-time incident prediction for online service systems,” in ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020 , P. Devanbu, M. B. Cohen, and T. Zimmermann, Eds. ACM, 2020, pp. 315–326. [Online]. Available: https://doi.org/10.1145/3368089.3409672 [129] S. Zhang, Y. Liu, W. Meng, Z. Luo, J. Bu, S. Yang, P. Liang, D. Pei, J. Xu, Y. Zhang, Y. Chen, H. Dong, X. Qu, and L. Song, “Prefix: Switch failure prediction in datacenter networks,” Proc. ACM Meas. Anal. Comput. Syst. , vol. 2, no. 1, pp. 2:1–2:29, 2018. [Online]. Available: https://doi.org/10.1145/3179405 [130] Q. Lin, K. Hsieh, Y. Dang, H. Zhang, K. Sui, Y. Xu, J. Lou, C. Li, Y. Wu, R. Yao, M. Chintalapati, and D. Zhang, “Predicting node failure in cloud service systems,” in Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018 , G. T. Leavens, A. Garcia, and C. S. Pasareanu, Eds. ACM, 2018, pp. 480–490. [Online]. Available: https://doi.org/10.1145/3236024.3236060 [131] E. Pinheiro, W. Weber, and L. A. Barroso, “Failure trends in a large disk drive population,” in 5th USENIX Conference on File and Storage Technologies, FAST 2007, February 13-16, 2007, San Jose, CA, USA , A. C. Arpaci-Dusseau and R. H. Arpaci- Dusseau, Eds. USENIX, 2007, pp. 17–28. [Online]. Available: http://www.usenix.org/events/fast07/tech/pinheiro.html [132] Y. Xu, K. Sui, R. Yao, H. Zhang, Q. Lin, Y. Dang, P. Li, K. Jiang, W. Zhang, J. Lou, M. Chintalapati, and D. Zhang, “Improving service availability of cloud systems by predicting disk error,” in 2018 USENIX Annual Technical Conference, USENIX ATC 2018, Boston, MA, USA, July 11-13, 2018 , H. S. Gunawi and B. Reed, Eds. USENIX Association, 2018, pp. 481–494. [Online]. Available: https://www.usenix.org/conference/atc18/presentation/xu-yong [133] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam, “Critical event prediction for proactive management in large-scale computer clusters,” in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , ser. KDD ’03. New York, NY, USA: Association for Computing Machinery, 2003, p. 426–435. [Online]. Available: https://doi.org/10.1145/956750.956799 [134] F. Yu, H. Xu, S. Jian, C. Huang, Y. Wang, and Z. Wu, “Dram failure prediction in large-scale data centers,” in 2021 IEEE International Conference on Joint Cloud Computing (JCC) , 2021, pp. 1–8. [135] J. Klinkenberg, C. Terboven, S. Lankes, and M. S. M¨uller, “Data mining-based analysis of hpc center operations,” in 2017 IEEE In- ternational Conference on Cluster Computing (CLUSTER) , 2017, pp. 766–773. [136] S. Zhang, Y. Liu, W. Meng, Z. Luo, J. Bu, S. Yang, P. Liang, D. Pei, J. Xu, Y. Zhang, Y. Chen, H. Dong, X. Qu, and L. Song, “Prefix: Switch failure prediction in datacenter networks,” Proc. ACM Meas. Anal. Comput. Syst. , vol. 2, no. 1, apr 2018. [Online]. Available: https://doi.org/10.1145/3179405 [137] B. Russo, G. Succi, and W. Pedrycz, “Mining system logs to learn error predictors: a case study of a telemetry system,” Empir. Softw. Eng. , vol. 20, no. 4, pp. 879–927, 2015. [Online]. Available: https://doi.org/10.1007/s10664-014-9303-2 [138] I. Fronza, A. Sillitti, G. Succi, M. Terho, and J. Vlasenko, “Failure prediction based on log files using random indexing and support vector machines,” J. Syst. Softw. , vol. 86, no. 1, p. 2–11, jan 2013. [Online]. Available: https://doi.org/10.1016/j.jss.2012.06.025 [139] F. Salfner and M. Malek, “Using hidden semi-markov models for effective online failure prediction,” in 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007) , 2007, pp. 161–174. [140] A. Das, F. Mueller, C. Siegel, and A. Vishnu, “Desh: Deep learning for system health prediction of lead times to failure in hpc,” in Proceedings
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
32 of the 27th International Symposium on High-Performance Parallel and Distributed Computing , ser. HPDC ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 40–51. [Online]. Available: https://doi.org/10.1145/3208040.3208051 [141] J. Gao, H. Wang, and H. Shen, “Task failure prediction in cloud data centers using deep learning,” in 2019 IEEE International Conference on Big Data (Big Data) , 2019, pp. 1111–1116. [142] Z. Zheng, Z. Lan, B. H. Park, and A. Geist, “System log pre-processing to improve failure prediction,” in 2009 IEEE/IFIP International Con- ference on Dependable Systems and Networks , 2009, pp. 572–577. [143] Y. Chen, X. Yang, Q. Lin, H. Zhang, F. Gao, Z. Xu, Y. Dang, D. Zhang, H. Dong, Y. Xu, H. Li, and Y. Kang, “Outage prediction and diagnosis for cloud service systems,” in The World Wide Web Conference , ser. WWW ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 2659–2665. [Online]. Available: https://doi.org/10.1145/3308558.3313501 [144] Q. Lin, K. Hsieh, Y. Dang, H. Zhang, K. Sui, Y. Xu, J.-G. Lou, C. Li, Y. Wu, R. Yao, M. Chintalapati, and D. Zhang, “Predicting node failure in cloud service systems,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2018. New York, NY, USA: Association for Computing Machinery, 2018, p. 480–490. [Online]. Available: https://doi.org/10.1145/3236024.3236060 [145] X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He, “Latent error prediction and fault localization for microservice applications by learning from system trace logs,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2019. New York, NY, USA: Association for Computing Machinery, 2019, p. 683–694. [Online]. Available: https://doi.org/10.1145/3338906.3338961 [146] H. Nguyen, Y. Tan, and X. Gu, “Pal: Propagation-aware anomaly localization for cloud hosted distributed applications,” ser. SLAML ’11. New York, NY, USA: Association for Computing Machinery, 2011. [Online]. Available: https://doi.org/10.1145/2038633.2038634 [147] H. Nguyen, Z. Shen, Y. Tan, and X. Gu, “Fchain: Toward black- box online fault localization for cloud systems,” in 2013 IEEE 33rd International Conference on Distributed Computing Systems , 2013, pp. 21–30. [148] H. Shan, Y. Chen, H. Liu, Y. Zhang, X. Xiao, X. He, M. Li, and W. Ding, “e-diagnosis: Unsupervised and real-time diagnosis of small- window long-tail latency in large-scale microservice platforms,” in The World Wide Web Conference , ser. WWW ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 3215–3222. [Online]. Available: https://doi.org/10.1145/3308558.3313653 [149] J. Thalheim, A. Rodrigues, I. E. Akkus, P. Bhatotia, R. Chen, B. Viswanath, L. Jiao, and C. Fetzer, “Sieve: Actionable insights from monitored metrics in distributed systems,” in Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference , ser. Middleware ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 14–27. [Online]. Available: https://doi.org/10.1145/3135974.3135977 [150] L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “Microrca: Root cause localization of performance issues in microservices,” in NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium , 2020, pp. 1–9. [151] ´ Alvaro Brand´on, M. Sol´e, A. Hu´elamo, D. Solans, M. S. P´erez, and V. Munt´es-Mulero, “Graph-based root cause analysis for service-oriented and microservice architectures,” Journal of Systems and Software , vol. 159, p. 110432, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121219302067 [152] A. Samir and C. Pahl, “Dla: Detecting and localizing anomalies in containerized microservice architectures using markov models,” in 2019 7th International Conference on Future Internet of Things and Cloud (FiCloud) , 2019, pp. 205–213. [153] P. Wang, J. Xu, M. Ma, W. Lin, D. Pan, Y. Wang, and P. Chen, “Cloudranger: Root cause identification for cloud native systems,” in 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) , 2018, pp. 492–502. [154] L. Mariani, C. Monni, M. Pezz´e, O. Riganelli, and R. Xin, “Localizing faults in cloud systems,” in 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST) , 2018, pp. 262– 273. [155] P. Chen, Y. Qi, and D. Hou, “Causeinfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud envi- ronment,” IEEE Transactions on Services Computing , vol. 12, no. 2, pp. 214–230, 2019. [156] Y. Meng, S. Zhang, Y. Sun, R. Zhang, Z. Hu, Y. Zhang, C. Jia, Z. Wang, and D. Pei, “Localizing failure root causes in a microservice through causality inference,” in 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS) , 2020, pp. 1–10. [157] J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance issues with causal graphs in micro-service environments,” in ICSOC , 2018. [158] M. Ma, W. Lin, D. Pan, and P. Wang, “Ms-rank: Multi-metric and self-adaptive root cause diagnosis for microservice applications,” in 2019 IEEE International Conference on Web Services (ICWS) , 2019, pp. 60–67. [159] M. Ma, J. Xu, Y. Wang, P. Chen, Z. Zhang, and P. Wang, AutoMAP: Diagnose Your Microservice-Based Web Applications Automatically . New York, NY, USA: Association for Computing Machinery, 2020, p. 246–258. [Online]. Available: https://doi.org/10.1145/3366423. 3380111 [160] M. Kim, R. Sumbaly, and S. Shah, “Root cause detection in a service-oriented architecture,” in Proceedings of the ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems , ser. SIGMETRICS ’13. New York, NY, USA: Association for Computing Machinery, 2013, p. 93–104. [Online]. Available: https://doi.org/10.1145/2465529.2465753 [161] P. Spirtes and C. Glymour, “An algorithm for fast recovery of sparse causal graphs,” Social Science Computer Review , vol. 9, no. 1, pp. 62–72, 1991. [162] D. M. Chickering, “Learning equivalence classes of Bayesian-network structures,” J. Mach. Learn. Res. , vol. 2, no. 3, pp. 445–498, 2002. [163] ——, “Optimal structure identification with greedy search,” J. Mach. Learn. Res. , vol. 3, no. 3, pp. 507–554, 2003. [164] J. Runge, P. Nowack, M. Kretschmer, S. Flaxman, and D. Sejdinovic, “Detecting and quantifying causal associations in large nonlinear time series datasets,” Science Advances , vol. 5, no. 11, p. eaau4996, 2019. [Online]. Available: https://www.science.org/doi/abs/10.1126/ sciadv.aau4996 [165] J. Runge, “Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets,” in UAI , 2020. [166] A. Gerhardus and J. Runge, “High-recall causal discovery for autocorrelated time series with latent confounders,” in Advances in Neural Information Processing Systems , H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 615–12 625. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/ 94e70705efae423efda1088614128d0b-Paper.pdf [167] J. Qiu, Q. Du, K. Yin, S.-L. Zhang, and C. Qian, “A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications,” Applied Sciences , vol. 10, no. 6, 2020. [Online]. Available: https://www.mdpi.com/ 2076-3417/10/6/2166 [168] K. Budhathoki, L. Minorics, P. Bloebaum, and D. Janzing, “Causal structure-based root cause analysis of outliers,” in ICML 2022 , 2022. [Online]. Available: https://www.amazon.science/publications/ causal-structure-based-root-cause-analysis-of-outliers [169] S. Lu, B. Rao, X. Wei, B. Tak, L. Wang, and L. Wang, “Log-based abnormal task detection and root cause analysis for spark,” in 2017 IEEE International Conference on Web Services (ICWS) , 2017, pp. 389–396. [170] F. Lin, K. Muzumdar, N. P. Laptev, M.-V. Curelea, S. Lee, and S. Sankar, “Fast dimensional analysis for root cause investigation in a large-scale service environment,” Proc. ACM Meas. Anal. Comput. Syst. , vol. 4, no. 2, jun 2020. [Online]. Available: https://doi.org/10.1145/3392149 [171] L. Wang, N. Zhao, J. Chen, P. Li, W. Zhang, and K. Sui, “Root-cause metric location for microservice systems via log anomaly detection,” in 2020 IEEE International Conference on Web Services (ICWS) , 2020, pp. 142–150. [172] C. Luo, J.-G. Lou, Q. Lin, Q. Fu, R. Ding, D. Zhang, and Z. Wang, “Correlating events with time series for incident diagnosis,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , ser. KDD ’14. New York, NY, USA: Association for Computing Machinery, 2014, p. 1583–1592. [Online]. Available: https://doi.org/10.1145/2623330.2623374 [173] E. Chuah, S.-h. Kuo, P. Hiew, W.-C. Tjhi, G. Lee, J. Hammond, M. T. Michalewicz, T. Hung, and J. C. Browne, “Diagnosing the root-causes of failures from cluster log files,” in 2010 International Conference on High Performance Computing , 2010, pp. 1–10. [174] T. S. Zaman, X. Han, and T. Yu, “Scminer: Localizing system- level concurrency faults from large system call traces,” in 2019 34th
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
33 IEEE/ACM International Conference on Automated Software Engineer- ing (ASE) , 2019, pp. 515–526. [175] K. Zhang, J. Xu, M. R. Min, G. Jiang, K. Pelechrinis, and H. Zhang, “Automated it system failure prediction: A deep learning approach,” in 2016 IEEE International Conference on Big Data (Big Data) , 2016, pp. 1291–1300. [176] Y. Yuan, W. Shi, B. Liang, and B. Qin, “An approach to cloud execution failure diagnosis based on exception logs in openstack,” in 2019 IEEE 12th International Conference on Cloud Computing (CLOUD) , 2019, pp. 124–131. [177] H. Mi, H. Wang, Y. Zhou, M. R.-T. Lyu, and H. Cai, “Toward fine- grained, unsupervised, scalable performance diagnosis for production cloud computing systems,” IEEE Transactions on Parallel and Dis- tributed Systems , vol. 24, no. 6, pp. 1245–1255, 2013. [178] H. Jiang, X. Li, Z. Yang, and J. Xuan, “What causes my test alarm? automatic cause analysis for test alarms in system and integration testing,” in Proceedings of the 39th International Conference on Software Engineering , ser. ICSE ’17. IEEE Press, 2017, p. 712–723. [Online]. Available: https://doi.org/10.1109/ICSE.2017.71 [179] A. Amar and P. C. Rigby, “Mining historical test logs to predict bugs and localize faults in the test logs,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) , 2019, pp. 140–151. [180] F. Wang, A. Bundy, X. Li, R. Zhu, K. Nuamah, L. Xu, S. Mauceri, and J. Z. Pan, “Lekg: A system for constructing knowledge graphs from log extraction,” in The 10th International Joint Conference on Knowledge Graphs , ser. IJCKG’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 181–185. [Online]. Available: https://doi.org/10.1145/3502223.3502250 [181] A. Ekelhart, F. J. Ekaputra, and E. Kiesling, “The slogert framework for automated log knowledge graph construction,” in The Semantic Web , R. Verborgh, K. Hose, H. Paulheim, P.-A. Champin, M. Maleshkova, O. Corcho, P. Ristoski, and M. Alam, Eds. Cham: Springer Interna- tional Publishing, 2021, pp. 631–646. [182] C. Bansal, S. Renganathan, A. Asudani, O. Midy, and M. Janakiraman, “Decaf: Diagnosing and triaging performance issues in large-scale cloud services,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice , ser. ICSE-SEIP ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 201–210. [Online]. Available: https://doi.org/10.1145/3377813.3381353 [183] B. C. Tak, S. Tao, L. Yang, C. Zhu, and Y. Ruan, “Logan: Problem diagnosis in the cloud using log-based reference models,” in 2016 IEEE International Conference on Cloud Engineering (IC2E) , 2016, pp. 62– 67. [184] W. Shang, Z. M. Jiang, H. Hemmati, B. Adams, A. E. Hassan, and P. Martin, “Assisting developers of big data analytics applications when deploying on hadoop clouds,” in 2013 35th International Conference on Software Engineering (ICSE) , 2013, pp. 402–411. [185] C. Pham, L. Wang, B. C. Tak, S. Baset, C. Tang, Z. Kalbarczyk, and R. K. Iyer, “Failure diagnosis for distributed systems using targeted fault injection,” IEEE Transactions on Parallel and Distributed Sys- tems , vol. 28, no. 2, pp. 503–516, 2017. [186] K. Nagaraj, C. Killian, and J. Neville, “Structured comparative analysis of systems logs to diagnose performance problems,” in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation , ser. NSDI’12. USA: USENIX Association, 2012, p. 26. [187] H. Ikeuchi, A. Watanabe, T. Kawata, and R. Kawahara, “Root-cause diagnosis using logs generated by user actions,” in 2018 IEEE Global Communications Conference (GLOBECOM) , 2018, pp. 1–7. [188] P. Aggarwal, A. Gupta, P. Mohapatra, S. Nagar, A. Mandal, Q. Wang, and A. Paradkar, “Localization of operational faults in cloud appli- cations by mining causal dependencies in logs using golden signals,” in Service-Oriented Computing – ICSOC 2020 Workshops , H. Hacid, F. Outay, H.-y. Paik, A. Alloum, M. Petrocchi, M. R. Bouadjenek, A. Beheshti, X. Liu, and A. Maaradji, Eds. Cham: Springer Interna- tional Publishing, 2021, pp. 137–149. [189] Y. Zhang, Z. Guan, H. Qian, L. Xu, H. Liu, Q. Wen, L. Sun, J. Jiang, L. Fan, and M. Ke, “Cloudrca: A root cause analysis framework for cloud computing platforms,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management , ser. CIKM ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 4373–4382. [Online]. Available: https://doi.org/10.1145/3459637.3481903 [190] H. Wang, Z. Wu, H. Jiang, Y. Huang, J. Wang, S. Kopru, and T. Xie, “Groot: An event-graph-based approach for root cause analysis in industrial settings,” in Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering , ser. ASE ’21. IEEE Press, 2021, p. 419–429. [Online]. Available: https://doi.org/10.1109/ASE51524.2021.9678708 [191] X. Fu, R. Ren, S. A. McKee, J. Zhan, and N. Sun, “Digging deeper into cluster system logs for failure prediction and root cause diagno- sis,” in 2014 IEEE International Conference on Cluster Computing (CLUSTER) , 2014, pp. 103–112. [192] S. Kobayashi, K. Fukuda, and H. Esaki, “Mining causes of network events in log data with causal inference,” in 2017 IFIP/IEEE Sympo- sium on Integrated Network and Service Management (IM) , 2017, pp. 45–53. [193] S. Kobayashi, K. Otomo, and K. Fukuda, “Causal analysis of network logs with layered protocols and topology knowledge,” in 2019 15th In- ternational Conference on Network and Service Management (CNSM) , 2019, pp. 1–9. [194] R. Jarry, S. Kobayashi, and K. Fukuda, “A quantitative causal analysis for network log data,” in 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC) , 2021, pp. 1437–1442. [195] D. Liu, C. He, X. Peng, F. Lin, C. Zhang, S. Gong, Z. Li, J. Ou, and Z. Wu, “Microhecl: High-efficient root cause localization in large- scale microservice systems,” in Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice , ser. ICSE-SEIP ’21. IEEE Press, 2021, p. 338–347. [Online]. Available: https://doi.org/10.1109/ICSE-SEIP52600.2021.00043 [196] Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y. Wu, L. Jiang, L. Yan, Z. Wang et al. , “Practical root cause localization for microservice systems via trace analysis,” in 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS) . IEEE, 2021, pp. 1–10. [197] X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He, “Latent error prediction and fault localization for microservice applications by learning from system trace logs,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Con- ference and Symposium on the Foundations of Software Engineering , 2019, pp. 683–694. [198] P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue et al. , “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) . IEEE, 2020, pp. 48–58. [199] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a large-scale dis- tributed systems tracing infrastructure,” 2010. [200] J. Jiang, W. Lu, J. Chen, Q. Lin, P. Zhao, Y. Kang, H. Zhang, Y. Xiong, F. Gao, Z. Xu, Y. Dang, and D. Zhang, “How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ser. ESEC/FSE 2020. New York, NY, USA: Association for Computing Machinery, 2020, p. 1410–1420. [Online]. Available: https://doi.org/10.1145/3368089.3417054 [201] M. Shetty, C. Bansal, S. P. Upadhyayula, A. Radhakrishna, and A. Gupta, “Autotsg: Learning and synthesis for incident troubleshooting,” CoRR , vol. abs/2205.13457, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.13457 [202] X. Nie, Y. Zhao, K. Sui, D. Pei, Y. Chen, and X. Qu, “Mining causality graph for automatic web-based service diagnosis,” in 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC) , 2016, pp. 1–8. [203] W. Lin, M. Ma, D. Pan, and P. Wang, “Facgraph: Frequent anomaly correlation graph mining for root cause diagnose in micro-service architecture,” in 2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC) , 2018, pp. 1–8. [204] R. Ding, Q. Fu, J. G. Lou, Q. Lin, D. Zhang, and T. Xie, “Mining his- torical issue repositories to heal large-scale online service systems,” in 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks , 2014, pp. 311–322. [205] M. Shetty, C. Bansal, S. Kumar, N. Rao, N. Nagappan, and T. Zimmermann, “Neural knowledge extraction from cloud service incidents,” in Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice , ser. ICSE-SEIP ’21. IEEE Press, 2021, p. 218–227. [Online]. Available: https://doi.org/10.1109/ICSE-SEIP52600.2021.00031 [206] M. Shetty, C. Bansal, S. Kumar, N. Rao, and N. Nagappan, “Softner: Mining knowledge graphs from cloud incidents,” Empir.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
34 Softw. Eng. , vol. 27, no. 4, p. 93, 2022. [Online]. Available: https://doi.org/10.1007/s10664-022-10159-w [207] A. Saha and S. C. H. Hoi, “Mining root cause knowledge from cloud service incident investigations for aiops,” in 44th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2022, Pittsburgh, PA, USA, May 22-24, 2022 . IEEE, 2022, pp. 197–206. [Online]. Available: https://doi.org/10.1109/ICSE-SEIP55303.2022.9793994 [208] S. Becker, F. Schmidt, A. Gulenko, A. Acker, and O. Kao, “Towards aiops in edge computing environments,” in 2020 IEEE International Conference on Big Data (Big Data) , 2020, pp. 3470–3475. [209] S. Levy, R. Yao, Y. Wu, Y. Dang, P. Huang, Z. Mu, P. Zhao, T. Ramani, N. Govindaraju, X. Li, Q. Lin, G. L. Shafriri, and M. Chintalapati, “Predictive and adaptive failure mitigation to avert production cloud VM interruptions,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) . USENIX Association, Nov. 2020, pp. 1155–1170. [Online]. Available: https://www.usenix.org/conference/osdi20/presentation/levy [210] J. D. Hamilton, Time Series Analysis . Princeton University Press, 1994. [211] R. Hyndman and G. Athanasopoulos, Forecasting: Principles and Practice , 2nd ed. OTexts, 2018. [212] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation , vol. 9, no. 8, pp. 1735–1780, 1997. [213] R. N. Calheiros, E. Masoumi, R. Ranjan, and R. Buyya, “Workload prediction using arima model and its impact on cloud applications’ qos,” IEEE Transactions on Cloud Computing , vol. 3, no. 4, pp. 449– 458, 2015. [214] D. Buchaca, J. L. Berral, C. Wang, and A. Youssef, “Proactive container auto-scaling for cloud native machine learning services,” in 2020 IEEE 13th International Conference on Cloud Computing (CLOUD) , 2020, pp. 475–479. [215] M. Wajahat, A. Gandhi, A. Karve, and A. Kochut, “Using machine learning for black-box autoscaling,” in 2016 Seventh International Green and Sustainable Computing Conference (IGSC) , 2016, pp. 1– 8. [216] N.-M. Dang-Quang and M. Yoo, “Deep learning-based autoscaling using bidirectional long short-term memory for kubernetes,” Applied Sciences , vol. 11, no. 9, 2021. [Online]. Available: https://www.mdpi. com/2076-3417/11/9/3835 [217] Y. Gar´ ı, D. A. Monge, E. Pacini, C. Mateos, and C. Garc´ ıa Garino, “Reinforcement learning-based application autoscaling in the cloud: A survey,” Engineering Applications of Artificial Intelligence , vol. 102, p. 104288, 2021. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0952197621001354 [218] S. Mustafa, B. Nazir, A. Hayat, A. ur Rehman Khan, and S. A. Madani, “Resource management in cloud computing: Taxonomy, prospects, and challenges,” Computers and Electrical Engineering , vol. 47, pp. 186–203, 2015. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S004579061500275X [219] T. Khan, W. Tian, G. Zhou, S. Ilager, M. Gong, and R. Buyya, “Machine learning (ml)-centric resource management in cloud computing: A review and future directions,” Journal of Network and Computer Applications , vol. 204, p. 103405, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1084804522000649 [220] F. Nzanywayingoma and Y. Yang, “Efficient resource management techniques in cloud computing environment: a review and discussion,” International Journal of Computers and Applications , vol. 41, no. 3, pp. 165–182, 2019. [Online]. Available: https://doi.org/10.1080/ 1206212X.2017.1416558 [221] N. M. Gonzalez, T. C. M. D. B. Carvalho, and C. C. Miers, “Cloud resource management: Towards efficient execution of large-scale scientific applications and workflows on complex infrastructures,” J. Cloud Comput. , vol. 6, no. 1, dec 2017. [Online]. Available: https://doi.org/10.1186/s13677-017-0081-4 [222] R. Bianchini, M. Fontoura, E. Cortez, A. Bonde, A. Muzio, A.-M. Constantin, T. Moscibroda, G. Magalhaes, G. Bablani, and M. Russinovich, “Toward ml-centric cloud platforms,” Commun. ACM , vol. 63, no. 2, p. 50–59, jan 2020. [Online]. Available: https://doi.org/10.1145/3364684 [223] E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, and R. Bianchini, “Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms,” in Proceedings of the 26th Symposium on Operating Systems Principles , ser. SOSP ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 153–167. [Online]. Available: https://doi.org/10.1145/3132747.3132772 [224] K. Haghshenas and S. Mohammadi, “Prediction-based underutilized and destination host selection approaches for energy-efficient dynamic vm consolidation in data centers,” The Journal of Supercomputing , vol. 76, no. 12, pp. 10 240–10 257, Dec 2020. [Online]. Available: https://doi.org/10.1007/s11227-020-03248-4 [225] S. Ilager, K. Ramamohanarao, and R. Buyya, “Thermal prediction for efficient energy management of clouds using machine learning,” IEEE Transactions on Parallel and Distributed Systems , vol. 32, no. 5, pp. 1044–1056, 2021. [226] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and M. Alizadeh, “Learning scheduling algorithms for data processing clusters,” in Proceedings of the ACM Special Interest Group on Data Communication , ser. SIGCOMM ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 270–288. [Online]. Available: https://doi.org/10.1145/3341302.3342080 [227] D. Anderson, “What is apm?” 2021. [Online]. Available: https: //www.dynatrace.com/news/blog/what-is-apm-2/ [228] J. Livens, “What is observability? not just logs, metrics and traces,” 2021. [Online]. Available: https://www.dynatrace.com/news/ blog/what-is-observability-2/ [229] Y. Guo, Y. Wen, C. Jiang, Y. Lian, and Y. Wan, “Detecting log anomalies with multi-head attention (LAMA),” CoRR , vol. abs/2101.02392, 2021. [Online]. Available: https://arxiv.org/abs/2101. 02392 [230] S. Zhang, Y. Liu, X. Zhang, W. Cheng, H. Chen, and H. Xiong, “Cat: Beyond efficient transformer for content-aware anomaly detection in event sequences,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , ser. KDD ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 4541–4550. [Online]. Available: https://doi.org/10.1145/3534678.3539155 [231] C. Zhang, X. Wang, H. Zhang, H. Zhang, and P. Han, “Log sequence anomaly detection based on local information extraction and globally sparse transformer model,” IEEE Transactions on Network and Service Management , vol. 18, no. 4, pp. 4119–4133, 2021. [232] Q. Wang, X. Zhang, X. Wang, and Z. Cao, “Log Sequence Anomaly Detection Method Based on Contrastive Adversarial Training and Dual Feature Extraction,” Entropy , vol. 24, no. 1, p. 69, Dec. 2021. [233] J. Qi, Z. Luan, S. Huang, Y. Wang, C. J. Fung, H. Yang, and D. Qian, “Adanomaly: Adaptive anomaly detection for system logs with adversarial learning,” in 2022 IEEE/IFIP Network Operations and Management Symposium, NOMS 2022, Budapest, Hungary, April 25-29, 2022 . IEEE, 2022, pp. 1–5. [Online]. Available: https://doi.org/10.1109/NOMS54207.2022.9789917 [234] M. Attariyan, M. Chow, and J. Flinn, “X-ray: Automating { Root- Cause } diagnosis of performance anomalies in production software,” in 10th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 12) , 2012, pp. 307–320. [235] X. Guo, X. Peng, H. Wang, W. Li, H. Jiang, D. Ding, T. Xie, and L. Su, “Graph-based trace analysis for microservice architecture understanding and problem diagnosis,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2020, pp. 1387–1397. [236] Y. Xu, Y. Zhu, B. Qiao, H. Che, P. Zhao, X. Zhang, Z. Li, Y. Dang, and Q. Lin, “Tracelingo: Trace representation and learning for performance issue diagnosis in cloud services,” in 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence) . IEEE, 2021, pp. 37–40. [237] M. Li, Z. Li, K. Yin, X. Nie, W. Zhang, K. Sui, and D. Pei, “Causal inference-based root cause analysis for online service systems with intervention recognition,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , ser. KDD ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 3230–3240. [Online]. Available: https://doi.org/10.1145/3534678.3539041
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Browse Popular Homework Q&A

Q: Refrigerant 134a is stored inside an uninsulated rigid tank (see figure below) and heated by an…
Q: Investment X offers to pay you $6,100 per year for 9 years, whereas Investment Y offers to pay you…
Q: A particle of matter is moving with a kinetic energy of 6.82 eV. Its de Broglie wavelength is 2.50 x…
Q: Knowing that the tension is 555 lb in cable AB and 470 lb in cable AC, determine the magnitude and…
Q: arker Plastic, Incorporated, manufactures plastic mats to use with rolling office chairs. Its…
Q: A mass weighing 4 pounds is attached to a spring whose spring constant is 9 lb/ft. What is the…
Q: The enthalpy of combustion of isooctane (C8H18(l)) is -5461 kJ/mol. From data found in appendix G,…
Q: A solid spherical conductor has a radius R and has a charge Q. Assume the potential is zero at…
Q: Demonstrate how males are at an increased risk of sex-linked recessive traits by crossing a female…
Q: Consider the resonance structures for the oxalate anion below. What is the average formal charge on…
Q: A cylindrical capacitor is made of two thin-walled concentric cylinders. The inner cylinder has…
Q: what is a nonparametric test?  What is a parametric test ? what is the difference between a…
Q: random sample of the birth weights of 186 babies has a mean of 3103 grams and a  standard deviation…
Q: Fortune Company had beginning raw materials inventory of $9,900. During the period, the company…
Q: You find a zero coupon bond with a par value of $10,000 and 20 years to maturity. The yield to…
Q: Arc length Find the arc length of the following curves on the given interval. x = 3 cos t, y = 3 sin…
Q: In the circuit of the figure below, the current I, is 1.8 A and the values of & and R are unknown.…
Q: 6. Dioxin, C₁2H4C14O₂ is a powerful poison. How many grams of dioxin do you have if there are 15.0…
Q: Pregnant women with gestational diabetes mellitus (GDM) are at risk for long-term weight gain and…
Q: What is the pooled variance of this problem? b) What is the pooled standard deviation? c) What is…
Q: During each phase of the assembly process, please describe the Assembly Registers and explain why…
Q: 2. Dr. Dahm has a 0.4 M solution of nitrous acid. If he takes 3 mL of this solution and dilutes it…