1_Cheng23_AIOps_Survey (2)
pdf
keyboard_arrow_up
School
Concordia University *
*We aren’t endorsed by this school
Course
691
Subject
Information Systems
Date
Oct 30, 2023
Type
Pages
34
Uploaded by BaronSandpiperMaster927
1
AI for IT Operations (AIOps) on Cloud Platforms:
Reviews, Opportunities and Challenges
Qian Cheng
*†
, Doyen Sahoo
*
, Amrita Saha, Wenzhuo Yang, Chenghao Liu, Gerald Woo,
Manpreet Singh, Silvio Saverese, and Steven C. H. Hoi
Salesforce AI
Abstract
—Artificial
Intelligence
for
IT
operations
(AIOps)
aims to combine the power of AI with the big data generated by
IT Operations processes, particularly in cloud infrastructures, to
provide actionable insights with the primary goal of maximizing
availability. There are a wide variety of problems to address,
and multiple use-cases, where AI capabilities can be leveraged to
enhance operational efficiency. Here we provide a review of the
AIOps vision, trends challenges and opportunities, specifically
focusing on the underlying AI techniques. We discuss in depth
the key types of data emitted by IT Operations activities, the
scale and challenges in analyzing them, and where they can be
helpful. We categorize the key AIOps tasks as - incident detection,
failure prediction, root cause analysis and automated actions.
We discuss the problem formulation for each task, and then
present a taxonomy of techniques to solve these problems. We
also identify relatively under explored topics, especially those that
could significantly benefit from advances in AI literature. We also
provide insights into the trends in this field, and what are the
key investment opportunities.
Index Terms
—AIOps, Artificial Intelligence, IT Operations,
Machine
Learning,
Anomaly
Detection,
Root-cause
Analysis,
Failure Prediction, Resource Management
I. I
NTRODUCTION
Modern software has been evolving rapidly during the era
of digital transformation. New infrastructure, techniques and
design patterns - such as cloud computing, Software-as-a-
Service (SaaS), microservices, DevOps, etc. have been devel-
oped to boost software development. Managing and operating
the infrastructure of such modern software is now facing new
challenges. For example, when traditional software transits
to SaaS, instead of handing over the installation package to
the user, the software company now needs to provide 24/7
software access to all the subscription based users. Besides
developing and testing, service management and operations
are now the new set of duties of SaaS companies. Meanwhile,
traditional software development separates functionalities of
the entire software lifecycle. Coding, testing, deployment and
operations are usually owned by different groups. Each of
these groups requires different sets of skills. However, agile
development and DevOps start to obfuscate the boundaries
between each process and DevOps engineers are required to
take E2E responsibilities. Balancing development and opera-
tions for a DevOps team become critical to the whole team’s
productivity.
*
Equal Contribution
†
Work done when author was with Salesforce AI
Software services need to guarantee service level agree-
ments (SLAs) to the customers, and often set internal Service
Level Objectives (SLOs). Meeting SLAs and SLOs is one
of the top priority for CIOs to choose the right service
providers[1]. Unexpected service downtime can impact avail-
ability goals and cause significant financial and trust issues.
For example, AWS experienced a major service outage in
December 2021, causing multiple first and third party websites
and heavily used services to experience downtime [2].
IT Operations plays a key role in the success of modern
software companies and as a result multiple concepts have
been introduced, such as IT service management (ITSM)
specifically for SaaS, and IT operations management (ITOM)
for general IT infrastructure. These concepts focus on different
aspects IT operations but the underlying workflow is very
similar. Life cycle of Software systems can be separated into
several main stages, including planning, development/coding,
building, testing, deployment, maintenance/operations, moni-
toring, etc. [3]. The operation part of DevOps can be further
broken down into four major stages: observe, detect, engage
and act, shown in Figure 1. Observing stage includes tasks
like collecting different telemetry data (metrics, logs, traces,
etc.), indexing and querying and visualizing the collected
telemetries. Time-to-observe (TTO) is a metric to measure the
performance of the observing stage. Detection stage includes
tasks like detecting incidents, predicting failures, finding cor-
related events, etc. whose performance is typically measured
as the Time-to-detect (TTD) (in addition to precision/recall).
Engaging stage includes tasks like issue triaging, localiza-
tion, root-cause analysis, etc., and the performance is often
measured by Time-to-triage (TTT). Acting stage includes
immediate remediation actions such as reboot the server,
scale-up / scale-out resources, rollback to previous versions,
etc. Time-to-resolve (TTR) is the key metric measured for
the acting stage. Unlike software development and release,
where we have comparatively mature continuous integration
and continuous delivery (CI/CD) pipelines, many of the post-
release operations are often done manually. Such manual
operational processes face several challenges:
•
Manual operations struggle to scale.
The capacity of
manual operations is limited by the size of the DevOps
team and the team size can only increase linearly. When
the software usage is at growing stage, the throughput
and workloads may grow exponentially, both in scale and
complexity. It is difficult for DevOps team to grow at the
arXiv:2304.04661v1 [cs.LG] 10 Apr 2023
2
Fig. 1.
Common DevOps life cycles[3] and ops breakdown. Ops can
comprise four stages: observe, detect, engage and act. Each of the stages
has a corresponding measure: time-to-observe, time-to-detect, time-to-triage
and time-to-resolve.
same pace to handle the increasing amount of operational
workload.
•
Manual operations is hard to standardize.
It is very
hard to keep the same high standard across the entire
DevOps team given the diversity of team members (e.g.
skill level, familiarity with the service, tenure, etc.). It
takes significant amount of time and effort to grow an
operational domain expert who can effectively handle
incidents. Unexpected attrition of these experts could
significantly hurt the operational efficiency of a DevOps
team.
•
Manual operations are error-prone.
It is very common
that human operation error causes major incidents. Even
for the most reliable cloud service providers, major
incidents have been caused by human error in recent
years.
Given
these
challenges,
fully-automated
operations
pipelines powered by AI capabilities becomes a promising
approach to achieve the SLA and SLO goals. AIOps, an
acronym of AI for IT Operations, was coined by Gartner
at 2016. According to Gartner Glossary, ”AIOps combines
big data and machine learning to automate IT operations
processes,
including
event
correlation,
anomaly
detection
and causality determination”[4]. In order to achieve fully-
automated IT Operations, investment in AIOps technolgies
is imperative. AIOps is the key to achieve
high availability,
scalability and operational efficiency
. For example, AIOps
can use AI models can automatically analyze large volumes
of telemetry data to detect and diagnose incidents much faster,
and much more consistently than humans, which can help
achieve ambitious targets such as 99.99 availability. AIOps
can dynamically scale its capabilities with growth demands
and use AI for automated incident and resource management,
thereby reducing the burden of hiring and training domain
experts to meet growth requirements. Moreover, automation
through AIOps helps save valuable developer time, and avoid
fatigue. AIOps, as an emerging AI technology, appeared
on the trending chart of Gartner Hyper Cycle for Artificial
Intelligence in 2017 [5], along with other popular topics such
as deep reinforcement learning, nature-language generation
and artificial general intelligence. As of 2022, enterprise
AIOps solutions have witnessed increased adoption by many
companies’
IT
infrastructure.
The
AIOps
market
size
is
predicted to be $11.02B by end of 2023 with cumulative
annual growth rate (CAGR) of 34%.
AIOps comprises a set of complex problems. Transforming
from manual to automated operations using AIOps is not a
one-step effort. Based on the adoption level of AI techniques,
we break down AIOps maturity into four different levels based
on the adoption of AIOps capabilities as shown in Figure 2.
Fig. 2. AIOps Transformation. Different maturity levels based on adoption of
AI techniques: Manual Ops, human-centric AIOps, machine-centric AIOps,
fully-automated AIOps.
Manual Ops.
At this maturity level, DevOps follows tra-
ditional best practices and all processes are setup manually.
There is no AI or ML models. This is the baseline to compare
with in AIOps transformation.
Human-centric
. At this level, operations are done mainly in
manual process and AI techniques are adopted to replace sub-
procedures in the workflow, and mainly act as assistants. For
example, instead of glass watching for incident alerts, DevOps
or SREs can set dynamic alerting threshold based on anomaly
detection models. Similarly, the root cause analysis process
requires watching multiple dashboards to draw insights, and
AI can help automatically obtain those insights.
Machine-centric
. At this level, all major components (mon-
itoring, detecting, engaging and acting) of the E2E operation
process are empowered by more complex AI techniques.
3
Humans are mostly hands-free but need to participate in
the human-in-the-loop process to help fine-tune and improve
the AI systems performance. For example, DevOps / SREs
operate and manage the AI platform to guarantee training and
inference pipelines functioning well, and domain experts need
to provide feedback or labels for AI-made decisions to improve
performance.
Fully-automated
. At this level, AIOps platform achieves
full automation with minimum or zero human intervention.
With the help of fully-automated AIOps platforms, the current
CI/CD (continuous integration and continuous deployment)
pipelines can be further extended to CI/CD/CM/CC (continu-
ous integration, continuous deployment, continuous monitor-
ing and continuous correction) pipelines.
Different software systems, and companies may be at dif-
ferent levels of AIOps maturity, and their priorities and goals
may differ with regard to specific AIOps capabilities to be
adopted. Setting up the right goals is important for the success
of AIOps applications. We foresee the trend of shifting from
manual operation all the way to fully-automated AIOps in
the future, with more and more complex AI techniques being
used to address challenging problems. In order to enable
the community to adopt AIOps capabilities faster, in this
paper, we present a comprehensive survey on the various
AIOps problems and tasks and the solutions developed by the
community to address them.
II. C
ONTRIBUTION OF
T
HIS
S
URVEY
Increasing number of research studies and industrial prod-
ucts in the AIOps domain have recently emerged to address a
variety of problems. Sabharwal
et al.
published a book ”Hands-
on AIOps” to discuss practical AIOps and implementation
[6]. Several AIOps literature reviews are also accessible [7]
[8] to help audiences better understand this domain. However,
there are very limited efforts to provide a holistic view to
deeply connect AIOps with latest AI techniques. Most of
the AI related literature reviews are still topic-based, such as
deep learning anomaly detection [9] [10], failure management,
root-cause analysis [11], etc. There is still limited effort to
provide a holistic view about AIOps, covering the status in
both academia and industry. We prepare this survey to address
this gap, and focus more on AI techniques used in AIOps.
Except for the monitoring stage, where most of the tasks
focus on telemetry data collection and management, AIOps
covers the other three stages where the tasks focus more on
analytics. In our survey, we group AIOps tasks based on which
operational stage they can contribute to, shown in Figure 3.
Incident Detection.
Incident detection tasks contribute to
detection stage. The goal of these tasks are reducing mean-
time-to-detect (MTTD). In our survey we cover time series
incident detection (Section IV-A), log incident detection (Sec-
tion IV-B), trace and multimodal incident detection (Section
IV-C).
Failure Prediction.
Failure prediction also contributes to
detection stage. The goal of failure prediction is to predict the
potential issue before it actually happens so actions can be
taken in advance to minimize impact. Failure prediction also
contributes to reducing mean-time-to-detect (MTTD). In our
survey we cover metric failure prediction (Section V-A) and
log failure prediction (Section V-B). There are very limited
efforts in literature that perform traces and multimodal failure
prediction.
Root-cause Analysis.
Root-cause analysis tasks contributes
to multiple operational stages, including triaging, acting and
even support more efficient long-term issue fixing and reso-
lution. Helping as an immediate response to an incident, the
goal is to minimize time to triage (MTTT), and simultaneously
contribute to reduction on reducing Mean Time to Resolve
(MTTR). An added benefit is also reduction in human toil.
We further breakdown root-cause analysis into time-series
RCA (Section VI-B), logs RCA (Section VI-B) and traces
and multimodal RCA (Section VI-C).
Automated Actions.
Automated actions contribute to acting
stage, where the main goal is to reduce mean-time-to-resolve
(MTTR), as well as long-term issue fix and resolution. In
this survey we discuss about a series of methods for auto-
remediation (Section VII-A), auto-scaling (Section VII-B) and
resource management (Section VII-C).
III. D
ATA FOR
AIO
PS
Before we dive into the problem settings, it is important to
understand the data available to perform AIOps tasks. Modern
software systems generate tremendously large volumes of
observability metrics. The data volume keeps growing expo-
nentially with digital transformation [12]. The increase in the
volume of data stored in large unstructured Data lake systems
makes it very difficult for DevOps teams to consume the
new information and fix consumers’ problems efficiently [13].
Successful products and platforms are now built to address
the monitoring and logging problems. Observability platforms,
e.g. Splunk, AWS Cloudwatch, are now supporting emitting,
storing and querying large scale telemetry data.
Similar to other AI domains, observability data is critical
to AIOps. Unfortunately there are limited public datasets in
this domain and many successful AIOps research efforts are
done with self-owned production data, which usually are not
available publicly. In this section, we describe major telemetry
data type including metrics, logs, traces and other records, and
present a collection of public datasets for each data type.
A. Metrics
Metrics are numerical data measured over time which
provide a snapshot of the system behavior. Metrics can rep-
resent a broad range of information, broadly classified into
compute metrics and service metrics. Compute metrics (e.g.
CPU utilization, memory usage, disk I/O) are an indicator of
the health status of compute nodes (servers, virtual machines,
pods). They are collected at the system level using tools such
as Slurm [14] for usage statistics from jobs and nodes, and
the Lustre parallel distributed file system for I/O information.
Service metrics (e.g. request count, page visits, number of
errors) measure the quality and level of service of customer
facing applications. Aggregate statistics of such numerical data
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
Fig. 3. AIOps Tasks. In this survey we discuss a series of AIOps tasks, categorized by which operational stages these tasks contribute to, and the observability
data type it takes.
also fall under the category of metrics, providing a more
coarse-grained view of system behavior.
Metrics are constantly generated by all components of the
cloud platform life cycle, making it one of the most ubiquitous
forms of AIOps data. Cloud platforms and supercomputer
clusters can generate petabytes of metrics data, making it
a challenge to store and analyze, but at the same time,
brings immense observability to the health of the entire IT
operation. Being numerical time-series data, metrics are simple
to interpret and easy to analyze, allowing for simple threshold-
based rules to be acted upon. At the same time, they contain
sufficiently rich information to be used to power more complex
AI based alerting and actions.
The major challenge in leveraging insights from metrics data
arises due to their diverse nature. Metrics data can exhibit a
variety of patterns, such as cyclical patterns (repeating patterns
hourly, daily, weekly, etc.), sparse and intermittent spikes, and
noisy signals. The characteristics of the metrics ultimately
depend on the underlying service or job.
In Table I, we briefly describe the datasets and benchmarks
of metrics data. Metrics data have been used in studies char-
acterizing the workloads of cloud data centers, as well as the
various AIOps tasks of incident detection, root cause analysis,
failure prediction, and various planning and optimization tasks
like auto-scaling and VM pre-provisioning.
B. Logs
Software logs are specifically designed by the software
developers in order to record any type of runtime information
about processes executing within a system - thus making
them an ubiquitous part of any modern system or software
maintenance. Once the system is live and throughout its life-
cycle, it continuously emits huge volumes of such logging
data which naturally contain a lot of rich dynamic runtime
information relevant to IT Operations and Incident Man-
agement of the system. Consequently in AI driven IT-Ops
pipelines, automated log based analysis plays an important role
in Incident Management - specifically in tasks like Incident
Detection and Causation and Failure Prediction, as have been
Fig. 4.
GPU utilization metrics from the MIT Supercloud Dataset exhibiting
various patterns (cyclical, sparse and intermittant, noisy).
studied by multiple literature surveys in the past [15], [16],
[17], [18], [19], [20], [21], [22], [23].
In most of the practical cases, especially in industrial
settings, the volume of the logs can go upto an order of
petabytes of loglines per week. Also because of the nature
of log content, log data dumps are much more heavier in
size in comparison to time series telemetry data. This requires
special handling of logs observability data in form of data
streams, - where today, there are various services like Splunk,
Datadog, LogStash, NewRelic, Loggly, Logz.io etc employed
to efficiently store and access the log stream and also visualize,
analyze and query past log data using specialized structured
query language.
Nature of Log Data.
Typically these logs consist of semi-
structured data i.e. a combination of structured and unstruc-
tured data. Amongst the typical types of unstructured data
there can be natural language tokens, programming language
constructs (e.g. method names) and the structured part can
consist of quantitative or categorical telemetry or observability
metrics data, which are printed in runtime by various logging
statements embedded in the source-code or sometimes gener-
ated automatically via loggers or logging agents. Depending
on the kind of service the logs are dumped from, there can be
5
a diverse types of logging data with heterogeneous form and
content. For example, logs can be originating from distributed
systems (e.g. hadoop or spark), operating systems (windows
or linux) or in complex supercomputer systems or can be
dumped at hardware level (e.g. switch logs) or middle-ware
level (like servers e.g. Apache logs) or by specific applications
(e.g. Health App). Typically each logline comprises of a fixed
part which is the template that had been designed by the
developer and some variable part or parameters which capture
some runtime information about the system.
Complexities of Log Data.
Thus, apart from being one of
the most generic and hence crucial data-sources in IT Ops,
logs are one of the most complex forms of observability
data due to their open-ended form and level of granularity
at which they contain system runtime information. In cloud
computing context, logs are the source of truth for cloud users
to the underlying servers that running their applications since
cloud providers don’t grant full access to their users of the
servers and platforms. Also, being designed by developers,
logs are immediately affected by any changes in the source-
code or logging statements by developers. This results in
non-stationarity in the logging vocabulary or even the entire
structure or template underlying the logs.
Log Observability Tasks.
Log observability typically in-
volves different tasks like anomaly detection over logs during
incident detection (Section IV-B), root cause analysis over logs
(Section VI-B) and log based failure prediction (Section V-B).
Datasets and Benchmarks.
Out of the different log ob-
servability tasks, log based anomaly detection is one of the
most objective tasks and hence most of the publicly released
benchmark datasets have been designed around anomaly de-
tection. In Table B, we give a comprehensive description about
the different public benchmark datasets that have been used
in the literature for anomaly detection tasks. Out of these,
datasets Switch and subsets of HPC and BGL have also been
redesigned to serve failure prediction task. On the other hand
there are no public benchmarks on log based RCA tasks, which
has been typically evaluated on private enterprise data.
C. Traces
Trace data are usually presented as semi-structured logs,
with identifiers to reconstruct the topological maps of the
applications and network flows of target requests. For example,
when user uses Google search, a typical trace graph of this
user request looks like in Figure 6. Traces are composed
system events (spans) that tracks the entire progress of a
request or execution. A span is a sequence of semi-structured
event logs. Tracing data makes it possible to put different
data modality into the same context. Requests travel through
multiple services / applications and each application may
have totally different behavior. Trace records usually contains
two required parts: timestamps and span
id. By using the
timestamps and span
id, we can easily reconstruct the trace
graph from trace logs.
Fig. 5.
An example of Log Data generated in IT Operations
Fig. 6. An snapshot of trace graph of user requests when using Google Search.
Trace analysis requires reliable tracing systems. Trace col-
lection systems such as ReTrace [24] can help achieve fast
and inexpensive trace collections. Trace collectors are usually
code agnostic and can emit different levels of performance
trace data back to the trace stores in near real-time. Early
summarization is also involved in the trace collection process
to help generate fine-grained events [25].
Although trace collection is common for system observ-
ability, it is still challenging to acquire high quality trace data
to train AI models. As far as we know, there are very few
public trace datasets with high quality labels. Also the only few
existing public trace datasets like [26] are not widely adopted
in AIOps research. Instead, most AIOps related trace analysis
research use self-owned production or simulation trace data,
6
which are generally not available publicly.
D. Other Data
Besides the machine generated observability data like met-
rics, logs, traces, etc., there are other types of operational data
that could be used in AIOps.
Human activity records is part of these valuable data.
Ticketing systems are used for DevOps/SREs to communicate
and efficiently resolve the issues. This process generates large
amount of human activity records. The human activity data
contains rich knowledge and learnings about solutions to
existing issues, which can be used to resolve similar issues
in the future.
User feedback data is also very important to improve AIOps
system performance. Unlike the issue tickets where human
needs to put lots of context information to describe and discuss
the issue, user feedback can be as simple as one click to
confirm if the alert is good or bad. Collecting real-time user
feedback of a running system and designing human-in-the-
loop workflows are also very significant for success of AIOps
solutions.
Although many companies collects these types of data
and use them to improve their operation workflows, there
are still very limited published research discussing how to
systematically incorporate these other types of operational
data in AIOps solutions. This brings challenges as well as
opportunities to make further improvements in AIOps domain.
Next, we discuss the key AIOps Tasks - Incident Detec-
tion, Failure Prediction, Root Cause Analysis, and Automated
Actions, and systematically review the key contributions in
literature in these areas.
IV. I
NCIDENT
D
ETECTION
Incident detection employs a variety of anomaly detec-
tion techniques. Anomaly detection is to detect abnormali-
ties, outliers or generally events that not normal. In AIOps
context, anomaly detection is widely adopted in detecting
any types of abnormal system behaviors. To detect such
anomalies, the detectors need to utilize different telemetry
data, such as metrics, logs, traces. Thus, anomaly detection
can be further broken down to handling one or more specific
telemetry data sources, including metric anomaly detection,
log anomaly detection, trace anomaly detection. Moreover,
multi-modal anomaly detection techniques can be employed
if multiple telemetry data sources are involved in the detec-
tion process. In recent years, deep learning based anomaly
detection techniques [9] are also widely discussed and can
be utilized for anomaly detection in AIOps. Another way
to distinguish anomaly detection techniques is depending on
different application use cases, such as detecting service health
issues, detecting networking issues, detecting security issues,
fraud transactions, etc. Usually these variety of techniques
are derived from same set of base detection algorithms and
localized to handle specific tasks. From technical perspective,
detecting anomalies from different telemetry data sources are
better aligned with the AI technology definitions, such as,
metric are usually time-series, logs are text / natural language,
traces are event sequences/graphs, etc. In this article, we
discuss anomaly detection by different telemetry data sources.
A. Metrics based Incident Detection
Problem Definition
To ensure the reliability of services, billions of metrics are
constantly monitored and collected at equal-space timestamp
[27]. Therefore, it is straightforward to organize metrics as
time series data for subsequent analysis. Metric based incident
detection, which aims to find the anomalous behaviors of
monitored metrics that significantly deviate from the other
observations, is vital for operators to timely detect software
failures and trigger failure diagnosis to mitigate loss. The most
basic form of incident detection on metrics is the rule-based
method which sets up an alert when a metric breaches a certain
threshold. Such an approach is only able to capture incidents
which are defined by the metric exceeding the threshold,
and is unable to detect more complex incidents. The rule-
based method to detect incidents on metrics are generally
too naive, and only able to account for the most simple of
incidents. They are also sensitive to the threshold, producing
too many false positives when the threshold is too low, and
false negatives when the threshold is too high. Due to the open-
ended nature of incidents, increasingly complex architectures
of systems, and increasing size of these systems and number
of metrics, manual monitoring and rule-based methods are no
longer sufficient. Thus, more advanced metric-based incident
detection methods that leveraging AI capability is urgent.
As metrics are a form of time series data, and incidents
are expressed as an abnormal occurrence in the data, metric
incident detection is most often formulated as a time series
anomaly detection problem [28], [29], [30]. In the following,
we focus on the AIOps setting and categorize it based on
several key criteria: (i) learning paradigm, (ii) dimensionality,
(iii) system, and (iv) streaming updates. We further summa-
rize a list of time series anomaly detection methods with a
comparison over these criteria in Table IV.
Learning Setting
a) Label Accessibility:
One natural way to formulate
the anomaly detection problem, is as the supervised binary
classification problem, to classify whether a given obser-
vation is an anomaly or not [31], [32]. Formulating it as
such has the benefit of being able to apply any supervised
learning method, which has been intensely studied in the
past decades [33]. However, due to the difficulty in obtaining
labelled data for metrics incident detection [34] and labels of
anomalies are prone to error [35], unsupervised approaches,
which do not require labels to build anomaly detectors, are
generally preferred and more widespread. Particularly, unsu-
pervised anomaly detection methods can be roughly catego-
rized into density-based methods, clustering-based methods,
and reconstruction-based methods [28], [29], [30]. Density-
based methods compute local density and local connectivity
for outlier decision. Clustering-based methods formulate the
anomaly score as the distance to cluster center. Reconstruction-
based methods explicitly model the generative process of the
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
7
data and measure the anomaly score with the reconstruction
error. While methods in metric anomaly detection are generally
unsupervised, there are cases where there is some access to
labels. In such situations, semi-supervised, domain adaptation,
and active learning paradigms come into play. The semi-
supervised paradigm [36], [37], [38] enables unsupervised
models to leverage information from sparsely available posi-
tive labels [39]. Domain adaptation [40] relies on a labelled
source dataset, while the target dataset is unlabeled, with the
goal of transferring a model trained on the source dataset, to
perform anomaly detection on the target.
b) Streaming Update:
Since metrics are collected in
large volume every minute, the model is used online to detect
anomalies. It is very common that temporal patterns of metrics
change overtime. The ability to perform timely model updates
when receiving new incoming data is an important criteria.
On the one hand, conventional models can handle the data
stream via retraining the whole model periodically [31], [41],
[32], [38]. However, this strategy could be computationally
expensive, and bring extra non-trivial questions, such as, how
often should this retraining be performed. On the other hand,
some methods [42], [43] have efficient updating mechanisms
inbuilt, and are naturally able to adapt to these new incoming
data streams. It can also support active learning paradigm [41],
which allows models to interactively query users for labels on
data points for which it is uncertain about, and subsequently
update the model with the new labels.
c) Dimensionality:
Each metric of monitoring data forms
a univariate time series, and thus a service usually contains
multiple metrics, each of which describes a different part
or attribute of a complex entity, constituting a multivariate
time series. The conventional solution is to build univariate
time series anomaly detection for each metric. However, for
a complex system, it ignores the intrinsic interactions among
each metric and cannot well represent the system’s overall
status. Naively combining the anomaly detection results of
each univariate time series performs poorly for multivariate
anomaly detection method [44], since it cannot model the
inter-dependencies among metrics for a service.
Model
A wide range of machine learning models can be
used for time series anomaly detection, broadly classified
as deep learning models, tree-based models, and statistical
models. Deep learning models [45], [36], [46], [47], [38], [48],
[49], [50] leverage the success and power deep neural networks
to learn representations of the time series data. These represen-
tations of time series data contain rich semantic information
of the underlying metric, and can be used as a reconstruction-
based, unsupervised method. Tree-based methods leverage a
tree structure as a density-based, unsupervised method [42].
Statistical models [51] rely on classical statistical tests, which
are considered a reconstruction-based method.
Industrial Practices
Building a system which can handle
the large amounts of metric data generated in real cloud IT
operations is often an issue. This is because the metric data
in real-world scenarios is quite diverse and the definition of
anomaly may vary in different scenarios. Moreover, almost all
time series anomaly detection systems require to handle a large
amount of metrics in parallel with low-latency [32]. Thus,
works which propose a system to handle the infrastructure
are highlighted here. EGADS [41] is a system by Yahoo!,
scaling up to millions of data points per second, and focuses
on optimizing real-time processing. It comprises a batch time
series modelling module, an online anomaly detection module,
and an alerting module. It leverages a variety of unsupervised
methods for anomaly detection, and an optional active learning
component for filtering alerts. [52] is a system by Microsoft,
which includes three major components, a data ingestion,
experimentation, and online compute platform. They propose
an efficient deep learning anomaly detector to achieve high
accuracy and high efficiency at the same time. [32] is a
system by Alibaba group, comprising data ingestion, offline
training, online service, and visualization and alarms modules.
They propose a robust anomaly detector by using time series
decomposition, and thus can easily handle time series with
different characteristics, such as different seasonal length,
different types of trends, etc. [38] is a system by Tencent,
comprising of a offline model training component and online
serving component, which employs active learning to update
the online model via a small number of uncertain samples.
Challenges
Lack of labels
The main challenge of metric anomaly
detection is the lack of ground truth anomaly labels [53], [44].
Due to the open-ended nature and complexity of incidents in
server architectures, it is difficult to define what an anomaly
is. Thus, building labelled datasets is an extremely labor and
resource intensive exercise, one which requires the effort of
domain experts to identify anomalies from time series data.
Furthermore, manual labelling could lead to labelling errors
as there is no unified and formal definition of an anomaly,
leading to subjective judgements on ground truth labels [35].
Real-time inference
A typical cloud infrastructure could
collect millions of data points in a second, requiring near real-
time inference to detect anomalies. Metric anomaly detection
systems need to be scalable and efficient [54], [53], optionally
supporting model retraining, leading to immense compute,
memory, and I/O loads. The increasing complexity of anomaly
detection models with the rising popularity of deep learning
methods [55] add a further strain on these systems due to the
additional computational cost these larger models bring about.
Non-stationarity of metric streams
The temporal patterns
of metric data streams typically change over time as they are
generated from non-stationary environments [56]. The evo-
lution of these patterns is often caused by exogenous factors
which are not observable. One such example is that the growth
in the popularity of a service would cause customer metrics
(e.g. request count) to drift upwards over time. Ignoring these
factors would cause a deterioration in the anomaly detector’s
performance. One solution is to continuously update the model
with the recent data [57], but this strategy requires carefully
balancing of the cost and model robustness with respect to the
updating frequency.
Public benchmarks
While there exists benchmarks for
general anomaly detection methods and time series anomaly
detection methods [33], [58], there is still a lack of benchmark-
ing for metric incident detection in AIOps domain. Given the
8
wide and diverse nature of time series data, they often exhibit
a mixture of different types of anomaly depends on specific
domain, making it challenging to understand the pros and cons
of algorithms [58]. Furthermore, existing datasets have been
criticised to be trivial and mislabelled [59].
Future Trends
Active learning/human-in-the-loop
To address the prob-
lem of lacking of labels, a more intelligent way is to integrate
human knowledge and experience with minimum cost. As
special agents, humans have rich prior knowledge [60]. If
the incident detection framework can encourage the machine
learning model to engage with learning operation expert wis-
dom and knowledge, it would help deal with scarce and noise
label issue. The use of active learning to update online model
in [38] is a typical example to incorporate human effort in
the annotation task. There are certainly large research scope
for incorporating human effort in other data processing step,
like feature extraction. Moreover, the human effort can also
be integrated in the machine learning model training and
inference phase.
Streaming updates
Due to the non-stationarity of metric
streams, keeping the anomaly detector updated is of utmost
importance. Alongside the increasingly complex models and
need for cost-effectiveness, we will see a move towards
methods with the built-in capability of efficient streaming
updates. With the great success of deep learning methods in
time series anomaly detection tasks [30]. Online deep learning
is an increasingly popular topic [61], and we may start to see
a transference of techniques into metric anomaly detection for
time-series in the near future.
Intrinsic anomaly detection
Current research works on
time series anomaly detection do not distinguish the cause
or the type of anomaly, which is critical for the subsequent
mitigation steps in AIOps. For example, even anomaly are suc-
cessfully detected, which is caused by extrinsic environment,
the operator is unable to mitigate its negative effect. Intro-
duced in [50], [48], intrinsic anomaly detection considers the
functional dependency structure between the monitored metric,
and the environment. This setting considers changes in the
environment, possibly leveraging information that may not be
available in the regular (extrinsic) setting. For example, when
scaling up/down the resources serving an application (perhaps
due to autoscaling rules), we will observe a drop/increase in
CPU metric. While this may be considered as an anomaly
in the extrinsic setting, it is in fact not an incident and
accordingly, is not an anomaly in the intrinsic setting.
B. Logs based Incident Detection
Problem Definition
Software and system logging data is one of the most
popular ways of recording and tracking runtime information
about all ongoing processes within a system, to any arbitrary
level of granularity. Overall, a large distributed system can
have massive volume of heterogenous logs dumped by its
different services or microservices, each having time-stamped
text messages following their own unstructured or semi-
structured or structured format. Throughout various kinds of
IT Operations these logs have been widely used by relia-
bility and performance engineers as well as core developers
in order to understand the system’s internal status and to
facilitate monitoring, administering, and troubleshooting [15],
[16], [17], [18], [19], [20], [21], [22], [62]. More, specifically,
in the AIOps pipeline, one of the foremost tasks that log
analysis can cater to is log based Incident Detection. This
is typically achieved through anomaly detection over logs
which aims to detect the anomalous loglines or sequences
of loglines that indicate possible occurrence of an incident,
from the humungous amounts of software logging data dumps
generated by the system. Log based anomaly detection is
generally applied once an incident has been detected based
on monitoring of KPI metrics, as a more fine-grained incident
detection or failure diagnosis step in order to detect which
service or micro-service or which software module of the
system execution is behaving anomalously.
Task Complexity
Diversity of Log Anomaly Patterns
: There are very diverse
kinds of incidents in AIOps which can result in different kinds
of anomaly patterns in the log data - either manifesting in the
log template (i.e. the constant part of the log line) or the log
parameters (i.e. the variable part of the log line containing
dynamic information). These are i) keywords - appearance
of keywords in log lines bearing domain-specific semantics
of failure or incident or abnormality in the system (e.g. out
of memory or crash) ii) template count - where a sudden
increase or decrease of log templates or log event types is
indicative of anomaly iii) template sequence - where some
significant deviation from the normal order of task execution
is indicative of anomaly iv) variable value - some variables
associated with some log templates or events can have physical
meaning (e.g. time cost) which could be extracted out and
aggregated into a structured time series on which standard
anomaly detection techniques can be applied. v) variable
distribution - for some categorical or numerical variables, a
deviation from the standard distribution of the variable can be
indicative of an anomaly vi) time interval - some performance
issues may not be explicitly observed in the logline themselves
but in the time interval between specific log events.
Need for AI
: Given the humongous nature of the logs,
it is often infeasible for even domain experts to manually
go through the logs to detect the anomalous loglines. Addi-
tionally, as described above, depending on the nature of the
incident there can be diverse types of anomaly patterns in
the logs, which can manifest as anomalous key words (like
”errors” or ”exception”) in the log templates or the volume of
specific event logs or distribution over log variables or the time
interval between two log specific event logs. However, even
for a domain expert it is not possible to come up with rules
to detect these anomalous patterns, and even when they can,
they would likely not be robust to diverse incident types and
changing nature of log lines as the software functionalities
change over time. Hence, this makes a compelling case for
9
employing data-driven models and machine intelligence to
mine and analyze this complex data-source to serve the end
goals of incident detection.
Log Analysis Workflow for Incident Detection
In order to handle the complex nature of the data, typically
a series of steps need to be followed to meaningfully analyze
logs to detect incidents. Starting with the raw log data or data
streams, the log analysis workflow first does some preprocess-
ing of the logs to make them amenable to ML models. This
is typically followed by log parsing which extracts a loose
structure from the semi-structured data and then grouping and
partitioning the log lines into log sequences in order to model
the sequence characteristics of the data. After this, the logs or
log sequences are represented as a machine-readable matrix
on which various log analysis tasks can be performed - like
clustering and summarizing the huge log dumps into a few key
log patterns for easy visualization or for detecting anomalous
log patterns that can be indicative of an incident. Figure 7
provides an outline of the different steps in the log analysis
wokflow. While some of these steps are more of engineering
challenges, others are more AI-driven and some even employ
a combination of machine learning and domain knowledge
rules.
i) Log Preprocessing:
This step typically involves cus-
tomised filtering of specific regular expression patterns (like
IP addresses or memory locations) that are deemed irrelevant
for the actual log analysis. Other preprocessing steps like
tokenization requires specialized handling of different wording
styles and patterns arising due to the hybrid nature of logs
consisting of both natural language and programming language
constructs. For example a log line can contain a mix of
text strings from source-code data having snake-case and
camelCase tokens along with white-spaced tokens in natural
language.
ii) Log Parsing:
To enable downstream processing, unstruc-
tured log messages first need to be parsed into a structured
event template (i.e. constant part that was actually designed
by the developers) and parameters (i.e. variable part which
contain the dynamic runtime information). Figure 8 provides
one such example of parsing a single log line. In literature
there have been heuristic methods for parsing as well as AI-
driven methods which include traditional ML and also more
recent neural models. The heuristic methods like Drain [63],
IPLoM [64] and AEL [65] exploit known inductive bias on log
structure while Spell [66] uses Longest common subsequence
algorithm to dynamically extract log patterms. Out of these,
Drain and Spell are most popular, as they scale well to
industrial standards. Amongst the traditional ML methods,
there are i) Clustering based methods like LogCluster [67],
LKE [68], LogSig [69], SHISO [70], LenMa [71], LogMine
[72] which assume that log message types coincide in similar
groups ii) Frequent pattern mining and item-set mining meth-
ods SLCT [73], LFA [74] to extract common message types
iii) Evolutionary optimization approaches like MoLFI [75]. On
the other hand, recent neural methods include [76] - Neural
Transformer based models which use self-supervised Masked
Language Modeling to learn log parsing vii) UniParser [77] -
an unified parser for heterogenous log data with a learnable
similarity module to generalize to diverse logs across different
systems. There are yet another class of log analysis methods
[78], [79] which aim at parsing free techniques, in order to
avoid the computational overhead of parsing and the errors
cascading from erroneous parses, especially due to the lack of
robustness of the parsing methods.
iii) Log Partitioning:
After parsing the next step is to
partition the log data into groups, based on some semantics
where each group represents a finite chunk of log lines or
log sequences. The main purpose behind this is to decompose
the original log dump typically consisting of millions of log
lines into logical chunks, so as to enable explicit modeling
on these chunks and allow the models to capture anomaly
patterns over sequences of log templates or log parameter
values or both. Log partitioning can be of different kinds [20],
[80] - Fixed or Sliding window based partitions, where the
length of window is determined by length of log sequence or
a period of time, and Identifier based partitions where logs are
partitioned based on some identifier (e.g. the session or process
they originate from). Figure 9 illustrates these different choices
of log grouping and partitioning. A log event is eventually
deemed to be anomalous or not, either at the level of a log
line or a log partition.
iv) Log Representation:
After log partitioning, the next step
is to represent each partition in a machine-readable way (e.g. a
vector or a matrix) by extracting features from them. This can
be done in various ways [81], [80]- either by extracting specific
handcrafted features using domain knowledge or through ii)
sequential representation which converts each partition to an
ordered sequence of log event ids ii) quantitative represen-
tation which uses count vectors, weighted by the term and
inverse document frequency information of the log events
iii) semantic representation captures the linguistic meaning
from the sequence of language tokens in the log events and
learns a high-dimensional embedding vector for each token
in the dataset. The nature of log representation chosen has
direct consequence in terms of which patterns of anomalies
they can support - for example, for capturing keyword based
anomalies, semantic representation might be key, while for
anomalies related to template count and variable distribution,
quantitative representations are possibly more appropriate. The
semantic embedding vectors themselves can be either obtained
using pretrained neural language models like GloVe, FastText,
pretrained Transformer like BERT, RoBERTa etc or learnt
using a trainable embedding layer as part of the target task.
v) Log Analysis tasks for Incident Detection:
Once the logs
are represented in some compact machine-interpretable form
which can be easily ingested by AI models, a pipeline of
log analysis tasks can be performed on it - starting with Log
compression techniques using Clustering and Summarization,
followed by Log based Anomaly Detection. In turn, anomaly
detection can further enable downstream tasks in Incident
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
10
Fig. 7.
Steps of the Log Analysis Workflow for Incident Detection
Fig. 8.
Example of Log Parsing
Fig. 9.
Different types of log partitioning
Management like Failure Prediction and Root Cause Analysis.
In this section we discuss only the first two log analysis tasks
which are pertinent to incident detection and leave failure
prediction and RCA for the subsequent sections.
v.1) Log Compression through Clustering & Summariza-
tion:
This is a practical first-step towards analyzing the huge
volumes of log data is Log Compression through various
clustering and summarization techniques. The objective of
this analysis serves two purposes - Firstly, this step can
independently help the site reliability engineers and service
owners during incident management by providing a practical
and intuitive way of visualizing these massive volumes of
complex unstructured raw log data. Secondly, the output of
log clustering can directly be leveraged in some of the log
based anomaly detection methods.
Amongst the various techniques of log clustering, [82],
[67], [83] employ hierarchical clustering and can support
online settings by constructing and retrieving from knowledge
base of representative log clusters. [84], [85] use frequent
pattern matching with dimension reduction techniques like
PCA and locally sensitive hashing with online and streaming
support. [86], [64], [87] uses efficient iterative or incremental
clustering and partitioning techniques that support online and
streaming logs and can also handle clustering of rare log
instances. Another area of existing literature [88], [89], [90],
[91] focus on log compression through summarization - where,
for example, [88] uses heuristics like log event ids and
timings to summarize and [89], [21] does openIE based triple
extraction using semantic information and domain knowledge
and rules to generate summaries, while [90], [91] use sequence
clustering using linguistic rules or through grouping common
event sequences.
v.2) Log Anomaly Detection:
Perhaps the most common use
of log analysis is for log based anomaly detection where a
wide variety of models have been employed in both research
and industrial settings. These models are categorized based
on various factors i) the learning setting - supervised, semi-
supervised or unsupervised: While the semi-supervised models
assume partial knowledge of labels or access to few anomalous
instances, unsupervised ones train on normal log data and
detect anomaly based on their prediction confidence. ii) the
type of Model - Neural or traditional statistical non-neural
models iii) the kinds of log representations used iv) Whether
to use log parsing or parser free methods v) If using parsing,
then whether to encode only the log template part or both
template and parameter representations iv) Whether to restrict
modeling of anomalies at the level of individual log lines or
to support sequential modeling of anomaly detection over log
sequences.
The nature of log representation employed and the kind
of modeling used - both of these factors influence what type
of anomaly patterns can be detected - for example keyword
and variable value based anomalies are captured by semantic
representation of log lines, while template count and vari-
able distribution based anomaly patterns are more explicitly
modeled through quantitative representations of log events.
Similarly template sequence and time-interval based anomalies
need sequential modeling algorithms which can handle log
sequences.
Below we briefly summarize the body of literature dedicated
to these two types of models - Statistical and Neural; and In
Table III we provide a comparison of a more comprehensive
list of existing anomaly detection algorithms and systems.
Statistical Models are the more traditional machine learning
models which draw inference from various statistics under-
lying the training data. In the literature there have been
various statistical ML models employed for this task under
11
different training settings. Amongst the supervised methods,
[92], [93], [94] using traditional learning strategies of Lin-
ear Regression, SVM, Decision Trees, Isolation Forest with
handcrafted features extracted from the entire logline. Most
of these model the data at the level of individual log-lines and
cannot not explicitly capture sequence level anomalies. There
are also unsupervised methods like ii) dimension reduction
techniques like Principal Component Analysis (PCA) [84]
iii) clustering and drawing correlations between log events
and metric data as in [67], [82], [95], [80]. There are also
unsupervised pattern mining methods which include mining
invariant patterns from singular value decomposition [96] and
mining frequent patterns from execution flow and control
flow graphs [97], [98], [99], [68]. Apart from these there are
also systems which employ a rule engine built using domain
knowledge and an ensemble of different ML models to cater
to different incident types [20] and also heuristic methods
for doing contrast analysis between normal and incident-
indicating abnormal logs [100].
Neural Models, on the other hand are a more recent class of
machine learning models which use artificial neural networks
and have proven remarkably successful across numerous AI
applications. They are particularly powerful in encoding and
representing the complex semantics underlying in a way that
is meaningful for the predictive task. One class of unsuper-
vised neural models use reconstruction based self-supervised
techniques to learn the token or line level representation,
which includes i) Autoencoder models [101], [102] ii) more
powerful self-attention based Transformer models [103] iv)
specific pretrained Transformers like BERT language model
[104], [105], [21]. Another offshoot of reconstruction based
models is those using generative adversarial or GAN paradigm
of training for e.g. [106], [107] using LSTM or Transformer
based encoding. The other types of unsupervised models
are forecasting based, which learn to predict the next log
token or next log line in a self-supervised way - for e.g i)
Recurrent Neural Network based models like LSTM [108],
[109], [110], [18], [111] and GRU [104] or their attention
based counterparts [81], [112], [113] ii) Convolutional Neural
Network (CNN) based models [114] or more complex models
which use Graph Neural Network to represent log event data
[115], [116]. Both reconstruction and forecasting based models
are capable of handling sequence level anomalies, it depends
on the nature of training (i.e. whether representations are learnt
at log line or token level) and the capacity of model to handle
long sequences (e.g. amongst the above, Autoencoder models
are the most basic ones).
Most of these models follow the practical setup of unsu-
pervised training, where they train only non-anomalous log
data. However, other works have also focused on supervised
training of LSTM, CNN and Transformer models [111], [114],
[78], [117], over anomalous and normal labeled data. On
the other hand, [104], [110] use weak supervision based on
heuristic assumptions for e.g. logs from external systems are
considered anomalous. Most of the neural models use semantic
token representations, some with pretrained fixed or trainable
embeddings, initialized with GloVe, fastText or pretrained
transformer based models, BERT, GPT, XLM etc.
vi) Log Model Deployment:
The final step in the log
analysis workflow is deployment of these models in the actual
industrial settings. It involves i) a training step, typically
over offline log data dump, with or without some supervision
labels collected from domain experts ii) online inference step,
which often needs to handle practical challenges like non-
stationary streaming data i.e. where the data distribution is
not independently and identically distributed throughout the
time. For tackling this, some of the more traditional statistical
methods like [103], [95], [82], [84] support online streaming
update while some other works can also adapt to evolving
log data by incrementally building a knowledge base or
memory or out-of-domain vocabulary [101]. On the other
hand most of the unsupervised models support syncopated
batched online training, allowing the model to continually
adapt to changing data distributions and to be deployed on
high throughput streaming data sources. However for some of
the more advanced neural models, the online updation might
be too computationally expensive even for regular batched
updates.
Apart from these, there have also been specific work on
other challenges related to model deployment in practical
settings like transfer learning across logs from different do-
mains or applications [110], [103], [18], [18], [118] under
semi-supervised settings using only supervision from source
systems. Other works focus on evaluating model robustness
and generalization (i.e. how well the model adapts to) to
unstable log data due to continuous logging modifications
throughout
software
evolutions
and
updates
[109],
[111],
[104]. They achieve these by adopting domain adversarial
paradigms during training [18], [18] or using counterfactual
explanations [118] or multi-task settings [21] over various log
analysis tasks.
Challenges & Future Trends
Collecting supervision labels:
Like most AIOps tasks,
collecting large-scale supervision labels for training or even
evaluation of log analysis problems is very challenging and
impractical as it involves significant amount of manual inter-
vention and domain knowledge. For log anomaly detection,
the goal being quite objective, label collection is still possible
to enable atleast a reliable evaluation. Whereas, for other log
analysis tasks like clustering and summarization, collecting
supervision labels from domain experts is often not even
possible as the goal is quite subjective and hence these tasks
are typically evaluated through the downstream log analysis
or RCA task.
Imbalanced class problem:
One of the key challenges
of anomaly detection tasks, is the class imbalance, stemming
from the fact that anomalous data is inherently extremely
rare in occurrence. Additionally, various systems may show
different kinds of data skewness owing to the diverse kinds
of anomalies listed above. This poses a technical challenge
both during model training with highly skewed data as well
as choice of evaluation metrics, as Precision, Recall and F-
Score may not perform satisfactorily. Further at inference,
thresholding over the anomaly score gets particularly chal-
12
lenging for unsupervised models. While for benchmarking
purposes, evaluation metrics like AUROC (Area under ROC
curve) can suffice, but for practical deployment of these
models require either careful calibrations of anomaly scores
or manual tuning or heuristic means for setting the threshold.
This being quite sensitive to the application at hand, also poses
realistic challenges when generalizing to heterogenous logs
from different systems.
Handling large volume of data:
Another challenge in log
analysis tasks is handling the huge volumes of logs, where
most large-scale cloud-based systems can generate petabytes
of logs each day or week. This calls for log processing
algorithms, that are not only effective but also lightweight
enough to be very fast and efficient.
Handling non-stationary log data:
Along with humon-
gous
volume,
the
natural
and
most
practical
setting
of
logs analysis is an online streaming setting, involving non-
stationary data distribution - with heterogenous log streams
coming from different inter-connected micro-services, and the
software logging data itself evolving over time as developers
naturally keep evolving software in the agile cloud devel-
opment environment. This requires efficient online update
schemes for the learning algorithms and specialized effort
towards building robust models and evaluating their robustness
towards unstable or evolving log data.
Handling noisy data:
Annotating log data being ex-
tremely challenging even for domain experts, supervised and
semi-supervised models need to handle this noise during
training, while for unsupervised models, it can heavily mislead
evaluation. Even though it affects a small fraction of logs,
the extreme class imbalance aggrevates this problem. Another
related challenge is that of errors compounding and cascading
from each of the processing steps in the log analysis workflow
when performing the downstream tasks like anomaly detec-
tion.
Realistic public benchmark datasets for anomaly detec-
tion:
Amongst the publicly available log anomaly detection
datasets, only a limited few contain anomaly labels. Most of
those benchmarks have been excessively used in the literature
and hence do not have much scope of furthering research.
Infact, their biggest limitation is that they fail to showcase
the diverse nature of incidents that typically arise in real-
world deployment. Often very simple handcrafted rules prove
to be quite successful in solving anomaly detection tasks
on these datasets. Also, the original scale of these datasets
are several orders of magnitude smaller than the real-world
use-cases and hence not fit for showcasing the challenges of
online or streaming settings. Further, the volume of unique
patterns collapses significantly after the typical log processing
steps to remove irrelevant patterns from the data. On the
other hand, a vast majority of the literature is backed up
by empirical analysis and evaluation on internal proprietary
data, which cannot guarantee reproducibility. This calls for
more realistic public benchmark datasets that can expose the
real-world challenges of aiops-in-the-wild and also do a fair
benchmarking across contemporary log analysis models.
Public benchmarks for parsing, clustering, summariza-
tion:
Most of the log parsing, clustering and summarization
literature only uses a very small subset of data from some of
the public log datasets, where the oracle parsing is available,
or in-house log datasets from industrial applications where
they compare with oracle parsing methods that are unscalable
in practice. This also makes fair comparison and standardized
benchmarking difficult for these tasks.
Better log language models:
Some of the recent advances
in neural NLP models like transformer based language models
BERT, GPT has proved quite promising for representing logs
in natural language style and enabling various log analysis
tasks. However there is more scope of improvement in building
neural language models that can appropriately encode the
semi-structured logs composed of fixed template and variable
parameters without depending on an external parser.
Incorporating Domain Knowledge:
While existing log
anomaly detection systems are entirely rule-based or auto-
mated, given the complex nature of incidents and the di-
verse varieties of anomalies, a more practical approach would
involve incorporating domain knowledge into these models
either in a static form or dynamically, following a human-
in-the-loop feedback mechanism. For example, in a complex
system generating humungous amounts of logs, which kinds
of incidents are more severe and which types of logs are more
crucial to monitor for which kind of incidents. Or even at
the level of loglines, domain knowledge can help understand
the real-world semantics or physical significance of some
of the parameters or variables mentioned in the logs. These
aspects are often hard for the ML system to gauge on its own
especially in the practical unsupervised settings.
Unified models for heterogenous logs:
Most of the log
analysis models are highly sensitive towards the nature of log
preprocessing or grouping, needing customized preprocessing
for each type of application logs. This alludes towards the
need for unified models with more generalizable preprocessing
layers that can handle heterogenous kinds of log data and also
different types of log analysis tasks. While [21] was one of
the first works to explore this direction, there is certainly more
research scope for building practically applicable models for
log analysis.
C. Traces and Multimodal Incident Detection
Problem Definition
Traces are semi-structured event logs with span information
about the topological structure of the service graph. Trace
anomaly detection relies on finding abnormal paths on the
topological graph at given moments, as well as discovering
abnormal information directly from trace event log text. There
are multiple ways to process trace data. Traces usually have
timestamps and associated sequential information so it can
be covered into time-series data. Traces are also stored as
trace event logs, containing rich text information. Moreover,
traces store topological information which can be used to
reconstruct the service graphs that represents the relation
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
13
among components of the systems. From the data perspective,
traces can easily been turned into multiple data modalities.
Thus, we combines trace-based anomaly detection with multi-
modal anomaly detection to discuss in this section. Recently,
we can see with the help of multi-modal deep learning
technologies, trace anomaly detection can combine different
levels of information relayed by trace data and learn more
comprehensive anomaly detection models [119][120].
Empirical Approaches
Traces draw more attention in microservice system archi-
tectures since the topological structure becomes very complex
and dynamic. Trace anomaly detection started from practical
usages for large scale system debugging [121]. Empirical trace
anomaly detection and RCA started with constructing trace
graphs and identifying abnormal structures on the constructed
graph. Constructing the trace graph from trace data is usually
very time consuming, an offline component is designed to
train and construct such trace graph. Apart from , to adapt
to the usage requirements to detect and locate issues in large
scale systems, trace anomaly detection and RCA algorithms
usually also have an online part to support real-time service.
For example, Cai
et al.
. released their study of a real-time
trace-level diagnosis system, which is adopted by Alibaba
datacenters. This is one of the very few studies to deal with
real large distributed systems [122].
Most empirical trace anomaly detection work follow the
offline and online design pattern to construct their graph mod-
els. In the offline modeling, unsupervised or semi-supervised
techniques are utilized to construct the trace entity graphs,
very similar to techniques in process discovery and mining
domain. For example, PageRank has been used to construct
web graphs in one of the early web graph anomaly detection
works [123]. After constructing the trace entity graphs, a
variety of techniques can be used to detect anomalies. One
common way is to compare the current graph pattern to
normal graph patterns. If the current graph pattern significantly
deviates from the normal patterns, report anomalous traces.
An alternative approach is using data mining and statistical
learning techniques to run dynamic analysis without construct-
ing the offline trace graph. Chen
et al.
proposed Pinpoint
[124], a framework for root cause analysis that using coarse-
grained tagging data of real client requests at real-time when
these requests traverse through the system, with data mining
techniques. Pinpoint discovers the correlation between success
/ failure status of these requests and fault components. The
entire approach processes the traces on-the-fly and does not
leverage any static dependency graph models.
Deep Learning Based Approaches
In recent years, deep learning techniques started to be
employed in trace anomaly detection and RCA. Also with
the help of deep learning frameworks, combining general
trace graph information and the detailed information inside of
each trace event to train multimodal learning models become
possible.
Long-short term memory (LSTM) network [125] is a very
popular neural network model in early trace and multimodal
anomaly detection. LSTM is a special type of recurrent neural
network (RNN) and has been proved to success in lots of
other domains. In AIOps, LSTM is also commonly used in
metric and log anomaly detection applications. Trace data
is a natural fit with RNNs, majorly in two ways: 1) The
topological order of traces can be modeled as event sequences.
These event sequences can easily be transformed into model
inputs of RNNs. 2) Trace events usually have text data that
conveys rich information. The raw text, including both the
structured and unstructured parts, can be transformed into
vectors via standard tokenization and embedding techniques,
and feed the RNN as model inputs. Such deep learning model
architectures can be extended to support multimodal input,
such as combining trace event vector with numerical time
series values [119].
To better leverage the topological information of traces,
graph neural networks have also been introduced in trace
anomaly detection. Zhang
et al.
developed DeepTraLog, a
trace anomaly detection technique that employs Gated graph
neural networks [120]. DeepTraLog targets to solve anomaly
detection problems for complex microservice systems where
service entity relationships are not easy to obtain. Moreover,
the constructed graph by GGNN training can also be used
to localize the issue, providing additional root-cause analysis
capability.
Limitations
Trace data became increasingly attractive as more applica-
tions transitioned from monolithic to microservice architec-
ture. There are several challenges in machine learning based
trace anomaly detection.
Data quality.
As far as we know, there are multiple trace
collection platforms and the trace data format and quality
are inconsistent across these platforms, especially in the pro-
duction environment. To use these trace data for analysis,
researchers and developers have to spend significant time and
effort to clean and reform the data to feed machine learning
models.
Difficult to acquire labels.
It is very difficult to acquire
labels for production data. For a given incident, labeling the
corresponding trace requires identifying the incident occurring
time and location, as well as the root cause which may be
located in totally different time and location. Obtaining such
full labels for thousands of incidents is extremely difficult.
Thus, most of the existing trace analysis research still use
synthetic data to evaluate the model performance. This brings
more doubts whether the proposed solution can solve problems
in real production.
No sufficient multimodal and graph learning models.
Trace data are complex. Current trace analysis simplifies trace
data into event sequences or time-series numerical values, even
in the multimodal settings. However, these existing model
architectures did not fully leverage all information of trace
data in one place. Graph-based learning can potentially be a
solution but discussions of this topic are still very limited.
Offline
model
training.
The deep learning models in
existing research relies on offline model training, partially
because model training is usually very time consuming and
14
contradicts with the goal of real-time serving. However, offline
model training brings static dependencies to a dynamic system.
Such dependencies may cause additional performance issues.
Future Trends
Unified trace data
Recently, OpenTelemetry leads the effort
to unify observability telemetry data, including metrics, logs,
traces, etc., across different platforms. This effort can bring
huge benefits to future trace analysis. With more unified data
models, AI researchers can more easily acquire necessary data
to train better models. The trained model can also be easily
plug-and-play by other parties, which can further boost model
quality improvements.
Unified
engine
for
detection
and
RCA
Trace graph
contains rich information about the system at a given time.
With the help of trace data, incident detection and root cause
localization can be done within one step, instead of the current
two consecutive steps. Existing work has demonstrated that by
simply examining the constructed graph, the detection model
can reveal sufficient information to locate the root causes
[120].
Unified models for multimodal telemetry data
Trace data
analysis brings the opportunities for researchers to create a
holistic view of multiple telemetry data modality since traces
can be converted into text sequence data and time-series data.
The learnings can be extended to include logs or metrics
from different sources. Eventually we can expect unified
learning models that can consume multimodal telemetry data
for incident detection and RCA.
Online Learning
Modern systems are dynamic and ever-
changing. Current two-step solution relies on offline model
training and online serving or inference. Any system evolution
between two offline training cycles could cause potential issues
and damage model performance. Thus, supporting online
learning is critical to guarantee high performance in real
production environments.
V. F
AILURE
P
REDICTION
Incident Detection and Root-Cause Analysis of Incidents are
more reactive measures towards mitigating the effects of any
incident and improving service availability once the incident
has already occurred. On the other hand, there are other
proactive actions that can be taken to predict if any potential
incident can happen in the immediate future and prevent it
from happening. Failures in software systems are such kind
of highly disruptive incidents that often start by showing
symptoms of deviation from the normal routine behavior of
the required system functions and typically result in failure
to meet the service level agreement. Failure prediction is one
such proactive task in Incident Management, whose objective
is to continuously monitor the system health by analyzing the
different types of system data (KPI metrics, logging and trace
data) and generate early warnings to prevent failures from
occurring. Consequently, in order to handle the different kinds
of telemetry data sources, the task of predicting failures can
be tailored to metric based and log based failure prediction.
We describe these two in details in this section.
A. Metrics based Failure Prediction
Metric data are usually fruitful in monitoring system. It
is straightforward to directly leverage them to predict the
occurrence of the incident in advance. As such, some proactive
actions can be taken to prevent it from happening instead of
reducing the time for detection. Generally, it can be formulated
as the imbalanced binary classification problem if failure labels
are available, and formulated as the time series forecasting
problem if the normal range of monitored metrics are defined
in advance. In general, failure prediction [126] usually adopts
machine learning algorithms to learn the characteristics of
historical failure data, build a failure prediction model, and
then deploy the model to predict the likelihood of a failure in
the future.
Methods
General Failure Prediction:
Recently, there are increasing
efforts on considering general failure incident prediction with
the failure signals from the whole monitoring system. [127]
collected alerting signals across the whole system and dis-
covered the dependence relationships among alerting signals,
then the gradient boosting tree based model was adopted
to learn failure patterns. [128] proposed an effective feature
engineering process to deal with complex alert data. It used
multi-instance learning and handle noisy alerts, and inter-
pretable analysis to generate an interpretable prediction result
to facilitate the understanding and handling of incidents.
Specific Type Failure Prediction:
In contrast, some works
In contrast, [127] and [128] aim to proactively predict various
specific types of failures. [129] extracted statistical and textual
features from historical switch logs and applied random forest
to predict switch failures in data center networks. [130]
collected data from SMART [131] and system-level signals,
and proposed a hybrid of LSTM and random forest model
for node failure prediction in cloud service system. [132]
developed a disk error prediction method via a cost-sensitive
ranking models. These methods target at the specific type of
failure prediction, and thus are limited in practice.
Challenges and Future Trends
While conventional supervised learning for classification or
regression problems can be used to handle failure prediction,
it needs to overcome the following main challenges. First,
datasets are usually very imbalanced due to the limited number
of failure cases. This poses a significant challenge to the
prediction model to achieve high precision and high recall
simultaneously. Second, the raw signals are usually noisy,
not all information before incident is helpful. How to extract
omen features/patterns and filter out noises are critical to the
prediction performance. Third, it is common for a typical
system to generate a large volume of signals per minute,
leading to the challenge to update prediction model in the
streaming way and handle the large-scale data with lim-
ited computation resources. Fourth, post-processing of failure
prediction is very important for failure management system
to improve availability. For example, providing interpretable
failure prediction can facilitate engineers to take appropriate
action for it.
15
B. Logs based Incident Detection
Like Incident Detection and Root Cause Analysis, Failure
Prediction is also an extremely complex task, especially in
enterprise level systems which comprise of many distributed
but inter-connected components, services and micro-services
interacting with each other asynchronously. One of the main
complexities of the task is to be able to do early detection
of signals alluding towards a major disruption, even while
the system might be showing only slight or manageable
deviations from its usual behavior. Because of this nature of
the problem, often monitoring the KPI metrics alone may not
suffice for early detection, as many of these metrics might
register a late reaction to a developing issue or may not be
fine-grained enough to capture the early signals of an incident.
System and software logs, on the other hand, being an all-
pervasive part of systems data continuously capture rich and
very detailed runtime information that are often pertinent to
detecting possible future failures.
Thus various proactive log based analysis have been applied
in different industrial applications as a continuous monitoring
task and have proved to be quite effective for a more fine-
grained failure prediction and localizing the source of the
potential failure. It involves analyzing the sequences of events
in the log data and possibly even correlating them with other
data sources like metrics in order to detect anomalous event
patterns that indicate towards a developing incident. This is
typically achieved in literature by employing supervised or
semi-supervised machine learning models to predict future
failure likelihood by learning and modeling the characteristics
of historical failure data. In some cases these models can
also be additionally powered by domain knowledge about the
intricate relationships between the systems. While this task has
not been explored as popularly as Log Anomaly Detection and
Root Cause Analysis and there are fewer public datasets and
benchmark data, software and systems maintainance logging
data still plays a very important role in predicting potential
future failures. In literature, generally the failure prediction
task over log data has been employed in broadly two types of
systems - homogenous and heterogenous.
Failure Prediction in Homogenous Systems
In homogenous systems, like high-performance computing
systems or large-scale supercomputers, this entails prediction
of independent failures, where most systems leverage sequen-
tial information to predict failure of a single component.
Time-Series
Modeling
: Amongst homogenous systems,
[133], [134] extract system health indicating features from
structured logs and modeled this as time series based anomaly
forecasting problem. Similarly [135] extracts specific patterns
during critical events through feature engineering and build a
supervised binary classifier to predict failures. [136] converts
unstructured logs into templates through parsing and apply
feature extraction and time-series modeling to predict surge,
frequency and seasonality patterns of anomalies.
Supervised Classifiers
Some of the older works predict
failures in a supervised classification setting using tradi-
tional machine learning models like support vector machines,
nearest-neighbor or rule-based classifiers [137], [93], [138],
or ensemble of classifiers [93] or hidden semi-markov model
based classifier [139] over features handcrafted from log event
sequences or over random indexing based log encoding while
[140], [141] uses deep recurrent neural models like LSTM over
semantic representations of logs. [142] predict and diagnose
failures through first failure identification and causality based
filtering to combine correlated events for filtering through
association rule-mining method.
Failure Prediction in Heterogenous Systems
In heterogenous systems, like large-scale cloud services, es-
pecially in distributed micro-service environment, outages can
be caused by heterogenous components. Most popular meth-
ods utilize knowledge about the relationship and dependency
between the system components, in order to predict failures.
Amongst such systems, [143] constructed a Bayesian network
to identify conditional dependence between alerting signals
extracted from system logs and past outages in offline setting
and used gradient boosting trees to predict future outages in the
online setting. [144] uses a ranking model combining temporal
features from LSTM hidden states and spatial features from
Random Forest to rank relationships between failure indicating
alerts and outages. [145] trains trace-level and micro-service
level prediction models over handcrafted features extracted
from trace logs to detect three common types of micro-service
failures.
VI. R
OOT
C
AUSE
A
NALYSIS
Root-cause Analysis (RCA) is the process to conduct a
series of actions to discover the root causes of an incident.
RCA in DevOps focuses on building the standard process
workflow to handle incidents more systematically. Without AI,
RCA is more about creating rules that any DevOps member
can follow to solve repeated incidents. However, it is not
scalable to create separate rules and process workflow for
each type of repeated incident when the systems are large
and complex. AI models are capable to process high volume
of input data and learn representations from existing incidents
and how they are handled, without humans to define every
single details of the workflow. Thus, AI-based RCA has huge
potential to reform how root cause can be discovered.
In this section, we discuss a series of AI-based RCA topics,
separeted by the input data modality: metric-based, log-based,
trace-based and multimodal RCA.
A. Metric-based RCA
Problem Definition
With the rapidly growing adoption of microservices ar-
chitectures, multi-service applications become the standard
paradigm in real-world IT applications. A multi-service ap-
plication usually contains hundreds of interacting services,
making it harder to detect service failures and identify the
root causes. Root cause analysis (RCA) methods leverage the
KPI metrics monitored on those services to determine the root
causes when a system failure is detected, helping engineers and
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
16
SREs in the troubleshooting process
*
. The key idea behind
RCA with KPI metrics is to analyze the relationships or
dependencies between these metrics and then utilize these
relationships to identify root causes when an anomaly occurs.
Typically, there are two types of approaches: 1) identifying
the anomalous metrics in parallel with the observed anomaly
via metric data analysis, and 2) discovering a topology/causal
graph that represent the causal relationships between the
services and then identifying root causes based on it.
Metric Data Analysis
When an anomaly is detected in a multi-service application,
the services whose KPI metrics are anomalous can possibly
be the root causes. The first approach directly analyzes these
KPI metrics to determine root causes based on the assumption
that significant changes in one or multiple KPI metrics happen
when an anomaly occurs. Therefore, the key is to identify
whether a KPI metric has pattern or magnitude changes in a
look-back window or snapshot of a given size at the anomalous
timestamp.
Nguyen
et al.
[146], [147] propose two similar RCA meth-
ods by analyzing low-level system metrics, e.g., CPU, memory
and network statistics. Both methods first detect abnormal
behaviors for each component via a change point detection
algorithm when a performance anomaly is detected, and then
determine the root causes based on the propagation patterns
obtained by sorting all critical change points in a chronological
order. Because a real-world multi-service application usually
have hundreds of KPI metrics, the change point detection
algorithm must be efficient and robust. [146] provides an algo-
rithm by combining cumulative sum charts and bootstrapping
to detect change points. To identify the critical change point
from the change points discovered by this algorithm, they use
a separation level metric to measure the change magnitude for
each change point and extract the critical change point whose
separation level value is an outlier. Since the earliest anomalies
may have propagated from their corresponding services to
other services, the root causes are then determined by sorting
the critical change points in a chronological order. To further
improve root cause pinpointing accuracy, [147] develops a
new fault localization method by considering both propagation
patterns and service component dependencies.
Instead of change point detection, Shan
et al.
[148] devel-
oped a low-cost RCA method called
-Diagnosis to detect root
causes of small-window long-tail latency for web services.
-
Diagnosis assumes that the root cause metrics of an abnormal
service have significantly changes between the abnormal and
normal periods. It applies the two-sample test algorithm and
-
statistics for measuring similarity of time series to identify root
causes. In the two-sample test, one sample (normal sample) is
drawn from the snapshot during the normal period while the
other sample (anomaly sample) is drawn during the anomalous
period. If the difference between the anomaly sample and the
normal sample are statistically significant, the corresponding
metrics of the samples are potential root causes.
*
A good survey for anomaly detection and RCA in cloud applications
[22]
Topology or Causal Graph-based Analysis
The advantage of metric data analysis methods is the
ability of handling millions of metrics. But most of them
don’t consider the dependencies between services in an ap-
plication. The second type of RCA approaches leverages
such dependencies, which usually involves two steps, i.e.,
constructing topology/causal graphs given the KPI metrics and
domain knowledge, and extracting anomalous subgraphs or
paths given the observed anomalies. Such graphs can either
be reconstructed from the topology (domain knowledge) of a
certain application ([149], [150], [151], [152]) or automatically
estimated from the metrics via causal discovery techniques
([153], [154], [155], [156], [157], [158], [159]). To identify
the root causes of the observed anomalies, random walk (e.g.,
[160], [156], [153]), page-rank (e.g., [150]) or other techniques
can be applied over the discovered topology/causal graphs.
When the service graphs (the relationships between the
services) or the call graphs (the communications among the
services) are available, the topology graph of a multi-service
application can be reconstructed automatically, e.g., [149],
[150]. But such domain knowledge is usually unavailable or
partially available especially when investigating the relation-
ships between the KPI metrics instead of API calls. Therefore,
given the observed metrics, causal discovery techniques, e.g.,
[161], [162], [163] play a significant role in constructing
the causal graph describing the causal relationships between
these metrics. The most popular causal discovery algorithm
applied in RCA is the well-known PC-algorithm [161] due
to its simplicity and explainability. It starts from a complete
undirected graph and eliminates edges between the metrics
via conditional independence test. The orientations of the
edges are then determined by finding V-structures followed
by orientation propagation. Some variants of the PC-algorithm
[164], [165], [166] can also be applied based on different data
properties.
Given the discovered causal graph, the possible root causes
of the observed anomalies can be determined by random walk.
A random walk on a graph is a random process that begins at
some node, and randomly moves to another node at each time
step. The probability of moving from one node to another is
defined in the the transition probability matrix. Random walk
for RCA is based on the assumption that a metric that is more
correlated with the anomalous KPI metrics is more likely to be
the root cause. Each random walk starts from one anomalous
node corresponding to an anomalous metric, then the nodes
visited the most frequently are the most likely to be the root
causes. The key of random walk approaches is to determine the
transition probability matrix. Typically, there are three steps
for computing the transition probability matrix, i.e., forward
step (probability of walking from a node to one of its parents),
backward step (probability of walking from a node to one of
its children) and self step (probability of staying in the current
node). For example, [153], [158], [159], [150] computes these
probabilities based on the correlation of each metric with the
detected anomalous metrics during the anomaly period. But
correlation based random walk may not accurately localize
root cause [156]. Therefore, [156] proposes to use the partial
correlations instead of correlations to compute the transition
17
probabilities, which can remove the effect of the confounders
of two metrics.
Besides random walk, other causal graph analysis tech-
niques can also be applied. For example, [157], [155] find
root causes for the observed anomalies by recursively visiting
all the metrics that are affected by the anomalies, e.g., if
the parents of an affected metric are not affected by the
anomalies, this metric is considered a possible root cause.
[167] adopts a search algorithm based on a breadth-first search
(BFS) algorithm to find root causes. The search starts from one
anomalous KPI metric and extracts all possible paths outgoing
from this metric in the causal graph. These paths are then
sorted based on the path length and the sum of the weights
associated to the edges in the path. The last nodes in the
top paths are considered as the root causes. [168] considers
counterfactuals for root cause analysis based on the causal
graph, i.e., given a functional causal model, it finds the root
cause of a detected anomaly by computing the contribution of
each noise term to the anomaly score, where the contributions
are symmetrized using the concept of Shapley values.
Limitations
Data Issues
For a multi-service application with hundreds
of KPI metrics monitored on each service, it is very chal-
lenging to determine which metrics are crucial for identifying
root causes. The collected data usually doesn’t describe the
whole picture of the system architecture, e.g., missing some
important metrics. These missing metrics may be the causal
parents of other metrics, which violates the assumption of
PC algorithms that no latent confounders exist. Besides, due
to noises, non-stationarity and nonlinear relationships in real-
world KPI metrics, recovering accurate causal graphs becomes
even harder.
Lack of Domain Knowledge
The domain knowledge about
the monitored application, e.g., service graphs and call graphs,
is valuable to improve RCA performance. But for a complex
multi-service application, even developers may not fully un-
derstand the meanings or the relationships of all the monitored
metrics. Therefore, the domain knowledge provided by experts
is usually partially known, and sometimes conflicts with the
knowledge discovered from the observed data.
Causal
Discovery
Issues
The RCA methods based on
causal graph analysis leverage causal discovery techniques
to recover the causal relationships between KPI metrics. All
these techniques have certain assumptions on data properties
which may not be satisfied with real-world data, so the
discovered causal graph always contains errors, e.g., incorrect
links or orientations. In recent years, many causal discovery
methods have been proposed with different assumptions and
characteristics, so that it is difficult to choose the most suitable
one given the observed data.
Human in the Loop
After DevOps or SRE teams receive
the root causes identified by a certain RCA method, they will
do further analysis and provide feedback about whether these
root causes make sense. Most RCA methods cannot leverage
such feedback to improve RCA performance, or provide
explanations why the identified root causes are incorrect.
Lack of Benchmarks
Different from incident detection
problems, we lack benchmarks to evaluate RCA performance,
e.g., few public datasets with groundtruth root causes are
available, and most previous works use private internal datasets
for evaluation. Although some multi-service application de-
mos/simulators can be utilized to generate synthetic datasets
for RCA evaluation, the complexity of these demo applications
is much lower than real-world applications, so that such evalu-
ation may not reflect the real performance in practice. The lack
of public real-world benchmarks hampers the development of
new RCA approaches.
Future Trends
RCA Benchmarks
Benchmarks for evaluating the per-
formance of RCA methods are crucial for both real-world
applications and academic research. The benchmarks can
either be a collection of real-world datasets with groundtruth
root causes or some simulators whose architectures are close
to real-world applications. Constructing such large-scale real-
world benchmarks is essential for boosting novel ideas or
approaches in RCA.
Combining Causal Discovery and Domain Knowledge
The domain knowledge provided by experts are valuable to
improve causal discovery accuracy, e.g., providing required or
forbidden causal links between metrics. But sometimes such
domain knowledge introduces more issues when recovering
causal graphs, e.g., conflicts with data properties or conditional
independence tests, introducing cycles in the graph. How to
combine causal discovery and expert domain knowledge in a
principled manner is an interesting research topic.
Putting Human in the Loop
Integrating human interactions
into RCA approaches is important for real-world applications.
For instance, the causal graph can be built in an iterative way,
i.e., an initial causal graph is reconstructed by a certain causal
discovery algorithm, and then users examine this graph and
provide domain knowledge constraints (e.g., which relation-
ships are incorrect or missing) for the algorithm to revise the
graph. The RCA reports with detailed analysis about incidents
created by DevOps or SRE teams are valuable to improve
RCA performance. How to utilize these reports to improve
RCA performance is another importance research topic.
B. Log-based RCA
Problem Definition
Triaging and root cause analysis is one of the most complex
and critical phases in the Incident Management life cycle.
Given the nature of the problem which is to investigate into the
origin or the root cause of an incident, simply analyzing the
end KPI metrics often do not suffice. Especially in a micro-
service application setting or distributed cloud environment
with hundreds of services interacting with each other, RCA
and failure diagnosis is particularly challenging. In order to
localize the root cause in such complex environments, engi-
neers, SREs and service owners typically need to investigate
into core system data. Logs are one such ubiquitous forms of
systems data containing rich runtime information. Hence one
of the ultimate objectives of log analysis tasks is to enable
triaging of incident and localization of root cause to diagnose
faults and failures.
18
Starting with heterogenous log data from different sources
and microservices in the system, typical log-based aiops
workflows first have a layer of log processing and analysis,
involving log parsing, clustering, summarization and anomaly
detection. The log analysis and anomaly detection can then
cater to a causal inference layer that analyses the relationships
and dependencies between log events and possibly detected
anomalous events. These signals extracted from logs within or
across different services can be further correlated with other
observability data like metrics, traces etc in order to detect the
root cause of an incident. Typically this involves constructing a
causal graph or mining a knowledge graph over the log events
and correlating them with the KPI metrics or with other forms
of system data like traces or service call graphs. Through these,
the objective is to analyze the relationships and dependencies
between them in order to eventually identify the possible root
causes of an anomaly. Unlike the more concrete problems like
log anomaly detection, log based root cause analysis is a much
more open-ended task. Subsequently most of the literature on
log based RCA has been focused on industrial applications
deployed in real-world and evaluated with internal benchmark
data gathered from in-house domain experts.
Typical types of Log RCA methods
In literature, the task of log based root cause analysis have
been explored through various kinds of approaches. While
some of the works build a knowledge graph and knowledge
and leverage data mining based solutions, others follow funda-
mental principles from Causal Machine learning or and causal
knowledge mining. Other than these, there are also log based
RCA systems using traditional machine learning models which
use feature engineering or correlational analysis or supervised
classifier to detect the root cause.
Handcrafted features based methods:
[169] uses hand-
crafted feature engineering and probabilistic estimation of
specific types of root causes tailored for Spark logs. [170]
uses frequent item-set mining and association rule mining on
feature groups for structured logs.
Correlation based Methods:
[171], [172] localizes root
cause based on correlation analysis using mutual information
between anomaly scores obtained from logs and monitored
metrics. Similarly [173] use PCA, ICA based correlation
analysis to capture relationships between logs and consequent
failures. [84], [174] uses PCA to detect abnormal system call
sequences which it maps to application functions through
frequent pattern mining.[175] uses LSTM based sequential
modeling of log templates identified through pattern matching
over clusters of similar logs, in order to predict failures.
Supervised Classifier based Methods:
[176] does auto-
mated detection of exception logs and comparison of new
error patterns with normal cloud behaviours on OpenStack by
learning supervised classifiers over statistical and neural rep-
resentations of historical failure logs. [177] employs statistical
technique on the data distribution to identify the fine-grained
category of a performance problem and fast matrix recovery
RPCA to identify the root cause. [178], [179] uses KNN or its
supervised versions to identify loglines that led to a failure.
Knowledge Mining based Methods:
[180], [181] takes a
different approach of summarizing log events into an entity-
relation knowledge graph by extracting custom entities and
relationships from log lines and mining temporal and proce-
dural dependencies between them from the overall log dump.
While this gives a more structured representation of the log
summary, it is also an intuitive way of aggregating knowledge
from logs, it is also a way to bridge the knowledge gap
developer community who creates the log data and the site
reliability engineers who typically consume the log data when
investigating incidents. However, eventually the end goal of
constructing this knowledge graph representation of logs is
to facilitate RCA. While these works do provide use-cases
like case-studies on RCA for this vision, but they leave ample
scope of research towards a more concrete usage of this kind
of knowledge mining in RCA.
Knowledge Graph based Methods:
Amongst knowledge
graph based methods, [182] diagnoses and triages performance
failure issues in an online fashion by continuously building a
knowledge base out of rules extracted from a random forest
constructed over log data using heuristics and domain knowl-
edge. [151] constructs a system graph from the combination
of KPI metrics and log data. Based on the detected anomalies
from these data sources, it extracts anomalous subgraphs from
it and compares them with the normal system graph to detect
the root cause. Other works mine normal log patterns [183]
or time-weighted control flow graphs [99] from normal exe-
cutions and on estimates divergences from them to executions
during ongoing failures to suggest root causes. [184], [185],
[186] mines execution sequences or user actions [187] either
from normal and manually injected failures or from good or
bad performing systems, in a knowledge base and utilizes
the assumption that similar faults generate similar failures to
match and diagnose type of failure. Most of these knowledge
based approaches incrementally expand their knowledge or
rules to cater to newer incident types over time.
Causal Graph based Methods:
[188] uses a multivariate
time-series modeling over logs by representing them as error
event count. This work then infers its causal relationship with
KPI error rate using a pagerank style centrality detection
in order to identify the top root causes. [167] constructs
a knowledge graph over operation and maintenance entities
extracted from logs, metrics, traces and system dependency
graphs and mines causal relations using PC algorithm to detect
root causes of incidents. [189] uses a Knowledge informed
Hierarchical Bayesian Network over features extracted from
metric and log based anomaly detection to infer the root
causes. [190] constructs dynamic causality graph over events
extracted from logs, metrics and service dependency graphs.
[191] similarly constructs a causal dependency graph over log
events by clustering and mining similar events and use it to
infer the process in which the failure occurs.
Also, on a related domain of network analysis, [192],
[193], [194] mines causes of network events through causal
analysis on network logs by modeling the parsed log template
counts as a multivariate time series. [195], [156] use causality
inference on KPI metrics and service call graphs to localize
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
19
root causes in microservice systems and one of the future
research directions is to also incorporate unstructured logs to
such causal analysis.
Challenges & Future Trends Collecting supervision la-
bels
Being a complex and open-ended task, it is challenging
and requires a lot of domain expertise and manual effort to col-
lect supervision labels for root cause analysis. While a small
scale supervision can still be availed for evaluation purposes,
reaching the scale required for training these models is simply
not practical. At the same time, because of the complex nature
of the problem, completely unsupervised models often perform
quite poorly.
Data quality:
The workflow of RCA over hetero-
geneous unstructured log data typically involves various dif-
ferent analysis layers, preprocessing, parsing, partitioning and
anomaly detection. This results in compounding and cascading
of errors (both labeling errors as well as model prediction
errors) from these components, needing the noisy data to be
handled in the RCA task. In addition to this, the extremely
challenging nature of RCA labeling task further increases the
possibility of noisy data.
Imbalanced class problem:
RCA on
huge voluminous logs poses an additional problem of extreme
class imbalance - where out of millions of log lines or log
templates, a very sparse few instances might be related to the
true root cause.
Generalizability of models:
Most of the exist-
ing literature on RCA tailors their approach very specifically
towards their own application and cannot be easily adopted
even by other similar systems. This alludes towards need for
more generalizable architectures for modeling the RCA task
which in turn needs more robust generalizable log analysis
models that can handle hetergenous kinds of log data coming
from different systems.
Continual learning framework:
One
of the challenging aspects of RCA in the distributed cloud
setting is the agile environment, leading to new kinds of
incidents and evolving causation factors. This kind of non-
stationary learning setting poses non-trivial challenges for
RCA but is indeed a crucial aspect of all practical industrial
applications.
Human-in-the-loop framework:
While neither
completely supervised or unsupervised settings is practical
for this task, there is need for supporting human-in-the-loop
framework which can incorporate feedbacks from domain
experts to improve the system, especially in the agile settings
where causation factors can evolve over time.
Realistic public
benchmarks:
Majority of the literature in this area is focused
on industrial applications with in-house evaluation setting. In
some cases, they curate their internal testbed by injecting
failures or faults or anomalies in their internal simulation
environment (for e.g. injecting CPU, memory, network and
Disk anomalies in Spark platforms) or in popular testing
settings (like Grid5000 testbed or open-source microservice
applications based on online shopping platform or train ticket
booking or open source cloud operating system OpenStack).
Other works evaluate by deploying their solution in real-
world setting in their in-house cloud-native application, for
e.g. on IBM Bluemix platform, or for Facebook applications
or over hundreds of real production services at big data cloud
computing platforms like Alibaba or thousands of services
at e-commerce enterprises like eBay. One of the striking
limitations in this regard is the lack of any reproducible open-
source public benchmark for evaluating log based RCA in
practical industrial settings. This can hinder more open ended
research and fair evaluation of new models for tackling this
challenging task.
C. Trace-based and Multimodal RCA
Problem Definition.
Ideally, RCA for a complex system
needs to leverage all kind of available data, including machine
generated telemetry data and human activity records, to find
potential root causes of an issue. In this section we discuss
trace-based RCA together with multi-modal RCA. We also
include studies about RCA based on human records such as
incident reports. Ultimately, the RCA engine should aim to
process any data types and discover the right root causes.
RCA on Trace Data
In previous section (Section IV-C) we discussed trace can
be treated as multimodal data for anomaly detection. Similar
to trace anomaly detection, trace root cause analysis also lever-
ages the topological structure of the service map. Instead of
detecting abnormal traces or paths, trace RCA usually started
after issues were detected. Trace RCA techniques help ease
troubleshooting processes of engineers and SREs. And trace
RCA can be triggered in a more ad-hoc way instead of running
continuously. This differentiates the potential techniques to be
adopted from trace anomaly detection.
Trace Entity Graph.
From the technical point of view, trace
RCA and trace anomaly detection share similar perspectives.
To our best knowledge, there are not too many existing works
talking about trace RCA alone. Instead, trace RCA serves
as an additional feature or side benefit for trace anomaly
detection in either empirical approaches [121] [196] or deep
learning approaches [120] [197]. In trace anomaly detection,
the constructed trace entity graph (TEG) after offline training
provides a clean relationship between each component in the
application systems. Thus, besides anomaly detection, [122]
implemented a real-time RCA algorithm that discovers the
deepest root of the issues via relative importance analysis
after comparing the current abnormal trace pattern with normal
trace patterns. Their experiment in the production environment
demonstrated this RCA algorithm can achieve higher precision
and recall compared to naive fixed threshold methods. The
effectiveness of leverage trace entity graph for root cause
analysis is also proven in deep learning based trace anomaly
detection approaches. Liu
et al.
[198] proposed a multimodal
LSTM model for trace anomaly detection. Then the RCA
algorithm can check every anomalous trace with the model
training traces and discover root cause by localizing the next
called microservice which is not in the normal call paths. This
algorithm performs well for both synthetic dataset and produc-
tion datasets of four large production services, according to the
evaluation of this work.
20
Online Learning.
An alternative approach is using data
mining and statistical learning techniques to run dynamic
analysis without constructing the offline trace graph. Tra-
ditional trace management systems usually provides basic
analytical capabilities to diagnose issues and discover root
causes [199]. Such analysis can be performed online without
costly model training process. Chen
et al.
proposed Pinpoint
[124], a framework for root cause analysis that using coarse-
grained tagging data of real client requests at real-time when
these requests traverse through the system, with data mining
techniques. Pinpoint discovers the correlation between success
/ failure status of these requests and fault components. The
entire approach processes the traces on-the-fly and does not
leverage any static dependency graph models. Another related
area is using trouble-shooting guide data, where [200] rec-
ommends troubleshooting guide based on semantic similarity
with incident description while [201] focuses on automation
of troubleshooting guides to execution workflows, as a way to
remediate the incident.
RCA on Incident Reports
Another notable direction in AIOps literature has been
mining useful knowledge from domain-expert curated data
(incident report, incident investigation data, bug report etc)
towards enabling the final goals of root cause analysis and
automated remediation of incidents. This is an open ended task
which can serve various purposes - structuring and parsing
unstructured or semi-structured data and extracting targeted
information or topics from them (using topic modeling or in-
formation extraction) and mining and aggregating knowledge
into a structured form.
The end-goal of these tasks is majorly root cause analysis,
while some are also focused on recommending remediation
to mitigate the incident. Especially since in most cloud-
based settings, there is an increasing number of incidents that
occur repeatedly over time showing similar symptoms and
having similar root causes. This makes mining and curating
knowledge from various data sources, very crucial, in order to
be consumed by data-driven AI models or by domain experts
for better knowledge reuse.
Causality Graph.
[202] extracts and mines causality graph
from historical incident data and uses human-in-the-loop su-
pervision and feedback to further refine the causality graph.
[203] constructs an anomaly correlation graph, FacGraph using
a distributed frequent pattern mining algorithm. [204] recom-
mends appropriate healing actions by adapting remediations
retrieved from similar historical incidents. Though the end task
involves remediation recommendation, the system still needs
to understand the nature of incident and root cause in order
to retrieve meaningful past incidents.
Knowledge Mining.
[205], [206] mines knowledge graph
from named entity and relations extracted from incident re-
ports using LSTM based CRF models. [207] extracts symp-
toms, root causes and remediations from past incident inves-
tigations and builds a neural search and knowledge graph
to facilitate a retrieval based root cause and remediation
recommendation for recurring incidents.
Future Trends
More Efficient Trace Platform.
Currently there are very
limited studies in trace related topics. A fundamental challenge
is about the trace platforms.There are bottlenecks in collection,
storage, query and management of trace data. Traces are
usually at a much larger scale than logs and metrics. How
to more efficiently collect, store and retrieve trace data is very
critical to the success of trace root cause analysis.
Online Learning.
Compared to trace anomaly detection,
online learning plays a more important role for trace RCA,
especially for large cloud systems. An RCA tool usually needs
to analyze the evidence on the fly and correlate the most
suspicious evidence to the ongoing incidents, this approach is
very time sensitive. For example, we know trace entity graph
(TEG) can achieve accurate trace RCA but the preassumpiton
is the TEG is reflecting the current status of the system. If
offline training is the only way to get TEG, the performance
of such approaches in real-world production environments is
always questionable. Thus, using online learning to obtain the
TEG is a much better way to guarantee high performance in
this situation.
Causality Graphs on Multimodal Telemetries.
The most
precious information conveyed by trace data is the complex
topological order of large systems. Without traces, causal anal-
ysis for system operations relies on temporal and geometrical
correlations to infer causal relationships, and practically very
few existing causal inference can be adopted in real-world
systems. However, with traces, it is very convenient to obtain
the ground truth of how requests flow through the entire
system. Thus, we believe higher quality causal graphs will
be much easier achievable if it can be learned by multimodel
telemetry data.
Complete
Knowledge
Graph
of
Systems.
Currently
knowledge mining has been tried for single data type. How-
ever, to reflect the full picture of a complex system, the AI
models need to mining knowledge from any kind of data
types, including metrics, logs, traces, incident reports and other
system activity records, then construct a knowledge graph with
complete system information.
VII. A
UTOMATED
A
CTIONS
While both incident detection and RCA capabilities of
AIOps help provide information about ongoing issues, tak-
ing the right actions is the step that solve the problems.
Without automation to take actions, human operators will
still be needed in every single ops task. Thus, automated
actions is critical to build fully-automated end-to-end AIOps
systems. Automated actions contributes to both short-term
actions and longer-term actions: 1)
short-term remediation
:
immediate actions to quickly remediate the issue, including
server rebooting, live migration, automated scaling, etc.; and
2)
longer-term resolutions
: actions or guidance for tasks such
as code bug fixing, software updating, hard build-out and re-
source allocation optimization. In this section, we discuss three
common types of automated actions: automated remediation,
auto-scaling and resource management.
21
A. Automated Remediation
Problem Definition
Besides continuously monitoring the IT infrastructure, de-
tecting issues and discovering root causes, remediating issues
with minimum, or even no human intervention, is the path
towards the next generation of fully automated AIOps. Auto-
mated issue remediation (Auto-remediation) is taking a series
of actions to resolve issues by leveraging known information,
existing workflows and domain knowledge. Auto-remediation
is a concept already adopted in many IT operation scenarios,
including cloud computing, edge computing, SaaS, etc.
Traditional auto-remediation processes are based on a vari-
ety of well-defined policies and rules to get which workflows
to use for a given issue. While machine learning driven
auto-remediation means utilizing machine learning models to
decide the best action workflows to mitigate or resolve the
issue. ML based auto-remediation is exceptionally useful in
large scale cloud systems or edge-computing systems where
it’s impossible to manually create workflows for all issue
categories.
Existing Work
End-to-end auto-remediation solutions usually contain three
main components: anomaly or issue detection, root cause anal-
ysis and remediation engine [208]. This means successful auto-
remediation solutions highly rely on the quality of anomaly de-
tection and root cause analysis, which we’ve already discussed
in the above sections. Besides, the remediation engine should
be able to learn from the analysis results, make decisions and
execute.
Knowledge learning.
The knowledge here refers to a va-
riety of categories. Anomaly detection and root cause analysis
for this specific issue contributes to a majority of the learnable
knowledge [208]. Remediation engine uses these information
to locate and categorize the issue. Besides, the human activity
records (such as tickets, bug fixing logs) of past issues are also
significant for the remediation to learn the full picture of how
issues were handled in history. In Sections VI-A VI-B VI-C
we discussed about mining knowledge graphs from system
metrics, logs and human-in-the-loop records. A high quality
knowledge graph which clearly describes the relationship of
system components.
Decision
making
and
execution.
Levy
et
al.
[209]
proposed Narya, a system to handle failure remediation for
running virtual machines in cloud systems. For a given issue
where the host is predicted to fail, the remediation engine
needs to decide what is the best action to take from a few
options such as live migration, soft reboot, service healing, etc.
The decision on which actions to take are made via A/B testing
and reinforcement learning. With adopting machine learning in
their remediation engine, they see significant virtual machine
interruption savings compared to the previous static strategies.
Future Trends
Auto-remediation research and development is still in very
early stages. The existing work mainly focuses on an inter-
mediate step such as constructing a causal graph for a given
scenario, or an end-to-end auto-remediation solution for very
specific use cases such as virtual machine interruptions. Below
are a few topics that can significantly improve the quality of
auto-remediation systems.
System Integration
Now there is still no unified platform
that can perform all the issue analysis, learn the context
knowledge, make decisions and execute the actions.
Learn to generate and update knowledge graphs
Quality
of auto-remediation decision making strongly depends on
domain knowledge. Currently humans collect most of the
domain knowledge. In the future, it is valuable to explore
approaches that learn and maintain knowledge graphs of the
systems in a more reliable way.
AI driven decision making and execution
Currently most
of the decision making and action execution are rule-based or
statistical learning based. With more powerful AI techniques,
the remediation engine can then consume rich information and
make more complex decisions.
B. Auto-scaling
Problem Definition
The cloud native technologies are becoming the de facto
standard for building scalable applications in public or private
clouds, enabling loosely coupled systems that are resilient,
manageable, and observable
†
. The cloud systems such as GCP
and AWS provide users on-demand resources including CPU,
storage, memory and databases. Users needs to specify a limit
of these resources to provision for the workloads of their
applications. If a service in an application exceeds the limit of
a particular resource, end-users will experience request delays
or timeouts, so that system operators will request a larger
limit of this resource to avoid degraded performance. But
if hundreds of services are running, such large limit results
in massive resource wastage. Auto-scaling aims to resolve
this issue without human intervention, which enables dynamic
provisioning of resources to applications based on workload
behavior patterns to minimize resource wastage without loss
of quality of service (QoS) to end-users.
Auto-scaling approaches can be categorized into two types:
reactive auto-scaling and proactive (or predictive) auto-scaling.
Reactive auto-scaling monitors the services in a application,
and brings them up and down in reaction to changes in
workloads.
Reactive auto-scaling
. Reactive auto-scaling is very effec-
tive and supported by most cloud platforms. But it has one
potential disadvantage, i.e., it won’t scale up resources until
workloads increase so that there is a short period in which
more capacity is not yet available but workloads becomes
higher. Therefore, end-users can experience response delays
in this short period. Proactive auto-scaling aims to solve this
problem by predicting future workloads based on historical
data. In this paper, we mainly discuss proactive auto-scaling
algorithms based on machine learning.
Proactive Auto-scaling.
Typically, proactive auto-scaling
involves three steps, i.e., predicting workloads, estimating
†
https://github.com/cncf/foundation/blob/main/charter.md
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
22
capacities and scaling out. Machine learning techniques are
usually applied to predict future workloads and estimate the
suitable capacities for the monitored services, and then adjust-
ments can be done accordingly to avoid degraded performance.
One type of proactive auto-scaling approaches applies re-
gression models (e.g., ARIMA [210], SARIMA [211], MLP,
LSTM [212]). Given the historical metrics of a monitored
service, this type of approaches trains a particular regression
model to learn the workload behavior patterns. For example,
[213] investigated the ARIMA model for workload prediction
and showed that the model improves efficiency in resource
utilization with minimal impact in QoS. [214] applied a time
window MLP to predict phases in containers with different
types of workloads and proposed a predictive vertical auto-
scaling policy to resize containers. [215] also leveraged neural
networks (especially MLP) for workload prediction and com-
pared this approach with traditional machine learning models,
e.g., linear regression and K-nearest neighbors. [216] applied a
bidirectional LSTM to predict the number of HTTP workloads
and showed that Bi-LSTM works better than LSTM and
ARIMA on the tested use cases. These approaches require
accurate forecasting results to avoid over- or under-allocated
of resources, while it is hard to develop a robust forecasting-
based approach due to the existence of noises and sudden
spikes in user requests.
The other type is based on reinforcement learning (RL) that
treats auto-scaling as an automatic control problem, whose
goal is to learn an optimal auto-scaling policy for the best
resource provision action under each observed state. [217]
presents an exhaustive survey on reinforcement learning-based
auto-scaling approaches, and compares them based on a set of
proposed taxonomies. This survey is very worth reading for
developers or researchers who are interested in this direction.
Although RL looks promising in auto-scaling, there are many
issues needed to be resolved. For example, model-based meth-
ods require a perfect model of the environment and the learned
policies cannot adapt to the changes in the environment, while
model-free methods have very poor initial performance and
slow convergence so that they will introduce high cost if they
are applied in real-world cloud platforms.
C. Resource Management
Problem Definition
Resource management is another important topic in cloud
computing, which includes resource provisioning, allocation
and scheduling, e.g., workload estimation, task scheduling,
energy optimization, etc. Even small provisioning inefficien-
cies, such as selecting the wrong resources for a task, can
affect quality of service (QoS) and thus lead to significant
monetary costs. Therefore, the goal of resource management is
to provision the right amount of resources for tasks to improve
QoS, mitigate imbalance workloads, and avoid service level
agreements violations.
Because of multiple tenants sharing storage and computa-
tion resources on cloud platforms, resource management is
a difficult task that involves dynamically allocating resources
and scheduling tenants’ tasks. How to provision resources can
be determined in a reactive manner, e.g., creating static rules
manually based on domain knowledge. But similar to auto-
scaling, reactive approaches result in response delays and ex-
cessive overheads. To resolve this issue, ML-based approaches
for resource management have gained much attention recently.
ML-based Resource Management
Many ML-based resource management approaches have
been developed in recent years. Due to space limitation, we
will not discuss them in details. We recommend readers who
are interested in this research topic to read the following nice
review papers: [218], [219], [220], [221], [222]. Most of these
approaches apply ML techniques to forecast future resource
consumption and then do resource provisioning or scheduling
based on the forecasting results. For instance, [223] uses ran-
dom forest and XGBoost to predict VM behaviors including
maximum deployment sizes and workloads. [224] proposes
a linear regression based approach to predict the resource
utilization of the VMs based on their historical data, and then
leverage the prediction results to reduce energy consumption.
[225] applies gradient boosting models for temperature pre-
diction, based on which a dynamic scheduling algorithm is
developed to minimize the peak temperature of hosts. [226]
proposes a RL-based workload-specific scheduling algorithm
to minimize average task completion time.
The accuracy of the ML model is the key factor that affects
the efficiency of a resource management system. Applying
more sophisticated traditional ML models or even deep learn-
ing models to improve prediction accuracy is a promising
research direction. Besides accuracy, the time complexity of
model prediction is another important factor needed to be con-
sidered. If a ML model is over-complicated, it cannot handle
real-time requests of resource allocation and scheduling. How
to make a trade-off between accuracy and time complexity
needs to be explored further.
VIII. F
UTURE OF
AIO
PS
A. Common AI Challenges for AIOps
We have discussed the challenges and future trends in each
task sections according to how to employ AI techniques. In
summary, there are some common challenges across different
AIOps tasks.
Data Quality.
For all AIOps task there are data quality
issues. Most real-world AIOps data are extremely imbalanced
due to the nature that incidents only occurs occasionally. Also,
most of the real-world AIOps data are very noisy. Significant
efforts are needed in data cleaning and pre-processing before
it can be used as input to train ML models.
Lack of Labels.
It’s extremely difficult to acquire quality
labels sufficiently. We need a lot of domain experts who are
very familiar with system operations to evaluate incidents,
root-causes and service graphs, in order to provide high-quality
labels. This is extremely time consuming and require specific
expertise, which cannot be handled by general crowd sourcing
approaches like Mechanical Turk.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
23
Non-stationarity and heterogeneity.
Systems are ever-
changing. AIOps are facing non-stationary problem space. The
AI models in this domain need to have mechanisms to deal
with this non-stationary nature. Meanwhile, AIOps data are
heterogeneous, meaning the same telemetry data can have a
variety of underlying behaviors. For example, CPU utilization
pattern can be totally different when the resources are used to
host different applications. Thus, discovery the hidden states
and handle heterogeneity is very important for AIOps solutions
to succeed.
Lack of Public Benchmarking.
Even though AIOps re-
search communities are growing rapidly, there are still very
limited number of public datasets for researchers to benchmark
and evaluate their results. Operational data are highly sensitive.
Existing research are done either with simulated data or
enterprise production data which can hardly be shared with
other groups and organizations.
Human-in-the-loop.
Human feedback are very important to
build AIOps solutions. Currently most of the human feedback
are collected in ad-hoc fashion, which is inefficient. There
are lack of human-in-the-loop studies in AIOps domain to
automate feedback collection and utilize the feedback to
improve model performance.
B. Opportunities and Future Trends
Our literature review of existing AIOps work shows cur-
rent AIOps research still focuses more on infrastructure and
tooling. We see AI technologies being successfully applied
in incident detection, RCA applications and some of the
solutions has been adopted by large distributed systems like
AWS, Alibaba cloud. While it is still in very early stages for
AIOps process standardization and full automation. With these
evidences, we can foresee the promising topics of AIOps in
the next few years.
High Quality AIOps Infrastructure and Tooling
There are some successful AIOps platforms and tools being
developed in recent years. But still there are opportunities
where AI can help enhance the efficiency of IT operations.
AI is also growing rapidly and new AI technologies are
invented and successfully applied in other domains. The digital
transformation trend also brings challenges to traditional IT
operation and Devops. This creates tremendous needs for
high quality AI tooling, including monitoring, detection, RCA,
predictions and automations.
AIOps Standardization
While building the infrastructure and tooling, AIOps experts
also better understand the full picture of the entire domain.
AIOps modules can be identified and extracted from traditional
processes to form its own standard. With clear goals and
measures, it becomes possible to standardize AIOps systems,
just as what has been done in domains like recommendation
systems or NLP. With such standardization, it will be much
easier to experiment a large variety of AI techniques to
improve AIOps performance.
Human-centric to Machine-centric AIOps
Human-centric AIOps means human processes still play
critical roles in the entire AIOps eco-systems, and AI modules
help humans with better decisions and executions. While
in Machine-centric mode, AIOps systems require minimum
human intervention and can be in human-free state for most
of its lifetime. AIOps systems continuously monitor the IT
infrastructure, detecting and analysis issues, finding the right
paths to drive fixes. In this stage, engineers focus primarily on
development tasks rather than operations.
IX. C
ONCLUSION
Digital transformation creates tremendous needs for com-
puting resources. The trend boosts strong growth of large scale
IT infrastructure, such as cloud computing, edge computing,
search engines, etc. Since proposed by Gartner in 2016, AIOps
is emerging rapidly and now it draws the attention from large
enterprises and organizations. As the scale of IT infrastructure
grows to a level where human operation cannot catch up,
AIOps becomes the only promising solution to guarantee high
availability of these gigantic IT infrastructures. AIOps covers
different stages of software lifecycles, including development,
testing, deployment and maintenance.
Different AI techniques are now applied in AIOps applica-
tions, including anomaly detection, root-cause analysis, fail-
ure predictions, automated actions and resource management.
However, the entire AIOps industry is still in a very early
stage where AI only plays supporting roles to help human
conducting operation workflows. We foresee the trend shifting
from human-centric Operations to AI-centric Operations in
the near future. During the shift, Development of AIOps
techniques will also transit from build tools to create human-
free end-to-end solutions.
In this survey, we discovered that most of the current
AIOps outcomes focus on detections and root cause analysis,
while research work on automations is still very limited. The
AI techniques used in AIOps are mainly traditional machine
learning and statistical models.
A
CKNOWLEDGMENT
We want to thank all participants who took the time to ac-
complish this survey. Their knowledge and experiences about
AI fundamentals were invaluable to our study. We are also
grateful to our colleagues at the Salesforce AI Research Lab
and collaborators from other organizations for their helpful
feedback and support.
A
PPENDIX
A
T
ERMINOLOGY
DevOps:
Modern software development requires not only
higher development quality but also higher operations quality.
DevOps, as a set of best practices that combines the devel-
opment (Dev) and operations (Ops) processes, is created to
achieve high quality software development and after release
management [3].
Application Performance Monitoring (APM):
Applica-
tion performance monitoring is the practice of tracking key
software application performance using monitoring software
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
24
and telemetry data[227]. APM is used to guarantee high
system availability, optimize service performance and improve
user experiences. Originally APM was mostly adopted in
websites, mobile apps and other similar online business appli-
cations. However, with more and more traditional softwares
transforming to leverage cloud based, highly distributed sys-
tems, APM is now widely used for a larger variety of software
applications and backends.
Observability:
Observability is the ability to measure the
internal states of a system by examining its outputs [228]. A
system is “observable” if the current state can be estimated
by only using the information from outputs. Observability
data includes metrics, logs, traces and other system generated
information.
Cloud Intelligence:
The artificial intelligent features that
improve cloud applications.
MLOps:
MLOps stands for machine learning operations.
MLOps is the full process life cycle of deploying machine
learning models to production.
Site Reliability Engineering (SRE):
The type of engineer-
ing that bridge the gap between software development and
operations.
Cloud Computing:
Cloud computing is a technique, and
a
business
model,
that
builds
highly
scalable
distributed
computer systems and lends computing resources, e.g. hosts,
platforms, apps, to tenants to generate revenue. There are three
main category of cloud computing: infrastructure as a service
(IaaS), platform as a service (PaaS) and software as a service
(SaaS)
IT
Service
Management
(ITSM):
ITSM refers to all
processes and activities to design, create, deliver, and support
the IT services to customers.
IT Operations Management (ITOM):
ITOM overlaps
with ITSM, focusing more on the operation side of IT services
and infrastructures.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
25
A
PPENDIX
B
T
ABLES
TABLE I
T
ABLE OF POPULAR PUBLIC DATASETS FOR METRICS OBSERVABILITY
Name
Description
Tasks
Azure
Public
Dataset
These datasets contain a representative subset of first-party
Azure virtual machine workloads from a geographical region.
Workload
characterization,
VM
Pre-provisioning,
Workload
prediction
Google Cluster
Data
30 continuous days of information from Google Borg cells.
Workload characterization, Workload prediction
Alibaba Cluster
Trace
Cluster traces of real production servers from Alibaba Group.
Workload characterization, Workload prediction
MIT
Supercloud
Dataset
Combination of high-level data (e.g. Slurm Workload Manager
scheduler data) and low-level job-specific time series data.
Workload characterization
Numenta
Anomaly
Benchmark (re-
alAWSCloud-
watch)
AWS server metrics as collected by the AmazonCloudwatch
service. Example metrics include CPU Utilization, Network
Bytes In, and Disk Read Bytes.
Incident detection
Yahoo S5 (A1)
A1 benchmark contains real Yahoo! web traffic metrics.
Incident detection
Server Machine
Dataset
A 5-week-long dataset collected from a large Internet company
containing metrics like CPU load, network usage, memory
usage, etc.
Incident detection
KPI
Anomaly
Detection
Dataset A
A large-scale realworld KPI anomaly detection dataset, covering
various KPI patterns and anomaly patterns. This dataset is
collected from five large Internet companies (Sougo, eBay,
Baidu, Tencent, and Ali).
Incident detection
TABLE II
T
ABLE OF
P
OPULAR
P
UBLIC
D
ATASETS FOR
L
OG
O
BSERVABILITY
Dataset
Description
Time-span
Data Size
# logs
Anomaly
Labels
# Anomalies
# Log Templates
Distributed system logs
HDFS
Hadoop distributed file system log
38.7 hours
1.47 GB
11,175,629
3
16,838(blocks)
30
N.A.
16.06 GB
71,118,073
7
Hadoop
Hadoop map-reduce job log
N.A.
48.61MB
394,308
3
298
Spark
Spark job log
N.A.
2.75GB
33,236,604
7
456
Zookeeper
ZooKeeper service log
26.7 days
9.95MB
74,380
7
95
OpenStack
OpenStack infrastructure log
N.A.
58.61MB
207,820
3
503
51
Supercomputer logs
BGL
Blue Gene/L supercomputer log
214.7 days
708.76MB
4,747,963
3
348,460
619
HPC
High performance cluster log
N.A.
32MB
433,489
7
104
Thunderbird
Thunderbird supercomputer log
244 days
29.6GB
211,212,192
3
3,248,239
4040
Operating System logs
Windows
Windows event log
226.7 days
16.09GB
114,608,388
7
4833
Linux
Linux system log
263.9 days
2.25MB
25,567
7
488
Mac
Mac OS log
7 days
16.09MB
117,283
7
2214
Mobile System logs
Android
Android framework log
N.A.
183.37MB
1,555,005
7
76,923
Health App
Health app log
10.5days
22.44MB
253,395
7
220
Server application logs
Apache
Apache server error logs
263.9 days
4.9MB
56,481
7
44
OpenSSH
OpenSSH server logs
28.4 days
70.02MB
655,146
7
62
Standalone software logs
Proxifier
Proxifier software logs
N.A.
2.42MB
21,329
7
9
Hardware logs
Switch
Switch hardware failures
2 years
-
29,174,680
3
2,204
-
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
26
TABLE III
C
OMPARISON OF EXISTING
L
OG
A
NOMALY
D
ETECTION
M
ODELS
Reference
Learning
Setting
Type of Model
Log Representation
Log Tokens
Parsing
Sequence
modeling
[92],
[93],
[94]
Supervised
Linear Regression, SVM, Deci-
sion Tree
handcrafted feature
log template
3
7
[84]
Unsupervised
Principal Component Analysis
(PCA)
quantitative
log template
3
3
[67],
[82],
[95], [80]
Unsupervised
Clustering and Correlation be-
tween logs and metrics
sequential, quantitative
log template
3
7
[96]
Unsupervised
Mining invariants using singu-
lar value decomposition
quantitative, sequential
log template
3
7
[97],
[98],
[99], [68]
Unsupervised
Frequent pattern mining from
Execution
Flow
and
control
flow graph mining
quantitative, sequential
log template
3
7
[20], [100]
Unsupervised
Rule
Engine
over
Ensembles
and Heuristic contrast analysis
over anomaly characteristics
sequential (with tf-idf weights)
log template
3
7
[101]
Supervised
Autoencoder
for
log
specific
word2vec
semantic (trainable embedding)
log template
3
3
[102]
Unsupervised
Autoencoder w/ Isolation Forest
semantic (trainable embedding)
all tokens
7
7
[114]
Supervised
Convolutional Neural Network
semantic (trainable embedding)
log template
3
3
[108]
Unsupervised
Attention based LSTM
sequential,
quantitative,
semantic
(GloVe embedding)
log
template,
log parameter
3
3
[81]
Unsupervised
Attention based LSTM
quantitative and semantic (GloVe em-
bedding)
log template
3
3
[111]
Supervised
Attention based LSTM
semantic (fastText embedding with tf-
idf weights)
log template
3
3
[104]
Semi-
Supervised
Attention based GRU with clus-
tering
semantic (fastText embedding with tf-
idf weights)
log template
3
3
[112]
Unsupervised
Attention based Bi-LSTM
semantic (with trainable embedding)
all tokens
7
3
[109]
Unsupervised
Bi-LSTM
semantic (token embedding from BERT,
GPT, XLM)
all tokens
7
3
[113]
Unsupervised
Attention based Bi-LSTM
semantic (BERT token embedding)
log template
3
3
[110]
Semi-
Supervised
LSTM, trained with supervision
from source systems
semantic (GloVe embedding)
log template
3
3
[18]
Unsupervised
LSTM with domain adversarial
training
semantic (GloVe embedding)
all tokens
7
3
[118], [18]
Unsupervised
LSTM with Deep Support Vec-
tor Data Description
semantic (trainable embedding)
log template
3
3
[115]
Supervised
Graph Neural Network
semantic (BERT token embedding)
log template
3
3
[116]
Semi-
Supervised
Graph Neural Network
semantic (BERT token embedding)
log template
3
3
[103],
[229],
[230], [231]
Unsupervised
Self-Attention Transformer
semantic (trainable embedding)
all tokens
7
3
[78]
Supervised
Self-Attention Transformer
semantic ( trainable embedding)
all tokens
7
3
[117]
Supervised
Hierarchical Transformer
semantic (trainable GloVe embedding)
log
template,
log parameter
3
3
[104], [105]
Unsupervised
BERT Language Model
semantic (BERT token embedding)
all tokens
7
3
[21]
Unsupervised
Unified BERT on various log
analysis tasks
semantic (BERT token embedding)
all tokens
7
3
[232]
Unsupervised
Contrastive Adversarial model
semantic (BERT and VAE based em-
bedding) and quantitative
log template
3
3
[106],
[107],
[233]
Unsupervised
LSTM,Transformer based GAN
(Generative Adversarial)
semantic (trainable embedding)
log template
3
3
Log Tokens
refers to the tokens from the logline used in the log representations
Parsing
and
Sequence Modeling
columns respectively refers to whether these models need log parsing and they support modeling log sequences
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
27
TABLE IV
C
OMPARISON OF
E
XISTING
M
ETRIC
A
NOMALY
D
ETECTION
M
ODELS
Reference
Label Accessibility
Machine Learning Model
Dimensionality
Infrastructure
Streaming Updates
[31]
Supervised
Tree
Univariate
7
3
(Retraining)
[41]
Active
-
Univariate
3
3
(Retraining)
[42]
Unsupervised
Tree
Multivariate
7
3
[43]
Unsupervised
Statistical
Univariate
7
3
[51]
Unsupervised
Statistical
Univariate
7
7
[37]
Semi-supervised
Tree
Univariate
7
3
[36]
Unsupervised,
Semi-
supervised
Deep Learning
Univariate
7
7
[52]
Unsupervised
Deep Learning
Univariate
3
7
[40]
Domain
Adaptation,
Active
Tree
Univariate
7
7
[46]
Unsupervised
Deep Learning
Multivariate
7
7
[49]
Unsupervised
Deep Learning
Univariate
7
7
[45]
Unsupervised
Deep Learning
Multivariate
7
7
[32]
Supervised
Deep Learning
Univariate
3
3
(Retraining)
[47]
Unsupervised
Deep Learning
Multivariate
7
7
[48]
Unsupervised
Deep Learning
Multivariate
7
7
[50]
Unsupervised
Deep Learning
Multivariate
7
7
[38]
Semi-supervised,
Active
Deep Learning
Multivariate
3
3
(Retraining)
TABLE V
C
OMPARISON OF
E
XISTING
T
RACE AND
M
ULTIMODAL
A
NOMALY
D
ETECTION AND
RCA M
ODELS
Reference
Topic
Deep Learning Adoption
Method
[124]
Trace RCA
7
Clustering
[121]
Trace RCA
7
Heuristic
[234]
Trace RCA
7
Multi-input
Differential
Sum-
marization
[197]
Trace RCA
7
Random forest, k-NN
[122]
Trace RCA
7
Heuristic
[235]
Trace Anomaly Detection
7
Graph model
[198]
Multimodal Anomaly Detection
3
Deep Bayesian Networks
[236]
Trace Representation
3
Tree-based RNN
[196]
Trace Anomaly Detection
7
Heuristic
[120]
Multimodal Anomaly Detection
3
GGNN and SVDD
TABLE VI
C
OMPARISON OF SEVERAL EXISTING METRIC
RCA
APPROACHES
Reference
Metric or Graph Analysis
Root Cause Score
[147]
Change points
Chronological order
[146]
Change points
Chronological order
[148]
Two-sample test
Correlation
[149]
Call graphs
Cluster similarity
[150]
Service graph
PageRank
[151]
Service graph
Graph similarity
[152]
Service graph
Hierarchical HMM
[153]
PC algorithm
Random walk
[154]
ITOA-PI
PageRank
[155]
Service graph and PC
Causal inference
[156]
PC algorithm
Random walk
[157]
Service graph and PC
Causal inference
[158]
PC algorithm
Random walk
[159]
PC algorithm
Random walk
[237]
Service graph
Causal inference
[168]
Service graph
Contribution-based
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
28
R
EFERENCES
[1] T.
Olavsrud,
“How
to
choose
your
cloud
service
provider,”
2012.
[Online].
Available:
https://www2.cio.com.au/article/416752/
how
choose
your
cloud
service
provider/
[2] “Summary
of
the
amazon
s3
service
disruption
in
the
northern
virginia (us-east-1) region,” 2021. [Online]. Available: https://aws.
amazon.com/message/41926/
[3] S. Gunja, “What is devops? unpacking the purpose and importance
of
an
it
cultural
revolution,”
2021.
[Online].
Available:
https:
//www.dynatrace.com/news/blog/what-is-devops/
[4] Gartner,
“Aiops
(artificial
intelligence
for
it
operations).”
[On-
line].
Available:
https://www.gartner.com/en/information-technology/
glossary/aiops-artificial-intelligence-operations
[5] S. Siddique, “The road to enterprise artificial intelligence: A case
studies driven exploration,” Ph.D. dissertation, 05 2018.
[6] N. Sabharwal,
Hands-on AIOps
.
Springer, 2022.
[7] Y. Dang, Q. Lin, and P. Huang, “Aiops: Real-world challenges and re-
search innovations,” in
2019 IEEE/ACM 41st International Conference
on Software Engineering: Companion Proceedings (ICSE-Companion)
,
2019, pp. 4–5.
[8] L. Rijal, R. Colomo-Palacios, and M. S´anchez-Gord´on, “Aiops: A
multivocal literature review,”
Artificial Intelligence for Cloud and Edge
Computing
, pp. 31–50, 2022.
[9] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection:
A survey,”
arXiv preprint arXiv:1901.03407
, 2019.
[10] L. Akoglu, H. Tong, and D. Koutra, “Graph based anomaly detection
and description: a survey,”
Data mining and knowledge discovery
,
vol. 29, no. 3, pp. 626–688, 2015.
[11] J. Soldani and A. Brogi, “Anomaly detection and failure root cause
analysis in (micro)service-based cloud applications: A survey,” 2021.
[Online]. Available: https://arxiv.org/abs/2105.12378
[12] V.
Davidovski,
“Exponential
innovation
through
digital
transfor-
mation,”
in
Proceedings
of
the
3rd
International
Conference
on
Applications in Information Technology
, ser. ICAIT’2018.
New York,
NY,
USA:
Association
for
Computing
Machinery,
2018,
p.
3–5.
[Online]. Available: https://doi.org/10.1145/3274856.3274858
[13] D. S. Battina, “Ai and devops in information technology and its future
in the united states,”
INTERNATIONAL JOURNAL OF CREATIVE
RESEARCH THOUGHTS (IJCRT), ISSN
, pp. 2320–2882, 2021.
[14] A. B. Yoo, M. A. Jette, and M. Grondona, “Slurm: Simple linux utility
for resource management,” in
Workshop on job scheduling strategies
for parallel processing
.
Springer, 2003, pp. 44–60.
[15] J.
Zhaoxue,
L.
Tong,
Z.
Zhenguo,
G.
Jingguo,
Y.
Junling,
and
L. Liangxiong, “A survey on log research of aiops: Methods and
trends,”
Mob. Netw. Appl.
, vol. 26, no. 6, p. 2353–2364, dec 2021.
[Online]. Available: https://doi.org/10.1007/s11036-021-01832-3
[16] S.
He,
P.
He,
Z.
Chen,
T.
Yang,
Y.
Su,
and
M.
R.
Lyu,
“A survey on automated log analysis for reliability engineering,”
ACM Comput. Surv.
, vol. 54, no. 6, jul 2021. [Online]. Available:
https://doi.org/10.1145/3460345
[17] P. Notaro, J. Cardoso, and M. Gerndt, “A survey of aiops methods
for failure management,”
ACM Trans. Intell. Syst. Technol.
, vol. 12,
no. 6, nov 2021. [Online]. Available: https://doi.org/10.1145/3483424
[18] X.
Han
and
S.
Yuan,
“Unsupervised
cross-system
log
anomaly
detection via domain adaptation,” in
Proceedings of the 30th ACM
International Conference on Information & Knowledge Management
,
ser. CIKM ’21.
New York, NY, USA: Association for Computing
Machinery, 2021, p. 3068–3072. [Online]. Available: https://doi.org/
10.1145/3459637.3482209
[19] V.-H. Le and H. Zhang, “Log-based anomaly detection with deep
learning: How far are we?” in
Proceedings of the 44th International
Conference on Software Engineering
, ser. ICSE ’22.
New York, NY,
USA: Association for Computing Machinery, 2022, p. 1356–1367.
[Online]. Available: https://doi.org/10.1145/3510003.3510155
[20] N. Zhao, H. Wang, Z. Li, X. Peng, G. Wang, Z. Pan, Y. Wu,
Z. Feng, X. Wen, W. Zhang, K. Sui, and D. Pei, “An empirical
investigation of practical log anomaly detection for online service
systems,” in
Proceedings of the 29th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations
of Software Engineering
, ser. ESEC/FSE 2021.
New York, NY, USA:
Association for Computing Machinery, 2021, p. 1404–1415. [Online].
Available: https://doi.org/10.1145/3468264.3473933
[21] Y.
Zhu,
W.
Meng,
Y.
Liu,
S.
Zhang,
T.
Han,
S.
Tao,
and
D. Pei, “Unilog: Deploy one model and specialize it for all log
analysis tasks,”
CoRR
, vol. abs/2112.03159, 2021. [Online]. Available:
https://arxiv.org/abs/2112.03159
[22] J. Soldani and A. Brogi, “Anomaly detection and failure root cause
analysis
in
(micro)
service-based
cloud
applications:
A
survey,”
ACM Comput. Surv.
, vol. 55, no. 3, feb 2022. [Online]. Available:
https://doi.org/10.1145/3501297
[23] L. Korzeniowski and K. Goczyla, “Landscape of automated log anal-
ysis: A systematic literature review and mapping study,”
IEEE Access
,
vol. 10, pp. 21 892–21 913, 2022.
[24] M. Sheldon and G. V. B. Weissman, “Retrace: Collecting execution
trace with virtual machine deterministic replay,” in
Proceedings of the
Third Annual Workshop on Modeling, Benchmarking and Simulation
(MoBS 2007)
.
Citeseer, 2007.
[25] R. Fonseca, G. Porter, R. H. Katz, and S. Shenker, “
{
X-Trace
}
: A
pervasive network tracing framework,” in
4th USENIX Symposium on
Networked Systems Design & Implementation (NSDI 07)
, 2007.
[26] J. Zhou, Z. Chen, J. Wang, Z. Zheng, and M. R. Lyu, “Trace bench:
An open data set for trace-oriented monitoring,” in
2014 IEEE 6th
International Conference on Cloud Computing Technology and Science
.
IEEE, 2014, pp. 519–526.
[27] S. Zhang, C. Zhao, Y. Sui, Y. Su, Y. Sun, Y. Zhang, D. Pei, and
Y. Wang, “Robust KPI anomaly detection for large-scale software
services with partial labels,” in
32nd IEEE International Symposium
on Software Reliability Engineering, ISSRE 2021, Wuhan, China,
October 25-28, 2021
, Z. Jin, X. Li, J. Xiang, L. Mariani, T. Liu,
X. Yu, and N. Ivaki, Eds.
IEEE, 2021, pp. 103–114. [Online].
Available: https://doi.org/10.1109/ISSRE52982.2021.00023
[28] M. Braei and S. Wagner, “Anomaly detection in univariate time-series:
A survey on the state-of-the-art,”
ArXiv
, vol. abs/2004.00433, 2020.
[29] A. Bl´azquez-Garc´
ıa, A. Conde, U. Mori, and J. A. Lozano, “A review
on outlier/anomaly detection in time series data,”
ACM Computing
Surveys (CSUR)
, vol. 54, no. 3, pp. 1–33, 2021.
[30] K. Choi, J. Yi, C. Park, and S. Yoon, “Deep learning for anomaly
detection in time-series data: review, analysis, and guidelines,”
IEEE
Access
, 2021.
[31] D. Liu, Y. Zhao, H. Xu, Y. Sun, D. Pei, J. Luo, X. Jing, and
M. Feng, “Opprentice: Towards practical and automatic anomaly de-
tection through machine learning,” in
Proceedings of the 2015 internet
measurement conference
, 2015, pp. 211–224.
[32] J. Gao, X. Song, Q. Wen, P. Wang, L. Sun, and H. Xu, “Robusttad:
Robust time series anomaly detection via decomposition and convolu-
tional neural networks,”
arXiv preprint arXiv:2002.09545
, 2020.
[33] S. Han, X. Hu, H. Huang, M. Jiang, and Y. Zhao, “ADBench:
Anomaly detection benchmark,” in
Thirty-sixth Conference on Neural
Information Processing Systems Datasets and Benchmarks Track
, 2022.
[Online]. Available: https://openreview.net/forum?id=foA
SFQ9zo0
[34] Z. Li, N. Zhao, S. Zhang, Y. Sun, P. Chen, X. Wen, M. Ma, and
D. Pei, “Constructing large-scale real-world benchmark datasets for
aiops,”
arXiv preprint arXiv:2208.03938
, 2022.
[35] R. Wu and E. J. Keogh, “Current time series anomaly detection
benchmarks are flawed and are creating the illusion of progress,”
CoRR
, vol. abs/2009.13807, 2020. [Online]. Available: https://arxiv.
org/abs/2009.13807
[36] H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei,
Y. Feng
et al.
, “Unsupervised anomaly detection via variational auto-
encoder for seasonal kpis in web applications,” in
Proceedings of the
2018 world wide web conference
, 2018, pp. 187–196.
[37] J. Bu, Y. Liu, S. Zhang, W. Meng, Q. Liu, X. Zhu, and D. Pei, “Rapid
deployment of anomaly detection models for large number of emerging
kpi streams,” in
2018 IEEE 37th International Performance Computing
and Communications Conference (IPCCC)
.
IEEE, 2018, pp. 1–8.
[38] T. Huang, P. Chen, and R. Li, “A semi-supervised vae based active
anomaly detection framework in multivariate time series for online
systems,” in
Proceedings of the ACM Web Conference 2022
, 2022, pp.
1797–1806.
[39] X.-L. Li and B. Liu, “Learning from positive and unlabeled examples
with different data distributions,” in
European conference on machine
learning
.
Springer, 2005, pp. 218–229.
[40] X. Zhang, J. Kim, Q. Lin, K. Lim, S. O. Kanaujia, Y. Xu, K. Jamieson,
A. Albarghouthi, S. Qin, M. J. Freedman
et al.
, “Cross-dataset time
series anomaly detection for cloud systems,” in
2019 USENIX Annual
Technical Conference (USENIX ATC 19)
, 2019, pp. 1063–1076.
[41] N. Laptev, S. Amizadeh, and I. Flint, “Generic and scalable framework
for automated time-series anomaly detection,” in
Proceedings of the
21th ACM SIGKDD international conference on knowledge discovery
and data mining
, 2015, pp. 1939–1947.
[42] S. Guha, N. Mishra, G. Roy, and O. Schrijvers, “Robust random cut
forest based anomaly detection on streams,” in
International conference
on machine learning
.
PMLR, 2016, pp. 2712–2721.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
29
[43] S. Ahmad, A. Lavin, S. Purdy, and Z. Agha, “Unsupervised real-time
anomaly detection for streaming data,”
Neurocomputing
, vol. 262, pp.
134–147, 2017.
[44] Z.
Li,
Y.
Zhao,
J.
Han,
Y.
Su,
R.
Jiao,
X.
Wen,
and
D.
Pei,
“Multivariate time series anomaly detection and interpretation using
hierarchical inter-metric and temporal embedding,” in
KDD ’21: The
27th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining, Virtual Event, Singapore, August 14-18, 2021
, F. Zhu, B. C.
Ooi, and C. Miao, Eds.
ACM, 2021, pp. 3220–3230. [Online].
Available: https://doi.org/10.1145/3447548.3467075
[45] J. Audibert, P. Michiardi, F. Guyard, S. Marti, and M. A. Zuluaga,
“Usad: Unsupervised anomaly detection on multivariate time series,”
in
Proceedings of the 26th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining
, 2020, pp. 3395–3404.
[46] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly
detection for multivariate time series through stochastic recurrent neural
network,” in
Proceedings of the 25th ACM SIGKDD international
conference on knowledge discovery & data mining
, 2019, pp. 2828–
2837.
[47] Z. Li, Y. Zhao, J. Han, Y. Su, R. Jiao, X. Wen, and D. Pei, “Multivariate
time series anomaly detection and interpretation using hierarchical
inter-metric and temporal embedding,” in
Proceedings of the 27th ACM
SIGKDD Conference on Knowledge Discovery & Data Mining
, 2021,
pp. 3220–3230.
[48] W. Yang, K. Zhang, and S. C. Hoi, “Causality-based multivariate time
series anomaly detection,”
arXiv preprint arXiv:2206.15033
, 2022.
[49] F. Ayed, L. Stella, T. Januschowski, and J. Gasthaus, “Anomaly
detection at scale: The case for deep distributional time series models,”
in
International Conference on Service-Oriented Computing
.
Springer,
2020, pp. 97–109.
[50] S. Rabanser, T. Januschowski, K. Rasul, O. Borchert, R. Kurle,
J. Gasthaus, M. Bohlke-Schneider, N. Papernot, and V. Flunkert, “In-
trinsic anomaly detection for multi-variate time series,”
arXiv preprint
arXiv:2206.14342
, 2022.
[51] J. Hochenbaum, O. S. Vallis, and A. Kejariwal, “Automatic anomaly
detection
in
the
cloud
via
statistical
learning,”
arXiv
preprint
arXiv:1704.07706
, 2017.
[52] H. Ren, B. Xu, Y. Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang,
J. Tong, and Q. Zhang, “Time-series anomaly detection service at
microsoft,” in
Proceedings of the 25th ACM SIGKDD international
conference on knowledge discovery & data mining
, 2019, pp. 3009–
3017.
[53] J. Soldani and A. Brogi, “Anomaly detection and failure root cause
analysis in (micro) service-based cloud applications: A survey,”
ACM
Comput.
Surv.
,
vol.
55,
no.
3,
pp.
59:1–59:39,
2023.
[Online].
Available: https://doi.org/10.1145/3501297
[54] J. Bu, Y. Liu, S. Zhang, W. Meng, Q. Liu, X. Zhu, and D. Pei,
“Rapid deployment of anomaly detection models for large number
of emerging KPI streams,” in
37th IEEE International Performance
Computing and Communications Conference, IPCCC 2018, Orlando,
FL, USA, November 17-19, 2018
.
IEEE, 2018, pp. 1–8. [Online].
Available: https://doi.org/10.1109/PCCC.2018.8711315
[55] Z.
Z.
Darban,
G.
I.
Webb,
S.
Pan,
C.
C.
Aggarwal,
and
M.
Salehi,
“Deep
learning
for
time
series
anomaly
detection:
A survey,”
CoRR
, vol. abs/2211.05244, 2022. [Online]. Available:
https://doi.org/10.48550/arXiv.2211.05244
[56] B. Huang, K. Zhang, M. Gong, and C. Glymour, “Causal discovery and
forecasting in nonstationary environments with state-space models,”
in
Proceedings of the 36th International Conference on Machine
Learning, ICML 2019, 9-15 June 2019, Long Beach, California,
USA
, ser. Proceedings of Machine Learning Research, K. Chaudhuri
and R. Salakhutdinov, Eds., vol. 97.
PMLR, 2019, pp. 2901–2910.
[Online]. Available: http://proceedings.mlr.press/v97/huang19g.html
[57] Q. Pham, C. Liu, D. Sahoo, and S. C. H. Hoi, “Learning fast and
slow for online time series forecasting,”
CoRR
, vol. abs/2202.11672,
2022. [Online]. Available: https://arxiv.org/abs/2202.11672
[58] K. Lai, D. Zha, J. Xu, Y. Zhao, G. Wang, and X. Hu, “Revisiting time
series outlier detection: Definitions and benchmarks,” in
Proceedings
of the Neural Information Processing Systems Track on Datasets and
Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December
2021, virtual
, J. Vanschoren and S. Yeung, Eds., 2021. [Online].
Available:
https://datasets-benchmarks-proceedings.neurips.cc/paper/
2021/hash/ec5decca5ed3d6b8079e2e7e7bacc9f2-Abstract-round1.html
[59] R. Wu and E. Keogh, “Current time series anomaly detection bench-
marks are flawed and are creating the illusion of progress,”
IEEE
Transactions on Knowledge and Data Engineering
, 2021.
[60] X.
Wu,
L.
Xiao,
Y.
Sun,
J.
Zhang,
T.
Ma,
and
L.
He,
“A
survey of human-in-the-loop for machine learning,”
Future Gener.
Comput. Syst.
, vol. 135, pp. 364–381, 2022. [Online]. Available:
https://doi.org/10.1016/j.future.2022.05.014
[61] D. Sahoo, Q. Pham, J. Lu, and S. C. H. Hoi, “Online deep learning:
Learning deep neural networks on the fly,”
CoRR
, vol. abs/1711.03705,
2017. [Online]. Available: http://arxiv.org/abs/1711.03705
[62] Z. Chen, J. Liu, W. Gu, Y. Su, and M. R. Lyu, “Experience report:
Deep learning-based system log analysis for anomaly detection,”
2021. [Online]. Available: https://arxiv.org/abs/2107.05908
[63] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing
approach with fixed depth tree,” in
2017 IEEE International Conference
on Web Services (ICWS)
, 2017, pp. 33–40.
[64] A. A. Makanju, A. N. Zincir-Heywood, and E. E. Milios, “Clustering
event logs using iterative partitioning,” in
Proceedings of the 15th
ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining
, ser. KDD ’09.
New York, NY, USA: Association
for Computing Machinery, 2009, p. 1255–1264. [Online]. Available:
https://doi.org/10.1145/1557019.1557154
[65] Z. M. Jiang, A. E. Hassan, P. Flora, and G. Hamann, “Abstracting
execution logs to execution events for enterprise applications (short
paper),” in
2008 The Eighth International Conference on Quality
Software
, 2008, pp. 181–186.
[66] M. Du and F. Li, “Spell: Streaming parsing of system event logs,” in
2016 IEEE 16th International Conference on Data Mining (ICDM)
,
2016, pp. 859–864.
[67] R. Vaarandi and M. Pihelgas, “Logcluster - A data clustering and
pattern
mining
algorithm
for
event
logs,”
in
11th
International
Conference
on
Network
and
Service
Management,
CNSM
2015,
Barcelona, Spain, November 9-13, 2015
, M. Tortonesi, J. Sch¨onw¨alder,
E.
R.
M.
Madeira,
C.
Schmitt,
and
J.
Serrat,
Eds.
IEEE
Computer
Society,
2015,
pp.
1–7.
[Online].
Available:
https:
//doi.org/10.1109/CNSM.2015.7367331
[68] Q. Fu, J.-G. Lou, Y. Wang, and J. Li, “Execution anomaly detection in
distributed systems through unstructured log analysis,” in
2009 Ninth
IEEE International Conference on Data Mining
, 2009, pp. 149–158.
[69] L.
Tang,
T.
Li,
and
C.-S.
Perng,
“Logsig:
Generating
system
events
from
raw
textual
logs,”
in
Proceedings
of
the
20th
ACM
International
Conference
on
Information
and
Knowledge
Management
, ser. CIKM ’11.
New York, NY, USA: Association
for Computing Machinery, 2011, p. 785–794. [Online]. Available:
https://doi.org/10.1145/2063576.2063690
[70] M. Mizutani, “Incremental mining of system log format,” in
2013 IEEE
International Conference on Services Computing
, 2013, pp. 595–602.
[71] K. Shima, “Length matters: Clustering system log messages using
length
of
words,”
CoRR
,
vol.
abs/1611.03213,
2016.
[Online].
Available: http://arxiv.org/abs/1611.03213
[72] H. Hamooni, B. Debnath, J. Xu, H. Zhang, G. Jiang, and A. Mueen,
“Logmine: Fast pattern recognition for log analytics,” in
Proceedings
of the 25th ACM International on Conference on Information and
Knowledge Management
, ser. CIKM ’16.
New York, NY, USA:
Association for Computing Machinery, 2016, p. 1573–1582. [Online].
Available: https://doi.org/10.1145/2983323.2983358
[73] R. Vaarandi, “A data clustering algorithm for mining patterns from
event logs,” in
Proceedings of the 3rd IEEE Workshop on IP Operations
& Management (IPOM 2003) (IEEE Cat. No.03EX764)
, 2003, pp. 119–
126.
[74] M. Nagappan and M. A. Vouk, “Abstracting log lines to log event
types for mining software system logs,” in
2010 7th IEEE Working
Conference on Mining Software Repositories (MSR 2010)
, 2010, pp.
114–117.
[75] S.
Messaoudi,
A.
Panichella,
D.
Bianculli,
L.
Briand,
and
R. Sasnauskas, “A search-based approach for accurate identification
of log message formats,” in
Proceedings of the 26th Conference
on Program Comprehension
, ser. ICPC ’18.
New York, NY, USA:
Association for Computing Machinery, 2018, p. 167–177. [Online].
Available: https://doi.org/10.1145/3196321.3196340
[76] S. Nedelkoski, J. Bogatinovski, A. Acker, J. Cardoso, and O. Kao,
“Self-supervised log parsing,” in
Machine Learning and Knowledge
Discovery
in
Databases:
Applied
Data
Science
Track
,
Y.
Dong,
D. Mladeni´c, and C. Saunders, Eds.
Cham: Springer International
Publishing, 2021, pp. 122–138.
[77] Y. Liu, X. Zhang, S. He, H. Zhang, L. Li, Y. Kang, Y. Xu, M. Ma,
Q. Lin, Y. Dang, S. Rajmohan, and D. Zhang, “Uniparser: A unified
log parser for heterogeneous log data,” in
Proceedings of the ACM
Web
Conference
2022
,
ser.
WWW
’22.
New
York,
NY,
USA:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
30
Association for Computing Machinery, 2022, p. 1893–1901. [Online].
Available: https://doi.org/10.1145/3485447.3511993
[78] V.-H. Le and H. Zhang, “Log-based anomaly detection without log
parsing,” in
2021 36th IEEE/ACM International Conference on Auto-
mated Software Engineering (ASE)
, 2021, pp. 492–504.
[79] Y. Lee, J. Kim, and P. Kang, “Lanobert : System log anomaly detection
based on BERT masked language model,”
CoRR
, vol. abs/2111.09564,
2021. [Online]. Available: https://arxiv.org/abs/2111.09564
[80] M. Farshchi, J.-G. Schneider, I. Weber, and J. Grundy, “Experience
report: Anomaly detection of cloud application operations using log
and cloud metric correlation analysis,” in
2015 IEEE 26th International
Symposium on Software Reliability Engineering (ISSRE)
, 2015, pp. 24–
34.
[81] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang,
S. Tao, P. Sun, and R. Zhou, “Loganomaly: Unsupervised detection
of sequential and quantitative anomalies in unstructured logs,” in
Proceedings of the 28th International Joint Conference on Artificial
Intelligence
, ser. IJCAI’19.
AAAI Press, 2019, p. 4739–4745.
[82] Q.
Lin,
H.
Zhang,
J.-G.
Lou,
Y.
Zhang,
and
X.
Chen,
“Log
clustering based problem identification for online service systems,”
in
Proceedings of the 38th International Conference on Software
Engineering
Companion
,
ser.
ICSE
’16.
New
York,
NY,
USA:
Association for Computing Machinery, 2016, p. 102–111. [Online].
Available: https://doi.org/10.1145/2889160.2889232
[83] R. Yang, D. Qu, Y. Qian, Y. Dai, and S. Zhu, “An online log template
extraction method based on hierarchical clustering,”
EURASIP J.
Wirel. Commun. Netw.
, vol. 2019, p. 135, 2019. [Online]. Available:
https://doi.org/10.1186/s13638-019-1430-4
[84] W. Xu, L. Huang, A. Fox, D. A. Patterson, and M. I. Jordan, “Detecting
large-scale system problems by mining console logs,” in
Proceedings
of the 22nd ACM Symposium on Operating Systems Principles 2009,
SOSP 2009, Big Sky, Montana, USA, October 11-14, 2009
, J. N.
Matthews and T. E. Anderson, Eds.
ACM, 2009, pp. 117–132.
[Online]. Available: https://doi.org/10.1145/1629575.1629587
[85] B. Joshi, U. Bista, and M. Ghimire, “Intelligent clustering scheme for
log data streams,” in
Computational Linguistics and Intelligent Text
Processing
, A. Gelbukh, Ed.
Berlin, Heidelberg: Springer Berlin
Heidelberg, 2014, pp. 454–465.
[86] J.
Liu,
J.
Zhu,
S.
He,
P.
He,
Z.
Zheng,
and
M.
R.
Lyu,
“Logzip: Extracting hidden structures via iterative clustering for log
compression,” in
Proceedings of the 34th IEEE/ACM International
Conference on Automated Software Engineering
, ser. ASE ’19.
IEEE
Press, 2019, p. 863–873. [Online]. Available: https://doi.org/10.1109/
ASE.2019.00085
[87] M. Wurzenberger, F. Skopik, M. Landauer, P. Greitbauer, R. Fiedler,
and W. Kastner, “Incremental clustering for semi-supervised anomaly
detection applied on log data,” in
Proceedings of the 12th International
Conference on Availability, Reliability and Security
, ser. ARES ’17.
New York, NY, USA: Association for Computing Machinery, 2017.
[Online]. Available: https://doi.org/10.1145/3098954.3098973
[88] D. Gunter, B. L. Tierney, A. Brown, M. Swany, J. Bresnahan, and J. M.
Schopf, “Log summarization and anomaly detection for troubleshooting
distributed systems,” in
2007 8th IEEE/ACM International Conference
on Grid Computing
, 2007, pp. 226–234.
[89] W. Meng, F. Zaiter, Y. Huang, Y. Liu, S. Zhang, Y. Zhang, Y. Zhu,
T. Zhang, E. Wang, Z. Ren, F. Wang, S. Tao, and D. Pei, “Summarizing
unstructured logs in online services,”
CoRR
, vol. abs/2012.08938,
2020. [Online]. Available: https://arxiv.org/abs/2012.08938
[90] R.
Dijkman
and
A.
Wilbik,
“Linguistic
summarization
of
event
logs
–
a
practical
approach,”
Information
Systems
,
vol.
67,
pp.
114–125, 2017. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0306437916303192
[91] S. Locke, H. Li, T.-H. P. Chen, W. Shang, and W. Liu, “Logassist:
Assisting log analysis through log summarization,”
IEEE Transactions
on Software Engineering
, pp. 1–1, 2021.
[92] P.
Bodik,
M.
Goldszmidt,
A.
Fox,
D.
B.
Woodard,
and
H. Andersen, “Fingerprinting the datacenter: Automated classification
of performance crises,” in
Proceedings of the 5th European Conference
on Computer Systems
, ser. EuroSys ’10.
New York, NY, USA:
Association for Computing Machinery, 2010, p. 111–124. [Online].
Available: https://doi.org/10.1145/1755913.1755926
[93] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo, “Failure prediction in
ibm bluegene/l event logs,” in
Seventh IEEE International Conference
on Data Mining (ICDM 2007)
, 2007, pp. 583–588.
[94] M. Chen, A. Zheng, J. Lloyd, M. Jordan, and E. Brewer, “Failure diag-
nosis using decision trees,” in
International Conference on Autonomic
Computing, 2004. Proceedings.
, 2004, pp. 36–43.
[95] S. He, Q. Lin, J.-G. Lou, H. Zhang, M. R. Lyu, and D. Zhang,
“Identifying impactful service system problems via log analysis,”
in
Proceedings of the 2018 26th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations
of Software Engineering
, ser. ESEC/FSE 2018.
New York, NY, USA:
Association for Computing Machinery, 2018, p. 60–70. [Online].
Available: https://doi.org/10.1145/3236024.3236083
[96] J.
Lou,
Q.
Fu,
S.
Yang,
Y.
Xu,
and
J.
Li,
“Mining
invariants
from console logs for system problem detection,” in
2010 USENIX
Annual Technical Conference, Boston, MA, USA, June 23-25, 2010
,
P.
Barham
and
T.
Roscoe,
Eds.
USENIX
Association,
2010.
[Online]. Available: https://www.usenix.org/conference/usenix-atc-10/
mining-invariants-console-logs-system-problem-detection
[97] A. Nandi, A. Mandal, S. Atreja, G. B. Dasgupta, and S. Bhattacharya,
“Anomaly
detection
using
program
control
flow
graph
mining
from execution logs,” in
Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining
,
ser. KDD ’16.
New York, NY, USA: Association for Computing
Machinery, 2016, p. 215–224. [Online]. Available: https://doi.org/10.
1145/2939672.2939712
[98] T. Jia, P. Chen, L. Yang, Y. Li, F. Meng, and J. Xu, “An approach
for anomaly diagnosis based on hybrid graph model with logs for
distributed services,” in
2017 IEEE International Conference on Web
Services (ICWS)
, 2017, pp. 25–32.
[99] T. Jia, L. Yang, P. Chen, Y. Li, F. Meng, and J. Xu, “Logsed: Anomaly
diagnosis through mining time-weighted control flow graph in logs,”
in
2017 IEEE 10th International Conference on Cloud Computing
(CLOUD)
, 2017, pp. 447–455.
[100] X. Zhang, Y. Xu, S. Qin, S. He, B. Qiao, Z. Li, H. Zhang, X. Li,
Y. Dang, Q. Lin, M. Chintalapati, S. Rajmohan, and D. Zhang,
“Onion: Identifying incident-indicating logs for cloud systems,” in
Proceedings of the 29th ACM Joint Meeting on European Software
Engineering
Conference
and
Symposium
on
the
Foundations
of
Software Engineering
, ser. ESEC/FSE 2021.
New York, NY, USA:
Association for Computing Machinery, 2021, p. 1253–1263. [Online].
Available: https://doi.org/10.1145/3468264.3473919
[101] W. Meng, Y. Liu, Y. Huang, S. Zhang, F. Zaiter, B. Chen, and D. Pei,
“A semantic-aware representation framework for online log analysis,”
in
2020 29th International Conference on Computer Communications
and Networks (ICCCN)
, 2020, pp. 1–7.
[102] A. Farzad and T. A. Gulliver, “Unsupervised log message anomaly
detection,”
ICT
Express
,
vol.
6,
no.
3,
pp.
229–237,
2020.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
S2405959520300643
[103] S. Nedelkoski, J. Bogatinovski, A. Acker, J. Cardoso, and O. Kao,
“Self-attentive classification-based anomaly detection in unstructured
logs,” in
2020 IEEE International Conference on Data Mining (ICDM)
,
2020, pp. 1196–1201.
[104] L. Yang, J. Chen, Z. Wang, W. Wang, J. Jiang, X. Dong, and W. Zhang,
“Semi-supervised log-based anomaly detection via probabilistic label
estimation,” in
2021 IEEE/ACM 43rd International Conference on
Software Engineering (ICSE)
, 2021, pp. 1448–1460.
[105] H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via
bert,” in
2021 International Joint Conference on Neural Networks
(IJCNN)
, 2021, pp. 1–8.
[106] B.
Xia,
Y.
Bai,
J.
Yin,
Y.
Li,
and
J.
Xu,
“Loggan:
A
log-level
generative
adversarial
network
for
anomaly
detection
using permutation event modeling,”
Information Systems Frontiers
,
vol.
23,
no.
2,
p.
285–298,
apr
2021.
[Online].
Available:
https://doi.org/10.1007/s10796-020-10026-3
[107] Z. Zhao, W. Niu, X. Zhang, R. Zhang, Z. Yu, and C. Huang, “Trine:
Syslog anomaly detection with three transformer encoders in one
generative adversarial network,”
Applied Intelligence
, vol. 52, no. 8,
p. 8810–8819, jun 2022. [Online]. Available: https://doi.org/10.1007/
s10489-021-02863-9
[108] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly
detection and diagnosis from system logs through deep learning,”
in
Proceedings of the 2017 ACM SIGSAC Conference on Computer
and Communications Security
, ser. CCS ’17.
New York, NY, USA:
Association for Computing Machinery, 2017, p. 1285–1298. [Online].
Available: https://doi.org/10.1145/3133956.3134015
[109] H. Ott, J. Bogatinovski, A. Acker, S. Nedelkoski, and O. Kao,
“Robust
and
transferable
anomaly
detection
in
log
data
using
pre-trained
language
models,”
in
2021
IEEE/ACM
International
Workshop on Cloud Intelligence (CloudIntelligence)
.
Los Alamitos,
CA,
USA:
IEEE
Computer
Society,
may
2021,
pp.
19–
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
31
24. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/
CloudIntelligence52565.2021.00013
[110] R. Chen, S. Zhang, D. Li, Y. Zhang, F. Guo, W. Meng, D. Pei,
Y. Zhang, X. Chen, and Y. Liu, “Logtransfer: Cross-system log anomaly
detection for software systems with transfer learning,” in
2020 IEEE
31st International Symposium on Software Reliability Engineering
(ISSRE)
, 2020, pp. 37–47.
[111] X. Zhang, Y. Xu, Q. Lin, B. Qiao, H. Zhang, Y. Dang, C. Xie,
X. Yang, Q. Cheng, Z. Li, J. Chen, X. He, R. Yao, J.-G. Lou,
M. Chintalapati, F. Shen, and D. Zhang, “Robust log-based anomaly
detection on unstable log data,” in
Proceedings of the 2019 27th
ACM Joint Meeting on European Software Engineering Conference
and
Symposium
on
the
Foundations
of
Software
Engineering
,
ser.
ESEC/FSE
2019.
New
York,
NY,
USA:
Association
for
Computing
Machinery,
2019,
p.
807–817.
[Online].
Available:
https://doi.org/10.1145/3338906.3338931
[112] A. Brown, A. Tuor, B. Hutchinson, and N. Nichols, “Recurrent neural
network attention mechanisms for interpretable system log anomaly
detection,” in
Proceedings of the First Workshop on Machine Learning
for
Computing
Systems
,
ser.
MLCS’18.
New
York,
NY,
USA:
Association for Computing Machinery, 2018. [Online]. Available:
https://doi.org/10.1145/3217871.3217872
[113] X. Li, P. Chen, L. Jing, Z. He, and G. Yu, “Swisslog: Robust and
unified deep learning based log anomaly detection for diverse faults,”
in
2020 IEEE 31st International Symposium on Software Reliability
Engineering (ISSRE)
, 2020, pp. 92–103.
[114] S. Lu, X. Wei, Y. Li, and L. Wang, “Detecting anomaly in big data
system logs using convolutional neural network,” in
2018 IEEE 16th
Intl Conf on Dependable, Autonomic and Secure Computing, 16th
Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf
on Big Data Intelligence and Computing and Cyber Science and
Technology
Congress,
DASC/PiCom/DataCom/CyberSciTech
2018,
Athens,
Greece,
August
12-15,
2018
.
IEEE
Computer
Society,
2018,
pp.
151–158.
[Online].
Available:
https://doi.org/10.1109/
DASC/PiCom/DataCom/CyberSciTec.2018.00037
[115] Y. Xie, H. Zhang, and M. A. Babar, “Loggd:detecting anomalies from
system logs by graph neural networks,” 2022. [Online]. Available:
https://arxiv.org/abs/2209.07869
[116] Y. Wan, Y. Liu, D. Wang, and Y. Wen, “Glad-paw: Graph-based
log anomaly detection by position aware weighted graph attention
network,” in
Advances in Knowledge Discovery and Data Mining
,
K. Karlapalem, H. Cheng, N. Ramakrishnan, R. K. Agrawal, P. K.
Reddy, J. Srivastava, and T. Chakraborty, Eds.
Cham: Springer
International Publishing, 2021, pp. 66–77.
[117] S. Huang, Y. Liu, C. Fung, R. He, Y. Zhao, H. Yang, and Z. Luan, “Hi-
tanomaly: Hierarchical transformers for anomaly detection in system
log,”
IEEE Transactions on Network and Service Management
, vol. 17,
no. 4, pp. 2064–2076, 2020.
[118] H. Cheng, D. Xu, S. Yuan, and X. Wu, “Fine-grained anomaly
detection in sequential data via counterfactual explanations,” 2022.
[Online]. Available: https://arxiv.org/abs/2210.04145
[119] S. Nedelkoski, J. Cardoso, and O. Kao, “Anomaly detection from
system tracing data using multimodal deep learning,” in
2019 IEEE
12th International Conference on Cloud Computing (CLOUD)
, 2019,
pp. 179–186.
[120] C. Zhang, X. Peng, C. Sha, K. Zhang, Z. Fu, X. Wu, Q. Lin, and
D. Zhang, “Deeptralog: Trace-log combined microservice anomaly
detection through graph-based deep learning.”
Pittsburgh, PA, USA:
IEEE, 2022, pp. 623–634.
[121] D. C. Arnold, D. H. Ahn, B. R. De Supinski, G. L. Lee, B. P. Miller, and
M. Schulz, “Stack trace analysis for large scale debugging,” in
2007
IEEE International Parallel and Distributed Processing Symposium
.
IEEE, 2007, pp. 1–10.
[122] Z. Cai, W. Li, W. Zhu, L. Liu, and B. Yang, “A real-time trace-
level root-cause diagnosis system in alibaba datacenters,”
IEEE Access
,
vol. 7, pp. 142 692–142 702, 2019.
[123] P. Papadimitriou, A. Dasdan, and H. Garcia-Molina, “Web graph
similarity for anomaly detection,”
Journal of Internet Services and
Applications
, vol. 1, no. 1, pp. 19–30, 2010.
[124] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, “Pinpoint:
Problem determination in large, dynamic internet services,” in
Proceed-
ings International Conference on Dependable Systems and Networks
.
IEEE, 2002, pp. 595–604.
[125] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural
Computation
, vol. 9, no. 8, pp. 1735–1780, 1997.
[126] F. Salfner, M. Lenk, and M. Malek, “A survey of online failure
prediction methods,”
ACM Comput. Surv.
, vol. 42, no. 3, pp. 10:1–
10:42,
2010.
[Online].
Available:
https://doi.org/10.1145/1670679.
1670680
[127] Y. Chen, X. Yang, Q. Lin, H. Zhang, F. Gao, Z. Xu, Y. Dang,
D. Zhang, H. Dong, Y. Xu, H. Li, and Y. Kang, “Outage prediction
and diagnosis for cloud service systems,” in
The World Wide Web
Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019
,
L. Liu, R. W. White, A. Mantrach, F. Silvestri, J. J. McAuley,
R. Baeza-Yates, and L. Zia, Eds.
ACM, 2019, pp. 2659–2665.
[Online]. Available: https://doi.org/10.1145/3308558.3313501
[128] N. Zhao, J. Chen, Z. Wang, X. Peng, G. Wang, Y. Wu, F. Zhou,
Z.
Feng,
X.
Nie,
W.
Zhang,
K.
Sui,
and
D.
Pei,
“Real-time
incident prediction for online service systems,” in
ESEC/FSE ’20:
28th ACM Joint European Software Engineering Conference and
Symposium
on
the
Foundations
of
Software
Engineering,
Virtual
Event, USA, November 8-13, 2020
, P. Devanbu, M. B. Cohen, and
T. Zimmermann, Eds.
ACM, 2020, pp. 315–326. [Online]. Available:
https://doi.org/10.1145/3368089.3409672
[129] S. Zhang, Y. Liu, W. Meng, Z. Luo, J. Bu, S. Yang, P. Liang, D. Pei,
J. Xu, Y. Zhang, Y. Chen, H. Dong, X. Qu, and L. Song, “Prefix:
Switch failure prediction in datacenter networks,”
Proc. ACM Meas.
Anal. Comput. Syst.
, vol. 2, no. 1, pp. 2:1–2:29, 2018. [Online].
Available: https://doi.org/10.1145/3179405
[130] Q. Lin, K. Hsieh, Y. Dang, H. Zhang, K. Sui, Y. Xu, J. Lou,
C. Li, Y. Wu, R. Yao, M. Chintalapati, and D. Zhang, “Predicting
node
failure
in
cloud
service
systems,”
in
Proceedings
of
the
2018
ACM
Joint
Meeting
on
European
Software
Engineering
Conference
and
Symposium
on
the
Foundations
of
Software
Engineering,
ESEC/SIGSOFT
FSE
2018,
Lake
Buena
Vista,
FL,
USA, November 04-09, 2018
, G. T. Leavens, A. Garcia, and C. S.
Pasareanu,
Eds.
ACM,
2018,
pp.
480–490.
[Online].
Available:
https://doi.org/10.1145/3236024.3236060
[131] E.
Pinheiro,
W.
Weber,
and
L.
A.
Barroso,
“Failure
trends
in
a
large
disk
drive
population,”
in
5th
USENIX
Conference
on
File and Storage Technologies, FAST 2007, February 13-16, 2007,
San
Jose,
CA,
USA
,
A.
C.
Arpaci-Dusseau
and
R.
H.
Arpaci-
Dusseau,
Eds.
USENIX,
2007,
pp.
17–28.
[Online].
Available:
http://www.usenix.org/events/fast07/tech/pinheiro.html
[132] Y. Xu, K. Sui, R. Yao, H. Zhang, Q. Lin, Y. Dang, P. Li, K. Jiang,
W.
Zhang,
J.
Lou,
M.
Chintalapati,
and
D.
Zhang,
“Improving
service availability of cloud systems by predicting disk error,” in
2018 USENIX Annual Technical Conference, USENIX ATC 2018,
Boston, MA, USA, July 11-13, 2018
, H. S. Gunawi and B. Reed,
Eds.
USENIX Association, 2018, pp. 481–494. [Online]. Available:
https://www.usenix.org/conference/atc18/presentation/xu-yong
[133] R.
K.
Sahoo,
A.
J.
Oliner,
I.
Rish,
M.
Gupta,
J.
E.
Moreira,
S. Ma, R. Vilalta, and A. Sivasubramaniam, “Critical event prediction
for
proactive
management
in
large-scale
computer
clusters,”
in
Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining
, ser. KDD ’03.
New York,
NY, USA: Association for Computing Machinery, 2003, p. 426–435.
[Online]. Available: https://doi.org/10.1145/956750.956799
[134] F. Yu, H. Xu, S. Jian, C. Huang, Y. Wang, and Z. Wu, “Dram failure
prediction in large-scale data centers,” in
2021 IEEE International
Conference on Joint Cloud Computing (JCC)
, 2021, pp. 1–8.
[135] J. Klinkenberg, C. Terboven, S. Lankes, and M. S. M¨uller, “Data
mining-based analysis of hpc center operations,” in
2017 IEEE In-
ternational Conference on Cluster Computing (CLUSTER)
, 2017, pp.
766–773.
[136] S. Zhang, Y. Liu, W. Meng, Z. Luo, J. Bu, S. Yang, P. Liang, D. Pei,
J. Xu, Y. Zhang, Y. Chen, H. Dong, X. Qu, and L. Song, “Prefix:
Switch failure prediction in datacenter networks,”
Proc. ACM Meas.
Anal. Comput. Syst.
, vol. 2, no. 1, apr 2018. [Online]. Available:
https://doi.org/10.1145/3179405
[137] B.
Russo,
G.
Succi,
and
W.
Pedrycz,
“Mining
system
logs
to
learn error predictors: a case study of a telemetry system,”
Empir.
Softw. Eng.
, vol. 20, no. 4, pp. 879–927, 2015. [Online]. Available:
https://doi.org/10.1007/s10664-014-9303-2
[138] I. Fronza, A. Sillitti, G. Succi, M. Terho, and J. Vlasenko, “Failure
prediction based on log files using random indexing and support
vector machines,”
J. Syst. Softw.
, vol. 86, no. 1, p. 2–11, jan 2013.
[Online]. Available: https://doi.org/10.1016/j.jss.2012.06.025
[139] F. Salfner and M. Malek, “Using hidden semi-markov models for
effective online failure prediction,” in
2007 26th IEEE International
Symposium on Reliable Distributed Systems (SRDS 2007)
, 2007, pp.
161–174.
[140] A. Das, F. Mueller, C. Siegel, and A. Vishnu, “Desh: Deep learning for
system health prediction of lead times to failure in hpc,” in
Proceedings
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
32
of the 27th International Symposium on High-Performance Parallel
and Distributed Computing
, ser. HPDC ’18.
New York, NY, USA:
Association for Computing Machinery, 2018, p. 40–51. [Online].
Available: https://doi.org/10.1145/3208040.3208051
[141] J. Gao, H. Wang, and H. Shen, “Task failure prediction in cloud data
centers using deep learning,” in
2019 IEEE International Conference
on Big Data (Big Data)
, 2019, pp. 1111–1116.
[142] Z. Zheng, Z. Lan, B. H. Park, and A. Geist, “System log pre-processing
to improve failure prediction,” in
2009 IEEE/IFIP International Con-
ference on Dependable Systems and Networks
, 2009, pp. 572–577.
[143] Y. Chen, X. Yang, Q. Lin, H. Zhang, F. Gao, Z. Xu, Y. Dang,
D. Zhang, H. Dong, Y. Xu, H. Li, and Y. Kang, “Outage prediction
and diagnosis for cloud service systems,” in
The World Wide Web
Conference
, ser. WWW ’19.
New York, NY, USA: Association
for Computing Machinery, 2019, p. 2659–2665. [Online]. Available:
https://doi.org/10.1145/3308558.3313501
[144] Q.
Lin,
K.
Hsieh,
Y.
Dang,
H.
Zhang,
K.
Sui,
Y.
Xu,
J.-G.
Lou,
C.
Li,
Y.
Wu,
R.
Yao,
M.
Chintalapati,
and
D.
Zhang,
“Predicting node failure in cloud service systems,” in
Proceedings
of
the
2018
26th
ACM
Joint
Meeting
on
European
Software
Engineering
Conference
and
Symposium
on
the
Foundations
of
Software Engineering
, ser. ESEC/FSE 2018.
New York, NY, USA:
Association for Computing Machinery, 2018, p. 480–490. [Online].
Available: https://doi.org/10.1145/3236024.3236060
[145] X.
Zhou,
X.
Peng,
T.
Xie,
J.
Sun,
C.
Ji,
D.
Liu,
Q.
Xiang,
and
C.
He,
“Latent
error
prediction
and
fault
localization
for
microservice applications by learning from system trace logs,” in
Proceedings of the 2019 27th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations
of Software Engineering
, ser. ESEC/FSE 2019.
New York, NY, USA:
Association for Computing Machinery, 2019, p. 683–694. [Online].
Available: https://doi.org/10.1145/3338906.3338961
[146] H. Nguyen, Y. Tan, and X. Gu, “Pal: Propagation-aware anomaly
localization for cloud hosted distributed applications,” ser. SLAML
’11.
New York, NY, USA: Association for Computing Machinery,
2011. [Online]. Available: https://doi.org/10.1145/2038633.2038634
[147] H. Nguyen, Z. Shen, Y. Tan, and X. Gu, “Fchain: Toward black-
box online fault localization for cloud systems,” in
2013 IEEE 33rd
International Conference on Distributed Computing Systems
, 2013, pp.
21–30.
[148] H. Shan, Y. Chen, H. Liu, Y. Zhang, X. Xiao, X. He, M. Li, and
W. Ding, “e-diagnosis: Unsupervised and real-time diagnosis of small-
window long-tail latency in large-scale microservice platforms,” in
The World Wide Web Conference
, ser. WWW ’19.
New York, NY,
USA: Association for Computing Machinery, 2019, p. 3215–3222.
[Online]. Available: https://doi.org/10.1145/3308558.3313653
[149] J.
Thalheim,
A.
Rodrigues,
I.
E.
Akkus,
P.
Bhatotia,
R.
Chen,
B. Viswanath, L. Jiao, and C. Fetzer, “Sieve: Actionable insights from
monitored metrics in distributed systems,” in
Proceedings of the 18th
ACM/IFIP/USENIX
Middleware
Conference
,
ser.
Middleware
’17.
New York, NY, USA: Association for Computing Machinery, 2017, p.
14–27. [Online]. Available: https://doi.org/10.1145/3135974.3135977
[150] L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “Microrca: Root cause
localization of performance issues in microservices,” in
NOMS 2020
- 2020 IEEE/IFIP Network Operations and Management Symposium
,
2020, pp. 1–9.
[151]
´
Alvaro Brand´on, M. Sol´e, A. Hu´elamo, D. Solans, M. S. P´erez,
and
V.
Munt´es-Mulero,
“Graph-based
root
cause
analysis
for
service-oriented and microservice architectures,”
Journal of Systems
and
Software
,
vol.
159,
p.
110432,
2020.
[Online].
Available:
https://www.sciencedirect.com/science/article/pii/S0164121219302067
[152] A. Samir and C. Pahl, “Dla: Detecting and localizing anomalies in
containerized microservice architectures using markov models,” in
2019 7th International Conference on Future Internet of Things and
Cloud (FiCloud)
, 2019, pp. 205–213.
[153] P. Wang, J. Xu, M. Ma, W. Lin, D. Pan, Y. Wang, and P. Chen,
“Cloudranger: Root cause identification for cloud native systems,” in
2018 18th IEEE/ACM International Symposium on Cluster, Cloud and
Grid Computing (CCGRID)
, 2018, pp. 492–502.
[154] L. Mariani, C. Monni, M. Pezz´e, O. Riganelli, and R. Xin, “Localizing
faults in cloud systems,” in
2018 IEEE 11th International Conference
on Software Testing, Verification and Validation (ICST)
, 2018, pp. 262–
273.
[155] P. Chen, Y. Qi, and D. Hou, “Causeinfer: Automated end-to-end
performance diagnosis with hierarchical causality graph in cloud envi-
ronment,”
IEEE Transactions on Services Computing
, vol. 12, no. 2,
pp. 214–230, 2019.
[156] Y. Meng, S. Zhang, Y. Sun, R. Zhang, Z. Hu, Y. Zhang, C. Jia, Z. Wang,
and D. Pei, “Localizing failure root causes in a microservice through
causality inference,” in
2020 IEEE/ACM 28th International Symposium
on Quality of Service (IWQoS)
, 2020, pp. 1–10.
[157] J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance
issues with causal graphs in micro-service environments,” in
ICSOC
,
2018.
[158] M. Ma, W. Lin, D. Pan, and P. Wang, “Ms-rank: Multi-metric and
self-adaptive root cause diagnosis for microservice applications,” in
2019 IEEE International Conference on Web Services (ICWS)
, 2019,
pp. 60–67.
[159] M. Ma, J. Xu, Y. Wang, P. Chen, Z. Zhang, and P. Wang,
AutoMAP:
Diagnose Your Microservice-Based Web Applications Automatically
.
New York, NY, USA: Association for Computing Machinery, 2020,
p.
246–258.
[Online].
Available:
https://doi.org/10.1145/3366423.
3380111
[160] M.
Kim,
R.
Sumbaly,
and
S.
Shah,
“Root
cause
detection
in
a
service-oriented
architecture,”
in
Proceedings
of
the
ACM
SIGMETRICS/International Conference on Measurement and Modeling
of Computer Systems
, ser. SIGMETRICS ’13.
New York, NY, USA:
Association for Computing Machinery, 2013, p. 93–104. [Online].
Available: https://doi.org/10.1145/2465529.2465753
[161] P. Spirtes and C. Glymour, “An algorithm for fast recovery of sparse
causal graphs,”
Social Science Computer Review
, vol. 9, no. 1, pp.
62–72, 1991.
[162] D. M. Chickering, “Learning equivalence classes of Bayesian-network
structures,”
J. Mach. Learn. Res.
, vol. 2, no. 3, pp. 445–498, 2002.
[163] ——, “Optimal structure identification with greedy search,”
J. Mach.
Learn. Res.
, vol. 3, no. 3, pp. 507–554, 2003.
[164] J. Runge, P. Nowack, M. Kretschmer, S. Flaxman, and D. Sejdinovic,
“Detecting and quantifying causal associations in large nonlinear
time series datasets,”
Science Advances
, vol. 5, no. 11, p. eaau4996,
2019.
[Online].
Available:
https://www.science.org/doi/abs/10.1126/
sciadv.aau4996
[165] J. Runge, “Discovering contemporaneous and lagged causal relations
in autocorrelated nonlinear time series datasets,” in
UAI
, 2020.
[166] A.
Gerhardus
and
J.
Runge,
“High-recall
causal
discovery
for
autocorrelated
time
series
with
latent
confounders,”
in
Advances
in
Neural
Information
Processing
Systems
,
H.
Larochelle,
M.
Ranzato,
R.
Hadsell,
M.
Balcan,
and
H.
Lin,
Eds.,
vol.
33.
Curran
Associates,
Inc.,
2020,
pp.
12 615–12 625.
[Online].
Available:
https://proceedings.neurips.cc/paper/2020/file/
94e70705efae423efda1088614128d0b-Paper.pdf
[167] J. Qiu, Q. Du, K. Yin, S.-L. Zhang, and C. Qian, “A causality
mining and knowledge graph based method of root cause diagnosis
for performance anomaly in cloud applications,”
Applied Sciences
,
vol.
10,
no.
6,
2020.
[Online].
Available:
https://www.mdpi.com/
2076-3417/10/6/2166
[168] K. Budhathoki, L. Minorics, P. Bloebaum, and D. Janzing, “Causal
structure-based
root
cause
analysis
of
outliers,”
in
ICML
2022
,
2022. [Online]. Available: https://www.amazon.science/publications/
causal-structure-based-root-cause-analysis-of-outliers
[169] S. Lu, B. Rao, X. Wei, B. Tak, L. Wang, and L. Wang, “Log-based
abnormal task detection and root cause analysis for spark,” in
2017
IEEE International Conference on Web Services (ICWS)
, 2017, pp.
389–396.
[170] F. Lin, K. Muzumdar, N. P. Laptev, M.-V. Curelea, S. Lee, and
S. Sankar, “Fast dimensional analysis for root cause investigation
in
a
large-scale
service
environment,”
Proc.
ACM
Meas.
Anal.
Comput.
Syst.
,
vol.
4,
no.
2,
jun
2020.
[Online].
Available:
https://doi.org/10.1145/3392149
[171] L. Wang, N. Zhao, J. Chen, P. Li, W. Zhang, and K. Sui, “Root-cause
metric location for microservice systems via log anomaly detection,” in
2020 IEEE International Conference on Web Services (ICWS)
, 2020,
pp. 142–150.
[172] C. Luo, J.-G. Lou, Q. Lin, Q. Fu, R. Ding, D. Zhang, and Z. Wang,
“Correlating
events
with
time
series
for
incident
diagnosis,”
in
Proceedings of the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining
, ser. KDD ’14.
New York,
NY, USA: Association for Computing Machinery, 2014, p. 1583–1592.
[Online]. Available: https://doi.org/10.1145/2623330.2623374
[173] E. Chuah, S.-h. Kuo, P. Hiew, W.-C. Tjhi, G. Lee, J. Hammond, M. T.
Michalewicz, T. Hung, and J. C. Browne, “Diagnosing the root-causes
of failures from cluster log files,” in
2010 International Conference on
High Performance Computing
, 2010, pp. 1–10.
[174] T. S. Zaman, X. Han, and T. Yu, “Scminer: Localizing system-
level concurrency faults from large system call traces,” in
2019 34th
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
33
IEEE/ACM International Conference on Automated Software Engineer-
ing (ASE)
, 2019, pp. 515–526.
[175] K. Zhang, J. Xu, M. R. Min, G. Jiang, K. Pelechrinis, and H. Zhang,
“Automated it system failure prediction: A deep learning approach,” in
2016 IEEE International Conference on Big Data (Big Data)
, 2016,
pp. 1291–1300.
[176] Y. Yuan, W. Shi, B. Liang, and B. Qin, “An approach to cloud execution
failure diagnosis based on exception logs in openstack,” in
2019 IEEE
12th International Conference on Cloud Computing (CLOUD)
, 2019,
pp. 124–131.
[177] H. Mi, H. Wang, Y. Zhou, M. R.-T. Lyu, and H. Cai, “Toward fine-
grained, unsupervised, scalable performance diagnosis for production
cloud computing systems,”
IEEE Transactions on Parallel and Dis-
tributed Systems
, vol. 24, no. 6, pp. 1245–1255, 2013.
[178] H. Jiang, X. Li, Z. Yang, and J. Xuan, “What causes my test alarm?
automatic cause analysis for test alarms in system and integration
testing,” in
Proceedings of the 39th International Conference on
Software Engineering
, ser. ICSE ’17.
IEEE Press, 2017, p. 712–723.
[Online]. Available: https://doi.org/10.1109/ICSE.2017.71
[179] A. Amar and P. C. Rigby, “Mining historical test logs to predict
bugs and localize faults in the test logs,” in
2019 IEEE/ACM 41st
International Conference on Software Engineering (ICSE)
, 2019, pp.
140–151.
[180] F. Wang, A. Bundy, X. Li, R. Zhu, K. Nuamah, L. Xu, S. Mauceri,
and J. Z. Pan, “Lekg: A system for constructing knowledge graphs
from log extraction,” in
The 10th International Joint Conference
on
Knowledge
Graphs
,
ser.
IJCKG’21.
New
York,
NY,
USA:
Association for Computing Machinery, 2021, p. 181–185. [Online].
Available: https://doi.org/10.1145/3502223.3502250
[181] A. Ekelhart, F. J. Ekaputra, and E. Kiesling, “The slogert framework for
automated log knowledge graph construction,” in
The Semantic Web
,
R. Verborgh, K. Hose, H. Paulheim, P.-A. Champin, M. Maleshkova,
O. Corcho, P. Ristoski, and M. Alam, Eds.
Cham: Springer Interna-
tional Publishing, 2021, pp. 631–646.
[182] C. Bansal, S. Renganathan, A. Asudani, O. Midy, and M. Janakiraman,
“Decaf: Diagnosing and triaging performance issues in large-scale
cloud services,” in
Proceedings of the ACM/IEEE 42nd International
Conference
on
Software
Engineering:
Software
Engineering
in
Practice
, ser. ICSE-SEIP ’20.
New York, NY, USA: Association
for Computing Machinery, 2020, p. 201–210. [Online]. Available:
https://doi.org/10.1145/3377813.3381353
[183] B. C. Tak, S. Tao, L. Yang, C. Zhu, and Y. Ruan, “Logan: Problem
diagnosis in the cloud using log-based reference models,” in
2016 IEEE
International Conference on Cloud Engineering (IC2E)
, 2016, pp. 62–
67.
[184] W. Shang, Z. M. Jiang, H. Hemmati, B. Adams, A. E. Hassan, and
P. Martin, “Assisting developers of big data analytics applications when
deploying on hadoop clouds,” in
2013 35th International Conference
on Software Engineering (ICSE)
, 2013, pp. 402–411.
[185] C. Pham, L. Wang, B. C. Tak, S. Baset, C. Tang, Z. Kalbarczyk, and
R. K. Iyer, “Failure diagnosis for distributed systems using targeted
fault injection,”
IEEE Transactions on Parallel and Distributed Sys-
tems
, vol. 28, no. 2, pp. 503–516, 2017.
[186] K. Nagaraj, C. Killian, and J. Neville, “Structured comparative analysis
of systems logs to diagnose performance problems,” in
Proceedings
of the 9th USENIX Conference on Networked Systems Design and
Implementation
, ser. NSDI’12.
USA: USENIX Association, 2012,
p. 26.
[187] H. Ikeuchi, A. Watanabe, T. Kawata, and R. Kawahara, “Root-cause
diagnosis using logs generated by user actions,” in
2018 IEEE Global
Communications Conference (GLOBECOM)
, 2018, pp. 1–7.
[188] P. Aggarwal, A. Gupta, P. Mohapatra, S. Nagar, A. Mandal, Q. Wang,
and A. Paradkar, “Localization of operational faults in cloud appli-
cations by mining causal dependencies in logs using golden signals,”
in
Service-Oriented Computing – ICSOC 2020 Workshops
, H. Hacid,
F. Outay, H.-y. Paik, A. Alloum, M. Petrocchi, M. R. Bouadjenek,
A. Beheshti, X. Liu, and A. Maaradji, Eds.
Cham: Springer Interna-
tional Publishing, 2021, pp. 137–149.
[189] Y. Zhang, Z. Guan, H. Qian, L. Xu, H. Liu, Q. Wen, L. Sun,
J. Jiang, L. Fan, and M. Ke, “Cloudrca: A root cause analysis
framework for cloud computing platforms,” in
Proceedings of the
30th ACM International Conference on Information & Knowledge
Management
, ser. CIKM ’21.
New York, NY, USA: Association
for Computing Machinery, 2021, p. 4373–4382. [Online]. Available:
https://doi.org/10.1145/3459637.3481903
[190] H. Wang, Z. Wu, H. Jiang, Y. Huang, J. Wang, S. Kopru, and
T.
Xie,
“Groot:
An
event-graph-based
approach
for
root
cause
analysis in industrial settings,” in
Proceedings of the 36th IEEE/ACM
International Conference on Automated Software Engineering
, ser.
ASE
’21.
IEEE
Press,
2021,
p.
419–429.
[Online].
Available:
https://doi.org/10.1109/ASE51524.2021.9678708
[191] X. Fu, R. Ren, S. A. McKee, J. Zhan, and N. Sun, “Digging deeper
into cluster system logs for failure prediction and root cause diagno-
sis,” in
2014 IEEE International Conference on Cluster Computing
(CLUSTER)
, 2014, pp. 103–112.
[192] S. Kobayashi, K. Fukuda, and H. Esaki, “Mining causes of network
events in log data with causal inference,” in
2017 IFIP/IEEE Sympo-
sium on Integrated Network and Service Management (IM)
, 2017, pp.
45–53.
[193] S. Kobayashi, K. Otomo, and K. Fukuda, “Causal analysis of network
logs with layered protocols and topology knowledge,” in
2019 15th In-
ternational Conference on Network and Service Management (CNSM)
,
2019, pp. 1–9.
[194] R. Jarry, S. Kobayashi, and K. Fukuda, “A quantitative causal analysis
for network log data,” in
2021 IEEE 45th Annual Computers, Software,
and Applications Conference (COMPSAC)
, 2021, pp. 1437–1442.
[195] D. Liu, C. He, X. Peng, F. Lin, C. Zhang, S. Gong, Z. Li, J. Ou, and
Z. Wu, “Microhecl: High-efficient root cause localization in large-
scale microservice systems,” in
Proceedings of the 43rd International
Conference
on
Software
Engineering:
Software
Engineering
in
Practice
, ser. ICSE-SEIP ’21.
IEEE Press, 2021, p. 338–347. [Online].
Available: https://doi.org/10.1109/ICSE-SEIP52600.2021.00043
[196] Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y. Wu,
L. Jiang, L. Yan, Z. Wang
et al.
, “Practical root cause localization
for microservice systems via trace analysis,” in
2021 IEEE/ACM 29th
International Symposium on Quality of Service (IWQOS)
.
IEEE, 2021,
pp. 1–10.
[197] X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and
C. He, “Latent error prediction and fault localization for microservice
applications by learning from system trace logs,” in
Proceedings of the
2019 27th ACM Joint Meeting on European Software Engineering Con-
ference and Symposium on the Foundations of Software Engineering
,
2019, pp. 683–694.
[198] P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo,
J. Zeng, W. Xue
et al.
, “Unsupervised detection of microservice trace
anomalies through service-level deep bayesian networks,” in
2020 IEEE
31st International Symposium on Software Reliability Engineering
(ISSRE)
.
IEEE, 2020, pp. 48–58.
[199] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal,
D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a large-scale dis-
tributed systems tracing infrastructure,” 2010.
[200] J. Jiang, W. Lu, J. Chen, Q. Lin, P. Zhao, Y. Kang, H. Zhang,
Y. Xiong, F. Gao, Z. Xu, Y. Dang, and D. Zhang, “How to mitigate
the
incident?
an
effective
troubleshooting
guide
recommendation
technique for online service systems,” in
Proceedings of the 28th
ACM Joint Meeting on European Software Engineering Conference
and
Symposium
on
the
Foundations
of
Software
Engineering
,
ser.
ESEC/FSE
2020.
New
York,
NY,
USA:
Association
for
Computing
Machinery,
2020,
p.
1410–1420.
[Online].
Available:
https://doi.org/10.1145/3368089.3417054
[201] M.
Shetty,
C.
Bansal,
S.
P.
Upadhyayula,
A.
Radhakrishna,
and
A.
Gupta,
“Autotsg:
Learning
and
synthesis
for
incident
troubleshooting,”
CoRR
,
vol.
abs/2205.13457,
2022.
[Online].
Available: https://doi.org/10.48550/arXiv.2205.13457
[202] X. Nie, Y. Zhao, K. Sui, D. Pei, Y. Chen, and X. Qu, “Mining
causality graph for automatic web-based service diagnosis,” in
2016
IEEE 35th International Performance Computing and Communications
Conference (IPCCC)
, 2016, pp. 1–8.
[203] W. Lin, M. Ma, D. Pan, and P. Wang, “Facgraph: Frequent anomaly
correlation graph mining for root cause diagnose in micro-service
architecture,” in
2018 IEEE 37th International Performance Computing
and Communications Conference (IPCCC)
, 2018, pp. 1–8.
[204] R. Ding, Q. Fu, J. G. Lou, Q. Lin, D. Zhang, and T. Xie, “Mining his-
torical issue repositories to heal large-scale online service systems,” in
2014 44th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks
, 2014, pp. 311–322.
[205] M.
Shetty,
C.
Bansal,
S.
Kumar,
N.
Rao,
N.
Nagappan,
and
T. Zimmermann, “Neural knowledge extraction from cloud service
incidents,”
in
Proceedings
of
the
43rd
International
Conference
on
Software
Engineering:
Software
Engineering
in
Practice
,
ser.
ICSE-SEIP ’21.
IEEE Press, 2021, p. 218–227. [Online]. Available:
https://doi.org/10.1109/ICSE-SEIP52600.2021.00031
[206] M.
Shetty,
C.
Bansal,
S.
Kumar,
N.
Rao,
and
N.
Nagappan,
“Softner: Mining knowledge graphs from cloud incidents,”
Empir.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
34
Softw.
Eng.
,
vol.
27,
no.
4,
p.
93,
2022.
[Online].
Available:
https://doi.org/10.1007/s10664-022-10159-w
[207] A.
Saha
and
S.
C.
H.
Hoi,
“Mining
root
cause
knowledge
from
cloud
service
incident
investigations
for
aiops,”
in
44th
IEEE/ACM
International
Conference
on
Software
Engineering:
Software Engineering in Practice, ICSE (SEIP) 2022, Pittsburgh,
PA, USA, May 22-24, 2022
.
IEEE, 2022, pp. 197–206. [Online].
Available: https://doi.org/10.1109/ICSE-SEIP55303.2022.9793994
[208] S. Becker, F. Schmidt, A. Gulenko, A. Acker, and O. Kao, “Towards
aiops in edge computing environments,” in
2020 IEEE International
Conference on Big Data (Big Data)
, 2020, pp. 3470–3475.
[209] S. Levy, R. Yao, Y. Wu, Y. Dang, P. Huang, Z. Mu, P. Zhao,
T. Ramani, N. Govindaraju, X. Li, Q. Lin, G. L. Shafriri, and
M. Chintalapati, “Predictive and adaptive failure mitigation to avert
production cloud VM interruptions,” in
14th USENIX Symposium
on
Operating
Systems
Design
and
Implementation
(OSDI
20)
.
USENIX Association, Nov. 2020, pp. 1155–1170. [Online]. Available:
https://www.usenix.org/conference/osdi20/presentation/levy
[210] J. D. Hamilton,
Time Series Analysis
.
Princeton University Press,
1994.
[211] R. Hyndman and G. Athanasopoulos,
Forecasting: Principles and
Practice
, 2nd ed.
OTexts, 2018.
[212] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural
Computation
, vol. 9, no. 8, pp. 1735–1780, 1997.
[213] R. N. Calheiros, E. Masoumi, R. Ranjan, and R. Buyya, “Workload
prediction using arima model and its impact on cloud applications’
qos,”
IEEE Transactions on Cloud Computing
, vol. 3, no. 4, pp. 449–
458, 2015.
[214] D. Buchaca, J. L. Berral, C. Wang, and A. Youssef, “Proactive container
auto-scaling for cloud native machine learning services,” in
2020 IEEE
13th International Conference on Cloud Computing (CLOUD)
, 2020,
pp. 475–479.
[215] M. Wajahat, A. Gandhi, A. Karve, and A. Kochut, “Using machine
learning for black-box autoscaling,” in
2016 Seventh International
Green and Sustainable Computing Conference (IGSC)
, 2016, pp. 1–
8.
[216] N.-M. Dang-Quang and M. Yoo, “Deep learning-based autoscaling
using bidirectional long short-term memory for kubernetes,”
Applied
Sciences
, vol. 11, no. 9, 2021. [Online]. Available: https://www.mdpi.
com/2076-3417/11/9/3835
[217] Y. Gar´
ı, D. A. Monge, E. Pacini, C. Mateos, and C. Garc´
ıa Garino,
“Reinforcement learning-based application autoscaling in the cloud: A
survey,”
Engineering Applications of Artificial Intelligence
, vol. 102,
p. 104288, 2021. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0952197621001354
[218] S. Mustafa, B. Nazir, A. Hayat, A. ur Rehman Khan, and S. A. Madani,
“Resource management in cloud computing: Taxonomy, prospects,
and challenges,”
Computers and Electrical Engineering
, vol. 47, pp.
186–203, 2015. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S004579061500275X
[219] T. Khan, W. Tian, G. Zhou, S. Ilager, M. Gong, and R. Buyya,
“Machine
learning
(ml)-centric
resource
management
in
cloud
computing: A review and future directions,”
Journal of Network and
Computer Applications
, vol. 204, p. 103405, 2022. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1084804522000649
[220] F. Nzanywayingoma and Y. Yang, “Efficient resource management
techniques in cloud computing environment: a review and discussion,”
International
Journal
of
Computers
and
Applications
,
vol.
41,
no. 3, pp. 165–182, 2019. [Online]. Available: https://doi.org/10.1080/
1206212X.2017.1416558
[221] N. M. Gonzalez, T. C. M. D. B. Carvalho, and C. C. Miers, “Cloud
resource
management:
Towards
efficient
execution
of
large-scale
scientific applications and workflows on complex infrastructures,”
J. Cloud Comput.
, vol. 6, no. 1, dec 2017. [Online]. Available:
https://doi.org/10.1186/s13677-017-0081-4
[222] R.
Bianchini,
M.
Fontoura,
E.
Cortez,
A.
Bonde,
A.
Muzio,
A.-M. Constantin, T. Moscibroda, G. Magalhaes, G. Bablani, and
M.
Russinovich,
“Toward
ml-centric
cloud
platforms,”
Commun.
ACM
,
vol.
63,
no.
2,
p.
50–59,
jan
2020.
[Online].
Available:
https://doi.org/10.1145/3364684
[223] E.
Cortez,
A.
Bonde,
A.
Muzio,
M.
Russinovich,
M.
Fontoura,
and R. Bianchini, “Resource central: Understanding and predicting
workloads
for
improved
resource
management
in
large
cloud
platforms,” in
Proceedings of the 26th Symposium on Operating
Systems Principles
, ser. SOSP ’17.
New York, NY, USA: Association
for Computing Machinery, 2017, p. 153–167. [Online]. Available:
https://doi.org/10.1145/3132747.3132772
[224] K. Haghshenas and S. Mohammadi, “Prediction-based underutilized
and destination host selection approaches for energy-efficient dynamic
vm consolidation in data centers,”
The Journal of Supercomputing
,
vol. 76, no. 12, pp. 10 240–10 257, Dec 2020. [Online]. Available:
https://doi.org/10.1007/s11227-020-03248-4
[225] S. Ilager, K. Ramamohanarao, and R. Buyya, “Thermal prediction for
efficient energy management of clouds using machine learning,”
IEEE
Transactions on Parallel and Distributed Systems
, vol. 32, no. 5, pp.
1044–1056, 2021.
[226] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and
M. Alizadeh, “Learning scheduling algorithms for data processing
clusters,” in
Proceedings of the ACM Special Interest Group on
Data Communication
, ser. SIGCOMM ’19.
New York, NY, USA:
Association for Computing Machinery, 2019, p. 270–288. [Online].
Available: https://doi.org/10.1145/3341302.3342080
[227] D.
Anderson,
“What
is
apm?”
2021.
[Online].
Available:
https:
//www.dynatrace.com/news/blog/what-is-apm-2/
[228] J.
Livens,
“What
is
observability?
not
just
logs,
metrics
and
traces,” 2021. [Online]. Available: https://www.dynatrace.com/news/
blog/what-is-observability-2/
[229] Y.
Guo,
Y.
Wen,
C.
Jiang,
Y.
Lian,
and
Y.
Wan,
“Detecting
log
anomalies
with
multi-head
attention
(LAMA),”
CoRR
,
vol.
abs/2101.02392, 2021. [Online]. Available: https://arxiv.org/abs/2101.
02392
[230] S. Zhang, Y. Liu, X. Zhang, W. Cheng, H. Chen, and H. Xiong, “Cat:
Beyond efficient transformer for content-aware anomaly detection in
event sequences,” in
Proceedings of the 28th ACM SIGKDD Conference
on Knowledge Discovery and Data Mining
, ser. KDD ’22.
New York,
NY, USA: Association for Computing Machinery, 2022, p. 4541–4550.
[Online]. Available: https://doi.org/10.1145/3534678.3539155
[231] C. Zhang, X. Wang, H. Zhang, H. Zhang, and P. Han, “Log sequence
anomaly detection based on local information extraction and globally
sparse transformer model,”
IEEE Transactions on Network and Service
Management
, vol. 18, no. 4, pp. 4119–4133, 2021.
[232] Q. Wang, X. Zhang, X. Wang, and Z. Cao, “Log Sequence Anomaly
Detection Method Based on Contrastive Adversarial Training and Dual
Feature Extraction,”
Entropy
, vol. 24, no. 1, p. 69, Dec. 2021.
[233] J. Qi, Z. Luan, S. Huang, Y. Wang, C. J. Fung, H. Yang, and
D. Qian, “Adanomaly: Adaptive anomaly detection for system logs
with adversarial learning,” in
2022 IEEE/IFIP Network Operations
and
Management
Symposium,
NOMS
2022,
Budapest,
Hungary,
April
25-29,
2022
.
IEEE,
2022,
pp.
1–5.
[Online].
Available:
https://doi.org/10.1109/NOMS54207.2022.9789917
[234] M. Attariyan, M. Chow, and J. Flinn, “X-ray: Automating
{
Root-
Cause
}
diagnosis of performance anomalies in production software,”
in
10th USENIX Symposium on Operating Systems Design and Imple-
mentation (OSDI 12)
, 2012, pp. 307–320.
[235] X. Guo, X. Peng, H. Wang, W. Li, H. Jiang, D. Ding, T. Xie,
and L. Su, “Graph-based trace analysis for microservice architecture
understanding and problem diagnosis,” in
Proceedings of the 28th
ACM Joint Meeting on European Software Engineering Conference
and Symposium on the Foundations of Software Engineering
, 2020,
pp. 1387–1397.
[236] Y. Xu, Y. Zhu, B. Qiao, H. Che, P. Zhao, X. Zhang, Z. Li, Y. Dang, and
Q. Lin, “Tracelingo: Trace representation and learning for performance
issue diagnosis in cloud services,” in
2021 IEEE/ACM International
Workshop on Cloud Intelligence (CloudIntelligence)
.
IEEE, 2021, pp.
37–40.
[237] M. Li, Z. Li, K. Yin, X. Nie, W. Zhang, K. Sui, and D. Pei,
“Causal
inference-based
root
cause
analysis
for
online
service
systems with intervention recognition,” in
Proceedings of the 28th
ACM
SIGKDD
Conference
on
Knowledge
Discovery
and
Data
Mining
,
ser.
KDD
’22.
New
York,
NY,
USA:
Association
for
Computing
Machinery,
2022,
p.
3230–3240.
[Online].
Available:
https://doi.org/10.1145/3534678.3539041
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Browse Popular Homework Q&A
Q: 4) On January 1, 2020, Ehrlich had outstanding 300,000 shares of $2 par value common stock and…
Q: E Question Help
A study was conducted to measure the effectiveness of hypnotism in reducing pain.…
Q: Assets
Current assets
Cash and marketable
securities
Receivables
Inventories
Other current assets…
Q: Refrigerant 134a is stored inside an uninsulated rigid tank (see figure below) and heated
by an…
Q: Investment X offers to pay you $6,100 per year for 9 years, whereas Investment Y offers
to pay you…
Q: A particle of matter is moving with a kinetic energy of 6.82 eV. Its de Broglie wavelength is 2.50 x…
Q: Knowing that the tension is 555 lb in cable AB and 470 lb in cable AC, determine the magnitude and…
Q: arker Plastic, Incorporated, manufactures plastic mats to use with rolling office chairs. Its…
Q: A mass weighing 4 pounds is attached to a spring whose spring constant
is 9 lb/ft. What is the…
Q: The enthalpy of combustion of isooctane (C8H18(l)) is -5461 kJ/mol. From data found in appendix G,…
Q: A solid spherical conductor has a radius R and has a charge Q. Assume the potential is zero
at…
Q: Demonstrate how males are at an increased risk of sex-linked recessive traits by crossing a female…
Q: Consider the resonance structures for the oxalate anion below. What is the average formal charge on…
Q: A cylindrical capacitor is made of two thin-walled concentric cylinders. The inner cylinder has…
Q: what is a nonparametric test? What is a parametric test ?
what is the difference between a…
Q: random sample of the birth weights of 186 babies has a mean of 3103 grams and a
standard deviation…
Q: Fortune Company had beginning raw materials inventory of $9,900. During the period, the company…
Q: You find a zero coupon bond with a par value of $10,000 and 20 years to maturity. The
yield to…
Q: Arc length Find the arc length of the following curves on the given interval.
x = 3 cos t, y = 3 sin…
Q: In the circuit of the figure below, the current I, is 1.8 A and the values of & and R are unknown.…
Q: 6. Dioxin, C₁2H4C14O₂ is a powerful poison. How many grams of dioxin do you have if there are 15.0…
Q: Pregnant women with gestational diabetes mellitus (GDM) are at risk for long-term weight gain and…
Q: What is the pooled variance of this problem?
b) What is the pooled standard deviation?
c) What is…
Q: During each phase of the assembly process, please describe the Assembly Registers and explain why…
Q: 2. Dr. Dahm has a 0.4 M solution of nitrous acid. If he takes 3
mL of this solution and dilutes it…