technical and non-technical perspectives. Based on our practice
and experience
in Microsoft, we summarize the major
challenges of building AIOps solution as follows.
A.
Gaps in innovation methodologies and mindset
Gap in innovation methodologies
.
Building AIOps solutions
requires holistic thinking and sufficient understanding about the
whole problem space, from business value and constraints, data,
models, to system and process integration considerations, etc.
Today, there lacks innovation methodologies that can guide
people in different disciplines (e.g., bu
sine
ss stake holders,
engineers, data scientists) to build AIOps solutions.
Difficulty of the mindset shift
. The essential methodology of
AIOps solutions is to learn from history for predicting future and
to identify patterns from large amount of data. Such mindset is
substantially different from the traditional engineering mindset
(e.g., digging into individual cases by looking
at
bug
reproducing steps and detailed logs, which is inefficient or even
infeasible in large-scale service scenarios). Meanwhile, there is
a strong AI-solves-everything mindset, which is not a realistic
expectation.
B.
Engineering changes needed to suppor
t
AIOps
Traditional engineering best practices do not fit the needs
.
Building AIOps solutions needs significant engineering efforts.
AIOps-oriented engineering is still at a very early stage, and the
best practice/principles/design patterns are not well established
in the industry yet. For example, the AIOps engineering
principles should include data/label quality monitoring and
assurances,
continuous
model
-quality
validation,
and
actionability of insights.
The data quality and quantity available today do not serve
the needs of AIOps solutions
. Although major cloud services
today collect terabytes and even petabytes of telemetry data
every day/month, there still lacks
representative and hi
gh-
quality data for building AIOps
solutions.
A
continuous
improvement of data quality and quantity is necessary. The
method of instrumentation and collection of telemetry also
needs to be re-visited (e.g., principled instrumentation for AIOps
solutions instead of ad-hoc logging for de
bugging a few
issues).
C.
Difficulty on building ML models for AIOps
Building ML/AI model for AIOps solutions has unique
challenges that are not always seen in other ML/AI scenarios.
The challenges for building supervised machine learning model
for AIOps include: no clear ground truth labels or huge manual
efforts to obtain high quality ones (extremely imbalance, too
small amount, high degree of noise,
etc.)[6]
, complex
dependencies/relations
among
components/services[7],
complicated feature engineering effort due to the high
complexity of cloud service behaviors, continuous model update
and online learning, and the risk of service interruptions caused
by misbehaving ML models.
In many AIOps scenarios, due to the difficulty of obtaining
label data, only unsupervised or semi-supervised
machine
learning models is feasible. For example, detecting anomalous
behavior of services [8]. It is difficult to have enough labels to
learn “what is abnormal” of
a service, because almost every
service is ever evolving with the change of customer needs and
underlying infrastructure changes. The difficulty of building
high-
quality
unsupervised models lies in the complexity of the
internal logic of services and the huge volume of the telemetry
data that needs to be analyzed.
IV.
R
ESEARCH
I
NNOVATIONS ON
AIO
PS
AIOps can be viewed as a cross-disciplinary research and
innovation area. We believe there is a long way to go for the
industry to achieve our AIOps vision. We will focus on the
technical innovations that are needed to achieve our AIOps
vision. Meanwhile, AIOps related research is not entirely new.
For example, many of the research works on software analytics
[3] can be viewed as AIOps innovations.
A. Cross-disciplinary research
AIOps innovations involve research areas including (but not
limited to) system design, software engineering, big data,
artificial intelligence, machine learning, distributed computing,
and information visualization. One example is that, system
researchers need to work with machine learning experts to build
services with self-awareness and auto-adaptation [5].
B.
Close collaboration between academia and industry
AIOps innovations call for a close partnership between
academia and industry. The real pain of software
and service
engineers needs to be well understood. The running behaviors
of real-world services
need to
be researched.
While the
proliferation of open-source software enables easy access to
source code for the
research
community, it is far from enough
for AIOps innovations.
V. O
UTLINE OF
T
ECHNICAL
B
RIEFING
In this technical briefing, we will present our position in
AIOps with great details: (1) talking about the motivation and
emerging importance of AIOps; (2) describing the real-world
challenges of building AIOps solutions based on our experience
in Microsoft; (3) introducing a set of sample AIOps solutions
that have successfully benefited Microsoft service products; (4)
sharing some learnings from our AIOps practice.
R
EFERENCES
[1]
“Everything
you
need
to
know
about
AIOps”,
from
https://www.moogsoft.com/resources/aiops/guide/everything-aiops/
(retrieved as of Feb. 12, 2019)
[2]
IDC FutureScape
, “
Worldwide CIO Agenda 2019 Predictions
”
, doc
#US44390218, October 2018
[3]
D. Zhang, S. Han, et. Al.,
“Software Analytics in Practice”, IEEE
Software, 2013
[4]
G. Kim, P. Debois, et al,
“
The DevOps Handbook: How to Create World-
Class Agility, Reliability, and Security in Technology Organizations
”
, IT
Revolution Press, Oct. 2016
[5]
P. Huang, C. Guo, et. Al.,
“
Capturing and Enhancing In Situ System
Observability for Failure Detection
”,
In Proceedings of OSDI 2018
[6]
Y. Xu, K. Sui, et. Al.,
“
Improving Service Availability of Cloud Systems
by Predicting Disk Error
”
, in Proceedings of USNIX ATC 2018
[7]
Q. Lin, K.
Hsieh, et. Al., “
Predicting Node Failure in Cloud Service
Systems
”
, In proceedings of FSE 2018
[8]
Q. Lin, J. Lou, et.
Al., “
iDice: Problem Identification for Emerging
Issues
”, In Proceeding of ICSE 2016
5