Anomaly Detection

An anomaly is a data point or event that is inconsistent with normal operating conditions. Detecting anomalies in order to locate problems and understand trends within infrastructure and applications is a key use case for AIOps. Detection allows tools to both recognize behavior that is out of the ordinary (such as a server that is responding more slowly than usual, or uncommon network activity generated by a breach) and react accordingly.
By using Anomaly Detection in our AIOps solution, you gain the following benefits:
  • Our AIOps solution ingests metrics, not just alarms or events. Metrics are a must have for effective anomaly detection.
  • With our AIOps solution you don’t need to setup thresholds. You can just send metrics to the data lake and our AIOps solution will correlate data and identify anomalies.
  • Our AIOps solution does multi-variate anomaly detection, rather than relying on just a single variate.
  • Our AIOps solution features more than ten AI and ML algorithms, which we have tuned based on our domain expertise. These optimized algorithms enable you to do fast root cause analysis and predictive IT.
  • With our automation and topology mapping, we can accurately detect anomalies, reduce event noise, and pinpoint the root cause of issues.
  • If our AIOps solution is ever incorrect in identifying a root cause, it can take input from operators and learn from this information.
Dynamic Baselining
While understanding the concept of an anomaly might be easy, what makes anomaly detection particularly challenging for AIOps in modern software environments is that, in many instances, there is no consistent means of defining
normal
operating conditions.The amount of network traffic, memory and storage space that a given environment consumes might fluctuate widely throughout the day, which also includes the number of active users or application instances. Effective detection under these circumstances requires AIOps tools that are intelligent enough to set dynamic baselines. Dynamic baselines allow the tools to determine what constitutes normal activity under given circumstances (such as the time of day and the number of registered users for an application), then detect data or events that do not align with the dynamic baseline.
Time-Series Anomaly Detection
Time-series data represents time-stamped observations of various probes we have in the environment. In large deployments, we can collect tens of millions of metrics. The majority of these metrics are time averaged and can provide great detail on the transactional or resource-related state of the system.
Each of the individual metrics follow a distribution. Without making any assumptions on the distribution, our KDE algorithm draws kernels of distributions for the historical data points of each metric. Using this distribution, it estimates the probability of a value occurring for a metric. Using a quartile-based breakdown, this distribution helps Anomaly Detection estimate how rare or common a data value is for a particular metric at a particular time of the day. These areas form bands that our Anomaly Detection can then consider normal.
Our AIOps solution then interprets these bands for you. An anomaly is generated when a metric value is in the rare band for long enough.
The raw metrics are published to the Data Science Platform (DSP) where the anomaly detection engine lies.