Alert Noise Reduction using CloudFabrix AIOps Platform
May 22nd, 2019
Alert noise is a common problem in most IT organizations. Alert noise arises due to excessive number of false alerts that are either redundant or not actionable. This is primarily due to limitations of traditional or legacy IT operations tools that are not able to meet demands of modern IT environments.
Most common reasons for alert noise are due to
- Lack of stack awareness
- Static thresholds
- And Alert Storms
1) Lack of Stack Awareness –> Full Stack Awareness, Application Dependency Map
To bring full stack awareness
- First we run AI discovery to discover and group similar hosts using clustering algorithms, which is a supervised Machine Learning Algorithm. Specifically, platform uses K-Means clustering algorithm.
- This is very different from traditional discovery approaches that solely rely on signature/footprint matching, whereas Dimensions uses a combination of user feedback loop (training) + machine learning to obtain best possible results.
- Second, we establish a Dependency map to provide topology and application context to perform alert correlation. As part of this Dependency map, connected neighbors data, application or stack notion is established that allows perform impact analysis and correlation of alerts/events from multiple stack elements.
2) Static Thresholds –> Dynamic Thresholds and Granular Controls
Static thresholds ignore dynamic nature of IT workloads and create alerts at per-established levels, which won’t work for majority of the IT workloads that are dynamic in nature. This results in excessive number of alerts.
To address this problem
- Granular Controls: We provide granular alert controls to tune telemetry collection interval. And to minimize the alerts caused to metric fluctuations we provide hi-watermark, lo-watermark and minimum occurrence controls.
- Dynamic Thresholds: Our Dynamic thresholds establish a baseline for every metric and raise an alert only if the metric is deviating from baseline.
We also identify heavily utilized assets where alerting should be disabled or more capacity should be added.
3) Alert Storms –> Actionable Alerts
Alert Storms can occur anytime, but more so during unplanned outages and planned outages.
- Planned Outages: With our platform, alerts can be configured to be ignored during planned outages like patching, backup or maintenance. In addition to this, we are able to automatically exclude network device access ports from monitoring, as this can cause excessive number of unwanted alerts, whenever employees, remote users, phones etc. connect/disconnect from the network.
- Unplanned Outages: and device fluctuations or flapping situations cause alert storms, which we detect automatically and suppress the alerts during unplanned outages like network disruption or device unavailable events.
In short, all of these key capabilities and many other features help keep alert noise under control.
In addition to this our platform provide automated diagnosis and remediation of incidents with Incident room, which we will cover in another blog post.
To learn more about CloudFabrix cfxDimensions platform, subscribe to one of our upcoming webinars or visit https://cloudfabrix.com