AIOps Incident Detection

Enable AIOps Incident Detection

Written by Josh Peters
Updated over a week ago

When managing large and distributed networks, it can be challenging to know where to focus your attention, especially when issues have a limited impact or are short lived. User Experience Insight AIOps can help you transform your operations by identifying the most critical issues that need attention before users complain. The first component of AIOps is Incident Detection.

Motivation for AIOps Incident Detection

User Experience Insight sensors run synthetic tests one at a time in a continuous round-robin sequence. When these tests identify a problem, they generate an issue. There are two types of issues observed in the dashboard which generate notifications: threshold violations and test failures.

The Incident Detection system examines this issue data in real-time to identify issues that are significantly different from your typical issue profile. These anomalous issues are consolidated into incidents and surfaced on the dashboard. An incident is a collection of related anomalous issues.

Incident Detection Models

User Experience Insight Incident Detection begins with training a machine learning model for the typical issue profile of your deployment, using historical data.

When an issue is detected and the timing of the arrival of the issue in relation to other issues does not confirm to the model, the anomalous issue and other related non-confirming issues will appear red in the dashboard letting you know this is set of issues is anomalous and might need immediate attention. Emails, alerts and other notifications will only be sent for issues that are classified as incidents (in red) on the dashboard. When an issue is detected and the timing of the arrival of the issue in relation to other issues conforms to the model, the issue will appear blue to indicate it is informational. You will not receive email and webhook alerts for these issues.

Incident - incident is when a bunch of issues happen in a short amount of time together in an unexpected way.

What are Blue and Red issues?

We have 2 ways issues can be highlighted in the dashboard - Blue & Red

The Blue issue is not an unusual issue or an issue that has not been identified as an incident.

The Red issue is an unusual issue when the rate of issues coming in suddenly is much higher than normal, and that rate has to include at least that rate increase has to include at least 11 issues.

The issue normally starts from Blue, but the only time it turns directly to Red is when 11 or more issues that happen in an unusually small amount of time. Here unusual is measured relative to the network that sensors are on like guest WiFi or Ethernet VLAN, etc., or unusual for a type of service like Slack, Facebook or unusual for a type of issue like External connectivity failure, HTTP Timeout.

For e.g. let's say we are seeing 10 issues for Slack that are ordinary, that's like standard behavior so that those will stay blue but suddenly you get 100 issues over 2 mins, And then that would be recognized as being unusual and those ones would create an incident and they would go red.

Blue to Red:

All issues are blue except gradually when issues start increasing in a short amount of time on an unusual scale(11 or more). Then those blue issues turn Red and become part of the incident. There are instances when the issues directly turn red like a sudden spike in HTTP timeout, these issues stay red because they got those issues classified as incident or anomaly.

Red to Blue:

From a high level, we should not expect issues to go from red to blue, once we have identified the issue is part of the incident, then it stays red but when we close the incident for e.g. there are 100 issues in an incident and when 80% of issues are resolved then we typically resolve the incident and at that stage, if there are any issues that are still ongoing which were part of the incident then we make this assumption that those issues should not have been put in this incident so we just release them and then those issues will turn from Red to Blue.

The model requires at least 20 active sensors and sufficient issue data to build relevant models. The model is recalculated every week. As this feature's capabilities expand, more models will be added.

How to Enable AIOps Incident Detection

Go to Settings β†’ AIOps and follow the wizard to enable the feature. Once enabled, the transition to AI Incident mode may take up to 15 minutes. This feature can be toggled on and off only once every 4 hours. So once you have enabled it, you will need to wait 4 hours to disable it.

Note: When Incident Detection is enabled, there are no yellow 'warning issues' on your dashboard.

How to View Incidents

To see a view of the past 7 or 30 days of incidents, select the bell icon on the top right of the main dashboard. The naming convention for issues is Month/Year-Incident Number.

Select an Incident to navigate to the Incident View. This view shows the specific time period of the incident and where the sensors are located. You can rename the incident and drill down into the triage to better understand the issue.

Current Limitations

  • Mutes - Mutes only affect the visual representation of the dashboard. They do not affect whether an issue can be added to an Incident or any notifications.

  • Please note that read-only users will receive incident notification messages regardless of group assignment.

  • Weekly reports are issue-aware for now but will eventually evolve to provide full support for incidents.

  • When Incident Detection is enabled, there are no yellow 'warning issues' on your dashboard.

Improve Incident Detection for your account

Admin users can now vote thumbs up or thumbs down for each incident. This will tell us if the incident was relevant and useful to you.

Voting thumbs up indicates you want to see more of such incidents and voting thumbs down indicates that you would like to see less of such incidents moving forward.

Read-only users will not be able to use this voting functionality. The votes we receive from you will create direct feedback into our machine learning capabilities which will adapt to your preferences over time.

Ongoing Incidents but no red smiley on the dashboard

This is a fairly common scenario because the incident waits about 10 mins to close, in case more issues come in. Hence the incident can potentially be open for a while after the issues are all closed.

Future Considerations

We are considering additional ML models and enhanced AIOps capabilities as we continue to improve upon this feature.


Did this answer your question?