AIOps Incident Detection | User Experience Insight Help Center

When managing large and distributed networks, it can be challenging to know where to focus your attention, especially when issues have a limited impact or are short-lived.

User Experience Insight AIOps can help transform your operations by identifying the most critical issues that need attention before users complain. The first component of AIOps is Incident Detection.

Motivation for AIOps Incident Detection

User Experience Insight sensors run synthetic tests one at a time in a continuous round-robin sequence. When these tests identify a problem, they generate an issue.

There are two types of issues observed in the dashboard which generate notifications, namely threshold violations and test failures.

The Incident Detection system examines issue data in real-time to identify issues that are significantly different from your typical issue profile. These anomalous issues are consolidated into incidents and surfaced on the dashboard. An incident is a collection of related anomalous issues.

Incident Detection Models

User Experience Insight Incident Detection begins by training a machine learning model on the typical issue profile of your deployment, using historical data.

When an issue is detected, and the timing of its arrival in relation to other issues does not conform to the model, the anomalous issue and other related non-conforming issues will appear red on the dashboard, indicating that this set of issues is anomalous and might need immediate attention. Emails, alerts, and other notifications will only be sent for issues that are classified as incidents (in red) on the dashboard.

When an issue is detected and its timing in relation to other issues conforms to the model, the issue will appear blue, indicating it is informational. You will not receive email or webhook alerts for these issues.

An incident occurs when a group of issues happens together in a short amount of time in an unexpected way.

What are Blue and Red Issues?

In the UXI dashboard, issues are highlighted in two ways: Blue and Red.

Blue Issues: These represent ordinary issues or issues that have not been identified as incidents. A Blue issue is part of the normal operation and does not require immediate attention.
Red Issues: These indicate unusual issues, specifically when the rate of incoming issues suddenly spikes much higher than normal. For an issue to turn Red, this rate increase must involve at least 11 issues occurring in a very short amount of time.

Typically, an issue starts as Blue. However, if 11 or more issues occur in an unusually small timeframe, the issue can immediately turn Red. The "unusual" nature is relative to the network (e.g., guest WiFi, Ethernet VLAN) or the type of service (e.g., Slack, Facebook) or issue (e.g., External connectivity failure, HTTP Timeout).

Example: If you typically see 10 issues for Slack that are within normal behavior, these will stay Blue. However, if you suddenly see 100 issues over 2 minutes, this would be recognized as unusual, creating an incident, and those issues would turn Red.

Blue to Red

All issues begin as Blue. If the number of issues increases rapidly in a short period on an unusual scale (11 or more), they turn Red and become part of an incident. Some issues can directly turn Red, like a sudden spike in HTTP timeouts, because they are immediately classified as incidents or anomalies.

Red to Blue

Generally, issues do not revert from Red to Blue. Once an issue is identified as part of an incident, it stays Red. However, when an incident is resolved (e.g., 80% of the issues are addressed), the remaining ongoing issues may be reclassified, and those not directly related to the incident may turn from Red to Blue.

Model Requirements

The model requires at least 20 active sensors and sufficient issue data to build relevant models. The model is recalculated every week. As the feature's capabilities expand, more models will be added.

How to Enable AIOps Incident Detection

Go to Settings → AIOps and follow the wizard to enable the feature.
Once enabled, the transition to AI Incident mode may take up to 15 minutes.
This feature can be toggled on and off only once every 4 hours. So, after enabling it, you will need to wait 4 hours before you can disable it again.

Note: When Incident Detection is enabled, there are no yellow 'warning issues' on your dashboard.

Currently, AIOps Incident Detection characterizes issues by:

Start Time
Issue Code (cause of issue, e.g., DNS Failure)
Category (e.g., DNS)
Network (e.g., SSID or Ethernet)
Application/Service Identifier (e.g., Gmail)

How to View Incidents

To view incidents from the past 7 or 30 days, select the bell icon on the top right of the main dashboard. The naming convention for incidents is Month/Year-Incident Number.

Select an incident to navigate to the Incident View. This view shows the specific time period of the incident and the locations of the sensors involved. You can rename the incident and drill down into the triage to better understand the issue.

Current Limitations

Mutes: Mutes only affect the visual representation of the dashboard. They do not influence whether an issue can be added to an incident or affect notifications.
Incident Notifications for Read-Only Users: Read-only users will receive incident notification messages regardless of group assignment.
Weekly Reports: Currently, weekly reports are issue-aware but will eventually evolve to provide full support for incidents.
No Yellow 'Warning Issues': When Incident Detection is enabled, there are no yellow 'warning issues' displayed on your dashboard.

Improve Incident Detection for Your Account

Admin users can now vote thumbs up or thumbs down for each incident. This feedback helps us understand if the incident was relevant and useful to you, allowing us to improve the accuracy of Incident Detection for your account.

Voting thumbs up indicates that you want to see more of such incidents, while voting thumbs down indicates that you would like to see fewer of these incidents moving forward.

Read-only users will not be able to use this voting functionality. The votes we receive from you provide direct feedback into our machine learning capabilities, which will adapt to your preferences over time.

Ongoing Incidents but No Red Smiley on the Dashboard

This is a fairly common scenario because the incident waits about 10 minutes to close, in case more issues come in. As a result, the incident can potentially remain open for a while after all the issues have been resolved.

Future Considerations

We are considering additional ML models and enhanced AIOps capabilities as we continue to improve upon this feature.