AIOps Incident Detection
Enable AIOps Incident Detection
J
Written by Josh Peters
Updated over a week ago

When managing large and distributed networks, it can be challenging to know where to focus your attention, especially when issues have a limited impact or are short lived. User Experience Insight AIOps can help you transform your operations by identifying the most critical issues that need attention before users complain. The first component of AIOps is Incident Detection.

Motivation for AIOps Incident Detection

User Experience Insight sensors run synthetic tests one at a time in a continuous round-robin sequence. When these tests identify a problem, they generate an issue. There are two types of issues observed in the dashboard which generate notifications: threshold violations and test failures.

The Incident Detection system examines this issue data in real-time to identify issues that are significantly different from your typical issue profile. These anomalous issues are consolidated into incidents and surfaced on the dashboard. An incident is a collection of related anomalous issues.

Incident Detection Models

User Experience Insight Incident Detection begins with training a machine learning model for the typical issue profile of your deployment, using historical data.

When an issue is detected and the timing of the arrival of the issue in relation to other issues does not confirm to the model, the anomalous issue and other related non-confirming issues will appear red in the dashboard letting you know this is set of issues is anomalous and might need immediate attention. Emails, alerts and other notifications will only be sent for issues that are classified as incidents (in red) on the dashboard. When an issue is detected and the timing of the arrival of the issue in relation to other issues conforms to the model, the issue will appear blue to indicate it is informational. You will not receive email and webhook alerts for these issues.

The model requires at least 20 active sensors and sufficient issue data to build relevant models. The model is recalculated every week. As this feature capabilities expand, more models will be added.

How to Enable AIOps Incident Detection

Go to Settings โ†’ AIOps and follow the wizard to enable the feature. Once enabled, the transition to AI Incident mode may take up to 15 minutes. This feature can be toggled on and off only once every 4 hours. So once you have enabled it, you will need to wait 4 hours to disable it.

Note: When Incident Detection is enabled, there are no yellow 'warning issues' on your dashboard.

How to View Incidents

To see a view of the past 7 or 30 days of incidents, select the bell icon on the top right of the main dashboard. The naming convention for issues is Month/Year-Incident Number.

Select an Incident to navigate to the Incident View. This view shows the specific time period of the incident and where the sensors are located. You can rename the incident and drill down into the triage to better understand the issue.

Current Limitations

  • Mutes - Mutes only affect the visual representation of the dashboard. They do not affect whether an issue can be added to an Incident or any notifications.

  • Please note that read-only users will receive incident notification messages regardless of group assignment.

  • Weekly reports are issue-aware for now but will eventually evolve to provide full support for incidents.

  • When Incident Detection is enabled, there are no yellow 'warning issues' on your dashboard.

Improve Incident Detection for your account

Admin users can now vote thumbs up or thumbs down for each incident. This will tell us if the incident was relevant and useful to you.

Voting thumbs up indicates you want to see more of such incidents and voting thumbs down indicates that you would like to see less of such incidents moving forward.

Read-only users will not be able to use this voting functionality. The votes we receive from you will create direct feedback into our machine learning capabilities which will adapt to your preferences over time.

Ongoing Incidents but no red smiley on the dashboard

This is a fairly common scenario because the incident waits about 10 mins to close, in case more issues come in. Hence the incident can potentially be open for a while after the issues are all closed.

Future Considerations

We are considering additional ML models and enhanced AIOps capabilities as we continue to improve upon this feature.


Did this answer your question?