How organizations handled incidents before and after deploying AIOps – Part 2
September 30th, 2021
In this highly dynamic environment, organizations are looking for ways to innovate and manage resources efficiently.
In the first part of the two-part blog series, we saw how organizations handled incidents without an AIOps solution and how long it took to resolve that incident — a scenario representing different steps to resolve an incident.
In the second part of the two-part blog series, we look at how organizations were able to handle incidents after deploying AIOps.
What does AIOps stand for, and why do we need AIOps?
AIOps stands for Artificial Intelligence for IT Operations, as quoted by analyst firm Gartner. An AIOPs solution can do so much more than just resolving incidents faster. Deploying AIOPs solutions also give organizations better visibility into when and where issues are happening, help analyze root causes, reduces MTTR for known bugs, enables proactive planning of maintenance windows across the stack, etc. AIOPs can even be used to automate service delivery. For example, AIOPs could automatically deploy an engineer when an incident occurs based on pre-determined rules. AIOPs can also help AIOps users get an in-depth view of root cause analysis by extracting information from logs and IT systems to generate insights.
Sample ServiceNow incident Description
The ACMEs Java applications CPU consumption suddenly starts to go higher and higher until it stays at 80 – 100%. After recycling the application, it ran fine for a few hours, but the CPU consumption started to spike until it crashed permanently.
Incident Management steps followed at ACME after deploying AIOps.
Step 1: Monitoring tools detect an issue and raise an alert at 10 am on September 08.
Step 2: The AIOps solution processes the alert and creates an appropriate ticket after performing a correlation with the help of past data. As soon as the ticket is created or raised, it is automatically picked up by the Machine Learning engine, analyzed, and updated with key insights. The insights identified using the ML engine are:
Assignment group – Java,
The sentiment of the ticket – Neutral,
Suggested Knowledge Base article – Resource Handling & Common Java Troubleshooting Steps.
Step 3: AIOps then uses the context to recommend remediation steps/workflows based on historical learning. Since the issue has not appeared before, no steps are mentioned. But it has suggested two recommended actions to be performed for resolving this issue.
- Restart the application
- Undo the latest changes/commit
Step 4: The L1 team has restarted the application; while it was working fine for some time, the issue started occurring again.
Step 5: So they proceed and follow the next recommendation. Since the L1 team cannot undo any changes, they immediately notify the L3 teams to perform those actions.
Step 6: The L3 team reverts the latest changes to the source code and immediately resolves the issue. The team later reviewed the code and found multiple threads accessing the HashMap’s get and put method concurrently, resulting in an endless loop causing the CPU spike. The team created a separate KB article about the incident for future reference.
Step 7: As the situation is remediated, the monitoring tool will clear the original alert, and the ticket was closed at 12 pm on September 08.
The average Mean Time to Resolution(MTTR) and Mean Time to Detect(MTTD) of this incident is about 3 hours which is a vast difference compared to 32 hours of MTTD without an AIOps platform.