How organizations Handled Incidents Before and After Deploying AIOps – Part 1

Organizations are always looking for new ways to innovate and reduce costs and allocate resources more efficiently. In this blog post, we will look at how enterprises handled incidents before and after deploying AIOps. 

Why AIOps and how do you get most value out of AIOps?

AIOps is a management approach that uses Data Analytics and automation to optimize IT service delivery across the entire lifecycle of an application such as deployment, monitoring, troubleshooting, and remediation. AIOps helps organizations make better decisions by augmenting Machine Learning(ML), Artificial Intelligence(AI), and Robotic Data Automation (RDA) and the best part is it can be deployed on-premises or cloud-based infrastructures as well.

Let us walk through how the traditional IT Incident Management process in ACME Corporation(name changed) looks like.

Quick summary of ACME Corporation: ACME is a $500Mn organization headquartered in the US. As ACME teams are distributed across different geographical regions they were heavily relying on online collaboration tools such as Slack and Microsoft Teams. The production support team, the Delivery team, QA, and ITOps team were all held responsible for the smooth functioning of the IT. Once the Incident is created, a Service Desk operator is assigned to triage, diagnose and resolve the problem. If any incident is out of their purview, it is escalated to other personnel with specialized skills. These teams are set up to operate in a tiered fashion and are colloquially referred to as L1, L2, or L3 engineers, i.e Level-1, Level-2, and Level-3 (or in some cases Tier-1/Tier-2/Tier-3), with L3 holding the most specialized skill-set. In the case of ACME, the L1 and L2 engineers are outsourced to an MSP and the L3 team is of ACMEs.

Sample ServiceNow Incident Description:

The ACMEs Java applications CPU consumption suddenly starts to go higher and higher until it stays at 80 – 100%. After recycling the application, it ran fine for a few hours, but the CPU consumption started to spike until it crashed permanently. 

Incident Management steps followed at ACME

Following are the steps that L1/L2 operators performed when the incident was reported at 10 am on September 08.

  1. Triage and prioritize incidents. – The priority of this incident is high as per the incident priority matrix.
  2. Identify services or assets impacted. – The incident has been assessed, and ACMEs Java application is impacted.
  3. Issue ping, traceroute, or other such simple sanity checks. – These checks are performed, and everything is normal.
  4. Log into multiple monitoring tools and check metric values. – From the monitoring tool Splunk we can find that the CPU usage levels are extremely high. Hence the incident has been created. 
  5. In some cases, remote login to a device to check the status. – This process is skipped as it is not relevant in this scenario. 
  6. Check log files on various systems to detect any errors. – The Java log files are checked, and there seem to be no issues. 
  7. Issue some low to medium-risk commands that change system state (ex: service restart, add stateless VM/container, etc.) – This process has been performed by the L1 and L2 teams. They tried restarting the service, and it was okay for a certain amount of time until the same issue started again. 
  8. If the issue is likely due to a problem from a vendor, create & track the issue with customer support case numbers. – This issue is not due to a vendor, so we are skipping this step.
  9. Look for any known or published vulnerabilities. – There are no known/published vulnerabilities. 
  10. Verify if there are any known defects or bugs that report the same kind of issues in the CMDB – There are no known bugs similar to this incident in the CMDB.
  11. Update Incident notes with findings from the above activities. – All the steps performed until now is updated in the incident. 
  12. ‘Resolve’ the incident if a successful resolution is determined or escalate to L3. – Since the issue couldn’t be resolved by L1 and L2 teams, the incident was escalated to the L3 team.
  13. L3 may perform deeper analysis and coordinate with multiple teams and call for a war-room to get to the bottom of the problem. – L3 did a more profound analysis by coordinating with multiple teams and found out that a recent update pushed 2 days ago found multiple threads accessing the HashMap’s get and put method concurrently, resulting in an endless loop causing the CPU spike. The L3 team then fixed the bug quickly and resolved the incident by updating all the details at 6 pm on September 09.

The average Mean Time to Resolution(MTTR) and Mean Time to Detect(MTTD) this incident is about 32 hours which breached the SLA set by ACME. 

A key part of any organization’s IT Operations strategy is incident response. In this article series, we talked about the measures taken by the ITOps team to handling IT incidents before deploying AIOps. Check out our Part 2, where we’ll explore how organizations deal with incidents after deploying AIOps.

Learn more about our AIOps tools and solution to handling incidents in a smart and efficient manner.   

Gurubaran Baskaran
Gurubaran Baskaran
https://www.linkedin.com/in/bgurubaran