Mckinsey Global Institute Report of 2018 states that Artificial Intelligence (AI) has the potential to create annual value of $3.5 billion -$5.8 billion across different industry sectors. Today, AI in Finance and IT alone accounts for about $100 billion and hence it is becoming quite the game changer in the IT world.

With the onset of cloud adoption, the world of IT DevOps has changed dramatically. The focus of IT Ops is changing to an integrated, service-centric approach that maximizes business services availability. AI can help IT Ops in early detection of outages, potential Root Cause prediction, finding systems and nodes which are susceptible to outages, average resolution time and more. This article highlights a few use cases where AI can be integrated with IT Ops, simplifying day-to-day operations and making remediation more robust.

1.) Predictive analytics of outages: False positive causes threat alert fatigue for IT Ops teams. The survey indicates that about 52% of security alerts are generally false positives. This puts a lot of pressure on the teams as they have to review each of these alerts manually. In such a scenario, deep neural networks can predict whether an alert will result into outages.

Alerts Layers Yes/No

Feed Forward back propagation with 2 hidden layers should yield good results in terms of predicting outages as illustrated above. All alert types within a stipulated time can act as inputs and outages would be the output. Historical data should be used to train the model. Every enterprise has its own fault line and weakness, and it is only through historical data that latent features are surfaced, hence every enterprise should build their own customized model as “one size fits all” model has a higher likelihood of not delivering expected outcomes.

The alternate method is a logistic regression where all “alert types” are input variables and “binary outages” would be the output.

Logistic regression measures the relationship between the categorical dependent variables and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Thus, it treats the same set of problems as probit regression using similar techniques, with the latter using a cumulative normal distribution curve instead.

2.) Root Cause classification and prediction: This is a two-step process. In the first step, root cause classification is done based on key word search. From free flow Root Cause Analysis fields, Natural Language Processing (NLP) is used to extract key values and classify into predefined root causes. This can be either supervised or unsupervised.

In the second step, Random Forest for Multi-Class Neural Network can be used to predict root causes while other attributes act as input. Based on the data volume and the datatype, one can choose the right classification model. In general, Random Forest has better accuracy, but it needs structured data and right labeling and it is less fault tolerant to data quality. While Multi-Class Neural Network will need a large volume of data to train, it is more fault tolerant but slightly less accurate.

3.) Prediction of average time to close a ticket: A simple weighted average formula can be used to predict time taken for ticket resolution.

Avg time (t) = (a1.T1 + a2.T2+ a3.T3 )/(count of T1+T2+T3)

Where T1 are ticket types.

Other attributes can be used to segment the ticket into right cohorts to make it more predictable. This helps in better resource planning and utilization. Weightage of features can be done heuristically or empirically.

4.) Unusual Load on System: Simple anomaly detection algorithms can inform whether the system is going through a normal load or it has high variance. A high variance / deviation from average on time series can inform the unusual activities or resources that are not freeing up. However, the algorithm should take care of seasonality as a system load is a function of time and season.

Given the above scenarios it is obvious that AI has a tremendous opportunity to serve IT operations. It can be used for several IT Ops including prediction, event correlation, detection of unusual loads on system (e.g. cyber-attack) and remediation based on root cause analysis.

About the Author:

Vivek Singh is the Senior Director at Relevance Lab and has around 22 years of IT experience in several large enterprises and startups. He is a data architect, an open source evangelist and a chief contributor of Open Source Data Quality project. He is the author of a novel “The Reverse Journey”.

(The article was originally published in Devops.com and can be read here: https://devops.com/artificial-intelligence-coming-to-the-rescue-of-itops/

Blog, 2018 Blogs

The Need of an ATC for managing complex IT Operations

By RL Admin on

September, 2018

- Blog, 2018 Blogs

A few days ago, I was traveling from Bangalore to Mumbai. It was an overcast and wet morning and in order to avoid any road traffic delays, I started early so I didn’t have to battle traffic and worry about the prospect of being delayed. At the airport, I checked-in, went through the usual formalities and boarded the flight. I was anticipating a delay, but to my surprise the flight was on time.

While we were approaching the main runway, I could see many flights ahead of us in a queue, waiting for their turn to take-off. At the same time, there were two flights which landed within a couple of minutes of each other. The entire environment of the runway and the surroundings looked terribly busy. While our flight was preparing to take off, the Air Traffic Controller (ATC) tower grabbed my attention. That tall structure, looked very calm in the midst of what seemed chaos, orchestrating every move of all the aircrafts, making sure that the ground operations were smooth, error free and efficient in difficult weather conditions.

I started comparing the runway and airport ground operations with that of the complex IT environment in enterprises today, and the challenges it poses to the IT operations teams. Today, critical business services reside on complex IT infrastructure such as on-premise, cloud and hybrid cloud environments. These require security, scalability and continuous monitoring. But do they have the ATC or the Command Center which can orchestrate, monitor all the IT assets and infrastructure for its smooth functioning? For instance, if the payment service of an e-commerce service provider is down for few minutes, it would have to incur significant losses and impact overall business opportunities creating an adverse impact.

Perhaps, today’s IT operations’ team need one such Command Center, just like an ATC at the airport, so that they can fight down-time, eliminate irrelevant noise in operations and provide critical remediation. This Command Center should have the ability to provide a 360 degree view of the health of the IT infrastructure and availability of business services besides providing the topology view of dependent node structure. This could help in assessing the root cause analysis of a particular IT incident or event occurrence. The Command Center should also provide a complete view of all IT assets, aggregated alerts, outage history and past incident occurrence and related communication enabling the IT team to predict the future occurrence of such events or incidents to prevent the outages of critical business services. In case these outages or incidents did occur, it would be a boon for the IT operations team if a Command Center could provide critical data driven insights and suggest remedial actions which in turn could be provisioned with proactive BOTs.

While I arrived at my destination on time- thanks to the ATC which made it possible, despite the challenging and complex weather conditions. This brings me to a critical question that I need to ask -do you have the required ATC or Command Center for IT Operations which can help you sustain, pre-empt and continue with business operations in a complex IT environment?

About the Author:

Neeraj Deuskar is the Director and Global Head of Marketing for the Relevance Lab. Relevance Lab is a DevOps and Automation specialist company- making cloud adoption easy for global enterprises. In his current role, Neeraj is formulating and implementing the global marketing strategy with the key responsibilities of making the brand and the pipeline impact. Prior to his current role, he has managed the global marketing teams for various IT product and services organizations and handled various responsibilities including strategy formulation, product and solutions marketing, demand generation, digital marketing, influencers’ marketing, thought leadership and branding. Neeraj is B.E. in Production Engineering and MBA in Marketing, both from the University of Mumbai, India.

(This blog was originally published in Devops.com and can also be read here: https://devops.com/the-need-for-a-command-center-in-managing-complex-it-operations/ )

Blog, 2018 Blogs

Artificial Intelligence: Coming to the Rescue of ITOps

Blog, 2018 Blogs

The Need of an ATC for managing complex IT Operations

About

Recent News

A Pre-built Platform for Site Reliability Engineering Implementation at Scale

Powering Frictionless Scientific Research for Better Human Lives with Research Gateway

Streamline SOX Compliance with Automation using RLCatalyst and RPA Solutions

Early Adoption of GenAI with Security and Data Privacy using Azure OpenAI for Customers

Platform

Services

Frictionless Business

Resources

Quick Links

Policies