What you need to know about AIOps

Content provided by IBM and TNW

As our lives become more digitized, the IT infrastructure supporting the applications and services we use have become increasingly complex. There are a variety of options to run services in the cloud, on-premise, serverless, and hybrid, which makes it possible to accommodate different kinds of applications, environments, and audiences.

However, managing such complex IT architectures is becoming increasingly difficult. There are too many moving parts, which makes it difficult to optimize IT, predict and prevent outages, and respond to incidents after they happen.

Fortunately, AIOps — the use of AI in IT operations — is a fast-developing field that can address some of these challenges through automation. Here’s everything you need to know about AIOps and what it can do for your organization.

The challenges of modern IT

“The industry is facing three major trends, and first is complexity,” says Pratik Gupta, CTO at IBM Automation.

TNW City Coworking space - Where your best work happens

A workspace designed for growth, collaboration, and endless networking opportunities in the heart of tech.

Book a tour now

More organizations are using cloud IT, and in many instances in conjunction with on-premise servers. This is in addition to all kinds of serverless technologies, APIs, microservices, and the like that become integrated into applications.

“Many organizations use multiple clouds — up to five. You have on-prem environments, you have cloud environments. It is much more complex than it used to be,” Gupta says.

People have to understand that this is a way of augmenting their job.

The second trend is scale.

“We have seen ten years of digitization in one year during the covid pandemic. Organizations are moving to more digital experiences and more applications for getting work done. There are a lot more applications in this hybrid cloud state,” Gupta says.

And third? Well, that is skills.

“Most C-level execs do not have the time or talent to manually manage IT environments, which as we know, are becoming extraordinarily complex,” Gupta says.

These trends are driving interest in automating the IT environment and getting help from AI.

“AI and automation, which can be referred to as intelligent automation, are no longer nice-to-have. It’s a necessity and it’s actually differentiating companies, and those who use automation and AI are going to fare much better,” Gupta says.

This is where AIOps enters the picture. AIOps is a series of tools and services that use AI to automate all IT operations, from monitoring and collecting information to optimizing machines and services, and predicting and resolving incidents.

Observability

“We think of applying AIOps as a transformation not only for technology but also for people,” Gupta says. “People have to understand that this is a way of augmenting their job and not a way of replacing them.”

Basically, AIOps helps IT staff do things that were impossible with their previous tools. The first step to implementing AIOps is to collect quality information about your IT infrastructure and operations. This is important not only to provide you with a better image of your IT infrastructure, but also to train and guide AI systems to optimize and monitor them. This first stage of AIOps is called “observability.”

“Observability is different from past application performance monitoring (APM) in the sense that observability is about collecting all the data,” Gupta says. “Whereas old APM legacy tools may sample information purely from a performance management perspective, observability is capturing information to do AIOps.”

An example of observability tools is IBM Instana Observability, a solution that can capture metrics, traces, and logs from applications running on different computing platforms, going all the way from mobile devices to on-premise servers to mainframes and virtual machines running in the cloud. According to Gupta:

One of the things observability tools like Instana help you do is to find root causes faster, which application or microservice is causing errors, and directly pinpoint it using very strong heuristics and algorithms and AI.

AI-powered observability can lead to huge gains. Consider ExaVault (acquired by Files.com), a company that provides file-transfer services to large organizations. ExaVault’s API receives 35,000 requests per minute and over 50 million calls per day. Availability is very critical for ExaVault, but since each customer uses the service and API in different ways, it is very difficult for the company to oversee all activities through traditional monitoring methods.

Using Instana, ExaVault was able to establish observability in its API to monitor and control availability in a way that was impossible with previous APM tools. As a result, they were able to track and resolve issues faster than before. They reached 99.99-percent availability and reduced mean time to resolution (MTTR) by around 57%.

Optimization

“In today’s complex environments of cloud, on-prem, and hybrid, once an app is deployed, no human being can mentally monitor and manage how to set things up, configure them properly, and make sure they have the right performance, right server size, memory allocation, and so on,” Gupta says. “These are currently managed through smart guesses.”

Another important aspect of AIOps is the optimization of IT resources. An example is IBM Turbonomic, a tool that analyzes end-to-end environments and creates a single-view topology of the system. Turbonomic can process data from different aspects of the system, including service-level objects, application configurations, and pricing and contracts. It takes in all this information and helps you optimize the components of your IT ecosystem to achieve different goals, such as improving availability or reducing waste and costs. Depending on your requirements, Turbonomic can automatically optimize your IT components or provide you with recommendations.

A Forrester Total Economic Impact study found that on average, the application of Turbonomic results in a 471% return on investment and the payback period is under six months. Automation tools like Turbonomic help IT departments avoid overprovisioning infrastructure, which on average results in a 75% reduction in IT spend.

The benefits of AIOps can go beyond reducing IT costs and outages.

For example, BBC Studios used Turbonomic to manage its network of more than 1,000 virtual machines. Upon implementing Turbonomic, the BBC Studios team had a full-stack view of their environment. This allowed them to better understand what was causing performance problems and identify where they could execute resizing or placement actions to bring their environment back into a maximally efficient and performant state. Not only did Turbonomic provide specific actions to take, but it also predicted the impact each action would have before being executed.

The team began by manually executing Turbonomic’s resizing recommendations, significantly reducing end-user complaints and eliminating downtime in the process. Once they saw the results of manual resizing, the team automated scheduled resizing on a select set of mission-critical applications, proactively and holistically assuring application performance. Automated scheduled resizing enabled the team to reclaim 228 GB of memory and 22 virtual CPUs (vCPUs) in one month alone. Because of Turbonomic, the team can now be confident they are using their existing resources as effectively as possible, and they can free themselves up to focus on strategic initiatives rather than fighting fires or searching for resizing opportunities.

Incident prevention and resolution

One of the challenges of complex IT infrastructures is predicting when and where failures will happen — and taking the right measures to prevent them. Another challenge is finding the cause of failures and responding to them in a timely manner. Fortunately, this is another area where AIOps can help.

An example is IBM Cloud Pak for Watson AIOps, a solution that collects all the incidents, metrics, traces, logs, and tickets from an IT system and analyzes them in a generalized AI framework with machine learning models. Cloud Pak for Watson AIOps can help predict blast radius, which is the effect that the outage of a particular component will have on other parts of the system. Accordingly, it can provide recommendations on how to prevent such incidents. As Gupta explaines:

It is a tool that provides a general framework for understanding what happens in the system and taking actions in response to incidents both predictably and proactively.

Incident prediction is especially useful for organizations that are responsible for critical infrastructure. For example, Taiwan’s National Center for High-performance Computing (NCHC) runs dozens of supercomputers and provides computation resources for all kinds of operations, including drug research and scientific projects. NCHC used Cloud Pak for Watson AIOps to establish an AI-based automation system for predicting incidents and improving resilience.

Cloud Pak for Watson AIOps used structured and unstructured data from NCHC’s compute network to train AI models to automatically and proactively manage problems and incidents. Thanks to automation, NCHC was able to achieve a 55% shorter mean time to detect (MTTD) issues that would affect its service. They were also able to detect potential outages 25 hours in advance, giving them vital time to resolve incidents before they happen.

Beyond IT

The benefits of AIOps can go beyond reducing IT costs and outages to creating better applications and serving customers. According to Gupta:

We’re seeing a shift in the thinking from managing IT as a cost center to managing IT as an enabler for revenue. Not only does AIOps optimize IT infrastructure dynamically and result in savings, but it also frees up the people to do more business-critical work.

For example, AIOps can help developer teams understand bottlenecks and the effects of failures in advance. This helps them design their applications and systems with robustness built into them, instead of responding to failures ad hoc.

“If you shift left and say how should a development team build their application to be more resilient to failure, the things we do include how code changes affect the quality of the release going out,” Gupta says.

By spending less time addressing technical failures, developers can focus more on creating better products that solve customer problems.

“Several studies show AIOps are resulting in more clients coming to web applications,” Gupta says. “The reason is that the people in IT were now more focused on doing work that is aligned with the business and generates revenue.”

The field is just beginning to take off, and there are many developments in artificial intelligence research that can find their way into AIOps.

“We started off with advanced heuristics, added machine learning models, and we are seeing more and more foundation models in IT and AIOps,” Gupta says.

Going forward, we’ll see a lot more use of natural language processing and foundation models impacting how IT is managed. We’re going to see a huge amount of intelligence and AI brought to bear in managing IT systems. We see an exciting road ahead with this evolution of using AI in IT. We should stay tuned because the next few years are going to be very exciting in terms of how AI is affecting IT.

Story by Ben Dickson

Ben Dickson is the founder of TechTalks. He writes regularly about business, technology and politics. Follow him on Twitter and Facebook (show all) Ben Dickson is the founder of TechTalks. He writes regularly about business, technology and politics. Follow him on Twitter and Facebook

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Content provided by IBM and TNW

What you need to know about AIOps

The challenges of modern IT

Observability

Optimization

Incident prevention and resolution

Beyond IT

Get the TNW newsletter

Amazon admits its AI models lag behind OpenAI and Anthropic, but says it can catch up

Europe’s carmakers are pivoting to defence as EV demand slows and military budgets soar

Discover TNW All Access

Parafin lands a Goldman Sachs credit facility to embed lending inside Amazon, DoorDash, and Walmart

PsiQuantum breaks ground in Australia on what it says will be the world’s first utility-scale quantum computer