Predictive Analytics for Server Incident Reduction

by Jasmina Bogojeska, Ioana Giurgiu, David Lanyi and Dorothea Wiesmann

As IT infrastructures become more heterogeneous — with cloud and local servers increasingly intermingling in multi-vendor datacentre infrastructure environments — CIOs and senior IT decision makers are struggling to optimize the cost of technology refreshes. They need to be able to justify the cost of technology refresh, manage the risk of service disruption introduced by change and balance this activity against business-led IT changes.

The decision about when to modernize which elements of the server HW/SW stack is often made manually based on simple business rules. The goal of our project is to alleviate this problem by supporting the decision process with an automated approach. To this end, we developed the (Predictive Analytics for Server Incident Reduction (PASIR) method and service (conceptually summarized in Figure 1) that correlates the occurrence of incidents with server configuration and utilization data.

In a first step, we need to identify past availability and performance issues in the large set of incident tickets. This incident ticket classification, however, is a very challenging task for the following reasons:

The number of tickets is very large (in the order of thousands in a year for a large IT environment), which makes their manual labelling practically impossible.
Ticket resolution is a mixture of human and machine generated text (from the monitoring system) with a very problem-specific vocabulary.
Different ticket types have very different sample abundances.
The texts of the tickets from different IT environments are very different as they are written by different teams who use different monitoring systems and lingua, which renders the reuse of manually labelled tickets and knowledge transfer among different IT environments infeasible.

To address these challenges, we implemented an automatic incident ticket classification method that utilizes a small, preselected set of manually labelled incident tickets to automatically classify the complete set of incidents available from a given IT environment. In the first step, to select the training data for the supervised learning, we apply the k-means clustering algorithm to group the incident tickets into bins with similar texts and then sample tickets for training with the ratio of samples to be selected from each cluster being computed using the silhouette widths of the clusters. This results in an increased representation of incident tickets from rare classes in the training data. In the second step, we use the manually labelled set of incident tickets to train a gradient boosting machine (GBM), a powerful, flexible method that can effectively capture complex non-linear function dependencies and offers high quality results in terms of prediction accuracy and generalization.

Figure 1: Overview of the PASIR concept.

Next, we define a threshold for incident tickets of a certain class to identify servers with problematic availability or performance. Based on the historic set, a Random Forest classifier is trained to identify and rank servers with problematic behaviour as candidates for modernization. Random Forest models are ensembles of classification or regression trees. While regular tree models are very attractive and widely used nonlinear models due to their interpretability, they exhibit high variance and thus have a lower capability for deducing generalizations. The Random Forest model reduces the variance by averaging a collection of decorrelated trees which provides performance comparable to that of support vector machines (SVMs) and boosting methods. Such a model can capture nonlinear relationships between the attributes of the server hardware, operating system and utilization and the server behaviour characterized by the corresponding incident tickets.

Figure 2: Overview of the training procedure for a Random Forest model.

A summary of the procedure for training random forest models is given in Figure 2. Once trained, the predictive model is used to evaluate the impact of different modernization actions and to suggest the most effective ones. Each modernization action modifies one or several server features. Given a set of modernization actions, a random forest prediction model, and a target server, we quantify their improvement impact by taking the difference between the probabilities of the server being problematic before and after applying the actions considered. This enables us to rank all modernization actions based on their improvement impact and select the most effective ones.

The PASIR tool has been applied to over one hundred IT environments. The resultant modernization actions have resulted in significant reductions in the account incident volumes with a concomitant increase in the availability of the IT environment. The primary use cases of our tool are planning a refresh program, identifying an at-risk application environment, identifying servers for CLOUD migration, and contributing to cost penalty analyses for at-risk servers.

Link:
http://www.zurich.ibm.com/csc/services/textmining.html

References:
[1] J. Bogojeska et al.: “Classifying Server Behavior and Predicting Impact of Modernization Actions”, in proc. of the IFIP/IEEE 9th International Conference on Network and Service Management (CNSM), 2013.
[2] J. Bogojeska et al.: „Impact of HW and OS Type and Currency on Server Availability Derived From Problem Ticket Analysis”, in proc. of the IFIP/IEEE Network Operations and Management Symposium (NOMS), 2014.
[3] L. Breiman: “Random Forests”, Machine Learning, 2001.

Please contact:
Dorothea Wiesmann
IBM Research Zurich, Switzerland
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

Predictive Analytics for Server Incident Reduction