Equipment Failure Prediction using Machine Learning Models

15 min readJan 25, 2021

I would like to thank ConocoPhillips for providing the dataset to solve the real world prediction problems using Machine learning models. Lets deep dive in to solving real world business problem with Machine Learning.

Predictive Maintenance with Machine Learning

Business Problem
Introduction of Equipment Maintenance Types
Machine Learning Formulation of Business Problem
Business Objectives and Constraints
Dataset Analysis
Evaluation of Performance Metrics
Exploratory Data Analysis
Data Pre-Processing
Feature Engineering
Preparing Data for Modeling
Apply Machine Learning Models
Conclusion
Challenges
Future Work
GitHub Repository
References

1. Business Problem

Organizations will use Stripper Well Equipment’s to produce the oils at the well level. Due to low operation cost maintenance on equipment’s, Organization will get a prominent profit on oil production. However, Equipment’s will subject to fail (breakdown) due to Stripper Rubber shrinks, oil color change, Feed gas temperature increase/decrease etc.,

Any unplanned downtime of equipment would degrade or interrupt a company’s core business, potentially resulting in significant penalties and unmeasurable reputation loss.

Main Business problem is to predict the asset failures upfront. Such that, based on the prediction, Organization will send crew to Physical Locations to fix the equipment or send a replacement equipment before fail which will rapidly decrease the downtime of equipment’s and also helps to reduce the operational cost drastically and increase the revenue to the Organization.

2. Introduction of Equipment Maintenance Types

A little bit introduction on types of equipment maintenance techniques that organizations commonly used across to reduce the downtime up to some extent with some limitations.

Corrective Maintenance: This is also called as a Run To Failure Maintenance. Which means, this maintenance will be performed only when the asset is failed.

Limitations: Since this maintenance comes under unplanned maintenance, it may lead to high risk for the Organization

Preventive Maintenance: This is a scheduled maintenance or time based maintenance which will periodically performs maintenance based on planned schedule in order to reduce the downtime of an equipment.

Limitations: This maintenance will lead to unnecessary maintenance which will increase the operational costs.

Condition Based Maintenance: This maintenance will be performed only when certain indicators show signs of decreasing performance or upcoming failures.

Limitations: Thresholds or indicators should be set against Equipment to identify whether maintenance should be performed or not.

All above Maintenance categories which will help to reduce the downtime of equipment up to some extent but not fully.

Predictive Maintenance (PdM): While unplanned and preventive Maintenance have the tradeoff scenario, predictive maintenance (PdM) is a promising technique that has an ability to break the tradeoff by maximizing the useful life of a component and uptime simultaneously.

Predictive Maintenance (PdM) is a maintenance that monitors the performance and condition of equipment during normal operation and reduce the likelihood of failures. Accordingly, the machine downtime and maintenance cost can be reduced significantly while making the maintenance frequency as low as possible.

3. Machine Learning Formulation of Business Problem

Machine Learning (ML) techniques emerged as a promising tool in PdM applications to predict the equipment failures to minimize the downtime and gain profits to the organizations.

ML can be defined as a technology by which the outcomes can be forecasted based on a model prepared and trained on past or historical input data and its output behavior

With the help of Sensors, a real time data will be streamed from equipment’s.

Using real time sensors data, we can implement Machine Learning algorithms to predict the Equipment failure based on the historical input data with failures. Based on the prediction, Organization will send crew to fix the equipment or send a replacement equipment before fail which will rapidly decrease the downtime of equipment’s and also helps to reduce the operational cost and increase the revenue to the Organization.

Below figure shows the process of capturing real-time data from sensors which are tagged to Equipment’s. Following by ML models prediction and raising a Automatic maintenance ticket to performance a maintenance to equipment.

Above explanation is a general business practice to implement Predictive Maintenance with ML.

Coming to the ConocoPhillips Dataset, dataset has been provided that has documented failure events that occurred on

- Surface Equipment : Equipment's that are installed on the surface ground.

- Down-hole Equipment : Equipment's that are installed below the surface ground.

For each failure event, data has been collected from 107 sensors that collect a variety of physical information both on the surface and below the ground.

Machine Learning Formulation of Business Problem is, apply ML techniques to predict the equipment failure (Surface & Down-hole) based on the real time sensor data to carry out the equipment maintenance before it fails.

4. Business Objective and Constraints

- No latency requirement
- Errors will be very costly

5. Dataset Analysis

Dataset can be downloaded from this link.

There are totally 60000 failure (Downhole and Surface) events captured in the dataset using 107 sensors. Lets understand the features available in the dataset.

6. Evaluation of Performance Metrics

Since its a binary classification problem, we will consider below performance metric to compute the Machine Learning models.

F1-Score: F1 is a hormonic mean of precision and recall. Precision refers to the measure of correctly identified positive cases from all predicted positive cases and it is importance when the cost of false positives are high. Recall refers to measure of the correctly identified positive cases from all the actual positive cases. It is important when the cost of False Negatives is high.

F1 score is a better metric to evaluate the model for highly imbalanced datasets.

7. Exploratory Data Analysis (EDA)

Lets perform data analysis on the dataset, such that it will be helpful for Preprocessing, Feature Engineering and Applying ML models.

7.1 Reading Data

There are totally 172 features available in the dataset.

7.2 Plot “Target” feature Distribution

As per below analysis, there are 98% of failures which are recorded against Surface related failures and 1.6% of failures are belongs to Downhole related failures.

We can understand that, Data is highly imbalanced. We will apply following methods to support imbalance during ML Modelling.

1)    Random Over Sampler (UpSampling)
2)    SMOTE Over Sampling
3)    SMOTE TOMEK Over Sampling
4)    Modify the model to account class imbalance

7.3 Describe Dataset Features

Let’s summarize the features to understand statistical information like percentile, mean, median, count etc.,

Observed that, some of the features contains ‘na’ strings and NaN (null) in data points.

7.4 Dealing with ‘na’ and ‘NaN’ values

Lets replace ‘na’ values with ‘NaN’ across dataset and will list out the percentage of ‘NaN’ data points against each features in data frame.

There are few features that contains more than 50% of data as NaN values. Will remove such features as it will not much useful for our modelling.

7.5 Dealing with Feature Data Types

We can observe that, datatype of sensor columns are Object because of containing ‘na’ values earlier inside the sensors columns. Since we already replaced ‘na’ with ‘NaN’, lets convert Sensor column datatype to Float to carry out with Analysis part.

All columns are converted in to float datatype

7.6 Univariate Analysis Through Line Plot

Lets start with univariate analysis to understand the sensor information through Line Plot

Sensor92_measure contains very less trends for Target 0 (Surface failure) and 1 (downhole failure). Noticed that, most of the sensor92_measure values are either zero or null values.
Sensor102_measure contains high peaks for surface (Target 0) failures.
Most of the sensor 3 data points are very high for Target 0 and 1. There is a high chance that, if Sensor 3 reaches to high value then Equipment might fail. Noticed that, there are very less spikes for Target 0 when compared with Target 1
Sensor 36, 37 and 38 has similar distribution. Target 0 (Surface failure) data points has wide spread
Sensor 54 graph distribution is looks odd.

7.7 Univariate Analysis Through PDF

Lets analyze the sensors information through PDF

sensor92_measure plot has too much of overlapping between target variable 0 and 1. PDF shows that, senssor92_measure has a right skewness (outliers).
Below sensors PDF are similar and the data is overlapped for a target variable 0 and 1.

- sensor27_measure, sensor47_measure, sensor46_measure, sensor67_measure, sensor48_measure, sensor53_measure, sensor59_measure

Noticed that, there are outliers for sensor3_measure after value 2 in X-axis
Below sensors PDF are similar and the data is overlapped for a target variable 0 and 1.

- sensor15_measure, sensor8_measure, sensor32_measure

All sensors features has a skewness problem and contains outliers as well

7.8 Univariate Analysis Through Box Plot

Lets analyze the sensors information Box plot

sensor92_measure: IQR range for target 0 and 1 is near to zero. It’s because most of the values are zero. Target 0 has many outliers when compared with Target 1
Observed below points for Sensors (Sensor27_measure, Sensor47_Measure, Sensor46_Measure, Sensor67_Measure, Sensor48_Measure, Sensor53_Measure,, Sensor59_Measure, Sensor14_Measure, Sensor15_Measure, Sensor8_Measure, Sensor32_Measure):

Observed that, 75 percentile of Surface Failures (Target 0) is less than 25th percentile of Downhole failures (Target 1). Which means, downhole failure trend is higher than surface failures
Median line of a box plot between target 0 and 1 lies outside of the box, which means there is a difference between two target 1 and 0
None of the sensors has a normal distribution but contains a right skewed with outliers

7.9 Multi Variate Analysis through Correlation Matrix

Correlation is one of the useful technique to analyze the data and better way to understand the relationships between variables.

Below features are highly correlated each other. Since highly correlated features will impact the Models, will keep one feature and remove the other highly correlated feature to predict the model better

sensor(32,8),sensor(27,67),sensor(27,47),sensor(27,46),sensor(92,93),sensor(14,15),sensor(72,78),sensor(95,94),sensor(12,13),sensor(89,33),sensor(90,91),sensor(27,14),sensor(14,46),sensor(14,67),sensor(14,47),sensor(32,14),sensor(8,14),sensor(97,96),sensor(32,33),sensor(33,8),sensor(1,45),sensor(89,35),sensor(14,33),sensor(33,27),sensor(8,27),sensor(33,17),sensor(15,27),sensor(32,27),sensor(67,33),sensor(33,47),sensor(46,33),sensor(8,46),sensor(8,67),sensor(8,47),sensor(46,15),sensor(15,67),sensor(47,15),sensor(46,89),sensor(67,89),sensor(47,89),sensor(46,32),sensor(67,32),sensor(47,32),sensor(34,16),sensor(32,15),sensor(27,89),sensor(8,15),sensor(6,5) ,sensor(35,33),sensor(35,17)

8. Data Preprocessing

8.1 Remove Features

Remove useless features from the dataset

8.2 Remove Least Performed Features

Applied ML models in many approaches and found out that, below features are the least features that model predict. Hence removing such features as it will not useful for our model to predict better.

Please refer to my GitHub repository to know what kind of approaches that i have followed to come to least features conclusion.

8.3 Impute ‘Median’ in missing values

Considering below facts, lets impute median in missing values across the dataset

Data points contains zero to very large numbers. Hence imputing with median is better option when compared with mean.
As we can observe during EDA that, there are multiple outliers are visible across the dataset. Hence, its better to replace ‘Np.Nan’ with Median to overcome outliers when compared with Mean, std variance etc.,
Median values will provide correct results even though if data have outliers.

9. Feature Engineering

9.1 Add Mean and Median Features:

Adding new features to the dataset to predict the model better. Here we are calculating mean and median of each row and adding the results in to new columns as mean and median.

9.2 Add interaction features for highly correlated features

Consider highly correlated features and will add some interactions terms to make the relationship strong.

9.3 Remove Highly Correlated features after Feature Engineering

Below code will help to find out one of the highly correlated feature among each other and will drop the features from the dataset automatically.

10. Preparing Data for Modelling

10.1 Prepare Data

Separate dataset in to two parts.

X Data frame contains all features expect ‘Target’ Feature
Y variable contains only ‘Target’ Feature

10.2 Split Train Dataset

Split the dataset in to train and test with 20 percentage as test data. Since the data is highly imbalanced better to use ‘stratify’ parameter. Stratify method returns training and test subsets that have the same proportions of class label as the input dataset

10.3 Apply Normalization

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.

11. Applying Machine Learning Models

Below are the chosen ML models for the given highly imbalanced dataset

Random Forest Classifier: Its a supervised learning algorithm. It builds a ensemble of decision trees usually trained with the bagging method. The general idea of the bagging method is that a combination of learning models increases the overall result. Random forests are very good for classification problems but are slightly less good at regression problems.
LightGBM Classifier: Its a gradient boosting framework based on decision tree algorithms. Its super fast when compared with other Boosting frameworks. It takes less memory to run and is able to deal with large amounts of data.
XGBoost Classifier: XGBoost is one of the most popular machine learning algorithm these days. XGBoost belongs to boosting algorithms and uses the gradient boosting (GBM) framework as its core. It is an optimized distributed gradient boosting library.

Have tried other ML models too, but above ML models are performing better than others.

I didn’t used GridSearchCV or RandomForestCV to hyper tune the parameters due to high time complexity . I have tuned the hyperparameters manually based on the article for XGBoost and LightGBM ML models.

11.1 Applying Random Forest ML Model with no Sampling Technique

Lets store the train and test split in to respective train and test variables.

Apply RandomForest ML model with parameter (max_depth and n_estimators) to predict the F1 scores.

Lets print Confusion Matrix for ML model

Random Forest is sensible model as TPR and TNR rates are high when compared with FNR and FPR for Train and Test confusion matrix. However there are few points which are wrongly classified

Lets print F1 Scores

Random Forest model has given Train F1 score as 1.00 and Test F1 Score as 0.83. F1 score are pretty good but model overfits.

11.2 RandomForest with RandomOverSampler Technique

Lets store the train and test split in to respective train and test variables.

Apply RandomOverSampler method to oversample the minority class label randomly.

As per below, Minority Class is successfully oversampled and data points are equal for class label Target 0 and Target 1

Apply RandomForest ML model with parameter (n_estimators) to predict the F1 scores.

Lets print Confusion Matrix for ML model

Random Forest is sensible model as TPR and TNR rates are high when compared with FNR and FPR for Train and Test confusion matrix. However there are few points which are wrongly classified

Lets print F1 Scores

F1 scores are pretty bad and badly overfit when compared with previous model.

11.3 LightGBM Classifier with No Sampling

Apply LightGBM Classifier to predict the F1 scores with defined parameters

Lets print Confusion Matrix for ML model

LightGBM is sensible model as TPR and TNR rates are high when compared with FNR and FPR for Train and Test confusion matrix. However there are few points which are wrongly classified.

LightGBM model has given best Train F1 score as 1.00 and Test F1 Score as 0.84. F1 score are pretty good but the model overfits.

This model is comparatively better when compared with previous models scores.

11.4 LightGBM Classifier with RandomOverSampler

Apply LightGBM Classifier to predict the F1 scores with defined parameters

RandomOverSampler technique has not performed well as F1 scores are not improved when compared with previous models.

11.5 LightGBM Classifier with SMOTE

Apply Smote over sampling to oversample minority class.

SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

Apply ML model on Smote over sample dataset

Smote oversampling technique has not performed well as F1 scores are not improved when compared with previous models.

11.6 LightGBM Classifier with SMOTE TOMEK

Apply Smote Tomek technique to oversample the minority class label

Smote Tomek is a hybrid method which is a combination of upsampling and downsampling. It uses a under sampling method (Tomek) with an oversampling method (SMOTE).

Apply ML model on Smote Tomek over sample dataset

Performance of scores are improved when compared with other models. However there is still a overfit problem similar to other ML models.

This is the best F1 scores so far

11.7 XGBoost With No Sampling

Apply XGBoost with predefined parameters

XGboost with out any sampling technique has given a good F1 score. However, the model overfits.

11.8 XGBoost with RandomOverSampler

Apply XGBoostClassifier on Random Over Sampler dataset.

XGBoost Random Over Sampler scores are pretty much similar to XGBoost with no sampling technique.

12. Conclusion

Apart from above ML models, I have also applied other techniques as well. But above are the best F1 scores provided so far.

We can conclude that, LGBMClassifier (SMOTE TOMEK Over Sampling) has provided best Test F1Score (0.85%), F1 Macro Score ( 0.93) and Test AUC Score (0.94) when compared with other ML models.

12. Challenges

I have trained many ML models on the given dataset but all models has a overfitting problem. Hence I choose regularization hyperparameters specific to ML models to reduce the overfitting problem.

13. Future Work

Applying Deep Learning model might improve the scores
Stacking Classifier may work well
Imputing the NaN with some other imputations other than Median values might improve the model performance metrics

14. GitHub Repository

Code is available in my GitHub Repository. Kindly have a glance at it.

yavisankar/Equipment-Failure-Prediction

Predict the Equipment Failure. Contribute to yavisankar/Equipment-Failure-Prediction development by creating an account…

github.com

15. References

https://www.researchgate.net/publication/327334242_Machine_Learning_approach_for_Predictive_Maintenance_in_Industry_40

Applied Course

We know how challenging changing careers can be. Our Applied AI/Machine Learning Courses are designed as whole learning…

www.appliedaicourse.com

Machine Learning in Predictive Maintenance towards Sustainable Smart Manufacturing in Industry 4.0

Recently, with the emergence of Industry 4.0 (I4.0), smart systems, machine learning (ML) within artificial…

www.mdpi.com

Predictive Maintenance of Machine Tool Systems Using Artificial Intelligence Techniques Applied to…

Often, manufacturing equipment is utilized without a planned maintenance approach. Such a strategy frequently results…

www.sciencedirect.com

Complete Guide to Parameter Tuning in XGBoost with codes in Python

Overview XGBoost is a powerful machine learning algorithm especially where speed and accuracy are concerned We need to…

www.analyticsvidhya.com

Equipment Failure Prediction using Machine Learning Models

Table of Contents

1. Business Problem

2. Introduction of Equipment Maintenance Types

3. Machine Learning Formulation of Business Problem

4. Business Objective and Constraints

5. Dataset Analysis

6. Evaluation of Performance Metrics

7. Exploratory Data Analysis (EDA)

7.6 Univariate Analysis Through Line Plot

7.7 Univariate Analysis Through PDF

7.8 Univariate Analysis Through Box Plot

8. Data Preprocessing

9. Feature Engineering

10. Preparing Data for Modelling

11.2 RandomForest with RandomOverSampler Technique

11.4 LightGBM Classifier with RandomOverSampler

11.5 LightGBM Classifier with SMOTE

11.6 LightGBM Classifier with SMOTE TOMEK

11.7 XGBoost With No Sampling

11.8 XGBoost with RandomOverSampler

12. Conclusion

12. Challenges

13. Future Work

14. GitHub Repository

yavisankar/Equipment-Failure-Prediction

Predict the Equipment Failure. Contribute to yavisankar/Equipment-Failure-Prediction development by creating an account…

15. References

Applied Course

We know how challenging changing careers can be. Our Applied AI/Machine Learning Courses are designed as whole learning…

Machine Learning in Predictive Maintenance towards Sustainable Smart Manufacturing in Industry 4.0

Recently, with the emergence of Industry 4.0 (I4.0), smart systems, machine learning (ML) within artificial…

Predictive Maintenance of Machine Tool Systems Using Artificial Intelligence Techniques Applied to…

Often, manufacturing equipment is utilized without a planned maintenance approach. Such a strategy frequently results…

Complete Guide to Parameter Tuning in XGBoost with codes in Python

Overview XGBoost is a powerful machine learning algorithm especially where speed and accuracy are concerned We need to…

Predictive Equipment Failures

Predict downhole equipment failures using sensor data!

Written by Sankar

No responses yet