I would like to thank ConocoPhillips for providing the dataset to solve the real world prediction problems using Machine learning models. Lets deep dive in to solving real world business problem with Machine Learning.
Table of Contents
- Business Problem
- Introduction of Equipment Maintenance Types
- Machine Learning Formulation of Business Problem
- Business Objectives and Constraints
- Dataset Analysis
- Evaluation of Performance Metrics
- Exploratory Data Analysis
- Data Pre-Processing
- Feature Engineering
- Preparing Data for Modeling
- Apply Machine Learning Models
- Conclusion
- Challenges
- Future Work
- GitHub Repository
- References
1. Business Problem
Organizations will use Stripper Well Equipment’s to produce the oils at the well level. Due to low operation cost maintenance on equipment’s, Organization will get a prominent profit on oil production. However, Equipment’s will subject to fail (breakdown) due to Stripper Rubber shrinks, oil color change, Feed gas temperature increase/decrease etc.,
Any unplanned downtime of equipment would degrade or interrupt a company’s core business, potentially resulting in significant penalties and unmeasurable reputation loss.
Main Business problem is to predict the asset failures upfront. Such that, based on the prediction, Organization will send crew to Physical Locations to fix the equipment or send a replacement equipment before fail which will rapidly decrease the downtime of equipment’s and also helps to reduce the operational cost drastically and increase the revenue to the Organization.
2. Introduction of Equipment Maintenance Types
A little bit introduction on types of equipment maintenance techniques that organizations commonly used across to reduce the downtime up to some extent with some limitations.
Corrective Maintenance: This is also called as a Run To Failure Maintenance. Which means, this maintenance will be performed only when the asset is failed.
Limitations: Since this maintenance comes under unplanned maintenance, it may lead to high risk for the Organization
Preventive Maintenance: This is a scheduled maintenance or time based maintenance which will periodically performs maintenance based on planned schedule in order to reduce the downtime of an equipment.
Limitations: This maintenance will lead to unnecessary maintenance which will increase the operational costs.
Condition Based Maintenance: This maintenance will be performed only when certain indicators show signs of decreasing performance or upcoming failures.
Limitations: Thresholds or indicators should be set against Equipment to identify whether maintenance should be performed or not.
All above Maintenance categories which will help to reduce the downtime of equipment up to some extent but not fully.
Predictive Maintenance (PdM): While unplanned and preventive Maintenance have the tradeoff scenario, predictive maintenance (PdM) is a promising technique that has an ability to break the tradeoff by maximizing the useful life of a component and uptime simultaneously.
Predictive Maintenance (PdM) is a maintenance that monitors the performance and condition of equipment during normal operation and reduce the likelihood of failures. Accordingly, the machine downtime and maintenance cost can be reduced significantly while making the maintenance frequency as low as possible.
3. Machine Learning Formulation of Business Problem
Machine Learning (ML) techniques emerged as a promising tool in PdM applications to predict the equipment failures to minimize the downtime and gain profits to the organizations.
ML can be defined as a technology by which the outcomes can be forecasted based on a model prepared and trained on past or historical input data and its output behavior
With the help of Sensors, a real time data will be streamed from equipment’s.
Using real time sensors data, we can implement Machine Learning algorithms to predict the Equipment failure based on the historical input data with failures. Based on the prediction, Organization will send crew to fix the equipment or send a replacement equipment before fail which will rapidly decrease the downtime of equipment’s and also helps to reduce the operational cost and increase the revenue to the Organization.
Below figure shows the process of capturing real-time data from sensors which are tagged to Equipment’s. Following by ML models prediction and raising a Automatic maintenance ticket to performance a maintenance to equipment.
Above explanation is a general business practice to implement Predictive Maintenance with ML.
Coming to the ConocoPhillips Dataset, dataset has been provided that has documented failure events that occurred on
- Surface Equipment : Equipment's that are installed on the surface ground.
- Down-hole Equipment : Equipment's that are installed below the surface ground.
For each failure event, data has been collected from 107 sensors that collect a variety of physical information both on the surface and below the ground.
Machine Learning Formulation of Business Problem is, apply ML techniques to predict the equipment failure (Surface & Down-hole) based on the real time sensor data to carry out the equipment maintenance before it fails.
4. Business Objective and Constraints
- No latency requirement
- Errors will be very costly
5. Dataset Analysis
Dataset can be downloaded from this link.
There are totally 60000 failure (Downhole and Surface) events captured in the dataset using 107 sensors. Lets understand the features available in the dataset.
6. Evaluation of Performance Metrics
Since its a binary classification problem, we will consider below performance metric to compute the Machine Learning models.
F1-Score: F1 is a hormonic mean of precision and recall. Precision refers to the measure of correctly identified positive cases from all predicted positive cases and it is importance when the cost of false positives are high. Recall refers to measure of the correctly identified positive cases from all the actual positive cases. It is important when the cost of False Negatives is high.
F1 score is a better metric to evaluate the model for highly imbalanced datasets.
7. Exploratory Data Analysis (EDA)
Lets perform data analysis on the dataset, such that it will be helpful for Preprocessing, Feature Engineering and Applying ML models.
7.1 Reading Data
There are totally 172 features available in the dataset.
7.2 Plot “Target” feature Distribution
As per below analysis, there are 98% of failures which are recorded against Surface related failures and 1.6% of failures are belongs to Downhole related failures.
We can understand that, Data is highly imbalanced. We will apply following methods to support imbalance during ML Modelling.
1) Random Over Sampler (UpSampling)
2) SMOTE Over Sampling
3) SMOTE TOMEK Over Sampling
4) Modify the model to account class imbalance
7.3 Describe Dataset Features
Let’s summarize the features to understand statistical information like percentile, mean, median, count etc.,
Observed that, some of the features contains ‘na’ strings and NaN (null) in data points.
7.4 Dealing with ‘na’ and ‘NaN’ values
Lets replace ‘na’ values with ‘NaN’ across dataset and will list out the percentage of ‘NaN’ data points against each features in data frame.
There are few features that contains more than 50% of data as NaN values. Will remove such features as it will not much useful for our modelling.
7.5 Dealing with Feature Data Types
We can observe that, datatype of sensor columns are Object because of containing ‘na’ values earlier inside the sensors columns. Since we already replaced ‘na’ with ‘NaN’, lets convert Sensor column datatype to Float to carry out with Analysis part.
All columns are converted in to float datatype
7.6 Univariate Analysis Through Line Plot
Lets start with univariate analysis to understand the sensor information through Line Plot
- Sensor92_measure contains very less trends for Target 0 (Surface failure) and 1 (downhole failure). Noticed that, most of the sensor92_measure values are either zero or null values.
- Sensor102_measure contains high peaks for surface (Target 0) failures.
- Most of the sensor 3 data points are very high for Target 0 and 1. There is a high chance that, if Sensor 3 reaches to high value then Equipment might fail. Noticed that, there are very less spikes for Target 0 when compared with Target 1
- Sensor 36, 37 and 38 has similar distribution. Target 0 (Surface failure) data points has wide spread
- Sensor 54 graph distribution is looks odd.
7.7 Univariate Analysis Through PDF
Lets analyze the sensors information through PDF
- sensor92_measure plot has too much of overlapping between target variable 0 and 1. PDF shows that, senssor92_measure has a right skewness (outliers).
- Below sensors PDF are similar and the data is overlapped for a target variable 0 and 1.
- sensor27_measure, sensor47_measure, sensor46_measure, sensor67_measure, sensor48_measure, sensor53_measure, sensor59_measure
- Noticed that, there are outliers for sensor3_measure after value 2 in X-axis
- Below sensors PDF are similar and the data is overlapped for a target variable 0 and 1.
- sensor15_measure, sensor8_measure, sensor32_measure
- All sensors features has a skewness problem and contains outliers as well
7.8 Univariate Analysis Through Box Plot
Lets analyze the sensors information Box plot
- sensor92_measure: IQR range for target 0 and 1 is near to zero. It’s because most of the values are zero. Target 0 has many outliers when compared with Target 1
- Observed below points for Sensors (Sensor27_measure, Sensor47_Measure, Sensor46_Measure, Sensor67_Measure, Sensor48_Measure, Sensor53_Measure,, Sensor59_Measure, Sensor14_Measure, Sensor15_Measure, Sensor8_Measure, Sensor32_Measure):
- Observed that, 75 percentile of Surface Failures (Target 0) is less than 25th percentile of Downhole failures (Target 1). Which means, downhole failure trend is higher than surface failures
- Median line of a box plot between target 0 and 1 lies outside of the box, which means there is a difference between two target 1 and 0
- None of the sensors has a normal distribution but contains a right skewed with outliers
7.9 Multi Variate Analysis through Correlation Matrix
Correlation is one of the useful technique to analyze the data and better way to understand the relationships between variables.
Below features are highly correlated each other. Since highly correlated features will impact the Models, will keep one feature and remove the other highly correlated feature to predict the model better
- sensor(32,8),sensor(27,67),sensor(27,47),sensor(27,46),sensor(92,93),sensor(14,15),sensor(72,78),sensor(95,94),sensor(12,13),sensor(89,33),sensor(90,91),sensor(27,14),sensor(14,46),sensor(14,67),sensor(14,47),sensor(32,14),sensor(8,14),sensor(97,96),sensor(32,33),sensor(33,8),sensor(1,45),sensor(89,35),sensor(14,33),sensor(33,27),sensor(8,27),sensor(33,17),sensor(15,27),sensor(32,27),sensor(67,33),sensor(33,47),sensor(46,33),sensor(8,46),sensor(8,67),sensor(8,47),sensor(46,15),sensor(15,67),sensor(47,15),sensor(46,89),sensor(67,89),sensor(47,89),sensor(46,32),sensor(67,32),sensor(47,32),sensor(34,16),sensor(32,15),sensor(27,89),sensor(8,15),sensor(6,5) ,sensor(35,33),sensor(35,17)
8. Data Preprocessing
8.1 Remove Features
Remove useless features from the dataset
8.2 Remove Least Performed Features
Applied ML models in many approaches and found out that, below features are the least features that model predict. Hence removing such features as it will not useful for our model to predict better.
Please refer to my GitHub repository to know what kind of approaches that i have followed to come to least features conclusion.
8.3 Impute ‘Median’ in missing values
Considering below facts, lets impute median in missing values across the dataset
- Data points contains zero to very large numbers. Hence imputing with median is better option when compared with mean.
- As we can observe during EDA that, there are multiple outliers are visible across the dataset. Hence, its better to replace ‘Np.Nan’ with Median to overcome outliers when compared with Mean, std variance etc.,
- Median values will provide correct results even though if data have outliers.
9. Feature Engineering
9.1 Add Mean and Median Features:
Adding new features to the dataset to predict the model better. Here we are calculating mean and median of each row and adding the results in to new columns as mean and median.
9.2 Add interaction features for highly correlated features
Consider highly correlated features and will add some interactions terms to make the relationship strong.
9.3 Remove Highly Correlated features after Feature Engineering
Below code will help to find out one of the highly correlated feature among each other and will drop the features from the dataset automatically.
10. Preparing Data for Modelling
10.1 Prepare Data
Separate dataset in to two parts.
- X Data frame contains all features expect ‘Target’ Feature
- Y variable contains only ‘Target’ Feature
10.2 Split Train Dataset
Split the dataset in to train and test with 20 percentage as test data. Since the data is highly imbalanced better to use ‘stratify’ parameter. Stratify method returns training and test subsets that have the same proportions of class label as the input dataset
10.3 Apply Normalization
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.
11. Applying Machine Learning Models
Below are the chosen ML models for the given highly imbalanced dataset
- Random Forest Classifier: Its a supervised learning algorithm. It builds a ensemble of decision trees usually trained with the bagging method. The general idea of the bagging method is that a combination of learning models increases the overall result. Random forests are very good for classification problems but are slightly less good at regression problems.
- LightGBM Classifier: Its a gradient boosting framework based on decision tree algorithms. Its super fast when compared with other Boosting frameworks. It takes less memory to run and is able to deal with large amounts of data.
- XGBoost Classifier: XGBoost is one of the most popular machine learning algorithm these days. XGBoost belongs to boosting algorithms and uses the gradient boosting (GBM) framework as its core. It is an optimized distributed gradient boosting library.
Have tried other ML models too, but above ML models are performing better than others.
I didn’t used GridSearchCV or RandomForestCV to hyper tune the parameters due to high time complexity . I have tuned the hyperparameters manually based on the article for XGBoost and LightGBM ML models.
11.1 Applying Random Forest ML Model with no Sampling Technique
Lets store the train and test split in to respective train and test variables.
Apply RandomForest ML model with parameter (max_depth and n_estimators) to predict the F1 scores.
Lets print Confusion Matrix for ML model
Random Forest is sensible model as TPR and TNR rates are high when compared with FNR and FPR for Train and Test confusion matrix. However there are few points which are wrongly classified
Lets print F1 Scores
Random Forest model has given Train F1 score as 1.00 and Test F1 Score as 0.83. F1 score are pretty good but model overfits.
11.2 RandomForest with RandomOverSampler Technique
Lets store the train and test split in to respective train and test variables.
Apply RandomOverSampler method to oversample the minority class label randomly.
As per below, Minority Class is successfully oversampled and data points are equal for class label Target 0 and Target 1
Apply RandomForest ML model with parameter (n_estimators) to predict the F1 scores.
Lets print Confusion Matrix for ML model
Random Forest is sensible model as TPR and TNR rates are high when compared with FNR and FPR for Train and Test confusion matrix. However there are few points which are wrongly classified
Lets print F1 Scores
F1 scores are pretty bad and badly overfit when compared with previous model.
11.3 LightGBM Classifier with No Sampling
Apply LightGBM Classifier to predict the F1 scores with defined parameters
Lets print Confusion Matrix for ML model
LightGBM is sensible model as TPR and TNR rates are high when compared with FNR and FPR for Train and Test confusion matrix. However there are few points which are wrongly classified.
LightGBM model has given best Train F1 score as 1.00 and Test F1 Score as 0.84. F1 score are pretty good but the model overfits.
This model is comparatively better when compared with previous models scores.
11.4 LightGBM Classifier with RandomOverSampler
Apply LightGBM Classifier to predict the F1 scores with defined parameters
RandomOverSampler technique has not performed well as F1 scores are not improved when compared with previous models.
11.5 LightGBM Classifier with SMOTE
Apply Smote over sampling to oversample minority class.
SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.
Apply ML model on Smote over sample dataset
Smote oversampling technique has not performed well as F1 scores are not improved when compared with previous models.
11.6 LightGBM Classifier with SMOTE TOMEK
Apply Smote Tomek technique to oversample the minority class label
Smote Tomek is a hybrid method which is a combination of upsampling and downsampling. It uses a under sampling method (Tomek) with an oversampling method (SMOTE).
Apply ML model on Smote Tomek over sample dataset
Performance of scores are improved when compared with other models. However there is still a overfit problem similar to other ML models.
This is the best F1 scores so far
11.7 XGBoost With No Sampling
Apply XGBoost with predefined parameters
XGboost with out any sampling technique has given a good F1 score. However, the model overfits.
11.8 XGBoost with RandomOverSampler
Apply XGBoostClassifier on Random Over Sampler dataset.
XGBoost Random Over Sampler scores are pretty much similar to XGBoost with no sampling technique.
12. Conclusion
Apart from above ML models, I have also applied other techniques as well. But above are the best F1 scores provided so far.
We can conclude that, LGBMClassifier (SMOTE TOMEK Over Sampling) has provided best Test F1Score (0.85%), F1 Macro Score ( 0.93) and Test AUC Score (0.94) when compared with other ML models.
12. Challenges
I have trained many ML models on the given dataset but all models has a overfitting problem. Hence I choose regularization hyperparameters specific to ML models to reduce the overfitting problem.
13. Future Work
- Applying Deep Learning model might improve the scores
- Stacking Classifier may work well
- Imputing the NaN with some other imputations other than Median values might improve the model performance metrics
14. GitHub Repository
Code is available in my GitHub Repository. Kindly have a glance at it.