Vijay Dwivedi | Freelancer Predictive Maintenance By Ml

Predictive Maintenance by ML

Transformer Failure Detection by Machine Learning Vijay Dwivedi Shrut Makde Nikhil Gupta Feb 25, 2025 1 | Restricted Content 1) Transformer Failure Definition ➢ Objective and parameter Definition 2) Data Aggregation logic 3) Exploratory Data Analysis -) 4) Parameter Selection Criteria 5) Model Development Steps ➢ Data Selection -) for model building 6) Test Set Metrics (2024) 7) Vibration_1 analysis and Summary 8) Relay Analysis 9) Model Metric with Ground Truth 10) Confidence Level Definition and plot 11) Model Deployment in Sagemaker 12) Appendix 2 | Restricted Transformer Failure Definition Objective: To predict failure probability of transformer with forecasting time window of 5-15 days. Failure definition is as follows: LT1: 1. Beyond 140 for 2 continuous Hours for 7-8 days 2. 120-140 for 40-60 min in Summers (Mar – June), 10-15 times 3. Either 1 or 2 4. Ambient temperature is used as additional parameter based on SME suggestions 1. In the Afternoon (12-17 hours), if a trip in 100-110 range of temp goes beyond 110 for 2 hours, it will fail 2. Ambient temperature is used as additional parameter based on SME suggestions LT2: Relay: 1. Vibration_1 clearly segregate the Relay category. 2. Rule based relay colour prediction. More detail in appendix Metric : Accuracy ~ 70% 3 | Restricted Data Aggregation Logic Data Analysis for three parameters (LT1, LT2 and Vibration_1): Assumption: The frequency of data is gathering 5 min. To separate the noise from it, the first data smoothing (average) is carried out for 1 hours. The average data retains the signature of the original data. 4 | Restricted Model Development Steps Next Step is to train the model for parameters (LT2, LT1, & Vibration_1) forecasting: Step 1: Divide the data into two group 1 -) and group 2 (2024) Step 2: Build the model in group 1 data set (80:20 :: train:validation) Step 3: Select XGboost model and RMSE metric to train the model Step 3: Check the overfitting/underfitting and performance of the training set Step 4: Calculate the RMSE on test data and present charts prediction vs test data LT1 5 | Restricted LT2 Test Set Metrics (2024) LT1 – Test Data and Predicted value comparison Test RMSE: 0.77 LT2 – Test Data and Predicted value comparison Test RMSE: 1.73 Good Match between actual test data and prediction based on XGB model 6 | Restricted Vibration_1 analysis and Summary Vibration 7 | Restricted Vibration_1 – Test Data and Predicted value comparison Relay Analysis For Relay Analysis three parameters considered LT1, LT2 and Vibration Rule based logic for relay colour prediction and verification Count of Colour_numeric Prediction Groud Truth Ash Ash 8 14 Count of Colour_numeric Prediction Groud Truth Ash 8 | Restricted White 17 Purple White Purple 1 26264 Purple White Ash 100% 0% 0% Purple 0% 100% 0% White 0.05% 0.004% 99.94% Model Metric with Ground Truth Work Order (Ground Truth) workorderid workorder-8 workorder-9 workorder-10 workorder-15 failuredescription LT1 value is 105.0 and LT2 Value is 110.0 (warning state) LT1 value is 120.0 and LT2 Value is 130.0 (alarm state) LT2 value is 130.0 (alarm state) LT2 value is 110.0 (warning state) failuredatetime 4/15/2023 13:35 5/5/2023 16:30 5/12/2023 15:50 4/6/2024 13:55 prioritylevel closuredate timedelta in days Medium 4/19/2023 8:08 3 High 5/10/2023 2:33 4 High 5/16/2023 5:22 3 Medium 4/10/2024 21:23 4 Failure YES NO NO NO remark LT2>110 for 2 hour LT2 >110 for <2 hour LT2 >110 for <2 hour LT2>110 for <2 hour Failure output for LT2 Model catch 3 failure point with time relaxation and one miss Sensitivity: 75 % (3/4) {TP/(TP+FN)} Precision: 25% (3/12) {TP/(TP+FP)} {TP+TN+FP+FN =73 = 26304/(24*15)} Accuracy: 86% (63/73) {(TP+TN)/(TP+TN+FP+FN)} 9 | Restricted Confidence Level Definition and Plot Definition: VL: None of the condition meets the failure criteria L: At least one condition meets the failure criteria M: Min two conditions meet the failure criteria H: L or M event happening consecutively in past 12 hours 10 | Restricted Level VL VL VL L L L M M M H H H Colour Confidence White 8.0% Ash 16.0% Purple 24.0% White 32.0% Ash 40.0% Purple 48.0% White 56.0% Ash 64.0% Purple 72.0% White 80.0% Ash 88.0% Purple 96.0% Model Deployment Strategy 11 | Restricted Model Deployment Strategy Components: 1. RDS database which has the input table 2. 2 SageMaker notebook Instances (ml.t3.medium & ml.c5.2xlarge) 3. S3 bucket (to store & access models and store the prediction output) Workflow: 1. 3 models (LT1, LT2, vib) are being trained in the instance (ml.c5.2xlarge) and will be sent over to the specified S3 bucket as shown in the figure. This will be done once every 6 months to have the latest model for prediction and replaces the old previous one. The input is fetched from the RDS database and all of the records (ex: 3.5 yr worth of data) are used to train the models. 2. Another instance Scanner (ml.t3.medium) runs daily which access and loads the 3 models saved in the bucket to predict the data. 12 | Restricted Model Deployment Strategy Workflow (cont): 3. The scanner uses the models to predict the data using the same RDS database’ table as input (data of last one month). The data predicted is for 10 days * 24 hours (240 records of prediction) and is saved over to the S3 bucket for the particular run date as show in the figure. 4. And also, the data for the particular run date is sent over to the RDS database in another table as well for API for the dashboard for every single run. Duration each instance will run: 1. ml.c5.2xlarge: runs once per 6 month to train the latest model, runs for approximately 3 hours. 2. ml.t3.medium: runs once daily to predict 10 days worth of data, runs approximately for 5-10 minutes 13 | Restricted Appendix 14 | Restricted Primary Factor for the Failure 1.Alarm Distribution 2. Buchholdz Relay 3.Temperature (mean deviation) 4.Cost Impact Analysis 5.Forecast (10 days) Predictive Analysis Work In progress 4. Details are in next slides WIP 15 | Restricted 5. From ML analysis Asset Cost Impact Analysis (ACIA) Transformer Input : Costs Impact 1.Direct Costs : 1.1 Repair/replacement cost (50-70%) 1.2 Diagnostics cost (5-15%) 1.3 Labor cost (15-30%) 2. Indirect Costs : 2.1 Downtime cost - 40-60% 2.2 Operational Cost --10-20% 2.3 Revenue Loss-- 20-40% 2.4 Penalties-- 5-15% 2.5 HSE Cost--5-10% 16 | Restricted Total Cost Under Various head comprising of Failure & Maintenance activity planned for a quarter Asset Cost Impact Analysis based on Work order history Expected Cost Over 3 Months (EC) E[C] = E[DC] + E[IC] Where: E[DC] = Expected Direct Cost E[IC] = Expected Indirect Cost Expected Direct Cost (E[DC]) Direct costs apply in all cases (whether the transformer trips or not), so we break it into two components: E[DC]=Csch+Ptrip×Cunsch Where: Csch = Cost of scheduled maintenance over 3 months (1-2 times per quarter) Ptrip = Probability of an unscheduled failure (from ML model) Cunsch = Cost of unscheduled repair/replacement 17 | Restricted Asset Cost Impact Analysis based on Work order history Expected Indirect Cost (E[IC]) Indirect costs apply only if the transformer trips. E[IC]=Ptrip×Cind Where: Cind = Indirect cost of transformer failure (lost revenue, operational losses, etc.) Csch = Cost (Repair/replacement, Diagnostics, Labor) Cunsch = Cost (NIL) Cind = Cost (Downtime, Operational, Revenue Loss, Penalties, HSE Cost) Final Formula EC= Csch + Ptrip×(Cunsch+Cind) 18 | Restricted Asset Cost Impact Analysis based on Work order history 19 | Restricted