Predictive Maintenance by ML
Transformer Failure Detection
by Machine Learning
Vijay Dwivedi
Shrut Makde
Nikhil Gupta
Feb 25, 2025
1
| Restricted
Content
1)
Transformer Failure Definition
➢
Objective and parameter Definition
2)
Data Aggregation logic
3)
Exploratory Data Analysis -)
4)
Parameter Selection Criteria
5)
Model Development Steps
➢
Data Selection -) for model
building
6)
Test Set Metrics (2024)
7)
Vibration_1 analysis and Summary
8)
Relay Analysis
9)
Model Metric with Ground Truth
10) Confidence Level Definition and plot
11) Model Deployment in Sagemaker
12) Appendix
2
| Restricted
Transformer Failure Definition
Objective: To predict failure probability of transformer with forecasting time window of 5-15
days.
Failure definition is as follows:
LT1:
1.
Beyond 140 for 2 continuous Hours for 7-8 days
2.
120-140 for 40-60 min in Summers (Mar – June), 10-15 times
3.
Either 1 or 2
4.
Ambient temperature is used as additional parameter based on SME suggestions
1.
In the Afternoon (12-17 hours), if a trip in 100-110 range of temp goes beyond 110 for 2
hours, it will fail
2.
Ambient temperature is used as additional parameter based on SME suggestions
LT2:
Relay:
1.
Vibration_1 clearly segregate the Relay category.
2.
Rule based relay colour prediction. More detail in appendix
Metric : Accuracy ~ 70%
3
| Restricted
Data Aggregation Logic
Data Analysis for three parameters (LT1, LT2 and Vibration_1):
Assumption: The frequency of data is gathering 5 min. To separate the noise from it, the first data
smoothing (average) is carried out for 1 hours. The average data retains the signature of the original
data.
4
| Restricted
Model Development Steps
Next Step is to train the model for parameters (LT2, LT1, & Vibration_1) forecasting:
Step 1: Divide the data into two group 1 -) and group 2 (2024)
Step 2: Build the model in group 1 data set (80:20 :: train:validation)
Step 3: Select XGboost model and RMSE metric to train the model
Step 3: Check the overfitting/underfitting and performance of the training set
Step 4: Calculate the RMSE on test data and present charts prediction vs test data
LT1
5
| Restricted
LT2
Test Set Metrics (2024)
LT1 – Test Data and
Predicted value comparison
Test RMSE: 0.77
LT2 – Test Data and Predicted
value comparison
Test RMSE: 1.73
Good Match between actual test data and prediction
based on XGB model
6
| Restricted
Vibration_1 analysis and Summary
Vibration
7
| Restricted
Vibration_1 – Test
Data and Predicted
value comparison
Relay Analysis
For Relay Analysis three parameters considered
LT1, LT2 and Vibration
Rule based logic for relay colour prediction and verification
Count of Colour_numeric Prediction
Groud Truth
Ash
Ash
8
14
Count of Colour_numeric Prediction
Groud Truth
Ash
8
| Restricted
White
17
Purple
White
Purple
1
26264
Purple
White
Ash
100%
0%
0%
Purple
0%
100%
0%
White
0.05%
0.004%
99.94%
Model Metric with Ground Truth
Work Order (Ground Truth)
workorderid
workorder-8
workorder-9
workorder-10
workorder-15
failuredescription
LT1 value is 105.0 and LT2 Value is 110.0 (warning state)
LT1 value is 120.0 and LT2 Value is 130.0 (alarm state)
LT2 value is 130.0 (alarm state)
LT2 value is 110.0 (warning state)
failuredatetime
4/15/2023 13:35
5/5/2023 16:30
5/12/2023 15:50
4/6/2024 13:55
prioritylevel closuredate
timedelta in days
Medium
4/19/2023 8:08
3
High
5/10/2023 2:33
4
High
5/16/2023 5:22
3
Medium
4/10/2024 21:23
4
Failure
YES
NO
NO
NO
remark
LT2>110 for 2 hour
LT2 >110 for <2 hour
LT2 >110 for <2 hour
LT2>110 for <2 hour
Failure output for LT2
Model catch 3 failure point with time relaxation and one
miss
Sensitivity: 75 % (3/4) {TP/(TP+FN)} Precision: 25% (3/12)
{TP/(TP+FP)}
{TP+TN+FP+FN =73 = 26304/(24*15)}
Accuracy: 86% (63/73) {(TP+TN)/(TP+TN+FP+FN)}
9
| Restricted
Confidence Level Definition and Plot
Definition:
VL: None of the condition meets the failure criteria
L: At least one condition meets the failure criteria
M: Min two conditions meet the failure criteria
H: L or M event happening consecutively in past 12 hours
10
| Restricted
Level
VL
VL
VL
L
L
L
M
M
M
H
H
H
Colour Confidence
White
8.0%
Ash
16.0%
Purple
24.0%
White
32.0%
Ash
40.0%
Purple
48.0%
White
56.0%
Ash
64.0%
Purple
72.0%
White
80.0%
Ash
88.0%
Purple
96.0%
Model Deployment Strategy
11
| Restricted
Model Deployment Strategy
Components:
1. RDS database which has the input table
2. 2 SageMaker notebook Instances
(ml.t3.medium & ml.c5.2xlarge)
3. S3 bucket (to store & access models and
store the prediction output)
Workflow:
1. 3 models (LT1, LT2, vib) are being
trained in the instance (ml.c5.2xlarge) and
will be sent over to the specified S3 bucket
as shown in the figure. This will be done
once every 6 months to have the latest
model for prediction and replaces the old
previous one. The input is fetched from the
RDS database and all of the records (ex:
3.5 yr worth of data) are used to train the
models.
2. Another instance Scanner
(ml.t3.medium) runs daily which access
and loads the 3 models saved in the bucket
to predict the data.
12
| Restricted
Model Deployment Strategy
Workflow (cont):
3. The scanner uses the models to predict the
data using the same RDS database’ table as
input (data of last one month). The data
predicted is for 10 days * 24 hours (240
records of prediction) and is saved over to
the S3 bucket for the particular run date as
show in the figure.
4. And also, the data for the particular run date
is sent over to the RDS database in another
table as well for API for the dashboard for every
single run.
Duration each instance will run:
1. ml.c5.2xlarge: runs once per 6 month to train
the latest model, runs for approximately 3 hours.
2. ml.t3.medium: runs once daily to predict 10
days worth of data, runs approximately for 5-10
minutes
13
| Restricted
Appendix
14
| Restricted
Primary Factor for the Failure
1.Alarm Distribution
2. Buchholdz Relay
3.Temperature (mean deviation)
4.Cost Impact Analysis
5.Forecast (10 days)
Predictive Analysis
Work In progress
4. Details are in next slides
WIP
15
| Restricted
5. From ML analysis
Asset Cost Impact Analysis (ACIA)
Transformer
Input :
Costs Impact 1.Direct Costs :
1.1 Repair/replacement cost (50-70%)
1.2 Diagnostics cost (5-15%)
1.3 Labor cost (15-30%)
2. Indirect Costs :
2.1 Downtime cost - 40-60%
2.2 Operational Cost --10-20%
2.3 Revenue Loss-- 20-40%
2.4 Penalties-- 5-15%
2.5 HSE Cost--5-10%
16
| Restricted
Total Cost Under Various head comprising of Failure &
Maintenance activity planned for a quarter
Asset Cost Impact Analysis based on
Work order history
Expected Cost Over 3 Months (EC)
E[C] = E[DC] + E[IC]
Where:
E[DC] = Expected Direct Cost
E[IC] = Expected Indirect Cost
Expected Direct Cost (E[DC])
Direct costs apply in all cases (whether the transformer trips or not), so we break it into two components:
E[DC]=Csch+Ptrip×Cunsch Where:
Csch = Cost of scheduled maintenance over 3 months (1-2 times per quarter)
Ptrip = Probability of an unscheduled failure (from ML model)
Cunsch = Cost of unscheduled repair/replacement
17
| Restricted
Asset Cost Impact Analysis based on
Work order history
Expected Indirect Cost (E[IC])
Indirect costs apply only if the transformer trips.
E[IC]=Ptrip×Cind Where:
Cind = Indirect cost of transformer failure (lost revenue, operational losses, etc.)
Csch
= Cost (Repair/replacement, Diagnostics, Labor)
Cunsch = Cost (NIL)
Cind
= Cost (Downtime, Operational, Revenue Loss, Penalties, HSE Cost)
Final Formula
EC= Csch + Ptrip×(Cunsch+Cind)
18
| Restricted
Asset Cost Impact Analysis based on
Work order history
19
| Restricted