Predictive Maintenance ML: Low Performance Despite Preprocessing — Suspecting Structural Issues (Time Leakage, Feature Engineering, Imbalance Handling) #191278

Hari-2782 · 2026-04-01T06:35:54Z

Hari-2782
Apr 1, 2026

🏷️ Discussion Type

Question

Body

Harisanth_vimalarjah_Technical Challenge 2.ipynb
predictive_maintenance_dirty_dataset_3000-45-15-03-2026-2202.csv
predictive_maintenance_improved_dataset_3500-1.xlsx
I am working on a predictive maintenance task using sensor data to predict binary failure (0/1). I have applied preprocessing (outlier handling, encoding, imputation), handled class imbalance using SMOTETomek and class weighting, and trained a tree-based model (XGBoost). I also engineered basic time features (hour, day, month), but used a random train-test split.

Despite this, my model struggles to reliably detect failures.

Could the main issue be that I am incorrectly treating this as a standard tabular classification problem instead of a time-dependent problem? Specifically:

Should I be using a time-based split instead of random splitting to avoid leakage?
Do I need temporal feature engineering (lag features, rolling statistics) to capture failure patterns?
Is my imbalance handling approach (SMOTETomek + class weighting) potentially harming performance?

I would appreciate guidance on whether my pipeline has structural flaws and how to properly approach failure prediction in this type of sensor dataset.

Guidelines

I have read and understood this category's guidelines before making this post.

syedsafeer · 2026-04-01T07:15:06Z

syedsafeer
Apr 1, 2026

Hi @Hari-2782,
To put it simply: Your data is like a movie, but you’re treating it like a pile of random photos.
The main problem is the Random Split. Sensor data is sequential; the order of time matters. When you shuffle it, the model 'peeks' into the future during training. This is why it looks okay in training but fails in reality.

The Fix:

Stop Shuffling: Use the first 70% of your history to train and the last 30% to test.
Add 'Yesterday's' Data: Don't just look at the sensor right now. Let the model see the average of the last few hours.

Fix the time-split first, and you’ll see the real performance of your model

0 replies

WilliamRossCrane · 2026-04-01T09:56:57Z

WilliamRossCrane
Apr 1, 2026

Yes , the main issue likely comes from treating this as a standard tabular classification problem rather than a time-dependent predictive maintenance problem. A few key points to consider:

Time-based splitting
- Using a random train-test split can cause data leakage, because future information may end up in the training set.
- Use a time-based split where all training data occurs before test data. This better simulates real-world failure prediction.
Temporal feature engineering
- Basic time features (hour, day, month) are not enough. Failures often depend on recent sensor trends.
- Add lag features, rolling means, rolling standard deviations, or exponentially weighted averages to capture temporal patterns.
- Consider window-based aggregations (e.g., last 5 readings) to give the model context.
Class imbalance handling
- SMOTETomek and class weighting can help, but when applied without considering time, synthetic samples may leak future information.
- You might be better off using only class weighting or resampling within the training period to avoid leakage.
Pipeline structure
- Ensure preprocessing (outlier handling, encoding, imputation) is fit only on training data and applied to test data.
- Sequence matters: feature engineering → train-test split → resampling/class weighting → model training.

✅ Takeaway: Predictive maintenance is temporal by nature. Treat your dataset as a time series rather than tabular, use a time-based split, engineer temporal features, and be careful with imbalance techniques to avoid leakage.

Mark this as answered please :D

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Predictive Maintenance ML: Low Performance Despite Preprocessing — Suspecting Structural Issues (Time Leakage, Feature Engineering, Imbalance Handling) #191278

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Predictive Maintenance ML: Low Performance Despite Preprocessing — Suspecting Structural Issues (Time Leakage, Feature Engineering, Imbalance Handling) #191278

Uh oh!

Hari-2782 Apr 1, 2026

🏷️ Discussion Type

Body

Guidelines

Replies: 2 comments

Uh oh!

syedsafeer Apr 1, 2026

Uh oh!

WilliamRossCrane Apr 1, 2026

Hari-2782
Apr 1, 2026

syedsafeer
Apr 1, 2026

WilliamRossCrane
Apr 1, 2026