Data Leakage
Data Leakage
2 minute read
Data Leakage
⭐️ Occurs when a model is trained using data that would not be available during real-world predictions,
leading to good training performance, but poor real‑world 🌎 performance.
It is essentially the model ‘cheating’ by inadvertently accessing information about the target variable.
👉Any information from the validation/test set must NOT influence training, directly or indirectly.
❓So, how do we prevent this leakage of information or data leakage from training to validation or test set ?
Train-Test Contamination
- ❌ Wrong: Applying preprocessing (like global StandardScaler, Mean_Imputation, Target_Encoding etc.) on the entire dataset before splitting.
- ✅ Right: Compute mean, variance, etc. only on the training data and use the same for validation and test data.
Preventing Leakage in Cross-Validation:
- ❌ Wrong: Perform preprocessing (e.g., scaling, normalization, missing value imputation) on the entire dataset before passing it to cross_val_score.
- ✅ Right: Use sklearn.pipeline.Pipeline; Pipeline ensures that the ‘validation fold’ remains unseen until the transformation is applied using the training fold’s parameters.
Temporal Leakage
This happens in Time Series ⏰ data.
- ❌ Wrong: Use standard random CV; it allows the model to ‘peek into the future’.
- ✅ Right: Use Time-Series Nested Cross-Validation (Forward Chaining) instead of random shuffling.
Target Leakage
- ❌ Wrong: Include features that are only available after the event we are trying to predict and are proxy for the target.
- e.g. Including number_of_late_payments in a model to predict whether a person applying for a bank loan will default ?
- ✅ Right: Do not include such features during training.
Group Leakage:
- ❌ Wrong: If you have multiple rows that are correlated (same user).
- For the same patient or user, you put some rows in Train and others in Test.
- ✅ Right: Use GroupKFold to ensure all data from a specific group stays together in one fold.
End of Section