Data Leakage

Data Leakage
Data Leakage

⭐️ Occurs when a model is trained using data that would not be available during real-world predictions, leading to good training performance, but poor real‑world 🌎 performance.
It is essentially the model ‘cheating’ by inadvertently accessing information about the target variable.

👉Any information from the validation/test set must NOT influence training, directly or indirectly.
❓So, how do we prevent this leakage of information or data leakage from training to validation or test set ?

Train-Test Contamination
  • Wrong: Applying preprocessing (like global StandardScaler, Mean_Imputation, Target_Encoding etc.) on the entire dataset before splitting.
  • Right: Compute mean, variance, etc. only on the training data and use the same for validation and test data.

Preventing Leakage in Cross-Validation:

  • Wrong: Perform preprocessing (e.g., scaling, normalization, missing value imputation) on the entire dataset before passing it to cross_val_score.
  • Right: Use sklearn.pipeline.Pipeline; Pipeline ensures that the ‘validation fold’ remains unseen until the transformation is applied using the training fold’s parameters.
Temporal Leakage

This happens in Time Series ⏰ data.

  • Wrong: Use standard random CV; it allows the model to ‘peek into the future’.
  • Right: Use Time-Series Nested Cross-Validation (Forward Chaining) instead of random shuffling.
Target Leakage
  • Wrong: Include features that are only available after the event we are trying to predict and are proxy for the target.
    • e.g. Including number_of_late_payments in a model to predict whether a person applying for a bank loan will default ?
  • Right: Do not include such features during training.

Group Leakage:

  • Wrong: If you have multiple rows that are correlated (same user).
    • For the same patient or user, you put some rows in Train and others in Test.
  • Right: Use GroupKFold to ensure all data from a specific group stays together in one fold.

End of Section