Interview Questions

Machine Learning Interview Questions
Question
How would you evaluate a model for an imbalanced classification problem?
Which metrics would you report and why?
Answer

We should evaluate an imbalanced classification model using metrics that that focus on performance for each class, especially the minority class.

Why ?
Say, we have a dataset with high imbalance, i.e, 99% of data belongs to positive class and only 1% of data belongs to the negative class.
In such a case, standard metrics, such as, accuracy is misleading, because a model can achieve 99% accuracy
by simply predicting positive class all the time.

So, what to do ?
First, of all, start with the confusion matrix. (focus on minority class)
It provides the raw counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). This is the foundation for all other metrics.

Confusion Matrix:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)
  • Precision: Of all instances the model predicted as positive, how many were actually positive?
    • \( Precision = \frac{TP}{TP + FP} \)
    • Use Case: The cost of a False Positive is high (e.g., marking a legitimate email as spam).
  • Recall (Sensitivity): Of all actual positive instances, how many did the model find?
    • \( Recall = \frac{TP}{TP + FN} \)
    • Use Case: The cost of a False Negative is high (e.g., missing a cancer diagnosis or fraud transaction).
  • F1-Score: Harmonic mean of precision and recall.
    • \( F1 ~ Score = 2 * \frac{Precision \times Recall}{Precision + Recall}\)
    • Why report F1-Score?: To balance precision and recall. A model with 1.0 precision and 0.0 recall will have an F1-score of 0.
  • Precision-Recall (PR) AUC: Plots Precision against Recall for different classification thresholds.
    Better than ROC curve because it uses Precision instead of False Positive Rate (FPR), which can be misleading for imbalanced data.
    • \(FPR = \frac{FP}{FP + TN}\)

Read more about Performance Metrics

Question
Why might ROC-AUC be misleading for imbalanced classes?
Answer

ROC curve plots TPR vs FPR, \(FPR = \frac{FP}{FP + TN}\), so for an imbalanced data, FPR can be misleading.
So for imbalanced data, we better use Precision-Recall curve that uses Precision instead of FPR and hence is more reliable.
Let’s look at the fraud detection example below, N = 10,000 transactions, Fraud = 100, NOT fraud = 9900:

Confusion Matrix:

Predicted FraudPredicted NOT Fraud
Actual Fraud80 (TP)20 (FN)
Actual NOT Fraud220 (FP)9680 (TN)
\[FPR = \frac{FP}{FP + TN} = \frac{220}{220 + 9680} \approx 0.022\]

\[Precision = \frac{TP}{TP + FP} = \frac{80}{80 + 220} = \frac{80}{300}\approx 0.267\]

The FPR is very low due to the class imbalance, and hence Precision gives us a better view of the model’s performance.

Question
Describe how to avoid data leakage when performing feature engineering and cross-validation.
Answer

👉Any information from the validation/test set must NOT influence training, directly or indirectly.
So, how do we prevent this leakage of information or data leakage from training to validation or test set ?

  1. Train-Test Contamination:
  • Wrong: Applying preprocessing (like global StandardScaler, Mean_Imputation, Target_Encoding etc.) on the entire dataset before splitting.
  • Right: Compute mean, variance, etc. only on the training data and use the same for validation and test data.
  1. Preventing Leakage in Cross-Validation:
  • Wrong: Perform preprocessing (e.g., scaling, normalization, missing value imputation) on the entire dataset before passing it to cross_val_score.
  • Right: Use sklearn.pipeline.Pipeline; Pipeline ensures that the ‘validation fold’ remains unseen until the transformation is applied using the training fold’s parameters.
  1. Time Series Data:
  • Wrong: Use standard random CV; it allows the model to ‘peek into the future’.
  • Right: Use Time-Series Nested Cross-Validation (Forward Chaining) instead of random shuffling.
  1. Target Leakage:
  • Wrong: Include features that are only available after the event we are trying to predict and are proxy for the target.
    • e.g. Including number_of_late_payments in a model to predict whether a person applying for a bank loan will default ?
  • Right: Do not include such features during training.
  1. Group Leakage:
  • Wrong: If you have multiple rows that are correlated (same user).
    • For the same patient or user, you put some rows in Train and others in Test.
  • Right: Use GroupKFold to ensure all data from a specific group stays together in one fold.
Question
Explain bias-variance tradeoff and write the bias and variance decomposition for squared error.
Answer

Bias-Variance Decomposition:
For Mean Squared Error (MSE) = \(\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y_i})^2\)

Total Error = Bias^2 + Variance + Irreducible Error

  • Bias = Systematic Error
    • Bias measures how far the average prediction of a model is from the true value.
  • Variance = Sensitivity to Data
    • Variance measures how much the predictions of a model vary for different training datasets.
  • Irreducible Error = Sensor noise, Human randomness
    • Inherent uncertainty in the data generation process itself and cannot be reduced by any model.

Bias-Variance Trade-Off:

  • High Bias (Underfitting): A model with high bias is too simple to capture the underlying patterns in the data
    • e.g., fitting a straight line to curved data.
  • High Variance (Overfitting): A model with high variance is too complex and learns the noise in the training data rather than the true relationship
    • e.g., a high-degree polynomial curve that perfectly fits training data points but performs poorly on new data.

🎯 The goal is to find a balance; a ‘sweet spot’ that minimizes the total error.

🦉 A good model ‘generalizes’ well, i.e., neither too simple (low bias) nor too complex (low variance).

Question
How does L1 (Lasso) differ from L2 (Ridge) mathematically, and when does Lasso produce sparse solutions?
Answer
  • L1 Regularization:
    • \( \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ J(w) + \lambda_1.\sum_{j=1}^n |w_j| \)
  • L2 Regularization:
    • \( \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ J(w) + \lambda_2.\sum_{j=1}^n w_j^2 \)

L1 regularization produces sparse solutions when regularization coefficient \(\lambda_1\) is high.

  • Because the gradient of L1 penalty (absolute function) is a constant value, i.e, \(\pm 1\), this means a constant reduction in weight at each step, making it gradually reach to 0 in finite steps.
  • Whereas, the derivative of L2 penalty is proportional to the weight (\(2w_j\)) and as the weight reaches close to 0, the gradient also becomes very small, this means that the weight will become very close to 0, but not exactly equal to 0.
Question
What is heteroscedasticity and how does it affect OLS? How would you test for it?
Answer

💡 Heteroscedasticity = Variance NOT Constant

Note: Linear regression assumes that the data has homoscedasticity (constant variance).

Ordinary Least Squares (OLS) is an unweighted estimator. It treats every data point as equallyinformative’.

  • Under Homoscedasticity: Every point has the same amount of noise, so giving them equal weight is logical.
  • Under Heteroscedasticity: Some points have very low variance (high certainty) and some have very high variance (lots of noise).

👉 By treating all the points equally, OLS is ‘wasting’ the precision of the low-variance points and being ‘skewed’ by the high-variance points.
This is why OLS is no longer efficient; does not produce the smallest possible standard errors.
Which means:

  • t-tests become unreliable.
  • p-values become misleading.
  • Confidence intervals are wrong.

👉 OLS is NO longer B.L.U.E. (Best Linear Unbiased Estimator).

  • While the coefficients remain unbiased,
    they are no longer the ‘best’ because there is another estimator (like Weighted Least Squares) that could provide a lower variance.

👉 How to Test for Heteroscedasticity ?

  • Visual (Residual Plot):
    • Heteroscedasticity: The points form a ‘fan’ or ‘funnel’ shape, widening or narrowing as values increase.
    • Homoscedasticity: The points look like a random ‘cloud’ with consistent thickness.
  • Breusch–Pagan Test
  • White Test
  • Goldfeld–Quandt Test
Question
Explain the difference between likelihood and probability; explain how MAP differs from MLE with example.
Answer

Probability vs. Likelihood:
Difference lies in which variable is fixed and which is varying.

  • Probability(Forward View):
    • Quantifies the chance of observing a specific outcome given a known, fixed parameters \(\theta\).
  • Likelihood(Backward/Inverse View):
    • Inverse concept used for inference (working backward from results to causes).
    • It is a function of the parameters \(\theta\) and measures how ‘likely’ a specific set of parameters makes the observed (fixed) data appear.

MLE vs. MAP:
Both help us answer the question:
Which parameter \(\theta\) best explains the data we just saw?

  • Maximum Likelihood Estimation (MLE):
    • MLE believes the data should speak for itself.
    • It asks: ‘Which value of \(\theta\) makes the observed data most probable?
    • It ignores any outside context or common sense.
  • Maximum A Posteriori (MAP):
    • MAP believes the data is important, but so is prior knowledge.
    • It asks: ‘Given the data AND what we already know about the problem at hand, which value of \(\theta\) is most likely?

The relationship between them is rooted in Bayes’ Theorem:

\[P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) P(\theta)}{P(\text{data})}\]
  • MLE maximizes only the Likelihood: \(P(\text{data} \mid \theta)\)
  • MAP maximizes the Posterior: \(P(\text{data} \mid \theta) P(\theta)\)

Note:

  1. P(data) is a constant (a ’normalizing factor’), so we ignore it during maximization.
  2. For a prior with uniform distribution where every value is equally likely, MAP becomes MLE.

👉 Coin Toss Example:

  • Data: Toss the coin 3 times, 2H + 1T.
    • MLE: Probability of heads (\(\theta\))
      • \(\theta_{MLE}\) = 2/3 = 0.67
    • MAP (with prior belief that coin is fair)
      • Assume prior: \(\theta \sim \beta(10,10)\)
      • Posterior = \(\beta(12,11)\)
      • \(\theta_{MAP}\) = 0.52 (Prior pulls estimate towards 0.5)

✅ Use MLE when:

  • large dataset.
  • no reliable prior knowledge.

✅ Use MAP when:

  • small dataset.
  • reliable prior/domain knowledge.

Read more about MLE & MAP

Question
How is entropy defined for a binary split? Derive information gain and show how it is used to choose a decision tree split.
Answer

Entropy (H) is a measure of impurity or randomness in a dataset.

\[H(S)=-\sum _{i=1}^{n}p_{i}\log(p_{i})\]

For binary classification, where the outcome is Yes/No, 0/1 etc., entropy will be:

\[H(S) = -p \log_2(p) - (1-p) \log_2(1-p)\]
  • Max Entropy: H(S) = 1, when the classes are split 50/50 (maximum uncertainty).
  • Min Entropy: H(S) = 0 when the set is pure (all examples belong to one class).

Information Gain:
️Measures the reduction in entropy (uncertainty) achieved by splitting a dataset based on a specific attribute.

\[ IG=Entropy(Parent)-\left[\frac{N_{left}}{N_{parent}}Entropy(Child_{left})+\frac{N_{right}}{N_{parent}}Entropy(Child_{right})\right] \]

Note: The goal of a decision tree algorithm is to find the split that maximizes information gain, meaning it removes the most uncertainty from the data.

👉 To understand how a Decision Tree selects the ‘best’ root node, let’s use the example below:

The Dataset: “Will they buy the product?”

IDAgeIncomeCredit ScoreBuy? (Target)
1YouthHighGoodNo
2YouthHighExcellentNo
3MiddleHighGoodYes
4SeniorMediumGoodYes
5SeniorLowGoodYes
6SeniorLowExcellentNo
  1. Calculate parent node’s entropy:
  • P(yes) = P(no) = 3/6 = 0.5
  • \(H(Parent) = -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) = \mathbf{1.0}\)
  1. Evaluate feature ‘Age’:
  • Youth: 2 samples (0 Yes, 2 No). \(H(Youth) = \mathbf{0}\)
  • Middle: 1 sample (1 Yes, 0 No). \(H(Middle) = \mathbf{0}\)
  • Senior: 3 samples (2 Yes, 1 No). \(H(Senior) = -(\frac{2}{3} \log_2 \frac{2}{3} + \frac{1}{3} \log_2 \frac{1}{3}) \approx \mathbf{0.918}\)
  • Weighted Entropy for Age:
    • \((\frac{2}{6} \times 0) + (\frac{1}{6} \times 0) + (\frac{3}{6} \times 0.918) = \mathbf{0.459}\)
  • Information Gain (Age): \(1.0 - 0.459 = \mathbf{0.541}\)
  1. Evaluate feature ‘Income’:
  • High: 3 samples (1 Yes, 2 No). \(H(High) \approx \mathbf{0.918}\)
  • Medium: 1 sample (1 Yes, 0 No). \(H(Medium) = \mathbf{0}\)
  • Low: 2 samples (1 Yes, 1 No). \(H(Low) = \mathbf{1.0}\)
  • Weighted Entropy for Income:
    • \((\frac{3}{6} \times 0.918) + (\frac{1}{6} \times 0) + (\frac{2}{6} \times 1.0) = \mathbf{0.792}\)
  • Information Gain (Income): \(1.0 - 0.792 = \mathbf{0.208}\)
  1. Evaluate feature ‘Credit Score
  • Good: 4 samples (3 Yes, 1 No). \(H(Good) \approx \mathbf{0.811}\)
  • Excellent: 2 samples (0 Yes, 2 No). \(H(Excellent) = \mathbf{0}\)
  • Weighted Entropy for Credit Score:
    • \((\frac{4}{6} \times 0.811) + (\frac{2}{6} \times 0) = \mathbf{0.541}\)
  • Information Gain (Credit Score): \(1.0 - 0.541 = \mathbf{0.459}\)
  1. The Decision Tree algorithm compares the information gain for all the features, and splits on the feature with maximum information gain.
  • Here in our case it is “Age”, IG = 0.541.
  • The algorithm chooses ‘Age’ as the root node.
  • Splits the data into three branches (Youth, Middle, Senior).



End of Section