Correlation
6 minute read
Covariance:
It measures the direction of linear relationship between two variables \(X\) and \(Y\).
\(N\) = size of population
\(\mu_{x}\) = population mean of \(X\)
\(\mu_{y}\) = population mean of \(Y\)
\(n\) = size of sample
\(\bar{x}\) = sample mean of \(X\)
\(\bar{y}\) = sample mean of \(Y\)
Note: We have a term (n-1) instead of n in the denominator to make it an unbiased estimate, called Bessel’s Correction.
If both \((x_i - \bar{x})\) and \((y_i - \bar{y})\) have the same sign, then the product is positive(+ve).
If both \((x_i - \bar{x})\) and \((y_i - \bar{y})\) have opposite signs, then the product is negative(-ve).
The final value of covariance depends on the sum of the above individual products.
\( \begin{aligned} \text{Cov}(X, Y) &> 0 &&\Rightarrow \text{ } X \text{ and } Y \text{ increase or decrease together} \\ \text{Cov}(X, Y) &= 0 &&\Rightarrow \text{ } \text{No linear relationship} \\ \text{Cov}(X, Y) &< 0 &&\Rightarrow \text{ } \text{If } X \text{ increases, } Y \text{ decreases (and vice versa)} \end{aligned} \)
Limitation:
Covariance is scale-dependent, i.e, units of X and Y impact its magnitude.
This makes it hard to make comparisons of covariance across different datasets.
E.g: Covariance between age and height will NOT be same as the covariance between years of experience and salary.
Note:It only measures the direction of the relationship, but does NOT give any information about the strength of the relationship.
- \(X = [1, 2, 3] \) and \(Y = [2, 4, 6] \)
Let’s calculate the covariance:
\(\text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\)
\(\bar{x} = 2\) and \(\bar{y} = 4\)
\(\text{Cov}(X, Y) = \frac{1}{3-1}[(1-2)(2-4) + (2-2)(4-4) + (3-2)(6-4)]= 0\)
\( = \frac{1}{2}[2+0+2]= 2\)
=> Cov(X,Y) > 0 i.e if X increases, Y increases and vice versa.
Correlation:
It measures both the strength and direction of the linear relationship between two variables \(X\) and \(Y\).
It is a standardized version of covariance that gives a dimensionless measure of linear relationship.
There are 2 popular ways to calculate correlation coefficient:
- Pearson Correlation Coefficient (r)
- Spearman Rank Correlation Coefficient (\(\rho\))
Pearson Correlation Coefficient (r):
It is a standardized version of covariance and most widely used measure of correlation.
Assumption: Data is normally distributed.
\(\sigma_{x}\) and \(\sigma_{y}\) are the standard deviations of \(X\) and \(Y\).
Range of \(r\) is between -1 and 1.
\(r = 1\) => perfect +ve linear relationship between X and Y
\(r = -1\) => perfect -ve linear relationship between X and Y
\(r = 0\) => NO linear relationship between X and Y.
Note: A correlation coefficient of 0.9 means that there is a strong linear relationship between X and Y,
irrespective of their units.
- \(X = [1, 2, 3] \) and \(Y = [2, 4, 6] \)
Let’s calculate the covariance:
\(\text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\)
\(\bar{x} = 2\) and \(\bar{y} = 4\)
\(\text{Cov}(X, Y) = \frac{1}{3-1}[(1-2)(2-4) + (2-2)(4-4) + (3-2)(6-4)]= 0\)
\( => \text{Cov}(X, Y) = \frac{1}{2}[2+0+2]= 2\)
Let’s calculate the standard deviation of \(X\) and \(Y\):
\(\sigma_{x} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2} \)
\(= \sqrt{\frac{1}{3-1}[(1-2)^2 + (2-2)^2 + (3-2)^2]}\)
\(= \sqrt{\frac{1+0+1}{2}} =\sqrt{\frac{2}{2}} = 1 \)
Similarly, we can calculate the standard deviation of \(Y\):
\(\sigma_{y} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(y_i - \bar{y})^2} \)
\(= \sqrt{\frac{1}{3-1}[(2-4)^2 + (4-4)^2 + (6-4)^2]}\)
\(= \sqrt{\frac{4+0+4}{2}} =\sqrt{\frac{8}{2}} = 2 \)
Now, we can calulate the pearson correlation coefficient (r):
\(r_{xy} = \frac{Cov(X, Y)}{\sigma_{x} \sigma_{y}}\)
=> \(r_{xy} = \frac{2}{1* 2}\)
=> \(r_{xy} = 1\)
Therefore, we can say that there is a strong +ve linear relationship between X and Y.
Spearman Rank Correlation Coefficient (\(\rho\)):
It is a measure of the strength and direction of the monotonic relationship between two ranked variables \(X\) and \(Y\).
It captures monotonic relationship, meaning the variables move in the same or opposite direction,
but not necessarily a linear relationship.
- It is used when Pearson’s correlation is not suitable, such as, ordinal data, or when the continuous data does not meet the assumptions of linear methods, such as, Pearson’s correlation.
- Non-parametric measure of correlation that uses ranks instead of raw data.
- Quantifies how well the ranks of one variable predict the ranks of the other variable.
- Range of \(\rho\) is between -1 and 1.
- Compute the correlation of ranks awarded to a group of 5 students by 2 different teacherrs.
Student Teacher A Rank Teacher B Rank \(d_i\) \(d_i^2\) S1 1 2 -1 1 S2 2 1 1 1 S3 3 3 0 0 S4 4 5 -1 1 S5 5 4 1 1
\(\sum_{i}d_i^2 = 4 \)
\( n = 5 \)
\(\rho_{xy} = 1 - \frac{6\sum_{i}d_i^2}{n(n^2-1)}\)
=> \(\rho_{xy} = 1 - \frac{6*4}{5(5^2-1)}\)
=> \(\rho_{xy} = 1 - \frac{24}{5*24}\)
=> \(\rho_{xy} = 1 - \frac{1}{5}\)
=> \(\rho_{xy} = 0.8\)
Therefore, we can say that there is a strong +ve correlation between the ranks given by teacher A and teacher B.
- \(X = [1, 2, 3] \) and \(Y = [1, 8, 27] \)
Here, Spearman’s rank correlation coefficient \(\rho\) will be perfect 1 as there is a monotonic relationship i.e as X increases, Y increases and vice versa.
But, the Pearson’s correlation coefficient (r) will be slightly less than 1 i.e r = 0.9662.
Correlation Application
Correlation is very useful in feature selection for training machine learning models.
- If 2 features are highly correlated => they provide redundant information.
- One of the features can be removed without significant loss of information.
- Keeping both can cause issues, such as, multicollinearity.
- If a feature is highly correlated with the target variable => this feature is a strong predictor, so keep it.
- A feature with very low or near zero correlation with the target variable may be considered for removal, as they have little predictive power.
Correlation Vs Causation
Causation means that one variable directly causes the change in another variable, i.e, direct
cause->effect relationship.
Whereas, correlation means that two variables move together.
- Correlation does NOT imply Causation.
- Correlation simply shows an association between two variables that could be coincidental or due to some third, unobserved, factor.
E.g: Election results and stock market - there may be some correlation between the two,
but establishing clear causal links is difficult.
End of Section