Pearson's Correlation Coefficient — Formulae, Interpretation [Worked Examples]
by Jack Bodeley on September 14, 2021
Pearson's Coefficient of Correlation, or Pearsonian Correlation coefficient, is a mathematical method of measuring the intensity or magnitude of linear relationship between two variables as suggested by Karl Pearson (1867 - 1936), a British Biometrician and Statician.
It is by far the most widely used method of measuring correlation in practice today.
A correlation coefficient is merely a mathematical relationship, this has nothing to do with cause and effect relation.
Correlation is a measure of the degree of relatedness.
It measures the association between two or more variables. When movement in one variable tends to be accompanied by corresponding movements in other variables, they are said to be correlated. Correlation can be positive or negative, it can be linear or curvilinear, and it can also be simple, partial or multiple.
Formulae
$$r_{xy}=\frac{n\sum xy-\sum x\sum y}{\sqrt{n\sum x^2-(\sum x)^2}\times \sqrt{n\sum y^2-(\sum y)^2}}$$
The formula measures Pearsonian Correlation Coefficient between two variables $x$ and $y$ usually denoted $r(x,y)$ or $r_{xy}$ or simply $r$.
It is a numerical measure of the linear relationship between the two variables.
This relationship can also be defined by the ratio of the covariance between $x$ and $y$, to the product of the standard deviations of $x$ and $y$.
$$r_{xy}=\frac{Cov(x, y)}{\sigma_x\sigma_y}$$
In a bivariate distribution where;
$$Cov(x, y)=\frac{\sum(x-\bar{x})(y-\bar{y})}{n}$$
$$\sigma_x=\sqrt\frac{\sum(x-\bar{x})^2}{n}$$
$$\sigma_y=\sqrt\frac{\sum(y-\bar{y})^2}{n}$$
Properties
The important properties of Pearsonian Correlation Coefficient are:
- Pearsonian Correlation Coefficient cannot exceed 1 numerically i.e. it always lies between -1 and +1, that is $-1\le r\le1$. Any value of $r$ lying out of these limits is incorrect.
- Pearsonian Correlation Coefficient is independent of the change of origin and scale. Given variables $x$ and $y$, for instance, if these are mathematically transformed to new variables $u$ and $v$ by change of origin and scale i.e. $$u=\frac{x-a}{h}$$ $$and;$$ $$v=\frac{y-b}{k}$$ where $a$, $h$, $b$ and $k$ are constants, $h\lt0$ and $k\gt0$, then the correlation coefficient between $u$ and $v$ is the same $r_{xy}=r_{uv} $. This is one of the most important properties of correlation coefficient and is extremely helpful numerical computation of $r$.
- If two variables are independent they are uncorrelated but the converse need not necessarily be true i.e. uncorrelated variables need not necessarily be independent. Uncorrelation between two variables $x$ and $y$ i.e. $r_{xy}=0$, implies abscence of a linear relationship but they may be related quadratically, logarithmically or trigonomically.
- Pearsonian Correlation Coefficient is the geometric mean of the two regression coefficients i.e. $$r_{xy}=\pm\sqrt{b_{xy}\times b_{yx}}$$ Both regression coefficients will either be positive or negative.
- The square of a Pearsonian Correlation Coefficient is known as the Coefficient of Determination. It measures the percentage variation in the dependent variable that is accounted for by the independent variable—this is useful in interpreting the value of $r$.
Interpretation
Positive values of $r$ indicates positive correlation, negative values of $r$ indicate negative correlation, whereas $r=0$ indicates absence of correlation.
The degree of correlation corresponding to various values of $r$ can be summed up as follows:
Value of $r$ | Degree of Correlation |
---|---|
$\pm1$ | perfect correlation |
$\pm0.9$ | very high correlation |
$\pm0.75$ | sufficiently high correlation |
$\pm0.6$ | moderate correlation |
$\pm0.3$ | possible correlation |
$0$ | absence of correlation |
Probable Error of Correlation Coefficient $(PE_r)$
After obtaining $r$, we want to find out the dependability or reliability of the coefficient.
Probable error of the correlation coefficient ($PE_r$) measures the reliability of obtained correlation coefficients. Generally, this is done by considering whether the conditions of random sampling are satisfied as follows:
$$PE_r=0.6745~SE_r$$
$$or;$$
$$PE_r=0.6745\frac{1-r^2}{\sqrt{n}}$$
Importance of Probable Error
The probable error $(PE_r)$ is used in the determination of limits. The limits of the population correlation coefficient are $r\pm PE_r$, that means that if we take another random sample of size $n$ from the same population, the sample correlation coefficient of the second sample will be within the determined limits, with 0.5 probability. However, the smaller the sample size, the highher the probability of inaccuracy. Ideally, the sample size $n$ should be fairly large.
Interpretation of Probable Error
The interpretation of $r$ based on $(PE_r)$ is as follows:
Value of $r$ | Correlation |
---|---|
$\lt PE_r$ | insignificant correlation |
$\gt 6\times PE_r$ | significant correlation |
If $PE_r$ is too small, correlation exists where $r\gt0.5$
Worked Examples
Example 1
Find the Pearsonian Correlation Coefficient between volatility and prices of the following 10 stocks and interprete it:
Stock | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Volatility | 50 | 50 | 55 | 60 | 65 | 65 | 65 | 60 | 60 | 50 |
Price | 11 | 13 | 14 | 16 | 16 | 15 | 15 | 14 | 13 | 13 |
Solution 1
Stock | Volatility $(x)$ | Price $(y)$ | $(x-\bar{x})^2$ | $(y-\bar{y})^2$ | $(x-\bar{x})(y-\bar{y})$ |
---|---|---|---|---|---|
1 | 50 | 11 | 64 | 9 | 24 |
2 | 50 | 13 | 64 | 1 | 8 |
3 | 55 | 14 | 9 | 0 | 0 |
4 | 60 | 16 | 4 | 4 | 4 |
5 | 65 | 16 | 49 | 4 | 14 |
6 | 65 | 15 | 49 | 1 | 7 |
7 | 65 | 15 | 49 | 1 | 7 |
8 | 60 | 14 | 4 | 0 | 0 |
9 | 60 | 13 | 4 | 1 | -2 |
10 | 50 | 13 | 64 | 1 | 8 |
Total | 580 | 140 | 360 | 22 | 70 |
$$\bar{x}=\frac{\sum~x}{n}=\frac{580}{10}=58$$
$$\bar{y}=\frac{\sum~y}{n}-\frac{140}{10}=14$$
Pearsonian correlation coefficient:
$$\sigma_x=\sqrt\frac{\sum(x-\bar{x})^2}{n}=\sqrt\frac{360}{10}=6$$
$$\sigma_y=\sqrt\frac{\sum(y-\bar{y})^2}{n}=\sqrt\frac{22}{10}\approx1.4832$$
$$Cov(x, y)=\frac{\sum(x-\bar{x})(y-\bar{y})}{n}=\frac{70}{10}=7$$
$$r_{xy}=\frac{Cov(x, y)}{\sigma_x\sigma_y}=\frac{7}{6\times 1.4832}\approx0.7866$$
Interpretation: There is a high positive correlation between volatility and price.
Example 2
The data on price and quantity of a commodity in a market for 5 months is given below:
Month | Jan | Feb | Mar | Apr | May |
---|---|---|---|---|---|
Price | 10 | 10 | 11 | 12 | 12 |
Quantity | 5 | 6 | 4 | 3 | 3 |
- Find the Pearsonian correlation coefficient between price and quantity and comment on its sign and magnitude.
Solution 2.1
Price $(x)$ | Quantity $(y)$ | $x^2$ | $y^2$ | $xy$ |
---|---|---|---|---|
10 | 5 | 100 | 25 | 50 |
10 | 6 | 100 | 36 | 60 |
11 | 4 | 121 | 16 | 44 |
12 | 3 | 144 | 9 | 36 |
12 | 3 | 144 | 9 | 36 |
Total = 55 | 21 | 609 | 95 | 226 |
Pearsonian correlation coefficient:
$$r_{xy}=\frac{n\sum xy-\sum x\sum y}{\sqrt{n\sum x^2-(\sum x)^2}\times \sqrt{n\sum y^2-(\sum y)^2}}$$
$$r_{xy}=\frac{5(226)-55(21)}{\sqrt{5(609)-(55)^2}\times \sqrt{5(95)-(21)^2}}$$
$$r_{xy}\approx-0.9587$$
Comment: The negative sign of $r$ indicates a negative correlation between price and quantity. The magnitude of -0.9587 indicates a very high negative correlation.
Example 3
Consider the following series of scores obtained by two teams:
Teams | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
A | 45 | 70 | 65 | 30 | 90 | 40 | 50 | 75 | 85 | 60 |
B | 35 | 90 | 70 | 40 | 95 | 40 | 60 | 80 | 80 | 50 |
- Find the Pearsonian correlation coefficient
- Find the probable error and interprete it
Solution 3.1
A $(x)$ | B $(y)$ | $u$ | $v$ | $u^2$ | $v^2$ | $uv$ |
---|---|---|---|---|---|---|
45 | 30 | 1 | -2 | 1 | 4 | -2 |
70 | 90 | 6 | 4 | 36 | 16 | 24 |
65 | 70 | 5 | 1 | 25 | 1 | 5 |
30 | 40 | -2 | -1 | 4 | 1 | 2 |
90 | 95 | 10 | 4.5 | 100 | 20.25 | 45 |
40 | 40 | 0 | -1 | 0 | 1 | 0 |
50 | 60 | 2 | 1 | 4 | 1 | 2 |
75 | 80 | 7 | 3 | 49 | 9 | 21 |
85 | 80 | 9 | 3 | 81 | 9 | 27 |
60 | 50 | 4 | 0 | 16 | 0 | 0 |
Total | 42 | 12.5 | 316 | 62.25 | 118 |
$$u=\frac{x-a}{h}$$ $$and;$$ $$v=\frac{y-b}{k}$$
If we assume that $a=40$, $h=5$, $b=50$ and $k=10$, then:
$$u=\frac{x-40}{5}$$
$$v=\frac{y-50}{10}$$
Pearsonian correlation coefficient:
$$r_{xy}=r_{uv}=\frac{n\sum uv-\sum u\sum v}{\sqrt{n\sum u^2-(\sum u)^2}\times \sqrt{n\sum v^2-(\sum v)^2}}$$
$$r_{uv}=\frac{10(118)-42(12.5)}{\sqrt{10(316)-(42)^2}\times \sqrt{10(62.25)-(12.5)^2}}$$
$$r_{uv}\approx0.8119$$
This is a high positive correlation.
Solution 3.2
$$PE_r=0.6745\frac{1-r^2}{\sqrt{n}}$$
$$PE_r=0.6745\frac{1-0.8119^2}{\sqrt{10}}$$
$$PE_r\approx0.0727$$
Interpretation: Because $r_{uv}=0.8119$ is greater than six times the $PE_r$ (0.4362), the correlation is significant.