Correlation Tutorial
- Gary Waldman
- Mar 16, 2019
- 2 min read
CORRELATION AND LINEAR REGRESSION
There is a mathematically objective way to test the relationship between two quantities like top marginal tax rate and unemployment rate. One can calculate the correlation coefficient. This number is +1 if the two are completely correlated: when one variable increases or decreases, the other always does likewise proportionately. For values between 0 and +1, the relationship is iffier, but still with a tendency for the two to move in the same direction. Zero correlation coefficient means no relationship: the two variables are independent. Negative values of the correlation coefficient, down to -1, mean that when one variable increases or decreases, the other tends to do the opposite. A correlation of + or -1 means that the value of one of the two variables completely determines the value of the other. The correlation coefficient is often symbolized by the letter “r”.
The square of the correlation coefficient is called the coefficient of determination (COD): COD = r2. This number tells how much of the variation in one quantity could be due to the variation of the other. Obviously if r = 1, COD = 1. Also if r = 0, COD = 0.
A linear regression analysis is related. This calculation finds the best straight line that fits the data in a graph of one variable versus the other. Such a graph is called a scatterplot. The slope of the regression line has the same sign (+ or -) as the correlation coefficient. When the correlation coefficient is zero so is the slope of the regression line and, in this case, knowing the value of one of the variables tells you nothing about the value of the other. The closer the correlation coefficient is to unity, the closer the empirical data points are to the regression line in a scatterplot; for a correlation coefficient of unity (+ or -) the points all fall exactly on the line.
As an example, figure 1 shows a scatterplot of two highly correlated variables: annual stock market return over a 22 year period as measured by either the Standard & Poor’s 500 index or the Dow-Jones Industrial Average index. We would expect a high correlation because both indices are trying to measure the same thing: stock market performance for the year.
In this case r = 0.96 and COD = 0.92. It can be seen that the data points all cluster closely along the regression line. We cannot say that variations in either index cause the variations in the other. Instead they are both being affected by the actual market performance. A high correlation value does not necessarily mean causation, but if there are logical reasons to think the one variable affects the other, then it can be evidence for causation. On the other hand, a zero correlation, or a correlation of the wrong sign can rule out some presumed causes.
FIGURE 1

Comentarios