Thursday, March 16, 2023

Correlation Coefficient

The correlation coefficient measures the linear relationship between two variables.  It is often represented by p (rho) symbol for the population and r for the sample.  The value ranges from -1 to +1.  The sign of the value indicates the direction of the linear relationship.  A positive (+) sign indicates a positive relationship which can be viewed as an upward sloping line, where an increase in one variable meant an increase in the other variable.  A (-) sign indicates a negative relationship which can be viewed as a downward sloping line, where an increase in one variable meant a decrease in the other variable and vice versa.

The formula for the correlation coefficient is:


 

Generally, a correlation coefficient of 0.90 and above, regardless of the sign, is considered high correlation.  A correlation coefficient of 0.50 and below is considered low correlation.  A correlation coefficient between 0.5 and 0.9 is considered medium correlation.

Some things to note about correlation coefficient value:

a)  A correlation coefficient of 0.50 meant that the one variable explains only about 25% of the variation in the other variable;

b)  It takes a correlation coefficient of 0.70 for one variable to explain about 50% of the variation in the other variable;

c)  For a correlation coefficient to be convincing in determining the linear relationship between the two variables, a medium correlation needs to tested for statistical significance using t statistic;

d)  It is often safe to say, that low correlations means "no correlations".  By precision, no correlation will have a correlation value of 0.


 

t-Statistic of the Correlation Coefficient

The t-Statistic of the Correlation Coefficient is given by the formula:



        where:     r = value of the correlation coefficient

                        n = is the number of samples

 

The t-Statistic is used to test the following hypothesis:

a)  Ho (Null Hypothesis) that r is equal to 0 and there is no significant correlation between the two variables under study;

b)  Ha (Alternative Hypothesis) the r is not equal to zero and that there is a significant correlation between the two variables under study.


As a rule of the thumb for samples larger than 60, a t-Statistic value of 2 or more (regardless of the sign) means that it is statistically significant and therefore the Ha (alternative hypothesis) is accepted and the Ho (null hypothesis) is rejected.  Inversely, a t-Statistic value of less than 2 (regardless of the sign) means that it is NOT statistically significant and therefore the Ho (null hypothesis) is accepted and the Ha (alternative hypothesis) is rejected.  

For boundary values, the reference t-Statistic is found in most college level textbooks.  However, the reference t-Statistic can also be viewed using tinv() function found in most computer spreadsheets.  For the LibreCalc spreadsheet, the tinv() function is given by formula =tinv(risk level,degrees of freedom).  To illustrate, a sample of 200 with a confidence level of 95%, we place  =tinv(0.05,198).  This gives a t-Statistic reference value of (1.972017).  The degrees of freedom is 200-2=198.  The risk level is 1 minus the confidence level of 95% (0.95) equals (0.05).

As a note, the t-Statistic of the Correlation Coefficient is equal to the t-Statistic of the Simple Regression Equation.  As such, the t-Statistic can be derived from the Simple Regression Model using the F-Statistic as F=t^2 or t-Statistic is the square root of the F-Statistic.