Wednesday, May 29, 2019

Simple Regression using SQL

A.  Calculating for the Slope

Given the following data in the MariaDB database table named "simreg":

XY
11234
21768
32234
42555
52800

Where X is the independent variable and Y is the dependent variable, the slope can be calculated using the following SQL command:

SELECT (sum((x - (select avg(x) from simreg)) * (y - (select avg(y) from simreg)))) / sum(power(x - (select avg(x) from simreg),2)) AS 'Slope' FROM simreg;

This query produces the following output:

Slope
391.9

The slope value of 391.9 is correct.  We can convert the query into an SQL function as follows:

DELIMITER |
CREATE FUNCTION c_slope() RETURNS DOUBLE
BEGIN
    DECLARE myres DOUBLE;
    SELECT (sum((x - (select avg(x) from simreg)) * (y - (select avg(y) from simreg)))) / sum(power(x - (select avg(x) from simreg),2)) INTO myres FROM simreg;
    RETURN myres;
END |
DELIMITER ;

Such will create the function c_slope() and can be queried using the command:

SELECT c_slope() AS 'Slope';


B.  Calculating for the Intercept

The Intercept can be calculated by issuing the following SQL command:

SELECT avg(y) - (sum((x - (select avg(x) from simreg)) * (y - (select avg(y) from simreg)))) / (sum((x - (select avg(x) from simreg))* (x - (select avg(x) from simreg)))) * avg(x) AS 'Intercept' FROM simreg;

This query produces the following output:

Intercept
942.5

The Intercept value of 942.5 is correct.  Such query can be converted into an SQL function as follows:

DELIMITER |
CREATE FUNCTION c_intercept() RETURNS DOUBLE
BEGIN
     DECLARE myres DOUBLE;
     SELECT avg(y) - (sum((x - (select avg(x) from simreg)) * (y - (select avg(y) from simreg)))) / (sum((x - (select avg(x) from simreg))* (x - (select avg(x) from simreg)))) * avg(x) INTO myres FROM simreg;
     RETURN myres;
END |
DELIMITER ;

Likewise, such will create the function c_intercept() and can be queried using the command:

SELECT c_intercept() AS 'Intercept';


C.  Forecasting

Given the Slope (b) and Intercept (a), the equation of the straight line is "Y = a + b X".  We can use this equation to forecast the value of Y by supplying the value of X.  To execute this in SQL, the command is:

SELECT c_intercept()+c_slope()*x AS 'Y Forecast' FROM simreg;

This query produces the following output:

Y Forecast
1334.4
1726.3
2118.2
2510.1
2902

If we want to have the forecast side by side with the original Y, we can just add the Y into the query as:

SELECT c_intercept()+c_slope()*x AS 'Y Forecast', y AS 'Y Original' FROM simreg;

This will produce the following output:

Y ForecastY Original
1334.41234
1726.31768
2118.22234
2510.12555
29022800

Likewise, if we want to get the forecast error, we can just modify the query a bit to deduct the 'Y Original' with the 'Y Forecast' as follows:

SELECT c_intercept()+c_slope()*x AS 'Y Forecast', y AS 'Y Original', (c_intercept()+c_slope()*x) - y AS 'Y Error' FROM simreg;

This will produce the following output:

Y ForecastY OriginalY Error
1334.41234100.40000000000009
1726.31768-41.700000000000045
2118.22234-115.80000000000018
2510.12555-44.90000000000009
29022800102


D.  Calculating the Mean Absolute Difference

We can calculate the Mean Absolute Difference (MAD) as a means to compare the "Simple Regression Method" with other forecasting methods.  The SQL command to get the MAD is:

SELECT avg(abs((c_intercept()+c_slope()*x) - y)) AS 'MAD' FROM simreg;

Which produces the output:

MAD
80.96000000000008


E.  Calculating the Root Mean Square Error

As an alternative the MAD, we can also get the Root Mean Square Error (RMSE) of the method.  The SQL command is:

SELECT SQRT(AVG(POWER((c_intercept()+c_slope()*x - y),2))) AS 'RMSE' FROM simreg;

Which produces the output:

RMSE
86.77407446927921


F.  Coefficient of Determination

The Coefficient of Determination, also commonly referred to as R-Squared, is the percentage of the variation in the dependent variable (Y) that can be explained by the independent variable (X).  The higher percentage and closer the value to 1, the better.  To be precise, R-Squared is the Regression Sum of Squares (RSS) over the Total Sum of Squares (TSS).

To calculate R-Squared, the following SQL functions needs to be created to facilitate the query.

a) Mean of X function

DELIMITER |
CREATE FUNCTION c_avex() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT avg(x) INTO myres FROM simreg;
return myres;
END |
DELIMITER ;

b) Mean of Y function

DELIMITER |
CREATE FUNCTION c_avey() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT avg(y) INTO myres FROM simreg;
return myres;
END |
DELIMITER ;

c)  Regression Sum of Squares (RSS) function

DELIMITER |
CREATE FUNCTION c_rss() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT sum(power(((c_intercept()+c_slope()*x) - c_avey()),2)) INTO myres FROM simreg;
return myres;
END |
DELIMITER ;

d)  Total Sum of Squares (TSS) function

DELIMITER |
CREATE FUNCTION c_tss() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT  sum(power(y-c_avey(),2)) INTO myres FROM simreg;
return myres;
END |
DELIMITER ;


e)  R-Squared function

DELIMITER |
CREATE FUNCTION c_r_squared() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT c_rss()/c_tss() INTO myres; 
return myres;
END |
DELIMITER ;

After creating all these SQL functions, R-Squared can be calculated by issuing the following command:

SELECT c_r_squared() AS 'R-Squared';

This produces the following output:

R-Squared
0.9760733246803284


G.  Correlation Coefficient

The Correlation Coefficient is designated in Statistics by the symbol R and this is for a good reason.  It means that if we have already the Coefficient of Determination, which is designated by R-Squared, we can readily derive R by just squaring the value of the R-Squared.  Thus, we can create the following function in SQL to get R:

DELIMITER |
CREATE FUNCTION c_r() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT sqrt( c_r_squared() ) INTO myres; 
return myres;
END |
DELIMITER ;


After creating the c_r() function, we can get the Correlation Coefficient by issuing the command:

SELECT c_r() AS 'Correlation';

This produces the output:

Correlation
0.9879642448615398


H.  Calculating the F Statistic

The F Statistic is the difference between the regression mean square (RMS) and the error mean square (EMS).  Since the RMS is equal to the regression sum of square (RSS), we just need to create a function to derive the EMS as follows:

DELIMITER |
CREATE FUNCTION c_ems() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT sum(power(y-(c_intercept()+c_slope()*x),2)) / (count(x)-2) INTO myres FROM simreg; 
return myres;
END |
DELIMITER ;

After this, the function that will create the F Statistic can be created as:


DELIMITER |
CREATE FUNCTION c_f_stat() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT c_rss()/c_ems() INTO myres; 
return myres;
END |
DELIMITER ;

Then, the following command to calculate the F Statistic can be invoke:

SELECT c_f_stat() AS 'F Statistic';

This produces the following output;

F Statistic
122.38319782621953


I.  The Student's t Statistic

In a simple regression equation, the relation between t and F statistic is that the square of the t Statistic is equal to the F Statistic.  Therefore, if we got the F, the t can be derived by just getting the square root of F.  This is undertaken by the following function:

DELIMITER |
CREATE FUNCTION c_t_stat() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT sqrt(c_rss()/c_ems()) INTO myres; 
return myres;
END |
DELIMITER ;

The Student's t Statistic can be calculated by issuing the following command:

SELECT c_t_stat() AS 't Statistic';

This produces the following output:

t Statistic
11.062693967846148


J.  Putting it all together

To put it all together, we can create the procedure:

DELIMITER |
CREATE PROCEDURE c_simple_regression()
BEGIN
SELECT c_intercept() as 'Intercept',c_slope() as 'Slope',c_t_stat() as 't Stat',c_r_squared() as 'R Squared',c_r() as 'Correlation', c_f_stat() as 'F Stat';
END |
DELIMITER ;

and the simple regression statistics can be calculated by issuing the command:

CALL c_simple_regression();

which produces the output:

InterceptSlopet StatR SquaredCorrelationF Stat
942.5391.911.0626930.9760730.987964122.383197


Note:

The above-mentioned procedure has been tested in MariaDB 10.3 database.  My guess is, it should also work with MySQL, PostgreSQL and other SQL databases with little modification and with some differences in precision mathematics.