Given the following data in the MariaDB database table named "simreg":
X | Y |
1 | 1234 |
2 | 1768 |
3 | 2234 |
4 | 2555 |
5 | 2800 |
Where X is the independent variable and Y is the dependent variable, the slope can be calculated using the following SQL command:
SELECT (sum((x - (select avg(x) from simreg)) * (y - (select avg(y) from simreg)))) / sum(power(x - (select avg(x) from simreg),2)) AS 'Slope' FROM simreg;
This query produces the following output:
This query produces the following output:
Slope |
391.9 |
The slope value of 391.9 is correct. We can convert the query into an SQL function as follows:
DELIMITER |
CREATE FUNCTION c_slope() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT (sum((x - (select avg(x) from simreg)) * (y - (select avg(y) from simreg)))) / sum(power(x - (select avg(x) from simreg),2)) INTO myres FROM simreg;
RETURN myres;
END |
DELIMITER ;
Such will create the function c_slope() and can be queried using the command:
SELECT c_slope() AS 'Slope';
B. Calculating for the Intercept
The Intercept can be calculated by issuing the following SQL command:
SELECT avg(y) - (sum((x - (select avg(x) from simreg)) * (y - (select avg(y) from simreg)))) / (sum((x - (select avg(x) from simreg))* (x - (select avg(x) from simreg)))) * avg(x) AS 'Intercept' FROM simreg;
B. Calculating for the Intercept
The Intercept can be calculated by issuing the following SQL command:
SELECT avg(y) - (sum((x - (select avg(x) from simreg)) * (y - (select avg(y) from simreg)))) / (sum((x - (select avg(x) from simreg))* (x - (select avg(x) from simreg)))) * avg(x) AS 'Intercept' FROM simreg;
This query produces the following output:
Intercept |
942.5 |
The Intercept value of 942.5 is correct. Such query can be converted into an SQL function as follows:
DELIMITER |
CREATE FUNCTION c_intercept() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT avg(y) - (sum((x - (select avg(x) from simreg)) * (y -
(select avg(y) from simreg)))) / (sum((x - (select avg(x) from simreg))*
(x - (select avg(x) from simreg)))) * avg(x) INTO myres FROM simreg;
RETURN myres;
END |
DELIMITER ;
Likewise, such will create the function c_intercept() and can be queried using the command:
SELECT c_intercept() AS 'Intercept';
C. Forecasting
Given the Slope (b) and Intercept (a), the equation of the straight line is "Y = a + b X". We can use this equation to forecast the value of Y by supplying the value of X. To execute this in SQL, the command is:
SELECT c_intercept()+c_slope()*x AS 'Y Forecast' FROM simreg;
This query produces the following output:
SELECT c_intercept() AS 'Intercept';
C. Forecasting
Given the Slope (b) and Intercept (a), the equation of the straight line is "Y = a + b X". We can use this equation to forecast the value of Y by supplying the value of X. To execute this in SQL, the command is:
SELECT c_intercept()+c_slope()*x AS 'Y Forecast' FROM simreg;
This query produces the following output:
Y Forecast |
1334.4 |
1726.3 |
2118.2 |
2510.1 |
2902 |
If we want to have the forecast side by side with the original Y, we can just add the Y into the query as:
SELECT c_intercept()+c_slope()*x AS 'Y Forecast', y AS 'Y Original' FROM simreg;
This will produce the following output:
Y Forecast | Y Original |
1334.4 | 1234 |
1726.3 | 1768 |
2118.2 | 2234 |
2510.1 | 2555 |
2902 | 2800 |
Likewise, if we want to get the forecast error, we can just modify the query a bit to deduct the 'Y Original' with the 'Y Forecast' as follows:
SELECT c_intercept()+c_slope()*x AS 'Y Forecast', y AS 'Y Original', (c_intercept()+c_slope()*x) - y AS 'Y Error' FROM simreg;
This will produce the following output:
Y Forecast | Y Original | Y Error |
1334.4 | 1234 | 100.40000000000009 |
1726.3 | 1768 | -41.700000000000045 |
2118.2 | 2234 | -115.80000000000018 |
2510.1 | 2555 | -44.90000000000009 |
2902 | 2800 | 102 |
D. Calculating the Mean Absolute Difference
We can calculate the Mean Absolute Difference (MAD) as a means to compare the "Simple Regression Method" with other forecasting methods. The SQL command to get the MAD is:
SELECT avg(abs((c_intercept()+c_slope()*x) - y)) AS 'MAD' FROM simreg;
Which produces the output:
MAD |
80.96000000000008 |
E. Calculating the Root Mean Square Error
As an alternative the MAD, we can also get the Root Mean Square Error (RMSE) of the method. The SQL command is:
SELECT SQRT(AVG(POWER((c_intercept()+c_slope()*x - y),2))) AS 'RMSE' FROM simreg;
Which produces the output:
RMSE |
86.77407446927921 |
F. Coefficient of Determination
The
Coefficient of Determination, also commonly referred to as R-Squared,
is the percentage of the variation in the dependent variable (Y) that
can be explained by the independent variable (X). The higher percentage
and closer the value to 1, the better. To be precise, R-Squared is the
Regression Sum of Squares (RSS) over the Total Sum of Squares (TSS).
To calculate R-Squared, the following SQL functions needs to be created to facilitate the query.
a) Mean of X function
DELIMITER |
CREATE FUNCTION c_avex() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT avg(x) INTO myres FROM simreg;
return myres;
END |
DELIMITER ;
b) Mean of Y function
DELIMITER |
CREATE FUNCTION c_avey() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT avg(y) INTO myres FROM simreg;
return myres;
END |
DELIMITER ;
c) Regression Sum of Squares (RSS) function
DELIMITER |
CREATE FUNCTION c_rss() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT sum(power(((c_intercept()+c_slope()*x) - c_avey()),2)) INTO myres FROM simreg;
return myres;
END |
DELIMITER ;
d) Total Sum of Squares (TSS) function
DELIMITER |
CREATE FUNCTION c_tss() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT sum(power(y-c_avey(),2)) INTO myres FROM simreg;
return myres;
END |
DELIMITER ;
e) R-Squared function
DELIMITER |
CREATE FUNCTION c_r_squared() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT c_rss()/c_tss() INTO myres;
return myres;
END |
DELIMITER ;
After creating all these SQL functions, R-Squared can be calculated by issuing the following command:
SELECT c_r_squared() AS 'R-Squared';
This produces the following output:
R-Squared |
0.9760733246803284 |
G. Correlation Coefficient
The Correlation Coefficient is designated in Statistics by the symbol R and this is for a good reason. It means that if we have already the Coefficient of Determination, which is designated by R-Squared, we can readily derive R by just squaring the value of the R-Squared. Thus, we can create the following function in SQL to get R:
DELIMITER |
CREATE FUNCTION c_r() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT sqrt( c_r_squared() ) INTO myres;
return myres;
END |
DELIMITER ;
After creating the c_r() function, we can get the Correlation Coefficient by issuing the command:
SELECT c_r() AS 'Correlation';
This produces the output:
Correlation |
0.9879642448615398 |
H. Calculating the F Statistic
The F Statistic is the difference between the regression mean square (RMS) and the error mean square (EMS). Since the RMS is equal to the regression sum of square (RSS), we just need to create a function to derive the EMS as follows:
DELIMITER |
CREATE FUNCTION c_ems() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT sum(power(y-(c_intercept()+c_slope()*x),2)) / (count(x)-2) INTO myres FROM simreg;
return myres;
END |
DELIMITER ;
After this, the function that will create the F Statistic can be created as:
DELIMITER |
CREATE FUNCTION c_f_stat() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT c_rss()/c_ems() INTO myres;
return myres;
END |
DELIMITER ;
CREATE FUNCTION c_f_stat() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT c_rss()/c_ems() INTO myres;
return myres;
END |
DELIMITER ;
Then, the following command to calculate the F Statistic can be invoke:
SELECT c_f_stat() AS 'F Statistic';
This produces the following output;
F Statistic |
122.38319782621953 |
I. The Student's t Statistic
In a simple regression equation, the relation between t and F statistic is that the square of the t Statistic is equal to the F Statistic. Therefore, if we got the F, the t can be derived by just getting the square root of F. This is undertaken by the following function:
DELIMITER |
CREATE FUNCTION c_t_stat() RETURNS DOUBLE
BEGIN
DECLARE myres DOUBLE;
SELECT sqrt(c_rss()/c_ems()) INTO myres;
return myres;
END |
DELIMITER ;
The Student's t Statistic can be calculated by issuing the following command:
SELECT c_t_stat() AS 't Statistic';
This produces the following output:
t Statistic |
11.062693967846148 |
J. Putting it all together
To put it all together, we can create the procedure:
DELIMITER |
CREATE PROCEDURE c_simple_regression()
BEGIN
SELECT c_intercept() as 'Intercept',c_slope() as 'Slope',c_t_stat() as 't Stat',c_r_squared() as 'R Squared',c_r() as 'Correlation', c_f_stat() as 'F Stat';
END |
DELIMITER ;
and the simple regression statistics can be calculated by issuing the command:
CALL c_simple_regression();
which produces the output:
CALL c_simple_regression();
which produces the output:
Intercept | Slope | t Stat | R Squared | Correlation | F Stat |
942.5 | 391.9 | 11.062693 | 0.976073 | 0.987964 | 122.383197 |
Note:
The above-mentioned
procedure has been tested in MariaDB 10.3 database. My guess is, it
should also work with MySQL, PostgreSQL and other SQL databases with
little modification and with some differences in precision mathematics.