PostgreSQL 聚合函数讲解 - 5 线性回归
首先讲个线性回归分析linear regression (最小二乘法least-squares-fit)的小故事(取自百度) :
1801年,意大利天文学家朱赛普·皮亚齐发现了第一颗小行星谷神星。经过40天的跟踪观测后,由于谷神星运行至太阳背后,使得皮亚齐失去了谷神星的位置。随后全世界的科学家利用皮亚齐的观测数据开始寻找谷神星,但是根据大多数人计算的结果来寻找谷神星都没有结果。时年24岁的高斯也计算了谷神星的轨道。奥地利天文学家海因里希·奥尔伯斯根据高斯计算出来的轨道重新发现了谷神星。
高斯使用的最小二乘法的方法发表于1809年他的着作《天体运动论》中。
法国科学家勒让德于1806年独立发现“最小二乘法”,但因不为世人所知而默默无闻。
勒让德曾与高斯为谁最早创立最小二乘法原理发生争执。
1829年,高斯提供了最小二乘法的优化效果强于其他方法的证明,因此被称为高斯-马尔可夫定理.
上面的故事说明通过已有数据可以对未来的数据进行预测. 但是预测结果是否准确有不确定因素, 所以需要不断的调整和校验.
如何做回归分析呢? (取自百度)
研究一个或多个随机变量Y1 ,Y2 ,…,Yi与另一些变量X1、X2,…,Xk之间的关系的统计方法,又称多重回归分析。通常称Y1,Y2,…,Yi为因变量,X1、X2,…,Xk为自变量。回归分析是一类数学模型,特别当因变量和自变量为线性关系时,它是一种特殊的线性模型。最简单的情形是一个自变量和一个因变量,且它们大体上有线性关系,这叫一元线性回归,即模型为Y=a+bX+ε,这里X是自变量,Y是因变量,ε是随机误差,通常假定随机误差的均值为0,方差为σ^2(σ^2大于0)σ^2与X的值无关。若进一步假定随机误差遵从正态分布,就叫做正态线性模型。一般的情形,它有k个自变量和一个因变量,因变量的值可以分解为两部分:一部分是由于自变量的影响,即表示为自变量的函数,其中函数形式已知,但含一些未知参数;另一部分是由于其他未被考虑的因素和随机性的影响,即随机误差。当函数形式为未知参数的线性函数时,称线性回归分析模型;当函数形式为未知参数的非线性函数时,称为非线性回归分析模型。当自变量的个数大于1时称为多元回归,当因变量个数大于1时称为多重回归。
回归分析的主要内容为:
①从一组数据出发,确定某些变量之间的定量关系式,即建立数学模型并估计其中的未知参数。估计参数的常用方法是最小二乘法。
②对这些关系式的可信程度进行检验。
③在许多自变量共同影响着一个因变量的关系中,判断哪个(或哪些)自变量的影响是显着的,哪些自变量的影响是不显着的,将影响显着的自变量入模型中,而剔除影响不显着的变量,通常用逐步回归、向前回归和向后回归等方法。
④利用所求的关系式对某一生产过程进行预测或控制。回归分析的应用是非常广泛的,统计软件包使各种回归方法计算十分方便。
在回归分析中,把变量分为两类。一类是因变量,它们通常是实际问题中所关心的一类指标,通常用Y表示;而影响因变量取值的的另一类变量称为自变量,用X来表示。
回归分析研究的主要问题是:
(1)确定Y与X间的定量关系表达式,这种表达式称为回归方程;
(2)对求得的回归方程的可信度进行检验;
(3)判断自变量X对因变量Y有无影响;
(4)利用所求得的回归方程进行预测和控制。
一个例子 :
例如,如果要研究质量和用户满意度之间的因果关系,从实践意义上讲,产品质量会影响用户的满意情况,因此设用户满意度为因变量,记为Y;质量为自变量,记为X。根据图8-3的散点图,可以建立下面的线性关系: Y=A+BX+§
式中:A和B为待定参数,A为回归直线的截距;B为回归直线的斜率,表示X变化一个单位时,Y的平均变化情况;§为依赖于用户满意度的随机误差项。
对于经验回归方程: y=0.857+0.836x
回归直线在y轴上的截距为0.857、斜率0.836,即质量每提高一分,用户满意度平均上升0.836分;或者说质量每提高1分对用户满意度的贡献是0.836分。
在PostgreSQL中提供了回归分析的一些聚合函数,
regr_avgx(Y, X) |
double precision | double precision | average of the independent variable (sum(X)/N) |
regr_avgy(Y, X) |
double precision | double precision | average of the dependent variable (sum(Y)/N) |
regr_count(Y, X) |
double precision | bigint | number of input rows in which both expressions are nonnull |
regr_intercept(Y, X) |
double precision | double precision | y-intercept of the least-squares-fit linear equation determined by the (X, Y) pairs |
regr_r2(Y, X) |
double precision | double precision | square of the correlation coefficient |
regr_slope(Y, X) |
double precision | double precision | slope of the least-squares-fit linear equation determined by the (X, Y) pairs |
regr_sxx(Y, X) |
double precision | double precision | sum(X^2) - sum(X)^2/N ("sum of squares" of the independent variable) |
regr_sxy(Y, X) |
double precision | double precision | sum(X*Y) - sum(X) * sum(Y)/N ("sum of products" of independent times dependent variable) |
regr_syy(Y, X) |
double precision | double precision | sum(Y^2) - sum(Y)^2/N ("sum of squares" of the dependent variable) |
本文会用到如下几个 :
regr_intercept, 计算截距.
regr_slope, 计算斜率.
regr_r2, 计算相关性, 相关性越高, 说明这组数据用于预估的准确度越高.
下面来举个例子 :
将最近一年的某个业务的日访问量数据统计放到一张测试表.
利用这一年的数据进行一元回归分析, 要预测的是因变量, 用作预测的是自变量. 因为要预测的数据未发生, 所以我们可以把时间交错一下, 就可以作为自变量来使用.
例如下面这组数据, 自变量就是数据下移一行产生的. :
因变量 | 自变量
48000 | 54624
47454 | 48000
56766 | 47454
60488 | 56766
58191 | 60488
57443 | 58191
54277 | 57443
55508 | 54277
52716 | 55508
63748 | 52716
43462 | 63748
44248 | 43462
40145 | 44248
如果影响的因素较多, 需要做多元回归, 你可以选择PostgreSQL R语言插件来分析.
本文做的是一元回归的例子, 接下来进入测试 :
创建一张过去365天的某业务的日下载量数据表.
digoal=> create table test as select row_number() over(order by dt) as rn,cnt from
(select date(createtime) as dt, count(*) as cnt from tbl_app_download where createtime>=( date(now())-366 ) and createtime<( date(now())-1 ) group by date(createtime)) as t;
SELECT 365
数据大概是这样的
digoal=> select * from test;
....
329 | 36293
330 | 40886
331 | 34465
332 | 30785
333 | 33318
334 | 34480
....
接下来我们要来测试在不同数据范围内的回归的线性相关性,
例如最近362天的数据交错后的回归线性相关性.
digoal=> select count(*),regr_r2(t.cnt,test.cnt) from
(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>2;
count | regr_r2
-------+-------------------
362 | 0.835282212765017
(1 row)
但是如果时间放大到最近363天, 相关性就降低到0.32915622582628了
digoal=> select count(*),regr_r2(t.cnt,test.cnt) from
(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>1;
count | regr_r2
-------+------------------
363 | 0.32915622582628
(1 row)
我们要不断的尝试来得到更好的相关性, 当获得最高的相关性(接近1)时, 预测数据最准确.
接下来我们看看以上两个时间段产生的截距和斜率的预测准确度.
截距
digoal=> select count(*),regr_intercept(t.cnt,test.cnt) from
(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>2;
count | regr_intercept
-------+------------------
362 | 6274.25023499543
(1 row)
斜率
digoal=> select count(*),regr_slope(t.cnt,test.cnt) from
(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>2;
count | regr_slope
-------+-------------------
362 | 0.906861594725424
(1 row)
使用自变量44248推测因变量40145.
使用公式推测的结果为46401.062078405991152
digoal=> select 44248*0.906861594725424+6274.25023499543;
?column?
-----------------------
46401.062078405991152
(1 row)
准确度 :
digoal=> select 40145/46401.062078405991152;
?column?
------------------------
0.86517416200873079820
(1 row)
当我们使用另一组截距和斜率时, 准确度最低是0.32915622582628. 所以得到的预测结果可能不及以上的.
截距
digoal=> select count(*),regr_intercept(t.cnt,test.cnt) from
(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>1;
count | regr_intercept
-------+------------------
363 | 49279.0342891155
(1 row)
斜率
digoal=> select count(*),regr_slope(t.cnt,test.cnt) from
(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>1;
count | regr_slope
-------+-------------------
363 | 0.292250474909646
(1 row)
预测结果
digoal=> select 44248*0.292250474909646+49279.0342891155;
?column?
-----------------------
62210.533302917516208
(1 row)
准确度
digoal=> select 40145/62210.533302917516208;
?column?
------------------------
0.64530872616900233730
(1 row)
最后再提一下另外几个和回归相关的函数 :
regr_avgx(Y, X) |
double precision | double precision | average of the independent variable (sum(X)/N) |
regr_avgy(Y, X) |
double precision | double precision | average of the dependent variable (sum(Y)/N) |
regr_count(Y, X) |
double precision | bigint | number of input rows in which both expressions are nonnull |
regr_sxx(Y, X) |
double precision | double precision | sum(X^2) - sum(X)^2/N ("sum of squares" of the independent variable) |
regr_sxy(Y, X) |
double precision | double precision | sum(X*Y) - sum(X) * sum(Y)/N ("sum of products" of independent times dependent variable) |
regr_syy(Y, X) |
double precision | double precision | sum(Y^2) - sum(Y)^2/N ("sum of squares" of the dependent variable) |
regr_avgx(y, x)其实就是算x的平均值(数学期望), y在这里没有任何作用.
regr_avgy(y, x)其实就是算y的平均值(数学期望), x在这里没有任何作用.
regr_count(y, x) 计算x和y都不是空的记录数.
另外三个是辅助函数, 计算诊断统计信息(总方差, 总协方差).
regr_sxx(y, x) : sum(X^2) - sum(X)^2/N
regr_sxy(y, x) : sum(X*Y) - sum(X) * sum(Y)/N
regr_syy(y, x) : sum(Y^2) - sum(Y)^2/N
REGR_SXY, REGR_SXX, REGR_SYY are auxiliary functions that are used to compute various diagnostic statistics.
REGR_SXX makes the following computation after the elimination of null (expr1, expr2) pairs:
REGR_COUNT(expr1, expr2) * VAR_POP(expr2)
REGR_SYY makes the following computation after the elimination of null (expr1, expr2) pairs:
REGR_COUNT(expr1, expr2) * VAR_POP(expr1)
REGR_SXY makes the following computation after the elimination of null (expr1, expr2) pairs:
REGR_COUNT(expr1, expr2) * COVAR_POP(expr1, expr2)
验证regr_sxx, sxy, syy.
postgres=# select regr_sxx(y,x), REGR_COUNT(y,x)*VAR_POP(x) from (values(2,400),(6,401),(7,400),(3,400),(1000,488)) as t(x,y);
regr_sxx | ?column?
----------+---------------------
792833.2 | 792833.200000000000
(1 row)
postgres=# select regr_sxy(y,x), REGR_COUNT(y,x)*COVAR_POP(x,y) from (values(2,400),(6,401),(7,400),(3,400),(1000,488)) as t(x,y);
regr_sxy | ?column?
----------+----------
69885.6 | 69885.6
(1 row)
postgres=# select regr_syy(y,x), REGR_COUNT(y,x)*VAR_POP(y) from (values(2,400),(6,401),(7,400),(3,400),(1000,488)) as t(x,y);
regr_syy | ?column?
----------+-----------------------
6160.8 | 6160.8000000000000000
(1 row)
好了, 先写到这里, 你还可以尝试更多的折腾玩法.
自动选择最优相关的例子如下 .
=> select * from test order by to_char desc limit 10;
to_char | count
------------+--------
2015030123 | 149496
2015030122 | 165320
2015030121 | 167663
2015030120 | 161071
2015030119 | 145570
2015030118 | 133155
2015030117 | 133962
2015030116 | 130484
2015030115 | 126182
2015030114 | 122998
(10 rows)
=> do language plpgsql $$
declare
r2_1 numeric := 0;
r2_2 numeric := 0;
var int;
inter numeric;
slope numeric;
inter_2 numeric;
slope_2 numeric;
realv numeric;
predicv numeric;
offset_var int := 0; -- 最后一个值的预测值
begin
for i in 1..450 loop
with t1 as (select row_number() over(order by to_char) as rn,count from test order by to_char desc offset offset_var),
t2 as (select row_number() over(order by to_char)-1 as rn,count from test order by to_char desc offset offset_var)
select regr_intercept(t2.count,t1.count),regr_slope(t2.count,t1.count),regr_r2(t1.count,t2.count) into inter,slope,r2_1 from t1,t2 where t1.rn=t2.rn and t1.rn>i;
if r2_1>r2_2 then
inter_2:=inter;
slope_2:=slope;
r2_2:=r2_1;
var:=i;
end if;
end loop;
raise notice '%, %, %, %', var, inter_2,slope_2,r2_2;
select slope_2*count+inter_2 into predicv from test order by to_char desc offset offset_var+1 limit 1;
select count into realv from test order by to_char desc offset offset_var limit 1;
raise notice '%, %', realv, predicv;
end;
$$;
NOTICE: 436, 16599.0041292694, 0.896184690654355, 0.925125327496365
NOTICE: 149496, 164756.257188247368600
DO
=> select 149496/164756.2;
?column?
------------------------
0.90737708201572990880
(1 row)
=> do language plpgsql $$
declare
r2_1 numeric := 0; -- 相关性
r2_2 numeric := 0; -- 最大相关性
var int; -- 样本数量
inter_1 numeric; -- 截距
slope_1 numeric; -- 斜率
inter_2 numeric; -- 最大相关性截距
slope_2 numeric; -- 最大相关性斜率
realv numeric; -- 真实数据
predicv numeric; -- 预测数据
offset_var int := 1; -- 倒数第二个值的预测值, 不停迭代, 最后计算所有的实际值和预测值的corr, 看看相似度如何?
begin
for i in 1..450 loop
with t1 as (select row_number() over(order by to_char) as rn,count from test order by to_char desc offset offset_var),
t2 as (select row_number() over(order by to_char)-1 as rn,count from test order by to_char desc offset offset_var)
select regr_intercept(t2.count,t1.count),regr_slope(t2.count,t1.count),regr_r2(t1.count,t2.count) into inter_1,slope_1,r2_1 from t1,t2 where t1.rn=t2.rn and t1.rn>i;
if r2_1>r2_2 then
inter_2 := inter_1;
slope_2 := slope_1;
r2_2 := r2_1;
var := i;
end if;
end loop;
raise notice '样本数量%, 截距%, 斜率%, 相关性%', var, round(inter_2,4), round(slope_2,4), round(r2_2,4);
select slope_2*count+inter_2 into predicv from test order by to_char desc offset offset_var+1 limit 1;
select count into realv from test order by to_char desc offset offset_var limit 1;
raise notice '真实数据%, 预测数据%, 本次预测偏差,%%%', realv, round(predicv), abs(1-round(predicv/realv,4))*100;
end;
$$;
NOTICE: 样本数量436, 截距10109.8500, 斜率0.9573, 相关性0.9476
NOTICE: 真实数据165320, 预测数据170611, 本次预测偏差,%3.2000
DO
to_char | count
------------+--------
2015030123 | 149496
2015030122 | 165320
2015030121 | 167663
2015030120 | 161071
2015030119 | 145570
2015030118 | 133155
2015030117 | 133962
2015030116 | 130484
2015030115 | 126182
2015030114 | 122998
(10 rows)
=> do language plpgsql $$
declare
r2_1 numeric := 0;
r2_2 numeric := 0;
var int;
inter numeric;
slope numeric;
inter_2 numeric;
slope_2 numeric;
realv numeric;
predicv numeric;
offset_var int := 0; -- 最后一个值的预测值
begin
for i in 1..450 loop
with t1 as (select row_number() over(order by to_char) as rn,count from test order by to_char desc offset offset_var),
t2 as (select row_number() over(order by to_char)-1 as rn,count from test order by to_char desc offset offset_var)
select regr_intercept(t2.count,t1.count),regr_slope(t2.count,t1.count),regr_r2(t1.count,t2.count) into inter,slope,r2_1 from t1,t2 where t1.rn=t2.rn and t1.rn>i;
if r2_1>r2_2 then
inter_2:=inter;
slope_2:=slope;
r2_2:=r2_1;
var:=i;
end if;
end loop;
raise notice '%, %, %, %', var, inter_2,slope_2,r2_2;
select slope_2*count+inter_2 into predicv from test order by to_char desc offset offset_var+1 limit 1;
select count into realv from test order by to_char desc offset offset_var limit 1;
raise notice '%, %', realv, predicv;
end;
$$;
NOTICE: 436, 16599.0041292694, 0.896184690654355, 0.925125327496365
NOTICE: 149496, 164756.257188247368600
DO
=> select 149496/164756.2;
?column?
------------------------
0.90737708201572990880
(1 row)
=> do language plpgsql $$
declare
r2_1 numeric := 0; -- 相关性
r2_2 numeric := 0; -- 最大相关性
var int; -- 样本数量
inter_1 numeric; -- 截距
slope_1 numeric; -- 斜率
inter_2 numeric; -- 最大相关性截距
slope_2 numeric; -- 最大相关性斜率
realv numeric; -- 真实数据
predicv numeric; -- 预测数据
offset_var int := 1; -- 倒数第二个值的预测值, 不停迭代, 最后计算所有的实际值和预测值的corr, 看看相似度如何?
begin
for i in 1..450 loop
with t1 as (select row_number() over(order by to_char) as rn,count from test order by to_char desc offset offset_var),
t2 as (select row_number() over(order by to_char)-1 as rn,count from test order by to_char desc offset offset_var)
select regr_intercept(t2.count,t1.count),regr_slope(t2.count,t1.count),regr_r2(t1.count,t2.count) into inter_1,slope_1,r2_1 from t1,t2 where t1.rn=t2.rn and t1.rn>i;
if r2_1>r2_2 then
inter_2 := inter_1;
slope_2 := slope_1;
r2_2 := r2_1;
var := i;
end if;
end loop;
raise notice '样本数量%, 截距%, 斜率%, 相关性%', var, round(inter_2,4), round(slope_2,4), round(r2_2,4);
select slope_2*count+inter_2 into predicv from test order by to_char desc offset offset_var+1 limit 1;
select count into realv from test order by to_char desc offset offset_var limit 1;
raise notice '真实数据%, 预测数据%, 本次预测偏差,%%%', realv, round(predicv), abs(1-round(predicv/realv,4))*100;
end;
$$;
NOTICE: 样本数量436, 截距10109.8500, 斜率0.9573, 相关性0.9476
NOTICE: 真实数据165320, 预测数据170611, 本次预测偏差,%3.2000
DO
校验函数
=> create or replace function check_predict(IN ov int, OUT rv numeric, OUT pv numeric, OUT dev numeric) returns record as $$
declare
r2_1 numeric := 0; -- 相关性
r2_2 numeric := 0; -- 最大相关性
var int; -- 样本数量
inter_1 numeric; -- 截距
slope_1 numeric; -- 斜率
inter_2 numeric; -- 最大相关性截距
slope_2 numeric; -- 最大相关性斜率
realv numeric; -- 真实数据
predicv numeric; -- 预测数据
offset_var int := ov; -- 倒数第二个值的预测值, 不停迭代, 最后计算所有的实际值和预测值的corr, 看看相似度如何?
lps int := 0;
begin
select count(*)-offset_var-4 into lps from test; -- 循环不要超过总样本数, 同时至少给2个样本.
for i in 1..lps loop
with t1 as (select row_number() over(order by to_char) as rn,to_char,count from test order by to_char desc offset offset_var),
t2 as (select row_number() over(order by to_char)-1 as rn,to_char,count from test order by to_char desc offset offset_var)
select regr_intercept(t2.count,t1.count),regr_slope(t2.count,t1.count),regr_r2(t1.count,t2.count) into inter_1,slope_1,r2_1 from t1,t2 where t1.rn=t2.rn and t1.rn>i;
if r2_1>r2_2 then
inter_2 := inter_1;
slope_2 := slope_1;
r2_2 := r2_1;
var := i;
end if;
end loop;
raise notice '样本数量%, 截距%, 斜率%, 相关性%', var, round(inter_2,4), round(slope_2,4), round(r2_2,4);
select slope_2*count+inter_2 into predicv from test order by to_char desc offset offset_var+1 limit 1;
select count into realv from test order by to_char desc offset offset_var limit 1;
raise notice '真实数据%, 预测数据%, 本次预测偏差%%%', realv, round(predicv), abs(1-round(predicv/realv,4))*100;
rv := realv;
pv := round(predicv);
dev := abs(1-round(predicv/realv,4));
return;
end;
$$ language plpgsql;
校验测试 :
=> select check_predict(i) from generate_series(1,100) t(i);
NOTICE: 样本数量436, 截距10109.8500, 斜率0.9573, 相关性0.9476
NOTICE: 真实数据165320, 预测数据170611, 本次预测偏差%3.2000
NOTICE: 样本数量436, 截距6909.3635, 斜率0.9872, 相关性0.9419
NOTICE: 真实数据167663, 预测数据165922, 本次预测偏差%1.0400
NOTICE: 样本数量436, 截距8151.8730, 斜率0.9754, 相关性0.9249
NOTICE: 真实数据161071, 预测数据150145, 本次预测偏差%6.7800
NOTICE: 样本数量436, 截距14388.5296, 斜率0.9135, 相关性0.9275
NOTICE: 真实数据145570, 预测数据136026, 本次预测偏差%6.5600
NOTICE: 样本数量437, 截距30451.0167, 斜率0.7726, 相关性0.9570
NOTICE: 真实数据133155, 预测数据133953, 本次预测偏差%0.6000
NOTICE: 样本数量446, 截距343.4262, 斜率1.0262, 相关性0.9785
NOTICE: 真实数据133962, 预测数据134241, 本次预测偏差%0.2100
NOTICE: 样本数量437, 截距31491.5019, 斜率0.7616, 相关性0.9494
NOTICE: 真实数据130484, 预测数据127596, 本次预测偏差%2.2100
NOTICE: 样本数量438, 截距48512.9273, 斜率0.6126, 相关性0.9484
NOTICE: 真实数据126182, 预测数据123864, 本次预测偏差%1.8400
NOTICE: 样本数量438, 截距50299.8161, 斜率0.5940, 相关性0.9526
NOTICE: 真实数据122998, 预测数据124578, 本次预测偏差%1.2800
NOTICE: 样本数量442, 截距33561.3690, 斜率0.7444, 相关性0.9983
NOTICE: 真实数据125052, 预测数据125119, 本次预测偏差%0.0500
NOTICE: 样本数量438, 截距50126.2968, 斜率0.5954, 相关性0.9475
NOTICE: 真实数据123000, 预测数据121572, 本次预测偏差%1.1600
NOTICE: 样本数量438, 截距52640.6564, 斜率0.5687, 相关性0.9400
NOTICE: 真实数据119991, 预测数据118710, 本次预测偏差%1.0700
NOTICE: 样本数量438, 截距55198.2911, 斜率0.5404, 相关性0.9301
NOTICE: 真实数据116182, 预测数据118363,最后更新:2017-04-01 13:38:50 上一篇:
下一篇: