PostgreSQL 聚合函數講解 - 5 線性回歸

首先講個線性回歸分析linear regression (最小二乘法least-squares-fit)的小故事(取自百度) :

1801年，意大利天文學家朱賽普·皮亞齊發現了第一顆小行星穀神星。經過40天的跟蹤觀測後，由於穀神星運行至太陽背後，使得皮亞齊失去了穀神星的位置。隨後全世界的科學家利用皮亞齊的觀測數據開始尋找穀神星，但是根據大多數人計算的結果來尋找穀神星都沒有結果。時年24歲的高斯也計算了穀神星的軌道。奧地利天文學家海因裏希·奧爾伯斯根據高斯計算出來的軌道重新發現了穀神星。

高斯使用的最小二乘法的方法發表於1809年他的著作《天體運動論》中。

法國科學家勒讓德於1806年獨立發現“最小二乘法”，但因不為世人所知而默默無聞。

勒讓德曾與高斯為誰最早創立最小二乘法原理發生爭執。

1829年，高斯提供了最小二乘法的優化效果強於其他方法的證明，因此被稱為高斯-馬爾可夫定理.

上麵的故事說明通過已有數據可以對未來的數據進行預測. 但是預測結果是否準確有不確定因素, 所以需要不斷的調整和校驗.

如何做回歸分析呢? (取自百度)

研究一個或多個隨機變量Y1 ，Y2 ，…，Yi與另一些變量X1、X2，…，Xk之間的關係的統計方法，又稱多重回歸分析。通常稱Y1，Y2，…，Yi為因變量，X1、X2，…，Xk為自變量。回歸分析是一類數學模型，特別當因變量和自變量為線性關係時，它是一種特殊的線性模型。最簡單的情形是一個自變量和一個因變量，且它們大體上有線性關係，這叫一元線性回歸，即模型為Y=a+bX+ε，這裏X是自變量，Y是因變量，ε是隨機誤差，通常假定隨機誤差的均值為0，方差為σ^2（σ^2大於0）σ^2與X的值無關。若進一步假定隨機誤差遵從正態分布，就叫做正態線性模型。一般的情形，它有k個自變量和一個因變量，因變量的值可以分解為兩部分：一部分是由於自變量的影響，即表示為自變量的函數，其中函數形式已知，但含一些未知參數；另一部分是由於其他未被考慮的因素和隨機性的影響，即隨機誤差。當函數形式為未知參數的線性函數時，稱線性回歸分析模型；當函數形式為未知參數的非線性函數時，稱為非線性回歸分析模型。當自變量的個數大於1時稱為多元回歸，當因變量個數大於1時稱為多重回歸。

回歸分析的主要內容為：

①從一組數據出發，確定某些變量之間的定量關係式，即建立數學模型並估計其中的未知參數。估計參數的常用方法是最小二乘法。

②對這些關係式的可信程度進行檢驗。

③在許多自變量共同影響著一個因變量的關係中，判斷哪個（或哪些）自變量的影響是顯著的，哪些自變量的影響是不顯著的，將影響顯著的自變量入模型中，而剔除影響不顯著的變量，通常用逐步回歸、向前回歸和向後回歸等方法。

④利用所求的關係式對某一生產過程進行預測或控製。回歸分析的應用是非常廣泛的，統計軟件包使各種回歸方法計算十分方便。

在回歸分析中，把變量分為兩類。一類是因變量，它們通常是實際問題中所關心的一類指標，通常用Y表示；而影響因變量取值的的另一類變量稱為自變量，用X來表示。

回歸分析研究的主要問題是：

（1）確定Y與X間的定量關係表達式，這種表達式稱為回歸方程；

（2）對求得的回歸方程的可信度進行檢驗；

（3）判斷自變量X對因變量Y有無影響；

（4）利用所求得的回歸方程進行預測和控製。

一個例子 :

例如，如果要研究質量和用戶滿意度之間的因果關係，從實踐意義上講，產品質量會影響用戶的滿意情況，因此設用戶滿意度為因變量，記為Y；質量為自變量，記為X。根據圖8－3的散點圖，可以建立下麵的線性關係： Y=A+BX+§

式中：A和B為待定參數，A為回歸直線的截距；B為回歸直線的斜率，表示X變化一個單位時，Y的平均變化情況；§為依賴於用戶滿意度的隨機誤差項。

對於經驗回歸方程： y=0.857+0.836x

回歸直線在y軸上的截距為0.857、斜率0.836，即質量每提高一分，用戶滿意度平均上升0.836分；或者說質量每提高1分對用戶滿意度的貢獻是0.836分。

在PostgreSQL中提供了回歸分析的一些聚合函數,

`regr_avgx(Y, X)`	double precision	double precision	average of the independent variable (sum(X)/N)
`regr_avgy(Y, X)`	double precision	double precision	average of the dependent variable (sum(Y)/N)
`regr_count(Y, X)`	double precision	bigint	number of input rows in which both expressions are nonnull
`regr_intercept(Y, X)`	double precision	double precision	y-intercept of the least-squares-fit linear equation determined by the (X, Y) pairs
`regr_r2(Y, X)`	double precision	double precision	square of the correlation coefficient
`regr_slope(Y, X)`	double precision	double precision	slope of the least-squares-fit linear equation determined by the (X, Y) pairs
`regr_sxx(Y, X)`	double precision	double precision	sum(X^2) - sum(X)^2/N ("sum of squares" of the independent variable)
`regr_sxy(Y, X)`	double precision	double precision	sum(XY) - sum(X) sum(Y)/N ("sum of products" of independent times dependent variable)
`regr_syy(Y, X)`	double precision	double precision	sum(Y^2) - sum(Y)^2/N ("sum of squares" of the dependent variable)

本文會用到如下幾個 :

regr_intercept, 計算截距.

regr_slope, 計算斜率.

regr_r2, 計算相關性, 相關性越高, 說明這組數據用於預估的準確度越高.

下麵來舉個例子 :

將最近一年的某個業務的日訪問量數據統計放到一張測試表.

利用這一年的數據進行一元回歸分析, 要預測的是因變量, 用作預測的是自變量. 因為要預測的數據未發生, 所以我們可以把時間交錯一下, 就可以作為自變量來使用.

例如下麵這組數據, 自變量就是數據下移一行產生的. :

因變量 | 自變量

48000 | 54624

47454 | 48000

56766 | 47454

60488 | 56766

58191 | 60488

57443 | 58191

54277 | 57443

55508 | 54277

52716 | 55508

63748 | 52716

43462 | 63748

44248 | 43462

40145 | 44248

如果影響的因素較多, 需要做多元回歸, 你可以選擇PostgreSQL R語言插件來分析.

本文做的是一元回歸的例子, 接下來進入測試 :

創建一張過去365天的某業務的日下載量數據表.

digoal=> create table test as select row_number() over(order by dt) as rn,cnt from

(select date(createtime) as dt, count(*) as cnt from tbl_app_download where createtime>=( date(now())-366 ) and createtime<( date(now())-1 ) group by date(createtime)) as t;

SELECT 365

數據大概是這樣的

digoal=> select * from test;

....

329 | 36293

330 | 40886

331 | 34465

332 | 30785

333 | 33318

334 | 34480

....

接下來我們要來測試在不同數據範圍內的回歸的線性相關性,

例如最近362天的數據交錯後的回歸線性相關性.

digoal=> select count(*),regr_r2(t.cnt,test.cnt) from

(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>2;

count | regr_r2

-------+-------------------

362 | 0.835282212765017

(1 row)

但是如果時間放大到最近363天, 相關性就降低到0.32915622582628了

digoal=> select count(*),regr_r2(t.cnt,test.cnt) from

(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>1;

count | regr_r2

-------+------------------

363 | 0.32915622582628

(1 row)

我們要不斷的嚐試來得到更好的相關性, 當獲得最高的相關性(接近1)時, 預測數據最準確.

接下來我們看看以上兩個時間段產生的截距和斜率的預測準確度.

截距

digoal=> select count(*),regr_intercept(t.cnt,test.cnt) from

(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>2;

count | regr_intercept

-------+------------------

362 | 6274.25023499543

(1 row)

斜率

digoal=> select count(*),regr_slope(t.cnt,test.cnt) from

(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>2;

count | regr_slope

-------+-------------------

362 | 0.906861594725424

(1 row)

使用自變量44248推測因變量40145.

使用公式推測的結果為46401.062078405991152

digoal=> select 44248*0.906861594725424+6274.25023499543;

?column?

-----------------------

46401.062078405991152

(1 row)

準確度 :

digoal=> select 40145/46401.062078405991152;

?column?

------------------------

0.86517416200873079820

(1 row)

當我們使用另一組截距和斜率時, 準確度最低是0.32915622582628. 所以得到的預測結果可能不及以上的.

截距

digoal=> select count(*),regr_intercept(t.cnt,test.cnt) from

(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>1;

count | regr_intercept

-------+------------------

363 | 49279.0342891155

(1 row)

斜率

digoal=> select count(*),regr_slope(t.cnt,test.cnt) from

(select rn-1 as rn,cnt from test) as t,test where t.rn=test.rn and test.rn>1;

count | regr_slope

-------+-------------------

363 | 0.292250474909646

(1 row)

預測結果

digoal=> select 44248*0.292250474909646+49279.0342891155;

?column?

-----------------------

62210.533302917516208

(1 row)

準確度

digoal=> select 40145/62210.533302917516208;

?column?

------------------------

0.64530872616900233730

(1 row)

最後再提一下另外幾個和回歸相關的函數 :

`regr_avgx(Y, X)`	double precision	double precision	average of the independent variable (sum(X)/N)
`regr_avgy(Y, X)`	double precision	double precision	average of the dependent variable (sum(Y)/N)
`regr_count(Y, X)`	double precision	bigint	number of input rows in which both expressions are nonnull

`regr_sxx(Y, X)`	double precision	double precision	sum(X^2) - sum(X)^2/N ("sum of squares" of the independent variable)
`regr_sxy(Y, X)`	double precision	double precision	sum(XY) - sum(X) sum(Y)/N ("sum of products" of independent times dependent variable)
`regr_syy(Y, X)`	double precision	double precision	sum(Y^2) - sum(Y)^2/N ("sum of squares" of the dependent variable)

regr_avgx(y, x)其實就是算x的平均值(數學期望), y在這裏沒有任何作用.

regr_avgy(y, x)其實就是算y的平均值(數學期望), x在這裏沒有任何作用.

regr_count(y, x) 計算x和y都不是空的記錄數.

另外三個是輔助函數, 計算診斷統計信息(總方差, 總協方差).

regr_sxx(y, x) : sum(X^2) - sum(X)^2/N

regr_sxy(y, x) : sum(X*Y) - sum(X) * sum(Y)/N

regr_syy(y, x) : sum(Y^2) - sum(Y)^2/N

REGR_SXY, REGR_SXX, REGR_SYY are auxiliary functions that are used to compute various diagnostic statistics.

REGR_SXX makes the following computation after the elimination of null (expr1, expr2) pairs:

REGR_COUNT(expr1, expr2) * VAR_POP(expr2)

REGR_SYY makes the following computation after the elimination of null (expr1, expr2) pairs:

REGR_COUNT(expr1, expr2) * VAR_POP(expr1)

REGR_SXY makes the following computation after the elimination of null (expr1, expr2) pairs:

REGR_COUNT(expr1, expr2) * COVAR_POP(expr1, expr2)

驗證regr_sxx, sxy, syy.

postgres=# select regr_sxx(y,x), REGR_COUNT(y,x)*VAR_POP(x) from (values(2,400),(6,401),(7,400),(3,400),(1000,488)) as t(x,y);

regr_sxx | ?column?

----------+---------------------

792833.2 | 792833.200000000000

(1 row)

postgres=# select regr_sxy(y,x), REGR_COUNT(y,x)*COVAR_POP(x,y) from (values(2,400),(6,401),(7,400),(3,400),(1000,488)) as t(x,y);

regr_sxy | ?column?

----------+----------

69885.6 | 69885.6

(1 row)

postgres=# select regr_syy(y,x), REGR_COUNT(y,x)*VAR_POP(y) from (values(2,400),(6,401),(7,400),(3,400),(1000,488)) as t(x,y);

regr_syy | ?column?

----------+-----------------------

6160.8 | 6160.8000000000000000

(1 row)

好了, 先寫到這裏, 你還可以嚐試更多的折騰玩法.

自動選擇最優相關的例子如下 .

=> select * from test order by to_char desc limit 10;
to_char | count
------------+--------
2015030123 | 149496
2015030122 | 165320
2015030121 | 167663
2015030120 | 161071
2015030119 | 145570
2015030118 | 133155
2015030117 | 133962
2015030116 | 130484
2015030115 | 126182
2015030114 | 122998
(10 rows)

=> do language plpgsql $$
declare
r2_1 numeric := 0;
r2_2 numeric := 0;
var int;
inter numeric;
slope numeric;
inter_2 numeric;
slope_2 numeric;
realv numeric;
predicv numeric;
offset_var int := 0; -- 最後一個值的預測值
begin
for i in 1..450 loop
with t1 as (select row_number() over(order by to_char) as rn,count from test order by to_char desc offset offset_var),
t2 as (select row_number() over(order by to_char)-1 as rn,count from test order by to_char desc offset offset_var)
select regr_intercept(t2.count,t1.count),regr_slope(t2.count,t1.count),regr_r2(t1.count,t2.count) into inter,slope,r2_1 from t1,t2 where t1.rn=t2.rn and t1.rn>i;
if r2_1>r2_2 then
inter_2:=inter;
slope_2:=slope;
r2_2:=r2_1;
var:=i;
end if;
end loop;

raise notice '%, %, %, %', var, inter_2,slope_2,r2_2;
select slope_2*count+inter_2 into predicv from test order by to_char desc offset offset_var+1 limit 1;
select count into realv from test order by to_char desc offset offset_var limit 1;
raise notice '%, %', realv, predicv;
end;
$$;
NOTICE: 436, 16599.0041292694, 0.896184690654355, 0.925125327496365
NOTICE: 149496, 164756.257188247368600
DO

=> select 149496/164756.2;
?column?
------------------------
0.90737708201572990880
(1 row)

=> do language plpgsql $$
declare
r2_1 numeric := 0; -- 相關性
r2_2 numeric := 0; -- 最大相關性
var int; -- 樣本數量
inter_1 numeric; -- 截距
slope_1 numeric; -- 斜率
inter_2 numeric; -- 最大相關性截距
slope_2 numeric; -- 最大相關性斜率
realv numeric; -- 真實數據
predicv numeric; -- 預測數據
offset_var int := 1; -- 倒數第二個值的預測值, 不停迭代, 最後計算所有的實際值和預測值的corr, 看看相似度如何?
begin
for i in 1..450 loop
with t1 as (select row_number() over(order by to_char) as rn,count from test order by to_char desc offset offset_var),
t2 as (select row_number() over(order by to_char)-1 as rn,count from test order by to_char desc offset offset_var)
select regr_intercept(t2.count,t1.count),regr_slope(t2.count,t1.count),regr_r2(t1.count,t2.count) into inter_1,slope_1,r2_1 from t1,t2 where t1.rn=t2.rn and t1.rn>i;
if r2_1>r2_2 then
inter_2 := inter_1;
slope_2 := slope_1;
r2_2 := r2_1;
var := i;
end if;
end loop;

raise notice '樣本數量%, 截距%, 斜率%, 相關性%', var, round(inter_2,4), round(slope_2,4), round(r2_2,4);
select slope_2*count+inter_2 into predicv from test order by to_char desc offset offset_var+1 limit 1;
select count into realv from test order by to_char desc offset offset_var limit 1;
raise notice '真實數據%, 預測數據%, 本次預測偏差,%%%', realv, round(predicv), abs(1-round(predicv/realv,4))*100;
end;
$$;
NOTICE: 樣本數量436, 截距10109.8500, 斜率0.9573, 相關性0.9476
NOTICE: 真實數據165320, 預測數據170611, 本次預測偏差,%3.2000
DO

校驗函數

=> create or replace function check_predict(IN ov int, OUT rv numeric, OUT pv numeric, OUT dev numeric) returns record as $$

declare

r2_1 numeric := 0; -- 相關性

r2_2 numeric := 0; -- 最大相關性

var int; -- 樣本數量

inter_1 numeric; -- 截距

slope_1 numeric; -- 斜率

inter_2 numeric; -- 最大相關性截距

slope_2 numeric; -- 最大相關性斜率

realv numeric; -- 真實數據

predicv numeric; -- 預測數據

offset_var int := ov; -- 倒數第二個值的預測值, 不停迭代, 最後計算所有的實際值和預測值的corr, 看看相似度如何?

lps int := 0;

begin

select count(*)-offset_var-4 into lps from test; -- 循環不要超過總樣本數, 同時至少給2個樣本.

for i in 1..lps loop

with t1 as (select row_number() over(order by to_char) as rn,to_char,count from test order by to_char desc offset offset_var),

t2 as (select row_number() over(order by to_char)-1 as rn,to_char,count from test order by to_char desc offset offset_var)

select regr_intercept(t2.count,t1.count),regr_slope(t2.count,t1.count),regr_r2(t1.count,t2.count) into inter_1,slope_1,r2_1 from t1,t2 where t1.rn=t2.rn and t1.rn>i;

if r2_1>r2_2 then

inter_2 := inter_1;

slope_2 := slope_1;

r2_2 := r2_1;

var := i;

end if;

end loop;

raise notice '樣本數量%, 截距%, 斜率%, 相關性%', var, round(inter_2,4), round(slope_2,4), round(r2_2,4);

select slope_2*count+inter_2 into predicv from test order by to_char desc offset offset_var+1 limit 1;

select count into realv from test order by to_char desc offset offset_var limit 1;

raise notice '真實數據%, 預測數據%, 本次預測偏差%%%', realv, round(predicv), abs(1-round(predicv/realv,4))*100;

rv := realv;

pv := round(predicv);

dev := abs(1-round(predicv/realv,4));

return;

end;

$$ language plpgsql;

校驗測試 :

=> select check_predict(i) from generate_series(1,100) t(i);

NOTICE: 樣本數量436, 截距10109.8500, 斜率0.9573, 相關性0.9476

NOTICE: 真實數據165320, 預測數據170611, 本次預測偏差%3.2000

NOTICE: 樣本數量436, 截距6909.3635, 斜率0.9872, 相關性0.9419

NOTICE: 真實數據167663, 預測數據165922, 本次預測偏差%1.0400

NOTICE: 樣本數量436, 截距8151.8730, 斜率0.9754, 相關性0.9249

NOTICE: 真實數據161071, 預測數據150145, 本次預測偏差%6.7800

NOTICE: 樣本數量436, 截距14388.5296, 斜率0.9135, 相關性0.9275

NOTICE: 真實數據145570, 預測數據136026, 本次預測偏差%6.5600

NOTICE: 樣本數量437, 截距30451.0167, 斜率0.7726, 相關性0.9570

NOTICE: 真實數據133155, 預測數據133953, 本次預測偏差%0.6000

NOTICE: 樣本數量446, 截距343.4262, 斜率1.0262, 相關性0.9785

NOTICE: 真實數據133962, 預測數據134241, 本次預測偏差%0.2100

NOTICE: 樣本數量437, 截距31491.5019, 斜率0.7616, 相關性0.9494

NOTICE: 真實數據130484, 預測數據127596, 本次預測偏差%2.2100

NOTICE: 樣本數量438, 截距48512.9273, 斜率0.6126, 相關性0.9484

NOTICE: 真實數據126182, 預測數據123864, 本次預測偏差%1.8400

NOTICE: 樣本數量438, 截距50299.8161, 斜率0.5940, 相關性0.9526

NOTICE: 真實數據122998, 預測數據124578, 本次預測偏差%1.2800

NOTICE: 樣本數量442, 截距33561.3690, 斜率0.7444, 相關性0.9983

NOTICE: 真實數據125052, 預測數據125119, 本次預測偏差%0.0500

NOTICE: 樣本數量438, 截距50126.2968, 斜率0.5954, 相關性0.9475

NOTICE: 真實數據123000, 預測數據121572, 本次預測偏差%1.1600

NOTICE: 樣本數量438, 截距52640.6564, 斜率0.5687, 相關性0.9400

NOTICE: 真實數據119991, 預測數據118710, 本次預測偏差%1.0700

NOTICE: 樣本數量438, 截距55198.2911, 斜率0.5404, 相關性0.9301

NOTICE: 真實數據116182, 預測數據118363,最後更新：2017-04-01 13:38:50

PostgreSQL 聚合函數講解 - 5 線性回歸

上一篇： Docker基礎之六: Docker基礎命令

下一篇： Docker基礎之七: 鏡像操作

相關內容

熱門內容

最新內容