PostgreSQL 聚合函數講解 - 2 相關性統計
PostgreSQL統計信息中, 有一個相關性的統計, 在pg_stats.correlation中可以查看到,
統計值範圍從-1到1, 趨向於-1表示逆向相關, 趨向於1表示正向相關, 趨向於0表示不相關.
postgres=# \d pg_stats
View "pg_catalog.pg_stats"
Column | Type | Modifiers
------------------------+----------+-----------
schemaname | name |
tablename | name |
attname | name |
inherited | boolean |
null_frac | real |
avg_width | integer |
n_distinct | real |
most_common_vals | anyarray |
most_common_freqs | real[] |
histogram_bounds | anyarray |
correlation | real |
most_common_elems | anyarray |
most_common_elem_freqs | real[] |
elem_count_histogram | real[] |
correlation的含義是什麼呢?
即列的物理順序和列的邏輯順序的相關性.
相關性越高, 走索引掃描的離散塊掃描更少, 也就是說, 相關性越高, 走索引掃描的離散塊掃描代價越低.
相關性在其他領域也有非常重要的應用, 例如廣告投入和銷售額的數據, 看百度提到的例子 :
軟件公司在全國有許多代理商,為研究它的財務軟件產品的廣告投入與銷售額的關係,統計人員隨機選擇10家代理商進行觀察,搜集到年廣告投入費和月平均銷售額的數據,並編製成相關表,見表1:
表1 廣告費與月平均銷售額相關表 單位:萬元
年廣告費投入
月均銷售額
12.5 15.3 23.2 26.4 33.5 34.4 39.4 45.2 55.4 60.9
21.2 23.9 32.9 34.1 42.5 43.2 49.0 52.8 59.4 63.5
參照表1,可計算相關係數如表2:
序號
廣告投入(萬元)
x
月均銷售額(萬元)
y
1 2 3 4 5 6 7 8 9 10
12.5 15.3 23.2 26.4 33.5 34.4 39.4 45.2 55.4 60.9
21.2 23.9 32.9 34.1 42.5 43.2 49.0 52.8 59.4 63.5
156.25 234.09 538.24 696.96 1122.25 1183.36 1552.36 2043.04 3069.16 3708.81
449.44 571.21 1082.41 1162.81 1806.25 1866.24 2401.00 2787.84 3528.36 4032.25
265.00 365.67 763.28 900.24 1423.75 1486.08 1930.60 2386.56 3290.76 3867.15
合計
346.2
422.5
14304.52
19687.81
16679.09
=0.9942
相關係數為0.9942,說明廣告投入費與月平均銷售額之間有高度的線性正相關關係。
相關性越高, 說明廣告投入和銷售額的關係越明顯.
相關性是如何計算的呢? 實際上是 "協方差(x,y)除以(平方根(方差(x)*方差(y)))" .
Postgresql attr correlation for values(logical order) ctid (physcial order) - 德哥@Digoal - PostgreSQL research
在運維領域, 也可以做相對應的統計, 例如服務器的內存使用量, 負載, 進程數, 網絡吞吐量, 用戶請求量, 用戶請求響應時間 等數據, 可以做相關性的統計, 觀察他們之間的關係.
接下來進入正題, 看看PostgreSQL是如何計算列的邏輯和物理順序相關性的
首選看一下pg_stats這個視圖對應的correlation是怎麼來的
postgres=# \d+ pg_stats
CASE
WHEN s.stakind1 = 3 THEN s.stanumbers1[1]
WHEN s.stakind2 = 3 THEN s.stanumbers2[1]
WHEN s.stakind3 = 3 THEN s.stanumbers3[1]
WHEN s.stakind4 = 3 THEN s.stanumbers4[1]
WHEN s.stakind5 = 3 THEN s.stanumbers5[1]
ELSE NULL::real
END AS correlation,
。。。
FROM pg_statistic s
JOIN pg_class c ON c.oid = s.starelid
JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE NOT a.attisdropped AND has_column_privilege(c.oid, a.attnum, 'select'::text);
其實是來自pg_statistic這個表, corr的統計是在analyze中完成的.
相關性計算的代碼如下, 注意是采樣統計 :
src/backend/commands/analyze.c
/*
* Now scan the values in order, find the most common ones, and also
* accumulate ordering-correlation statistics.
*
* To determine which are most common, we first have to count the
* number of duplicates of each value. The duplicates are adjacent in
* the sorted list, so a brute-force approach is to compare successive
* datum values until we find two that are not equal. However, that
* requires N-1 invocations of the datum comparison routine, which are
* completely redundant with work that was done during the sort. (The
* sort algorithm must at some point have compared each pair of items
* that are adjacent in the sorted order; otherwise it could not know
* that it's ordered the pair correctly.) We exploit this by having
* compare_scalars remember the highest tupno index that each
* ScalarItem has been found equal to. At the end of the sort, a
* ScalarItem's tupnoLink will still point to itself if and only if it
* is the last item of its group of duplicates (since the group will
* be ordered by tupno).
*/
corr_xysum = 0;
ndistinct = 0;
nmultiple = 0;
dups_cnt = 0;
for (i = 0; i < values_cnt; i++)
{
int tupno = values[i].tupno;
corr_xysum += ((double) i) * ((double) tupno);
dups_cnt++;
if (tupnoLink[tupno] == tupno)
{
/* Reached end of duplicates of this value */
ndistinct++;
if (dups_cnt > 1)
{
nmultiple++;
if (track_cnt < num_mcv ||
dups_cnt > track[track_cnt - 1].count)
{
/*
* Found a new item for the mcv list; find its
* position, bubbling down old items if needed. Loop
* invariant is that j points at an empty/ replaceable
* slot.
*/
int j;
if (track_cnt < num_mcv)
track_cnt++;
for (j = track_cnt - 1; j > 0; j--)
{
if (dups_cnt <= track[j - 1].count)
break;
track[j].count = track[j - 1].count;
track[j].first = track[j - 1].first;
}
track[j].count = dups_cnt;
track[j].first = i + 1 - dups_cnt;
}
}
dups_cnt = 0;
}
}
.........................
/* Generate a correlation entry if there are multiple values */
if (values_cnt > 1)
{
MemoryContext old_context;
float4 *corrs;
double corr_xsum,
corr_x2sum;
/* Must copy the target values into anl_context */
old_context = MemoryContextSwitchTo(stats->anl_context);
corrs = (float4 *) palloc(sizeof(float4));
MemoryContextSwitchTo(old_context);
/*----------
* Since we know the x and y value sets are both
* 0, 1, ..., values_cnt-1
* we have sum(x) = sum(y) =
* (values_cnt-1)*values_cnt / 2
* and sum(x^2) = sum(y^2) =
* (values_cnt-1)*values_cnt*(2*values_cnt-1) / 6.
*----------
*/
corr_xsum = ((double) (values_cnt - 1)) *
((double) values_cnt) / 2.0;
corr_x2sum = ((double) (values_cnt - 1)) *
((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0;
/* And the correlation coefficient reduces to */
corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) /
(values_cnt * corr_x2sum - corr_xsum * corr_xsum);
stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
stats->staop[slot_idx] = mystats->ltopr;
stats->stanumbers[slot_idx] = corrs;
stats->numnumbers[slot_idx] = 1;
slot_idx++;
}
PostgreSQL 提供了相關性統計的函數, corr供用戶使用.
參考
https://www.postgresql.org/docs/9.4/static/functions-aggregate.html
corr代碼如下 :
src/backend/utils/adt/float.c
Datum
float8_corr(PG_FUNCTION_ARGS)
{
ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);
float8 *transvalues;
float8 N,
sumX,
sumX2,
sumY,
sumY2,
sumXY,
numeratorX,
numeratorY,
numeratorXY;
transvalues = check_float8_array(transarray, "float8_corr", 6);
N = transvalues[0];
sumX = transvalues[1];
sumX2 = transvalues[2];
sumY = transvalues[3];
sumY2 = transvalues[4];
sumXY = transvalues[5];
/* if N is 0 we should return NULL */
if (N < 1.0)
PG_RETURN_NULL();
numeratorX = N * sumX2 - sumX * sumX;
CHECKFLOATVAL(numeratorX, isinf(sumX2) || isinf(sumX), true);
numeratorY = N * sumY2 - sumY * sumY;
CHECKFLOATVAL(numeratorY, isinf(sumY2) || isinf(sumY), true);
numeratorXY = N * sumXY - sumX * sumY;
CHECKFLOATVAL(numeratorXY, isinf(sumXY) || isinf(sumX) ||
isinf(sumY), true);
if (numeratorX <= 0 || numeratorY <= 0)
PG_RETURN_NULL();
PG_RETURN_FLOAT8(numeratorXY / sqrt(numeratorX * numeratorY));
}
我們可以用corr來驗證PostgreSQL的采樣統計, 但是注意, 要驗證的話, 數據量小一點比較好, 這樣的話PG會全量采樣, 和corr得到的結果一致, 如果數據量太大, 得到的結果可能有少量偏差.
postgres=# create table t(id int);
CREATE TABLE
postgres=# insert into t values (2),(5),(8),(3),(4),(6),(9),(7),(1);
INSERT 0 9
行號, ID值如下
postgres=# select ctid,* from t;
ctid | id
-------+----
(0,1) | 2
(0,2) | 5
(0,3) | 8
(0,4) | 3
(0,5) | 4
(0,6) | 6
(0,7) | 9
(0,8) | 7
(0,9) | 1
(9 rows)
使用窗口函數進行輸出
postgres=# select * from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);
rn | id
----+----
1 | 2
2 | 5
3 | 8
4 | 3
5 | 4
6 | 6
7 | 9
8 | 7
9 | 1
(9 rows)
分析 :
postgres=# analyze t;
ANALYZE
查詢統計信息的correlation
postgres=# select * from pg_stats where attname ='id' and tablename='t';
-[ RECORD 1 ]----------+--------------------
schemaname | public
tablename | t
attname | id
inherited | f
null_frac | 0
avg_width | 4
n_distinct | -1
most_common_vals |
most_common_freqs |
histogram_bounds | {1,2,3,4,5,6,7,8,9}
correlation | 0.116667
most_common_elems |
most_common_elem_freqs |
elem_count_histogram |
結果和corr得到的結果一致
postgres=# select corr(rn,id) from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);
-[ RECORD 1 ]-----------
corr | 0.116666666666667
如果隨機插入大量數據, 因此采樣的關係, 分析得到的相關性可能和實際的相關性有偏差
postgres=# insert into t select * from generate_series(1,100000) order by random();
INSERT 0 100000
如下 :
postgres=# select ctid,* from t limit 100;
ctid | id
---------+-------
(0,1) | 2
(0,2) | 5
(0,3) | 8
(0,4) | 3
(0,5) | 4
(0,6) | 6
(0,7) | 9
(0,8) | 7
(0,9) | 1
(0,10) | 4607
(0,11) | 39521
(0,12) | 92869
(0,13) | 80094
(0,14) | 13214
(0,15) | 15509
(0,16) | 8380
(0,17) | 22281
(0,18) | 99252
(0,19) | 60018
(0,20) | 55716
....
postgres=# analyze t;
ANALYZE
postgres=# select correlation from pg_stats where attname ='id' and tablename='t';
correlation
--------------
-0.000263469
(1 row)
和實際相關性偏差較大
postgres=# select corr(rn,id) from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);
corr
---------------------
0.00110293570728894
(1 row)
修改id列的采樣係數, 重新分析, 得到的相關性結果和實際的相關性基本一致.
postgres=# alter table t alter column id SET STATISTICS 10000;
ALTER TABLE
postgres=# analyze t;
ANALYZE
postgres=# select correlation from pg_stats where attname ='id' and tablename='t';
correlation
-------------
0.00110296
(1 row)
[參考]
1. https://zh.wikipedia.org/zh-cn/%E7%9B%B8%E5%85%B3
2. https://baike.baidu.com/view/172091.htm
3. https://en.wikipedia.org/wiki/Correlation_and_dependence
4. https://www.postgresql.org/docs/9.4/static/functions-aggregate.html
最後更新:2017-04-01 13:37:06