441 阿裏雲技術社區[雲棲]
PostgreSQL 聚合函數講解 - 2 相關性統計

PostgreSQL統計信息中, 有一個相關性的統計, 在pg_stats.correlation中可以查看到, 
統計值範圍從-1到1, 趨向於-1表示逆向相關, 趨向於1表示正向相關, 趨向於0表示不相關.
postgres=# \d pg_stats
          View "pg_catalog.pg_stats"
         Column         |   Type   | Modifiers 
------------------------+----------+-----------
 schemaname             | name     | 
 tablename              | name     | 
 attname                | name     | 
 inherited              | boolean  | 
 null_frac              | real     | 
 avg_width              | integer  | 
 n_distinct             | real     | 
 most_common_vals       | anyarray | 
 most_common_freqs      | real[]   | 
 histogram_bounds       | anyarray | 
 correlation            | real     | 
 most_common_elems      | anyarray | 
 most_common_elem_freqs | real[]   | 
 elem_count_histogram   | real[]   | 
correlation的含義是什麼呢?
即列的物理順序和列的邏輯順序的相關性.
相關性越高, 走索引掃描的離散塊掃描更少, 也就是說, 相關性越高, 走索引掃描的離散塊掃描代價越低.
相關性在其他領域也有非常重要的應用, 例如廣告投入和銷售額的數據, 看百度提到的例子 : 
軟件公司在全國有許多代理商，為研究它的財務軟件產品的廣告投入與銷售額的關係，統計人員隨機選擇10家代理商進行觀察，搜集到年廣告投入費和月平均銷售額的數據，並編製成相關表，見表1:
表1　廣告費與月平均銷售額相關表　單位：萬元
年廣告費投入
月均銷售額
12.5　　15.3　　23.2　　26.4　　33.5　　34.4　　39.4　　45.2　　55.4　　60.9
21.2　　23.9　　32.9　　34.1　　42.5　　43.2　　49.0　　52.8　　59.4　　63.5
參照表1，可計算相關係數如表2：
序號
廣告投入(萬元)
　　x
月均銷售額(萬元)
　　y



1　　2　　3　　4　　5　　6　　7　　8　　9　　10
12.5　　15.3　　23.2　　26.4　　33.5　　34.4　　39.4　　45.2　　55.4　　60.9
21.2　　23.9　　32.9　　34.1　　42.5　　43.2　　49.0　　52.8　　59.4　　63.5
156.25　　234.09　　538.24　　696.96　　1122.25　　1183.36　　1552.36　　2043.04　　3069.16　　3708.81
449.44　　571.21　　1082.41　　1162.81　　1806.25　　1866.24　　2401.00　　2787.84　　3528.36　　4032.25
265.00　　365.67　　763.28　　900.24　　1423.75　　1486.08　　1930.60　　2386.56　　3290.76　　3867.15
合計
346.2
422.5
14304.52
19687.81
16679.09
=0.9942
相關係數為0.9942，說明廣告投入費與月平均銷售額之間有高度的線性正相關關係。
相關性越高, 說明廣告投入和銷售額的關係越明顯.
相關性是如何計算的呢? 實際上是 "協方差(x,y)除以(平方根(方差(x)*方差(y)))" . 
Postgresql attr correlation for values(logical order)  ctid (physcial order) - 德哥@Digoal - PostgreSQL research

在運維領域, 也可以做相對應的統計, 例如服務器的內存使用量, 負載, 進程數, 網絡吞吐量, 用戶請求量, 用戶請求響應時間 等數據, 可以做相關性的統計, 觀察他們之間的關係.
接下來進入正題, 看看PostgreSQL是如何計算列的邏輯和物理順序相關性的
首選看一下pg_stats這個視圖對應的correlation是怎麼來的
postgres=# \d+ pg_stats
        CASE
            WHEN s.stakind1 = 3 THEN s.stanumbers1[1]
            WHEN s.stakind2 = 3 THEN s.stanumbers2[1]
            WHEN s.stakind3 = 3 THEN s.stanumbers3[1]
            WHEN s.stakind4 = 3 THEN s.stanumbers4[1]
            WHEN s.stakind5 = 3 THEN s.stanumbers5[1]
            ELSE NULL::real
        END AS correlation,
。。。
   FROM pg_statistic s
     JOIN pg_class c ON c.oid = s.starelid
     JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum
     LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
  WHERE NOT a.attisdropped AND has_column_privilege(c.oid, a.attnum, 'select'::text);
其實是來自pg_statistic這個表, corr的統計是在analyze中完成的.
相關性計算的代碼如下, 注意是采樣統計 : 
src/backend/commands/analyze.c
                /*
                 * Now scan the values in order, find the most common ones, and also
                 * accumulate ordering-correlation statistics.
                 *
                 * To determine which are most common, we first have to count the
                 * number of duplicates of each value.  The duplicates are adjacent in
                 * the sorted list, so a brute-force approach is to compare successive
                 * datum values until we find two that are not equal. However, that
                 * requires N-1 invocations of the datum comparison routine, which are
                 * completely redundant with work that was done during the sort.  (The
                 * sort algorithm must at some point have compared each pair of items
                 * that are adjacent in the sorted order; otherwise it could not know
                 * that it's ordered the pair correctly.) We exploit this by having
                 * compare_scalars remember the highest tupno index that each
                 * ScalarItem has been found equal to.  At the end of the sort, a
                 * ScalarItem's tupnoLink will still point to itself if and only if it
                 * is the last item of its group of duplicates (since the group will
                 * be ordered by tupno).
                 */
                corr_xysum = 0;
                ndistinct = 0;
                nmultiple = 0;
                dups_cnt = 0;
                for (i = 0; i < values_cnt; i++)
                {
                        int                     tupno = values[i].tupno;

                        corr_xysum += ((double) i) * ((double) tupno);
                        dups_cnt++;
                        if (tupnoLink[tupno] == tupno)
                        {
                                /* Reached end of duplicates of this value */
                                ndistinct++;
                                if (dups_cnt > 1)
                                {
                                        nmultiple++;
                                        if (track_cnt < num_mcv ||
                                                dups_cnt > track[track_cnt - 1].count)
                                        {
                                                /*
                                                 * Found a new item for the mcv list; find its
                                                 * position, bubbling down old items if needed. Loop
                                                 * invariant is that j points at an empty/ replaceable
                                                 * slot.
                                                 */
                                                int                     j;

                                                if (track_cnt < num_mcv)
                                                        track_cnt++;
                                                for (j = track_cnt - 1; j > 0; j--)
                                                {
                                                        if (dups_cnt <= track[j - 1].count)
                                                                break;
                                                        track[j].count = track[j - 1].count;
                                                        track[j].first = track[j - 1].first;
                                                }
                                                track[j].count = dups_cnt;
                                                track[j].first = i + 1 - dups_cnt;
                                        }
                                }
                                dups_cnt = 0;
                        }
                }

.........................
                /* Generate a correlation entry if there are multiple values */
                if (values_cnt > 1)
                {
                        MemoryContext old_context;
                        float4     *corrs;
                        double          corr_xsum,
                                                corr_x2sum;

                        /* Must copy the target values into anl_context */
                        old_context = MemoryContextSwitchTo(stats->anl_context);
                        corrs = (float4 *) palloc(sizeof(float4));
                        MemoryContextSwitchTo(old_context);

                        /*----------
                         * Since we know the x and y value sets are both
                         *              0, 1, ..., values_cnt-1
                         * we have sum(x) = sum(y) =
                         *              (values_cnt-1)*values_cnt / 2
                         * and sum(x^2) = sum(y^2) =
                         *              (values_cnt-1)*values_cnt*(2*values_cnt-1) / 6.
                         *----------
                         */
                        corr_xsum = ((double) (values_cnt - 1)) *
                                ((double) values_cnt) / 2.0;
                        corr_x2sum = ((double) (values_cnt - 1)) *
                                ((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0;

                        /* And the correlation coefficient reduces to */
                        corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) /
                                (values_cnt * corr_x2sum - corr_xsum * corr_xsum);

                        stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
                        stats->staop[slot_idx] = mystats->ltopr;
                        stats->stanumbers[slot_idx] = corrs;
                        stats->numnumbers[slot_idx] = 1;
                        slot_idx++;
                }
PostgreSQL 提供了相關性統計的函數, corr供用戶使用.
參考
https://www.postgresql.org/docs/9.4/static/functions-aggregate.html
corr代碼如下 : 
src/backend/utils/adt/float.c
Datum
float8_corr(PG_FUNCTION_ARGS)
{
        ArrayType  *transarray = PG_GETARG_ARRAYTYPE_P(0);
        float8     *transvalues;
        float8          N,
                                sumX,
                                sumX2,
                                sumY,
                                sumY2,
                                sumXY,
                                numeratorX,
                                numeratorY,
                                numeratorXY;

        transvalues = check_float8_array(transarray, "float8_corr", 6);
        N = transvalues[0];
        sumX = transvalues[1];
        sumX2 = transvalues[2];
        sumY = transvalues[3];
        sumY2 = transvalues[4];
        sumXY = transvalues[5];

        /* if N is 0 we should return NULL */
        if (N < 1.0)
                PG_RETURN_NULL();

        numeratorX = N * sumX2 - sumX * sumX;
        CHECKFLOATVAL(numeratorX, isinf(sumX2) || isinf(sumX), true);
        numeratorY = N * sumY2 - sumY * sumY;
        CHECKFLOATVAL(numeratorY, isinf(sumY2) || isinf(sumY), true);
        numeratorXY = N * sumXY - sumX * sumY;
        CHECKFLOATVAL(numeratorXY, isinf(sumXY) || isinf(sumX) ||
                                  isinf(sumY), true);
        if (numeratorX <= 0 || numeratorY <= 0)
                PG_RETURN_NULL();

        PG_RETURN_FLOAT8(numeratorXY / sqrt(numeratorX * numeratorY));
}
我們可以用corr來驗證PostgreSQL的采樣統計, 但是注意, 要驗證的話, 數據量小一點比較好, 這樣的話PG會全量采樣, 和corr得到的結果一致, 如果數據量太大, 得到的結果可能有少量偏差.
postgres=# create table t(id int);
CREATE TABLE
postgres=# insert into t values (2),(5),(8),(3),(4),(6),(9),(7),(1);
INSERT 0 9
行號, ID值如下
postgres=# select ctid,* from t;
 ctid  | id 
-------+----
 (0,1) |  2
 (0,2) |  5
 (0,3) |  8
 (0,4) |  3
 (0,5) |  4
 (0,6) |  6
 (0,7) |  9
 (0,8) |  7
 (0,9) |  1
(9 rows)
使用窗口函數進行輸出
postgres=# select * from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);
 rn | id 
----+----
  1 |  2
  2 |  5
  3 |  8
  4 |  3
  5 |  4
  6 |  6
  7 |  9
  8 |  7
  9 |  1
(9 rows)
分析 : 
postgres=# analyze t;
ANALYZE
查詢統計信息的correlation            
postgres=# select * from pg_stats where attname ='id' and tablename='t';
-[ RECORD 1 ]----------+--------------------
schemaname             | public
tablename              | t
attname                | id
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | -1
most_common_vals       | 
most_common_freqs      | 
histogram_bounds       | {1,2,3,4,5,6,7,8,9}
correlation            | 0.116667
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   | 
結果和corr得到的結果一致
postgres=# select corr(rn,id) from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);
-[ RECORD 1 ]-----------
corr | 0.116666666666667
如果隨機插入大量數據, 因此采樣的關係, 分析得到的相關性可能和實際的相關性有偏差
postgres=# insert into t select * from generate_series(1,100000) order by random();
INSERT 0 100000
如下 : 
postgres=# select ctid,* from t limit 100;
  ctid   |  id   
---------+-------
 (0,1)   |     2
 (0,2)   |     5
 (0,3)   |     8
 (0,4)   |     3
 (0,5)   |     4
 (0,6)   |     6
 (0,7)   |     9
 (0,8)   |     7
 (0,9)   |     1
 (0,10)  |  4607
 (0,11)  | 39521
 (0,12)  | 92869
 (0,13)  | 80094
 (0,14)  | 13214
 (0,15)  | 15509
 (0,16)  |  8380
 (0,17)  | 22281
 (0,18)  | 99252
 (0,19)  | 60018
 (0,20)  | 55716
....

postgres=# analyze t;
ANALYZE
postgres=# select correlation from pg_stats where attname ='id' and tablename='t';
 correlation  
--------------
 -0.000263469
(1 row)
和實際相關性偏差較大
postgres=# select corr(rn,id) from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);
        corr         
---------------------
 0.00110293570728894
(1 row)
修改id列的采樣係數, 重新分析, 得到的相關性結果和實際的相關性基本一致.
postgres=# alter table t alter column id SET STATISTICS 10000;
ALTER TABLE
postgres=# analyze t;
ANALYZE
postgres=# select correlation from pg_stats where attname ='id' and tablename='t';
 correlation 
-------------
  0.00110296
(1 row)

[參考]
1. https://zh.wikipedia.org/zh-cn/%E7%9B%B8%E5%85%B3
2. https://baike.baidu.com/view/172091.htm
3. https://en.wikipedia.org/wiki/Correlation_and_dependence
4. https://www.postgresql.org/docs/9.4/static/functions-aggregate.html
最後更新：2017-04-01 13:37:06
PostgreSQL 聚合函數講解 - 2 相關性統計

上一篇： Mac OS終端提示符前綴”bogon”

下一篇： PostgreSQL 聚合函數講解 - 3 總體|樣本方差, 標準方差

相關內容

熱門內容

最新內容

PostgreSQL 聚合函數講解 - 2 相關性統計

上一篇： Mac OS終端提示符前綴”bogon”

下一篇： PostgreSQL 聚合函數講解 - 3 總體|樣本 方差, 標準方差

相關內容

熱門內容

最新內容

下一篇： PostgreSQL 聚合函數講解 - 3 總體|樣本方差, 標準方差