閱讀573 返回首頁    go 阿裏雲 go 技術社區[雲棲]


PostgreSQL 10.0 preview 功能增強 - 自由定義統計信息維度

標簽

PostgreSQL , 10.0 , 統計信息 , 自由定義 , 自由分組統計信息


背景

統計信息是數據庫優化器用來計算SQL plan tree中每個NODE成本的依據,目前PostgreSQL的統計信息是開源數據庫中最完備的。

包括繼承表、以及全局的統計信息,每個字段的NULL值個數或百分比,平均字段寬度,唯一值個數或百分比,高頻詞,高頻詞的百分比,柱狀圖,線性相關性,數組字段的高頻元素,高頻元素的百分比,數組元素的柱狀圖。

          View "pg_catalog.pg_stats"  
         Column         |   Type   | Modifiers   
------------------------+----------+-----------  
 schemaname             | name     |   
 tablename              | name     |   
 attname                | name     |   
 inherited              | boolean  |   
 null_frac              | real     |   
 avg_width              | integer  |   
 n_distinct             | real     |   
 most_common_vals       | anyarray |   
 most_common_freqs      | real[]   |   
 histogram_bounds       | anyarray |   
 correlation            | real     |   
 most_common_elems      | anyarray |   
 most_common_elem_freqs | real[]   |   
 elem_count_histogram   | real[]   |   

實際上統計信息已經非常完備了,10.0提出了一個更大膽的統計信息,自定義統計信息。

比如我們將多個字段作為一個分組,生成分組的統計信息。有什麼用呢?

例如select count(distinct a,b) from tbl, select count(*),a,b from tbl group by a,b;

以上SQL,如果有a,b分組的統計信息,有多少唯一值,線性相關性如何。優化器就可以評估group by需要多少內存,結果集大概多大。

是的,現在PostgreSQL 10.0允許你定義字段分組的統計信息了。是不是很開森呢?

你甚至可以用它來做多個字段組合的TOP N的評估呢。

Implement multivariate n-distinct coefficients  
  
Add support for explicitly declared statistic objects (CREATE  
STATISTICS), allowing collection of statistics on more complex  
combinations that individual table columns.  Companion commands DROP  
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are  
added too.  All this DDL has been designed so that more statistic types  
can be added later on, such as multivariate most-common-values and  
multivariate histograms between columns of a single table, leaving room  
for permitting columns on multiple tables, too, as well as expressions.  
  
This commit only adds support for collection of n-distinct coefficient  
on user-specified sets of columns in a single table.  This is useful to  
estimate number of distinct groups in GROUP BY and DISTINCT clauses;  
estimation errors there can cause over-allocation of memory in hashed  
aggregates, for instance, so it's a worthwhile problem to solve.  A new  
special pseudo-type pg_ndistinct is used.  
  
(num-distinct estimation was deemed sufficiently useful by itself that  
this is worthwhile even if no further statistic types are added  
immediately; so much so that another version of essentially the same  
functionality was submitted by Kyotaro Horiguchi:  
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp  
though this commit does not use that code.)  
  
Author: Tomas Vondra.  Some code rework by Álvaro.  
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,  
    Ideriha Takeshi  
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz  
    https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql  

新增的係統表

pg_statistic_ext  

新增的語法

CREATE STATISTICS  

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=doc/src/sgml/ref/create_statistics.sgml;h=60184a347bf72c4e4d075976f2959bd2b9c8e99b;hb=7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=doc/src/sgml/catalogs.sgml;h=ac39c639edcbbc05b3b18c31cc5ab6b99b93efeb;hp=c531c73aac9f661cad9e3c03e0f16197932edfd4;hb=7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b;hpb=f120b614e070aed39586d1443193738a149a90d4

這個patch的討論,詳見郵件組,本文末尾URL。

PostgreSQL社區的作風非常嚴謹,一個patch可能在郵件組中討論幾個月甚至幾年,根據大家的意見反複的修正,patch合並到master已經非常成熟,所以PostgreSQL的穩定性也是遠近聞名的。

參考

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b

最後更新:2017-04-01 17:04:39

  上一篇:go 這顆膠囊能在你肚子裏24小時監測你的健康!| 醫療軟件
  下一篇:go #內含福利#2T免費的雲存儲,能帶我們飛多高?