889 技術社區[雲棲]

自動建立正確索引(btree,hash,gin,gist,sp-gist,brin,bitmap...)的方法

標簽

PostgreSQL , 索引接口 , 自動創建索引 , 自動選擇索引接口 , (btree,hash,bitmap,gin,gist,sp-gist,brin,rum,bloom,zoomdb)

背景

PostgreSQL的索引接口是開放的，支持btree,hash,bitmap,gin,gist,sp-gist,brin,rum,bloom,zoomdb等索引接口。因此，不同的數據類型，有不同的索引結構可以選擇。

由於索引接口眾多（應對不同的場景），一些用戶可能無法判斷應該選擇使用哪種索引方法。

雖然我在很多文章中有提到過索引的原理以及選擇方法，但是一切為懶人服務，所以我們看看如何把創建索引變得更簡單，讓數據庫自動選擇最合適的索引方法。

如果你想了解各種索引的用途，你也可以參考一下手冊或者以下CASE的文章，了解索引的原理和應用場景。

《懶人推動社會進步 - 多列聚合, gin與數據分布(選擇性)》

語法

CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON table_name [ USING method ]  
    ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )  
    [ WITH ( storage_parameter = value [, ... ] ) ]  
    [ TABLESPACE tablespace_name ]  
    [ WHERE predicate ]

索引選擇的要素

必要約束

1、如果用戶要創建 unique 索引，那麼隻能選擇btree索引接口。

2、某些類型指支持某類索引接口，例如數組類型、全文檢索類型，隻支持GIN索引。GIS類型隻支持gist或sp-gist索引。

選擇性

如果某個類型支持多種索引接口，那麼到底選擇哪種接口比較好呢？

和數據的選擇性是有關係的。

1、選擇性差（例如1億記錄，有100-10萬條唯一值），建議使用gin或bitmap索引。

2、選擇性好（例如1億記錄，有8000萬唯一值），建議使用btree或hash索引。

數據分布

1、對於數據值與行號呈現較強的線性相關特性時，加入用戶沒有按該列順序輸出的需求。則建議使用brin塊級索引。

2、當列長度超過數據塊的1/3時，不能使用btree，建議使用hash索引。或者使用表達式btree索引，建少索引entry的大小。

查詢需求

如果某個類型支持多種索引接口，那麼到底選擇哪種接口比較好呢？

和數據的查詢需要也是有關係的。

1、範圍查詢、排序查詢、等值查詢

可以使用btree, brin.

2、僅僅有等值查詢

可以使用btree, hash

3、有多個列的任意組合查詢需求

可以使用bitmap, gin, btree等索引接口

4、有包含、相交等查詢需求

可以使用gin等索引接口

5、有距離、距離排序、相交、包含、貫穿等查詢需求

可以使用gist等索引接口

性能要求

當一個列支持多種索引接口時，應該選擇哪個索引接口，和業務對性能的要求也有關係。

例如，某個列同時支持btree和brin索引，應該選哪個呢？

除了前麵提到的線性相關性，還需要考慮業務的查詢要求，以及對性能的要求，BTREE對寫入性能影響比BRIN大很多，但是明細查詢速度，排序速度，limit輸出等，都比GIN好很多。

但是當重複值較多時，建議使用GIN，因為它是將元素值作為索引KEY，將行號作為VALUE的倒排索引。

小結

在創建索引時，我們需要指定索引方法，拋開其他的原因，對於同一列可以使用多種索引方法建立索引時，到底使用什麼方法？

本文接下來的例子將給大家介紹這個問題的解法。

統計時需要使用參與索引的字段的pg_class, pg_stats的統計信息。

pg_class.reltuples，行數  
  
pg_stats.n_distinct，唯一值個數或比例  
  
pg_stats.correlation，列的線性相關性

將來PG 10還能支持自定義組合列的統計信息收集，比如A,B列的唯一值，相關性等統計信息。支持更加複雜的優化器成本計算。

例子

我講字段選擇索引方法的功能加入到UDF中，利用參數控製自動建立索引的名字。要求等。

Greenplum

語法

Command:     CREATE INDEX  
Description: define a new index  
Syntax:  
CREATE [UNIQUE] INDEX name ON table   
       [USING btree|bitmap|gist]  
       ( {column | (expression)} [opclass] [, ...] )  
       [ WITH ( FILLFACTOR = value ) ]  
       [TABLESPACE tablespace]  
       [WHERE predicate]

UDF

指定表名，列名，索引名，自動創建合適的索引方法的索引。

例子如下，用戶可以更精細的調整內容，自動建立合適的索引。

create or replace function build_idx(  
sch name, -- schema name  
tbl name, -- 表名  
col name, -- 列名  
idx name -- 索引名  
) returns void as   
$$  
declare  
  v_tuples numeric;  
  v_n_distinct_cnt numeric;  
  v_n_distinct numeric;  
  v_corr numeric;  
  v_sql text;  
  am text := 'btree';  
begin  
  -- 建議先收集統計信息, 或者根據統計信息的狀態，判斷是否要更新統計信息  
  
  select reltuples into v_tuples from pg_class where relname=tbl and relnamespace=(select oid from pg_namespace where nspname=sch);  
  
  -- 唯一值個數，100 ~ 10萬 , 同時唯一值比例低於1/10000，使用bitmap(gp) , 或者使用gin(pg)  
  select case when n_distinct >=1 then n_distinct when n_distinct=-1 then v_tuples else abs(n_distinct*v_tuples) end into v_n_distinct_cnt   
  from pg_stats where schemaname=sch and tablename=tbl and attname=col;  
      
  -- 唯一值比例，低於1/10000 ，使用bitmap(gp) , 或者使用gin(pg)。否則使用btree  
  select case when n_distinct >=1 then n_distinct/v_tuples when n_distinct=-1 then 1 else abs(n_distinct) end into v_n_distinct  
  from pg_stats where schemaname=sch and tablename=tbl and attname=col;  
  
  if v_n_distinct <= 1/10000.0 and v_n_distinct_cnt between 100 and 100000 then  
    am = 'bitmap';  
  end if;  
  
  v_sql := 'create index '||quote_ident(idx)||' on '||quote_ident(sch)||'.'||quote_ident(tbl)||' using '||am||' ('||quote_ident(col)||')';  
  raise notice '%', v_sql;  
  execute v_sql;  
end;  
$$ language plpgsql strict;

測試，如下，自動創建bitmap索引。

postgres=# create table tbl(id int, info text);  
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.  
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.  
CREATE TABLE  
postgres=# insert into tbl select 200*random(), 'test' from generate_series(1,3000000);  
INSERT 0 3000000  
postgres=# select build_idx('public','tbl','id','test_idx_id');  
NOTICE:  create index test_idx_id on public.tbl using bitmap (id)  
 build_idx   
-----------  
   
(1 row)

PostgreSQL

語法

Command:     CREATE INDEX  
Description: define a new index  
Syntax:  
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON table_name [ USING method ]  
    ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )  
    [ WITH ( storage_parameter = value [, ... ] ) ]  
    [ TABLESPACE tablespace_name ]  
    [ WHERE predicate ]

UDF

指定表名，列名，索引名，自動創建合適的索引方法的索引。

create or replace function build_idx(  
sch name, -- schema name  
tbl name, -- 表名  
col name, -- 列名  
idx name -- 索引名  
) returns void as   
$$  
declare  
  v_tuples numeric;  
  v_n_distinct_cnt numeric;  
  v_n_distinct numeric;  
  v_corr numeric;  
  v_sql text;  
  am text := 'btree';  
begin  
  -- 建議先收集統計信息, 或者根據統計信息的狀態，判斷是否要更新統計信息  
  
  select reltuples into v_tuples from pg_class where relname=tbl and relnamespace=sch::regnamespace;  
  
  -- 唯一值個數，100 ~ 10萬 , 同時唯一值比例低於1/10000，使用bitmap(gp) , 或者使用gin(pg)  
  select case when n_distinct >=1 then n_distinct when n_distinct=-1 then v_tuples else abs(n_distinct*v_tuples) end into v_n_distinct_cnt   
  from pg_stats where schemaname=sch and tablename=tbl and attname=col;  
      
  -- 唯一值比例，低於1/10000 ，使用bitmap(gp) , 或者使用gin(pg)。否則使用btree  
  select case when n_distinct >=1 then n_distinct/v_tuples when n_distinct=-1 then 1 else abs(n_distinct) end into v_n_distinct  
  from pg_stats where schemaname=sch and tablename=tbl and attname=col;  
  
  -- 相關性, abs>0.9 使用brin  
  select abs(correlation) into v_corr  
  from pg_stats where schemaname=sch and tablename=tbl and attname=col;  
    
  if v_corr>=0.9 then   
    am = 'brin';  
  elsif v_n_distinct <= 1/10000.0 and v_n_distinct_cnt between 100 and 100000 then  
    am = 'gin';  
  end if;  
  
  -- 如何列大小超過1000字節，am='hash'  
  
  v_sql := format('create index %I on %I.%I using '||am||' (%I)', idx, sch, tbl, col);  
  raise notice '%', v_sql;  
  execute v_sql;  
end;  
$$ language plpgsql strict;

測試，如下，自動創建brin索引。

postgres=# create table test(id int, info text);  
CREATE TABLE  
postgres=# insert into test select 1, 'test' from generate_series(1,1000000);  
INSERT 0 1000000  
postgres=# analyze test;  
ANALYZE  
postgres=# select build_idx('public','test','id','test_id_idx1212');  
NOTICE:  00000: create index test_id_idx1212 on public.test using brin (id)  
LOCATION:  exec_stmt_raise, pl_exec.c:3337  
 build_idx   
-----------  
   
(1 row)

延伸

用戶還可以把其他邏輯寫入UDF，使得這個自動選擇AM並創建索引的UDF更加智能。

最後更新：2017-06-17 08:32:49

自動建立正確索引(btree,hash,gin,gist,sp-gist,brin,bitmap...)的方法

標簽

背景

語法

索引選擇的要素

必要約束

選擇性

數據分布

查詢需求

性能要求

小結

例子

Greenplum

UDF

PostgreSQL

UDF

延伸

上一篇： java學習 JavaScript學習心得

下一篇：史上最全“大數據”學習資源整理

相關內容

熱門內容

最新內容

自動建立正確索引(btree,hash,gin,gist,sp-gist,brin,bitmap...)的方法

標簽

背景

語法

索引選擇的要素

必要約束

選擇性

數據分布

查詢需求

性能要求

小結

例子

Greenplum

UDF

PostgreSQL

UDF

延伸

上一篇： java學習 JavaScript學習心得

下一篇： 史上最全“大數據”學習資源整理

相關內容

熱門內容

最新內容

下一篇：史上最全“大數據”學習資源整理