懶人促進社會進步 - 5種索引的原理和優化Case (btree,hash,gin,gist,brin)

標簽

PostgreSQL , 多列聚合 , gin , btree , n_distinct , 選擇性 , 如何選擇索引方法(hash,btree,gin,gist,brin) , 如何優化索引 , 相關性

背景

在廣告行業，精準營銷是一個較熱的話題，之前寫過一個案例，如何使用PostgreSQL的array類型和GIN索引實時圈人的場景。

《萬億級營銷(圈人)邁入毫秒時代 - 萬億user_tags級實時推薦係統數據庫設計》

使用以上方法，程序需要作出一些調整（當然，如果用戶原本就是PostgreSQL技術棧，改動量會很少），改動量舉例

假設用戶使用了多個列來表示不同的屬性，每個屬性對應一些TAG取值空間。

create table user_tags(uid int8 primary key, lab1 int, lab2 text, lab3 timestamp, lab4 text, lab5 interval, lab6 json);

用戶原有的圈人、維度統計查詢可能是這樣的

select * from user_tags where lab1 ? xxx and lab2 ? xxx or lab4 ? xxx;  
  
select xx, count(*) from user_tags where lab1 ? xxx and lab2 ? xxx or lab4 ? xxx group by xxx;

由於屬性取值空間可能連續，使用《萬億級營銷(圈人)邁入毫秒時代 - 萬億user_tags級實時推薦係統數據庫設計》提到的方法，需要建立標簽庫，將數據階梯化，查詢也要進行轉換。

例如between and這種連續查詢需要轉換為in的散列查詢。使得程序設計更加複雜，（雖然這樣也許可以將性能最大化）。

那麼PostgreSQL有沒有什麼折中的辦法呢？

當然有，一切辦法都是為懶人準備的，懶人推動了社會的進步。

如果你閱讀一下這些文檔，你會發現PG裏麵辦法還是蠻多的。

1、使用bitmapand, bitmapor+任意索引，解決圈人問題。

《多字段，任意組合條件查詢(0建模) - 毫秒級實時圈人最佳實踐》

2、使用varbitx插件，解決圈人問題。

《阿裏雲RDS for PostgreSQL varbitx插件與實時畫像應用場景介紹》

接下來針對有連續查詢，等值查詢多種組合查詢的圈人場景，我們來看看如何解決。

建模和測試

構建一張TAG表

postgres=# create table tbl_label(uid int primary key, c1 int, c2 text, c3 numeric, c4 timestamp, c5 interval, c6 int);  
CREATE TABLE  
Time: 31.145 ms

插入一批數據

postgres=# insert into tbl_label select id,   
random()*10000, md5(random()::text),   
10000*random(), clock_timestamp(),   
(random()*1000::int||' hour')::interval,   
random()*99999   
from generate_series(1,10000000) t(id);  
INSERT 0 10000000

數據樣式

postgres=# select * from tbl_label limit 10;  
 uid |  c1  |                c2                |        c3        |             c4             |        c5        |  c6     
-----+------+----------------------------------+------------------+----------------------------+------------------+-------  
   1 | 1692 | 505662aa4a6b33e1775cea660063ba58 | 9761.26249413937 | 2017-06-12 18:38:57.515097 | 322:06:55.266882 | 67699  
   2 | 8611 | a60d564b7f4d58029dfd5e16f0457305 | 1003.07232700288 | 2017-06-12 18:38:57.515282 | 780:59:39.081975 | 89283  
   3 |  290 | f226358e08372d4b79c8ecdd27172244 | 8240.20517989993 | 2017-06-12 18:38:57.515296 | 261:29:59.658099 | 87975  
   4 | 7829 | 32bc5d97731ddaf2c1630218e43d1e85 | 9061.87939457595 | 2017-06-12 18:38:57.515303 | 760:47:18.267513 | 76409  
   5 | 7735 | 3813b4bcdaadc21a55da143f6aceeac9 | 6651.74870751798 | 2017-06-12 18:38:57.515309 | 512:45:50.116217 | 11252  
   6 | 9945 | ff72917169cdea9225e429e438f22586 | 2114.50539063662 | 2017-06-12 18:38:57.515316 | 63:30:34.15014   | 33288  
   7 | 9144 | 7cf4067f22c5edbb1fc4e08ecee7242c | 5662.74457611144 | 2017-06-12 18:38:57.515322 | 890:30:28.360096 | 55484  
   8 | 2433 | 8ac9732bec2b1c175483c16e82467653 | 9184.17258188128 | 2017-06-12 18:38:57.515328 | 343:34:40.88581  | 53265  
   9 | 8113 | 2dd724e82dc7c2a15dfda45f6a41cd53 | 5094.92502082139 | 2017-06-12 18:38:57.515334 | 635:16:39.096908 | 63410  
  10 | 3893 | b8abdb00228f09efb04c1e2a8a022c22 | 6397.02362008393 | 2017-06-12 18:38:57.51534  | 295:26:24.752753 | 17894  
(10 rows)

分析表的統計信息

postgres=# analyze tbl_label ;  
ANALYZE

查看每列的散列程度

n_distinct解釋  
  
-1表示唯一，也就是說這個列的每一行都不一樣.  
  
>=1時，表示這個列有多少唯一值.  
  
<1時，表示這個列的  唯一值數量/總數.    
  
correlation解釋  
表示該列與數據堆存儲的線性相關性, 1表示正向完全相關。越接近0表示數據分布越離散。<0表示反向相關。  
  
uid是自增的, c4是時間遞增的，所以都是1，完全相關。  
  
postgres=# select tablename,attname,n_distinct,correlation from pg_stats where tablename='tbl_label';  
 tablename | attname | n_distinct | correlation   
-----------+---------+------------+-------------  
 tbl_label | uid     |         -1 |           1  
 tbl_label | c1      |      10018 |  0.00431651  
 tbl_label | c2      |  -0.957505 | -0.00796595  
 tbl_label | c3      |         -1 |  0.00308372  
 tbl_label | c4      |         -1 |           1  
 tbl_label | c5      |         -1 | 0.000382809  
 tbl_label | c6      |     100688 |  0.00156045  
(7 rows)

針對以上統計信息，對於唯一列，建立btree索引，對於鬆散列，建立gin索引（倒排），以達到最好的效果。

為了讓普通類型支持gin，需要創建btree_gin插件

postgres=# create extension btree_gin;  
CREATE EXTENSION

創建c1,c6的gin複合索引

postgres=# set maintenance_work_mem ='32GB';  
SET  
Time: 0.168 ms  
postgres=# create index idx_tbl_label_1 on tbl_label using gin(c1,c6);  
CREATE INDEX  
Time: 1504.542 ms (00:01.505)

查詢測試，查詢c1,c6的任意組合，效果非常棒。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where c1 between 1 and 100;  
                                                            QUERY PLAN                                                               
-----------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl_label  (cost=125.76..8759.97 rows=10074 width=80) (actual time=40.856..50.480 rows=9922 loops=1)  
   Output: uid, c1, c2, c3, c4, c5, c6  
   Recheck Cond: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100))  
   Heap Blocks: exact=7222  
   Buffers: shared hit=7982  
   ->  Bitmap Index Scan on idx_tbl_label_1  (cost=0.00..123.24 rows=10074 width=0) (actual time=39.773..39.773 rows=9922 loops=1)  
         Index Cond: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100))  
         Buffers: shared hit=760  
 Planning time: 0.105 ms  
 Execution time: 51.043 ms  
(10 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where c1 between 1 and 100 or c6=100;  
                                                               QUERY PLAN                                                                  
-----------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl_label  (cost=134.36..8799.70 rows=10085 width=80) (actual time=41.133..50.187 rows=9932 loops=1)  
   Output: uid, c1, c2, c3, c4, c5, c6  
   Recheck Cond: (((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100)) OR (tbl_label.c6 = 100))  
   Heap Blocks: exact=7228  
   Buffers: shared hit=7992  
   ->  BitmapOr  (cost=134.36..134.36 rows=10085 width=0) (actual time=40.045..40.045 rows=0 loops=1)  
         Buffers: shared hit=764  
         ->  Bitmap Index Scan on idx_tbl_label_1  (cost=0.00..123.24 rows=10074 width=0) (actual time=40.031..40.031 rows=9922 loops=1)  
               Index Cond: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100))  
               Buffers: shared hit=760  
         ->  Bitmap Index Scan on idx_tbl_label_1  (cost=0.00..6.08 rows=11 width=0) (actual time=0.012..0.012 rows=10 loops=1)  
               Index Cond: (tbl_label.c6 = 100)  
               Buffers: shared hit=4  
 Planning time: 0.125 ms  
 Execution time: 50.758 ms  
(15 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where c1 between 1 and 100 and c6=100;  
                                                        QUERY PLAN                                                           
---------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl_label  (cost=22.50..24.02 rows=1 width=80) (actual time=36.193..36.193 rows=0 loops=1)  
   Output: uid, c1, c2, c3, c4, c5, c6  
   Recheck Cond: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100) AND (tbl_label.c6 = 100))  
   Buffers: shared hit=763  
   ->  Bitmap Index Scan on idx_tbl_label_1  (cost=0.00..22.50 rows=1 width=0) (actual time=36.190..36.190 rows=0 loops=1)  
         Index Cond: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100) AND (tbl_label.c6 = 100))  
         Buffers: shared hit=763  
 Planning time: 0.115 ms  
 Execution time: 36.226 ms  
(9 rows)

創建其他列的btree索引，因為其他列的n_distinct表明這些列基本唯一，所以我們建立btree索引，可以精準的進行定位。

對於線性相關性好的列，創建brin索引。後麵會講到索引的原理和選擇。

postgres=# create index idx_tbl_label2 on tbl_label using btree(c2);  
CREATE INDEX  
Time: 1388.756 ms (00:01.389)  
  
postgres=# create index idx_tbl_label3 on tbl_label using btree(c3);  
CREATE INDEX  
Time: 1028.865 ms (00:01.029)

多列組合查詢，效果非常好

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where c1 between 1 and 100 and c6=100 and c2='abc';  
                                                            QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_tbl_label2 on public.tbl_label  (cost=0.42..3.45 rows=1 width=80) (actual time=0.032..0.032 rows=0 loops=1)  
   Output: uid, c1, c2, c3, c4, c5, c6  
   Index Cond: (tbl_label.c2 = 'abc'::text)  
   Filter: ((tbl_label.c1 >= 1) AND (tbl_label.c1 <= 100) AND (tbl_label.c6 = 100))  
   Buffers: shared read=3  
 Planning time: 0.248 ms  
 Execution time: 0.056 ms  
(7 rows)

多個索引通過bitmapAnd, bitmapOr對數據進行過濾，大幅提升任意條件查詢的性能。原理如下

《多字段，任意組合條件查詢(0建模) - 毫秒級實時圈人最佳實踐》

那麼應該如何選擇索引呢？後麵會講到。

贈送彩蛋

實際上前麵用到的是GIN多列複合索引，還有一種方法，將多列轉換為數組，然後建立數組索引(PostgreSQL 表達式索引。)。

1、如何將多列轉換為數組？

postgres=# create or replace function to_array(VARIADIC anyarray) returns anyarray as $$  
  select $1;                        
$$ language sql strict immutable;  
CREATE FUNCTION

例子

postgres=# select to_array('a'::text,'b','c');  
 to_array   
----------  
 {a,b,c}  
(1 row)  
  
postgres=# select to_array(now(),clock_timestamp());  
                             to_array                                
-------------------------------------------------------------------  
 {"2017-06-12 17:50:47.992274+08","2017-06-12 17:50:47.992489+08"}  
(1 row)  
  
postgres=# select to_array(1,2,3);  
 to_array   
----------  
 {1,2,3}  
(1 row)

2、數組表達式索引

例子

create index idx_tbl_label_combin on tbl_label using gin (to_array(c1,c6));   
  
當列的類型不一致時，可以轉換為一致的，然後建立表達式索引，類型轉換可能需要使用immutable函數，如果沒有則需要自建immutable轉換函數，也很簡單  
  
postgres=# create index idx_tbl_label_combin1 on tbl_label using gin (to_array('c1:'||c1,'c6:'||c6));

3、如何命中數組表達式索引

查詢條件與索引中的表達式一致，即可命中。

例子

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where to_array(c1,c6) && array[1,2];  
                                                              QUERY PLAN                                                                
--------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl_label  (cost=840.56..86397.30 rows=99750 width=80) (actual time=0.777..4.030 rows=2254 loops=1)  
   Output: uid, c1, c2, c3, c4, c5, c6  
   Recheck Cond: (ARRAY[tbl_label.c1, tbl_label.c6] && '{1,2}'::integer[])  
   Heap Blocks: exact=2242  
   Buffers: shared hit=2251  
   ->  Bitmap Index Scan on idx_tbl_label_combin  (cost=0.00..815.62 rows=99750 width=0) (actual time=0.465..0.465 rows=2254 loops=1)  
         Index Cond: (ARRAY[tbl_label.c1, tbl_label.c6] && '{1,2}'::integer[])  
         Buffers: shared hit=9  
 Planning time: 0.361 ms  
 Execution time: 4.176 ms  
(10 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl_label where to_array('c1:'||c1,'c6:'||c6) && array['c1:1'];  
                                                              QUERY PLAN                                                                 
---------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl_label  (cost=422.00..54015.43 rows=50000 width=80) (actual time=0.331..1.888 rows=1021 loops=1)  
   Output: uid, c1, c2, c3, c4, c5, c6  
   Recheck Cond: (ARRAY[('c1:'::text || (tbl_label.c1)::text), ('c6:'::text || (tbl_label.c6)::text)] && '{c1:1}'::text[])  
   Heap Blocks: exact=1019  
   Buffers: shared hit=1024  
   ->  Bitmap Index Scan on idx_tbl_label_combin1  (cost=0.00..409.50 rows=50000 width=0) (actual time=0.195..0.195 rows=1021 loops=1)  
         Index Cond: (ARRAY[('c1:'::text || (tbl_label.c1)::text), ('c6:'::text || (tbl_label.c6)::text)] && '{c1:1}'::text[])  
         Buffers: shared hit=5  
 Planning time: 0.173 ms  
 Execution time: 1.972 ms  
(10 rows)

小結

1、什麼時候選擇btree

btree索引適合選擇性好的列（n_distinct很大，或者=-1），唯一值比例越高越適合btree。

2、什麼時候選擇gin

與btree相反，選擇性越差，采用GIN索引效率越高。

另外GIN的倒排特性，還特別適合多值類型的元素組合查詢，例如數組、全文檢索類型、TOKEN類型、等等。

同時GIN索引接口是開放的，用戶可以根據數據特征，自定義GIN索引。支持更多的數據類型，例如圖像特征值相似查詢，文本的相似度查詢等。

3、什麼時候選擇gist

GIST是PG的一種通用索引接口，適合各種數據類型，特別適合異構的類型，例如幾何類型，空間類型，範圍類型等。

GIST索引的原理可參考

《從難纏的模煳查詢聊開 - PostgreSQL獨門絕招之一 GIN , GiST , SP-GiST , RUM 索引原理與技術背景》

4、什麼時候選擇hash

如何用好隻有等值查詢，並且被索引的列長度很長，可能超過數據庫block的1/3時，建議使用hash索引。 PG 10 hash索引會產生WAL，確保了可靠性，同時支持流複製。

PG 10 以前的版本，不建議使用hash index，crash後需要rebuild，不支持流複製。

5、什麼時候選擇brin

當數據與堆存儲線性相關性很好時，可以采用BRIN索引。

BRIN是塊級索引，存儲每個（或者每一段連續的）數據塊的原子信息（最大值，最小值，平均值，空值比例，COUNT等）。

特別適合範圍掃描。

不同的索引方法支持什麼類型的查詢？

1、btree

適合排序、>=, <=, =, in, >, < 等查詢。

2、HASH

適合=查詢。

3、GIN

不同的數據類型，適應不同的查詢需求。

例如數組類型，適合相交，包含等。

4、GIST

不同的數據類型，適應不同的查詢需求。

例如空間類型，適合，距離排序，KNN，包含，相交，左，右等。

5、BRIN

適合範圍查詢，=查詢。

如何優化索引效率

前麵的方法告訴你應該如何選擇索引，但是沒有提索引本身的優化，實際上數據分布會影響索引的效率。

例如

《索引順序掃描引發的堆掃描IO放大背後的統計學原理與解決辦法 - PostgreSQL index scan enlarge heap page scans when index and column correlation small.》

因此，根據索引的掃描特點，對數據進行重分布，可以大幅度優化索引查詢的效率。

例如bitmap index scan（按BLOCK ID順序讀取）就是PostgreSQL用於減少離散IO的手段。

1、btree數據分布優化

線性相關越好，掃描或返回多條數據的效率越高。

2、hash數據分布優化

線性相關越好，掃描或返回多條數據的效率越高。

3、gin數據分布優化

如果是普通類型，則線性相關越好，掃描或返回多條數據的效率越高。

如果是多值類型（如數組、全文檢索、TOKENs），則元素越集中（元素聚類分析，橫坐標為行號，縱坐標為元素值，數據分布越集中），效率越高。

元素集中通常不好實現，但是我們可以有集中方法來聚集數據，1. 根據元素的出現頻率進行排序重組，當用戶搜索高頻詞時，掃描的塊更少，減少IO放大。2. 根據（被搜索元素的次數*命中條數）的值進行排序，按排在最前的元素進行聚集，逐級聚集。

(以上方法可能比較燒腦，下次發一篇文檔專門講GIN的數據重組優化)

《索引掃描優化之 - GIN數據重組優化(按元素聚合) 想象在玩多階魔方》

4、gist數據分布優化

如果是普通類型，則線性相關越好，掃描或返回多條數據的效率越高。

如果是空間類型，則元素越集中（例如數據按geohash連續分布），效率越高。

5、brin數據分布優化

線性相關越好，掃描或返回多條數據的效率越高。

6、多列複合索引數據分布優化

對於多列符合索引，則看索引的類型，要求與前麵一樣。

增加一個，多個列的線性相關性越好，性能越好。

多列線性相關性計算方法如下

《PostgreSQL 計算任意類型字段之間的線性相關性》

數據分布還有一個好處，對於列存儲，可以大幅提升壓縮比

《一個簡單算法可以幫助物聯網,金融用戶節約98%的數據存儲成本 (PostgreSQL,Greenplum幫你做到)》

參考

《阿裏雲RDS for PostgreSQL varbitx插件與實時畫像應用場景介紹》

《多字段，任意組合條件查詢(0建模) - 毫秒級實時圈人最佳實踐》

《PostgreSQL GIN 單列聚集索引應用》

《寶劍贈英雄 - 任意組合字段等效查詢, 探探PostgreSQL多列展開式B樹 (GIN)》

《PostgreSQL GIN索引實現原理》

《從難纏的模煳查詢聊開 - PostgreSQL獨門絕招之一 GIN , GiST , SP-GiST , RUM 索引原理與技術背景》

《PostgreSQL 9.3 pg_trgm imporve support multi-bytes char and gist,gin index for reg-exp search》

《萬億級營銷(圈人)邁入毫秒時代 - 萬億user_tags級實時推薦係統數據庫設計》

最後更新：2017-06-13 02:32:06

懶人促進社會進步 - 5種索引的原理和優化Case (btree,hash,gin,gist,brin)

標簽

背景

建模和測試

贈送彩蛋

小結

不同的索引方法支持什麼類型的查詢？

如何優化索引效率

參考

上一篇：多階魔方複原在數據優化中的應用

下一篇： PostgreSQL數據保留窗口功能的使用

相關內容

熱門內容

最新內容

懶人促進社會進步 - 5種索引的原理和優化Case (btree,hash,gin,gist,brin)

標簽

背景

建模和測試

贈送彩蛋

小結

不同的索引方法支持什麼類型的查詢？

如何優化索引效率

參考

上一篇： 多階魔方複原在數據優化中的應用

下一篇： PostgreSQL數據保留窗口功能的使用

相關內容

熱門內容

最新內容

上一篇：多階魔方複原在數據優化中的應用