全文檢索 (不包含、不等於) 索引優化 - 阿裏雲RDS PostgreSQL最佳實踐

背景

PostgreSQL內置了GIN索引，支持全文檢索，支持數組檢索等多值數據類型的檢索。

在全文檢索中，不包含某個關鍵字能用到索引嗎？

實際上GIN是倒排索引，不包含某個關鍵字的查詢，實際上是跳過主tree上麵的TOKEN的掃描。

隻要被跳過的TOKEN包含了大量數據，那麼就是劃算的。PostgreSQL是基於CBO的執行計劃優化器，所以會自動選擇最優的索引。

例子1，全文檢索不包含查詢

1、創建測試表

postgres=# create table notcontain (id int, info tsvector);  
CREATE TABLE

2、創建生成隨機字符串的函數

CREATE OR REPLACE FUNCTION   
gen_rand_str(integer)    
 RETURNS text    
 LANGUAGE sql    
 STRICT    
AS $function$    
  select string_agg(a[(random()*6)::int+1],'') from generate_series(1,$1), (select array['a','b','c','d','e','f',' ']) t(a);    
$function$;

3、插入100萬測試數據

postgres=# insert into notcontain select generate_series(1,1000000), to_tsvector(gen_rand_str(256));

4、創建全文索引（GIN索引）

create index idx_notcontain_info on notcontain using gin (info);

5、查詢某一條記錄

postgres=# select * from notcontain limit 1;  
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
id   | 1  
info | 'afbbeeccbf':3 'b':16 'bdcdfd':2 'bdcfbcecdeeaed':8 'bfedfecbfab':7 'cd':9 'cdcaefaccdccadeafadededddcbdecdaefbcfbdaefcec':14 'ceafecff':6 'd':17,18 'dbc':12 'dceabcdcbdca':10 'dddfdbffffeaca':13 'deafcccfbcdebdaecda':11 'dfbadcdebdedbfa':19 'eb':15 'ebe':1 'febdcbdaeaeabbdadacabdbbedfafcaeabbdcedaeca':5 'fedeecbcdfcdceabbabbfcdd':4

6、測試不包含某個關鍵字

數據庫自動選擇了全表掃描，沒有使用GIN索引。

為什麼沒有使用索引呢，我前麵解釋了，因為這個關鍵字的數據記錄太少了，不包含它時使用索引過濾不劃算。

（當包含它時，使用GIN索引就非常劃算。包含和不包含是相反的過程，成本也是反的）

select * from notcontain t1 where info @@ to_tsquery ('!eb');  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!eb');  
                                                             QUERY PLAN                                                               
------------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=0.00..318054.51 rows=950820 width=412) (actual time=0.016..1087.463 rows=947911 loops=1)  
   Output: id, info  
   Filter: (t1.info @@ to_tsquery('!eb'::text))  
   Rows Removed by Filter: 52089  
   Buffers: shared hit=55549  
 Planning time: 0.131 ms  
 Execution time: 1134.571 ms  
(7 rows)

7、強製關閉全表掃描，讓數據庫選擇索引。

可以看到，使用索引確實是慢的，我們大多數時候應該相信數據庫的成本規劃是準確的。（隻要成本因子和環境性能匹配足夠的準，這些都是可以校準的，有興趣的同學可以參考我寫的因子校準方法。）

postgres=# set enable_seqscan=off;  
SET  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!eb');  
                                                                       QUERY PLAN                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on postgres.notcontain t1  (cost=82294981.00..82600120.25 rows=950820 width=412) (actual time=1325.587..1540.145 rows=947911 loops=1)  
   Output: id, info  
   Recheck Cond: (t1.info @@ to_tsquery('!eb'::text))  
   Heap Blocks: exact=55549  
   Buffers: shared hit=171948  
   ->  Bitmap Index Scan on idx_notcontain_info  (cost=0.00..82294743.30 rows=950820 width=0) (actual time=1315.663..1315.663 rows=947911 loops=1)  
         Index Cond: (t1.info @@ to_tsquery('!eb'::text))  
         Buffers: shared hit=116399  
 Planning time: 0.135 ms  
 Execution time: 1584.670 ms  
(10 rows)

例子2，全文檢索不包含查詢

這個例子造一份傾斜數據，這個TOKEN包含了大量的重複記錄，通過不包含過濾它。看看能否使用索引。

1、生成測試數據

postgres=# truncate notcontain ;  
TRUNCATE TABLE  
postgres=# insert into notcontain select generate_series(1,1000000), to_tsvector('abc');  
INSERT 0 1000000

2、測試不包含ABC的檢索

數據庫自動選擇了索引掃描，跳過了不需要檢索的數據塊。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!abc');  
                                                              QUERY PLAN                                                                 
---------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on postgres.notcontain t1  (cost=220432.15..220433.71 rows=1 width=21) (actual time=107.936..107.936 rows=0 loops=1)  
   Output: id, info  
   Recheck Cond: (t1.info @@ to_tsquery('!abc'::text))  
   Buffers: shared hit=268  
   ->  Bitmap Index Scan on idx_notcontain_info  (cost=0.00..220432.15 rows=1 width=0) (actual time=107.933..107.933 rows=0 loops=1)  
         Index Cond: (t1.info @@ to_tsquery('!abc'::text))  
         Buffers: shared hit=268  
 Planning time: 0.183 ms  
 Execution time: 107.962 ms  
(9 rows)

3、強製使用全表掃描，發現性能確實不如索引掃描，也驗證了我們說的PostgreSQL是基於成本的優化器，自動選擇最優的執行計劃。

postgres=# set enable_bitmapscan =off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!abc');  
                                                         QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=0.00..268870.00 rows=1 width=21) (actual time=1065.436..1065.436 rows=0 loops=1)  
   Output: id, info  
   Filter: (t1.info @@ to_tsquery('!abc'::text))  
   Rows Removed by Filter: 1000000  
   Buffers: shared hit=6370  
 Planning time: 0.059 ms  
 Execution time: 1065.449 ms  
(7 rows)

例子3，普通類型BTREE索引，不等於檢索

這個例子是普通類型，使用BTREE索引，看看是否支持不等於的索引檢索。

測試方法與GIN測試類似，使用傾斜和非傾斜兩種測試數據。

1、非傾斜數據的不包含查詢，使用索引過濾的記錄非常少。

目前內核層麵沒有實現BTREE索引的不包含檢索。（雖然技術上是可以通過INDEX SKIP SCAN來實現的，跳過不需要掃描的BRANCH節點）

postgres=# truncate notcontain ;  
TRUNCATE TABLE  
postgres=# insert into notcontain select generate_series(1,1000000);  
INSERT 0 1000000  
postgres=# create index idx1 on notcontain (id);  
CREATE INDEX  
postgres=# set enable_bitmapscan =on;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where id<>1;  
                                                           QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=0.00..16925.00 rows=999999 width=36) (actual time=0.011..110.592 rows=999999 loops=1)  
   Output: id, info  
   Filter: (t1.id <> 1)  
   Rows Removed by Filter: 1  
   Buffers: shared hit=4425  
 Planning time: 0.195 ms  
 Execution time: 156.013 ms  
(7 rows)  
  
  
postgres=# set enable_seqscan=off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where id<>1;  
                                                                   QUERY PLAN                                                                      
-------------------------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on postgres.notcontain t1  (cost=10000000000.00..10000016925.00 rows=999999 width=36) (actual time=0.011..110.964 rows=999999 loops=1)  
   Output: id, info  
   Filter: (t1.id <> 1)  
   Rows Removed by Filter: 1  
   Buffers: shared hit=4425  
 Planning time: 0.062 ms  
 Execution time: 156.461 ms  
(7 rows)

2、更換SQL寫法，可以實現索引檢索。但實際上由於不是使用的INDEX SKIP SCAN，所以需要一個JOIN過程，實際上效果並不佳。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);  
                                                                      QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------  
 Merge Anti Join  (cost=0.85..25497.28 rows=999999 width=36) (actual time=0.023..277.639 rows=999999 loops=1)  
   Output: t1.id, t1.info  
   Merge Cond: (t1.id = t2.id)  
   Buffers: shared hit=7164  
   ->  Index Scan using idx1 on postgres.notcontain t1  (cost=0.42..22994.22 rows=1000000 width=36) (actual time=0.009..148.520 rows=1000000 loops=1)  
         Output: t1.id, t1.info  
         Buffers: shared hit=7160  
   ->  Index Only Scan using idx1 on postgres.notcontain t2  (cost=0.42..3.04 rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=1)  
         Output: t2.id  
         Index Cond: (t2.id = 1)  
         Heap Fetches: 1  
         Buffers: shared hit=4  
 Planning time: 0.223 ms  
 Execution time: 322.798 ms  
(14 rows)  
postgres=# set enable_mergejoin=off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);  
                                                                  QUERY PLAN                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------  
 Hash Anti Join  (cost=3.05..27053.05 rows=999999 width=36) (actual time=0.060..251.232 rows=999999 loops=1)  
   Output: t1.id, t1.info  
   Hash Cond: (t1.id = t2.id)  
   Buffers: shared hit=4432  
   ->  Seq Scan on postgres.notcontain t1  (cost=0.00..14425.00 rows=1000000 width=36) (actual time=0.011..84.659 rows=1000000 loops=1)  
         Output: t1.id, t1.info  
         Buffers: shared hit=4425  
   ->  Hash  (cost=3.04..3.04 rows=1 width=4) (actual time=0.014..0.014 rows=1 loops=1)  
         Output: t2.id  
         Buckets: 1024  Batches: 1  Memory Usage: 9kB  
         Buffers: shared hit=4  
         ->  Index Only Scan using idx1 on postgres.notcontain t2  (cost=0.42..3.04 rows=1 width=4) (actual time=0.010..0.011 rows=1 loops=1)  
               Output: t2.id  
               Index Cond: (t2.id = 1)  
               Heap Fetches: 1  
               Buffers: shared hit=4  
 Planning time: 0.147 ms  
 Execution time: 297.127 ms  
(18 rows)  
  
postgres=# set enable_seqscan=off;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1);  
                                                                      QUERY PLAN                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------  
 Hash Anti Join  (cost=3.48..35622.27 rows=999999 width=36) (actual time=0.036..324.401 rows=999999 loops=1)  
   Output: t1.id, t1.info  
   Hash Cond: (t1.id = t2.id)  
   Buffers: shared hit=7164  
   ->  Index Scan using idx1 on postgres.notcontain t1  (cost=0.42..22994.22 rows=1000000 width=36) (actual time=0.017..149.383 rows=1000000 loops=1)  
         Output: t1.id, t1.info  
         Buffers: shared hit=7160  
   ->  Hash  (cost=3.04..3.04 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=1)  
         Output: t2.id  
         Buckets: 1024  Batches: 1  Memory Usage: 9kB  
         Buffers: shared hit=4  
         ->  Index Only Scan using idx1 on postgres.notcontain t2  (cost=0.42..3.04 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1)  
               Output: t2.id  
               Index Cond: (t2.id = 1)  
               Heap Fetches: 1  
               Buffers: shared hit=4  
 Planning time: 0.141 ms  
 Execution time: 369.749 ms  
(18 rows)

3、PostgreSQL還支持多核並行，所以全表掃描還可以暴力提升性能。

如果記錄數非常多，使用並行掃描，性能提升非常明顯。

postgres=# create  unlogged table ptbl(id int);  
CREATE TABLE  
postgres=# insert into ptbl select generate_series(1,100000000);  
  
postgres=# alter table ptbl set (parallel_workers =32);  
  
\timing  
  
非並行查詢：  
postgres=# set max_parallel_workers_per_gather =0;  
postgres=# select count(*) from ptbl where id<>1;  
  count     
----------  
 99999999  
(1 row)  
  
Time: 11863.151 ms (00:11.863)  
  
並行查詢：  
postgres=# set max_parallel_workers_per_gather =32;  
postgres=# select count(*) from ptbl where id<>1;  
  count     
----------  
 99999999  
(1 row)  
  
Time: 610.017 ms

使用並行查詢後，性能提升非常明顯。

例子4，普通類型partial BTREE索引，不等於檢索

對於固定的不等於查詢，我們可以使用PostgreSQL的partial index功能。

create table tbl (id int, info text, crt_time timestamp, c1 int);

select * from tbl where c1<>1;

insert into tbl select generate_series(1,10000000), 'test', now(), 1;
insert into tbl values (1,'abc',now(),2);

create index idx_tbl_1 on tbl(id) where c1<>1;

cool，使用PARTIAL INDEX，0.03毫秒，在1000萬數據中進行不等於檢索。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1<>1;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_tbl_1 on postgres.tbl  (cost=0.12..1.44 rows=1 width=21) (actual time=0.015..0.015 rows=1 loops=1)
   Output: id, info, crt_time, c1
   Buffers: shared hit=1 read=1
 Planning time: 0.194 ms
 Execution time: 0.030 ms
(5 rows)

小結

1、PostgreSQL內置了GIN索引，支持全文檢索、支持數組等多值類型的搜索。

2、PostgreSQL使用基於成本的執行計劃優化器，會自動選擇最優的執行計劃，在進行不包含檢索時，PostgreSQL會自動選擇是否使用索引掃描。

3、對於BTREE索引，理論上也能實現不等於的搜索（INDEX SKIP SCAN），目前內核層麵還沒有實現它，目前可以通過調整SQL的寫法來使用索引掃描。

4、PostgreSQL還支持多核並行，所以全表掃描還可以暴力提升性能。如果記錄數非常多，使用並行掃描，性能提升非常明顯。

5、PostgreSQL支持partial index，可以用於分區索引，或者部分索引。對於固定條件的不等於查詢，效果非常顯著。

最後更新：2017-08-13 22:52:18

全文檢索 (不包含、不等於) 索引優化 - 阿裏雲RDS PostgreSQL最佳實踐

背景

例子1，全文檢索不包含查詢

例子2，全文檢索不包含查詢

例子3，普通類型BTREE索引，不等於檢索

例子4，普通類型partial BTREE索引，不等於檢索

小結

上一篇：機票業務(單實例 2700萬行/s return)數據庫架構設計 - 阿裏雲RDS PostgreSQL最佳實踐

下一篇： Greenplum在企業生產中的最佳實踐（上）

相關內容

熱門內容

最新內容

全文檢索 (不包含、不等於) 索引優化 - 阿裏雲RDS PostgreSQL最佳實踐

背景

例子1，全文檢索不包含查詢

例子2，全文檢索不包含查詢

例子3，普通類型BTREE索引，不等於檢索

例子4，普通類型partial BTREE索引，不等於檢索

小結

上一篇： 機票業務(單實例 2700萬行/s return)數據庫架構設計 - 阿裏雲RDS PostgreSQL最佳實踐

下一篇： Greenplum在企業生產中的最佳實踐（上）

相關內容

熱門內容

最新內容

上一篇：機票業務(單實例 2700萬行/s return)數據庫架構設計 - 阿裏雲RDS PostgreSQL最佳實踐