929 阿裏雲技術社區[雲棲]

索引順序掃描引發的heap scan IO放大, 背後的統計學原理與解決辦法

標簽

PostgreSQL , 優化器 , 索引掃描 , 堆掃描 , IO放大

背景

通過B-TREE索引掃描可能會帶來了巨大的heap page scan數目，即IO的放大.

為什麼呢？

示例視頻如下 :

https://www.tudou.com/programs/view/yQ0SzBqx_4w/

如果數據庫的單個數據塊（block_size）很大的話, 這種情況帶來的負麵影響也將被放大. 例如32k的block_size顯然比8k的block_size掃描開銷更大.

本文將講解一下索引掃描引發的heap page scan放大的原因, 以及解決辦法。告誡大家注意這樣的事情發生，以及如何對付。

正文

測試環境的成本因子如下 :

shared_buffers = 8192MB                 # min 128kB  
#seq_page_cost = 1.0                    # measured on an arbitrary scale  
random_page_cost = 1.0                  # same scale as above  
#cpu_tuple_cost = 0.01                  # same scale as above  
cpu_index_tuple_cost = 0.005            # same scale as above  
#cpu_operator_cost = 0.0025             # same scale as above  
effective_cache_size = 96GB

我們先創建一個測試表, 插入一些測試數據, 創建一個索引 :

digoal=> create table test_indexscan(id int, info text);  
CREATE TABLE  
digoal=> insert into test_indexscan select generate_series(1,5000000),md5(random()::text);  
INSERT 0 5000000  
digoal=> create index idx_test_indexscan_id on test_indexscan (id);  
CREATE INDEX

我們查看這個表和索引占用了多少數據塊.

digoal=> select relpages from pg_class where relname='test_indexscan';  
 relpages   
----------  
    10396  
(1 row)  
digoal=> select relpages from pg_class where relname='idx_test_indexscan_id';  
 relpages   
----------  
     3402  
(1 row)

接下來分析以下查詢, 我們看到走索引掃描, 並且掃描的數據塊是13547個. (10209 +3338).

掃描的數據塊和實際表占用的數據塊和索引塊相當.

digoal=> explain (analyze,verbose,costs,buffers,timing) select * from test_indexscan where id>90000;  
                                                                           QUERY PLAN                                                 
                               
------------------------------------------------------------------------------------------------------------------------------------  
-----------------------------  
 Index Scan using idx_test_indexscan_id on digoal.test_indexscan  (cost=0.43..99518.57 rows=4912065 width=37) (actual time=0.180..21  
72.949 rows=4910000 loops=1)  
   Output: id, info  
   Index Cond: (test_indexscan.id > 90000)  
   Buffers: shared hit=10209 read=3338  
 Total runtime: 2674.637 ms  
(5 rows)

這裏使用索引掃描為什麼沒有帶來heap page掃描的放大呢? 原因和值的順序與物理存儲順序一致.

如下, 那麼索引掃描的時候沒有發生塊的跳躍 :

digoal=> select correlation from pg_stats where tablename='test_indexscan' and attname='id';  
 correlation   
-------------  
  1  
(1 row)  
digoal=> select ctid,id from test_indexscan limit 10;  
  ctid  |   id      
--------+---------  
 (0,1)  | 1  
 (0,2)  | 2  
 (0,3)  | 3  
 (0,4)  | 4  
 (0,5)  | 5  
 (0,6)  | 6  
 (0,7)  | 7  
 (0,8)  | 8  
 (0,9)  | 9  
 (0,10) | 10  
(10 rows)

接下來我們插入隨機數據, 使得索引掃描時發生heap page的跳躍.

digoal=> truncate test_indexscan ;  
TRUNCATE TABLE  
digoal=> insert into test_indexscan select (random()*5000000)::int,md5(random()::text) from generate_series(1,100000);  
INSERT 0 100000

查詢當前的ID列的順性, 非常小, 說明這個值非常的離散.

digoal=> select correlation from pg_stats where tablename='test_indexscan' and attname='id';  
 correlation   
-------------  
  0.00986802  
(1 row)

從數據分布結果中也能看到這點.

digoal=> select ctid,id from test_indexscan limit 10;  
  ctid  |   id      
--------+---------  
 (0,1)  | 4217216  
 (0,2)  | 2127868  
 (0,3)  | 2072952  
 (0,4)  |   62641  
 (0,5)  | 4927312  
 (0,6)  | 3000894  
 (0,7)  | 2799439  
 (0,8)  | 4165217  
 (0,9)  | 2446438  
 (0,10) | 2835211  
(10 rows)

按以下順序掃描, 顯然會出現大量的數據塊的跳躍.

digoal=> select id,ctid from test_indexscan order by id limit 10;  
 id  |   ctid      
-----+-----------  
  56 | (192,318)  
  73 | (119,163)  
 218 | (189,2)  
 235 | (7,209)  
 260 | (41,427)  
 340 | (37,371)  
 548 | (118,363)  
 607 | (143,174)  
 690 | (161,38)  
 714 | (1,21)  
(10 rows)

當前這個表和索引占用的數據塊如下 :

digoal=> select relpages from pg_class where relname='test_indexscan';  
 relpages   
----------  
      208  
(1 row)  
  
digoal=> select relpages from pg_class where relname='idx_test_indexscan_id';  
 relpages   
----------  
       86  
(1 row)

接下來我們執行這個SQL, 發現走索引掃描了, 但是顯然shared hit變得非常的大, 原因就是每掃描一個索引條目, 對應到heap page number都是跳躍的. 造成了heap page掃描的放大. 具體放大多少行呢, 和差出來的行差不多.

digoal=> explain (analyze,verbose,costs,buffers,timing) select * from test_indexscan where id>90000;  
                                                                        QUERY PLAN                                                    
                        
------------------------------------------------------------------------------------------------------------------------------------  
----------------------  
 Index Scan using idx_test_indexscan_id on digoal.test_indexscan  (cost=0.29..2035.38 rows=99719 width=37) (actual time=0.027..87.45  
6 rows=98229 loops=1)  
   Output: id, info  
   Index Cond: (test_indexscan.id > 90000)  
   Buffers: shared hit=97837  
 Total runtime: 97.370 ms  
(5 rows)

heap page scan放大評估和索引掃描了多少條目有關, 但至少有98229個條目 :

digoal=> select count(*) from test_indexscan where id>90000;  
 count   
-------  
 98229  
(1 row)

如果純隨機掃描, 那麼將要掃描98229次heap page. 也就不難理解這裏的Buffers: shared hit=97837.

但是實際上, PostgreSQL的優化器似乎沒有關注這些開銷, 因為我們看到的成本隻有2035.38 (這裏和random_page_cost以及effective_cache_size 大於整個表和索引的空間有關)

接下來把random_page_cost設置為2和1, 兩個cost相減, 看看到底優化器評估了多少個塊掃描.

digoal=> set random_page_cost=2;  
SET  
digoal=> set enable_seqscan=off;  
SET  
digoal=> explain (analyze,verbose,costs,buffers,timing) select * from test_indexscan where id>90000;  
                                                                        QUERY PLAN                                                    
                        
------------------------------------------------------------------------------------------------------------------------------------  
----------------------  
 Index Scan using idx_test_indexscan_id on digoal.test_indexscan  (cost=0.29..2305.73 rows=98255 width=37) (actual time=0.045..81.76  
8 rows=98229 loops=1)  
   Output: id, info  
   Index Cond: (test_indexscan.id > 90000)  
   Buffers: shared hit=97837  
 Total runtime: 92.186 ms  
(5 rows)  
  
digoal=> set random_page_cost=1;  
SET  
digoal=> explain (analyze,verbose,costs,buffers,timing) select * from test_indexscan where id>90000;  
                                                                        QUERY PLAN                                                    
                        
------------------------------------------------------------------------------------------------------------------------------------  
----------------------  
 Index Scan using idx_test_indexscan_id on digoal.test_indexscan  (cost=0.29..2012.75 rows=98255 width=37) (actual time=0.028..80.05  
5 rows=98229 loops=1)  
   Output: id, info  
   Index Cond: (test_indexscan.id > 90000)  
   Buffers: shared hit=97837  
 Total runtime: 90.549 ms  
(5 rows)

相減得到293, 即優化器認為index scan需要掃描293個數據塊.

digoal=> select 2305-2012;  
 ?column?   
----------  
      293  
(1 row)

接下來我把enable_indexscan關閉, 讓優化器選擇bitmap scan.

digoal=> set enable_indexscan=off;  
SET  
digoal=> explain (analyze,verbose,costs,buffers,timing) select * from test_indexscan where id>90000;  
                                                                QUERY PLAN                                                            
        
------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on digoal.test_indexscan  (cost=846.77..2282.96 rows=98255 width=37) (actual time=15.291..35.911 rows=98229 loops=  
1)  
   Output: id, info  
   Recheck Cond: (test_indexscan.id > 90000)  
   Buffers: shared hit=292  
   ->  Bitmap Index Scan on idx_test_indexscan_id  (cost=0.00..822.21 rows=98255 width=0) (actual time=15.202..15.202 rows=98229 loo  
ps=1)  
         Index Cond: (test_indexscan.id > 90000)  
         Buffers: shared hit=84  
 Total runtime: 45.838 ms  
(8 rows)

從bitmap scan的結果可以看到, 實際掃描的塊為292個, 相比index scan少掃描了9.7萬多數據塊. 並且實際的執行時間也是bitmap scan要快很多.

本例PostgreSQL在計算index scan的random page的成本時, 評估得到的index scan成本小於bitmap index scan的成本, 然而實際上當correlation 很小時, index scan會掃描更多次的heap page, 成本遠遠大於bitmap scan.

本例發生這樣的情況, 具體的原因和我們的成本因子設置有關係, 因為錯誤的設置了random_page_cost以及表和索引的大小小於effective_cache_size, PostgreSQL在使用這樣的成本因子計算成本時, 出現了bitmap scan大於index scan成本的結果.

所以設置正確的成本因子非常重要, 這也是我們需要校準成本因子的原因.

例子 :

[postgres@digoal pgdata]$ psql  
psql (9.3.4)  
Type "help" for help.

默認的成本因子如下

digoal=# show seq_page_cost;  
 seq_page_cost   
---------------  
 1  
(1 row)  
  
digoal=# show random_page_cost;  
 random_page_cost   
------------------  
 4  
(1 row)  
  
digoal=# show cpu_tuple_cost;  
 cpu_tuple_cost   
----------------  
 0.01  
(1 row)  
  
digoal=# show cpu_index_tuple_cost;  
 cpu_index_tuple_cost   
----------------------  
 0.005  
(1 row)  
  
digoal=# show cpu_operator_cost;  
 cpu_operator_cost   
-------------------  
 0.0025  
(1 row)  
  
digoal=# show effective_cache_size;  
 effective_cache_size   
----------------------  
 128MB  
(1 row)

表和索引的大小如下

digoal=# \dt+ tbl_cost_align   
                         List of relations  
 Schema |      Name      | Type  |  Owner   |  Size  | Description   
--------+----------------+-------+----------+--------+-------------  
 public | tbl_cost_align | table | postgres | 219 MB |   
(1 row)  
  
digoal=# \di+ tbl_cost_align_id   
                                  List of relations  
 Schema |       Name        | Type  |  Owner   |     Table      | Size  | Description   
--------+-------------------+-------+----------+----------------+-------+-------------  
 public | tbl_cost_align_id | index | postgres | tbl_cost_align | 64 MB |   
(1 row)

把random_page_cost校準為10, 這個在一般的硬件環境中都適用.

digoal=# set random_page_cost=10;  
SET

默認選擇了全表掃描

digoal=# explain (analyze,costs,buffers,timing,verbose) select * from tbl_cost_align where id>2000000;  
                                                            QUERY PLAN                                                               
-----------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on public.tbl_cost_align  (cost=0.00..65538.00 rows=2996963 width=45) (actual time=0.050..1477.028 rows=2997015 loops=1)  
   Output: id, info, crt_time  
   Filter: (tbl_cost_align.id > 2000000)  
   Rows Removed by Filter: 2985  
   Buffers: shared hit=28038  
 Total runtime: 2011.742 ms  
(6 rows)

關閉全表掃描後, 選擇了bitmap scan

digoal=# set enable_seqscan=off;  
SET  
digoal=# explain (analyze,costs,buffers,timing,verbose) select * from tbl_cost_align where id>2000000;  
                                                                     QUERY PLAN                                                       
                  
------------------------------------------------------------------------------------------------------------------------------------  
----------------  
 Bitmap Heap Scan on public.tbl_cost_align  (cost=105426.89..170926.93 rows=2996963 width=45) (actual time=1221.104..2911.889 rows=2  
997015 loops=1)  
   Output: id, info, crt_time  
   Recheck Cond: (tbl_cost_align.id > 2000000)  
   Rows Removed by Index Recheck: 2105  
   Buffers: shared hit=36229  
   ->  Bitmap Index Scan on tbl_cost_align_id  (cost=0.00..104677.65 rows=2996963 width=0) (actual time=1214.865..1214.865 rows=2997  
015 loops=1)  
         Index Cond: (tbl_cost_align.id > 2000000)  
         Buffers: shared hit=8191  
 Total runtime: 3585.699 ms  
(9 rows)

關閉bitmap scan後選擇了index scan, index scan的cost遠遠大於評估到的bitmap scan. 因為我們使用了正確的成本因子.

digoal=# set enable_bitmapscan=off;  
SET  
digoal=# explain (analyze,costs,buffers,timing,verbose) select * from tbl_cost_align where id>2000000;  
                                                                           QUERY PLAN                                                 
                              
------------------------------------------------------------------------------------------------------------------------------------  
----------------------------  
 Index Scan using tbl_cost_align_id on public.tbl_cost_align  (cost=0.43..16601388.04 rows=2996963 width=45) (actual time=0.064..566  
2.361 rows=2997015 loops=1)  
   Output: id, info, crt_time  
   Index Cond: (tbl_cost_align.id > 2000000)  
   Buffers: shared hit=3005084  
 Total runtime: 6173.067 ms  
(5 rows)

當錯誤的設置了random_page_cost=1=seq_page_cost時, 執行計劃會有所改變(改變出現在effective_cache_size大於表和索引的大小時).

the wrong plan cost occur when i set random_page_cost to 1, and effective_cache_size big then index size and table size in this case.

重新進入psql, 所有因子重回默認值.

digoal=# set random_page_cost=1;  
SET  
digoal=# explain (analyze,costs,buffers,timing,verbose) select * from tbl_cost_align where id>2000000;  
                                                            QUERY PLAN                                                               
-----------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on public.tbl_cost_align  (cost=0.00..65538.00 rows=2996963 width=45) (actual time=0.040..1692.712 rows=2997015 loops=1)  
   Output: id, info, crt_time  
   Filter: (tbl_cost_align.id > 2000000)  
   Rows Removed by Filter: 2985  
   Buffers: shared hit=28038  
 Total runtime: 2249.313 ms  
(6 rows)

目前看來還正確

digoal=# set enable_seqscan=off;  
SET  
digoal=# explain (analyze,costs,buffers,timing,verbose) select * from tbl_cost_align where id>2000000;  
                                                                    QUERY PLAN                                                        
                
------------------------------------------------------------------------------------------------------------------------------------  
--------------  
 Bitmap Heap Scan on public.tbl_cost_align  (cost=31446.89..96946.93 rows=2996963 width=45) (actual time=1224.445..2454.797 rows=299  
7015 loops=1)  
   Output: id, info, crt_time  
   Recheck Cond: (tbl_cost_align.id > 2000000)  
   Rows Removed by Index Recheck: 2105  
   Buffers: shared hit=36229  
   ->  Bitmap Index Scan on tbl_cost_align_id  (cost=0.00..30697.65 rows=2996963 width=0) (actual time=1220.404..1220.404 rows=29970  
15 loops=1)  
         Index Cond: (tbl_cost_align.id > 2000000)  
         Buffers: shared hit=8191  
 Total runtime: 2955.816 ms  
(9 rows)

當effective_cache_size還是小於表和索引時, 執行計劃依舊正確

digoal=# set effective_cache_size='280MB';  
SET  
digoal=# explain (analyze,costs,buffers,timing,verbose) select * from tbl_cost_align where id>2000000;  
                                                                   QUERY PLAN                                                         
               
------------------------------------------------------------------------------------------------------------------------------------  
-------------  
 Bitmap Heap Scan on public.tbl_cost_align  (cost=31446.89..96946.93 rows=2996963 width=45) (actual time=963.845..2060.463 rows=2997  
015 loops=1)  
   Output: id, info, crt_time  
   Recheck Cond: (tbl_cost_align.id > 2000000)  
   Rows Removed by Index Recheck: 2105  
   Buffers: shared hit=36229  
   ->  Bitmap Index Scan on tbl_cost_align_id  (cost=0.00..30697.65 rows=2996963 width=0) (actual time=959.673..959.673 rows=2997015  
 loops=1)  
         Index Cond: (tbl_cost_align.id > 2000000)  
         Buffers: shared hit=8191  
 Total runtime: 2515.649 ms  
(9 rows)

當effective_cache_size大於表和索引的大小時, index scan的成本低於bitmap scan的成本了.

When effective_cache_size large then table and index's size. then use index scan first than bitmap scan.

digoal=# set effective_cache_size='283MB';  
SET  
digoal=# explain (analyze,costs,buffers,timing,verbose) select * from tbl_cost_align where id>2000000;  
                                                                         QUERY PLAN                                                   
                           
------------------------------------------------------------------------------------------------------------------------------------  
-------------------------  
 Index Scan using tbl_cost_align_id on public.tbl_cost_align  (cost=0.43..92030.24 rows=2996963 width=45) (actual time=0.045..5238.3  
61 rows=2997015 loops=1)  
   Output: id, info, crt_time  
   Index Cond: (tbl_cost_align.id > 2000000)  
   Buffers: shared hit=3005084  
 Total runtime: 5689.583 ms  
(5 rows)

如果這個時候再把random_page_cost調回正常值10, 則執行計劃回歸正常.

digoal=# set random_page_cost=10;  
SET  
digoal=# explain (analyze,costs,buffers,timing,verbose) select * from tbl_cost_align where id>2000000;  
                                                                    QUERY PLAN                                                        
                 
------------------------------------------------------------------------------------------------------------------------------------  
---------------  
 Bitmap Heap Scan on public.tbl_cost_align  (cost=105426.89..170926.93 rows=2996963 width=45) (actual time=918.225..2195.414 rows=29  
97015 loops=1)  
   Output: id, info, crt_time  
   Recheck Cond: (tbl_cost_align.id > 2000000)  
   Rows Removed by Index Recheck: 2105  
   Buffers: shared hit=36229  
   ->  Bitmap Index Scan on tbl_cost_align_id  (cost=0.00..104677.65 rows=2996963 width=0) (actual time=913.935..913.935 rows=299701  
5 loops=1)  
         Index Cond: (tbl_cost_align.id > 2000000)  
         Buffers: shared hit=8191  
 Total runtime: 2698.429 ms  
(9 rows)  
  
digoal=# set enable_seqscan=on;  
SET  
digoal=# explain (analyze,costs,buffers,timing,verbose) select * from tbl_cost_align where id>2000000;  
                                                            QUERY PLAN                                                               
-----------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on public.tbl_cost_align  (cost=0.00..65538.00 rows=2996963 width=45) (actual time=0.020..1522.791 rows=2997015 loops=1)  
   Output: id, info, crt_time  
   Filter: (tbl_cost_align.id > 2000000)  
   Rows Removed by Filter: 2985  
   Buffers: shared hit=28038  
 Total runtime: 2104.057 ms  
(6 rows)

本例說明了成本因子的重要性. 千萬不能隨意設置, 即使完全內存命中, random_page_cost也應該大於seq_page_cost.

我在前一篇BLOG中測試了這樣的場景, 完全內存命中的場景可以設置 random_page_cost=1.6; seq_page_cost=1;

《優化器成本因子校對 - PostgreSQL explain cost constants alignment to timestamp》

B-TREE掃描，對於線性相關性不好的列，會放大HEAP SCAN 的IO消耗，使用bitmap可以解決。

線性相關性的知識如下

《PostgreSQL 計算任意類型字段之間的線性相關性》

《PostgreSQL 統計信息之 - 邏輯與物理存儲的線性相關性》

參考

1. https://www.tudou.com/programs/view/yQ0SzBqx_4w/

2. https://www.postgresql.org/message-id/flat/13668.1398541533@sss.pgh.pa.us

3. 《優化器成本因子校對 - PostgreSQL explain cost constants alignment to timestamp》

4. src/backend/optimizer/path/costsize.c

cost_index function :   
        /*  
         * Now interpolate based on estimated index order correlation to get total  
         * disk I/O cost for main table accesses.  
         */  
        csquared = indexCorrelation * indexCorrelation;  
  
        run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);

最後更新：2017-05-07 07:57:20

索引順序掃描引發的heap scan IO放大, 背後的統計學原理與解決辦法

標簽

背景

正文

參考

上一篇： trichedit帶格式文本操作

下一篇： Application.ProcessMessages使用說明

相關內容

熱門內容

最新內容