Greenplum 點查詢的優化(分布鍵)

標簽

PostgreSQL , Greenplum , 分布式查詢 , 分布式執行計劃 , 點查

背景

Greenplum是分布式數據庫，在建表時，可以指定隨機分布、或按指定字段或多個字段進行分布。

因此在做點查時，Greenplum可以根據WHERE條件來判斷是需要在所有節點查詢，還是隻需要到若幹節點查詢。

假設tbl按id分布，那麼下麵幾種情況應該到哪個節點查詢呢：

-- 到單個節點查詢  
select * from tbl where id=?  
  
-- 到若幹個節點查詢  
select * from tbl where id in (?,?,?)  
  
-- 到所有節點查詢  
select * from tbl where id >= ? and id < ?;  
  
-- 到所有節點查詢  
select * from tbl where col1 ...;

到多個節點查詢和到一個節點查詢，性能是不一樣的。因此我們就有了優化的方法。

點查的優化與分布式QUERY

測試

1、構建測試表和數據

create table tbl2 (id int, c1 int) DISTRIBUTED BY (id);  
  
insert into tbl2 select id,id from generate_series(1,10000000) t(id);

2、創建索引

create index idx_tbl2_1 on tbl2(id);  
create index idx_tbl2_2 on tbl2(c1);

3、按分布鍵，點查，隻需要查詢一個節點。

postgres=# explain analyze select * from tbl2 where id=1;  
                                                QUERY PLAN                                                   
-----------------------------------------------------------------------------------------------------------  
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..200.28 rows=1 width=8)  
   Rows out:  1 rows at destination with 2.421 ms to first row, 2.422 ms to end, start offset by 0.212 ms.  
   ->  Index Scan using idx_tbl2_1 on tbl2  (cost=0.00..200.28 rows=1 width=8)  
         Index Cond: id = 1  
         Rows out:  1 rows with 0.017 ms to first row, 0.019 ms to end, start offset by 2.576 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 226K bytes.  
   (slice1)    Executor memory: 172K bytes (seg42).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  enable_bitmapscan=off; enable_seqscan=off; optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 2.732 ms  
(13 rows)

4、按非分布鍵查詢，需要查詢所有節點。

postgres=# explain analyze select * from tbl2 where c1=1;  
                                              QUERY PLAN                                                 
-------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice1; segments: 48)  (cost=0.00..200.28 rows=1 width=8)  
   Rows out:  1 rows at destination with 58 ms to first row, 74 ms to end, start offset by 0.341 ms.  
   ->  Index Scan using idx_tbl2_2 on tbl2  (cost=0.00..200.28 rows=1 width=8)  
         Index Cond: c1 = 1  
         Rows out:  1 rows (seg42) with 0.294 ms to first row, 0.297 ms to end, start offset by 58 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 226K bytes.  
   (slice1)    Executor memory: 172K bytes avg x 48 workers, 172K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  enable_bitmapscan=off; enable_seqscan=off; optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 74.553 ms  
(13 rows)

分析

在單個節點執行和在多個節點執行性能完全不一樣對吧，從現象來看：

1、首先master需要生成分布式執行計劃，耗費若幹毫秒。

2、建立master與segment的連接，這一步可能是串行的，節點越多，建立連接耗時越久。

這個是在所有節點執行的，可以看到offset 58 ms，看起來是串行帶來的問題。

Rows out:  1 rows (seg42) with 0.294 ms to first row, 0.297 ms to end, start offset by 58 ms.

3、執行，執行實際上是並行的，而且也沒有花多少時間。

優化

1、如果需要經常點查，並且條件字段分布均勻的話。可以選擇這個字段作為分布鍵，提高查詢效率。（並且這個表的UK或PK也是這個列，或者沒有UK或者PK列的需求時。）

2、如果不能做到按查詢條件字段進行分布，參數可以優化（效果不明顯）。

gp_cached_segworkers_threshold

When a user starts a session with Greenplum Database and issues a query, the system creates groups or  
  
'gangs' of worker processes on each segment to do the work. After the work is done, the segment worker  
  
processes are destroyed except for a cached number which is set by this parameter. A lower setting  
  
conserves system resources on the segment hosts, but a higher setting may improve performance for  
  
power-users that want to issue many complex queries in a row.

gp_interconnect_type

Sets the networking protocol used for Greenplum Database interconnect traffic. With the TCP protocol,  
  
Greenplum Database has an upper limit of 1000 segment instances - less than that if the query workload  
  
involves complex, multi-slice queries.  
  
The UDP protocol allows for greater interconnect scalability. Note that the Greenplum Database software  
  
does the additional packet verification and checking not performed by UDP, so reliability and performance  
  
is equivalent to TCP.  
  
UDPIFC specifies using UDP with flow control for interconnect traffic. Specify the interconnect flow control  
  
method with gp_interconnect_fc_method.  
  
  
Note: The Greenplum Database interconnect types TCP and UDP are deprecated. In the next  
  
major release, only the UDPIFC interconnect type will be supported by Greenplum Database.

3、在master與segment之間使用連接池，也可能是一種優化方法。

https://www.linkedin.com/pulse/scaling-greenplum-pgbouncer-sandeep-katta-/?articleId=6128769027482402816

https://greenplum.org/docs/admin_guide/access_db/topics/pgbouncer.html

最後更新：2017-10-28 23:34:32

Greenplum 點查詢的優化(分布鍵)

標簽

背景

點查的優化與分布式QUERY

測試

分析

優化

上一篇： Greenplum 類型一致性使用規範 - 索引條件、JOIN的類型一致性限製

下一篇： PostgreSQL DISTINCT 和 DISTINCT ON 語法的使用

相關內容

熱門內容

最新內容