615 阿裏雲技術社區[雲棲]
Greenplum 注意對其數據類型，否則優化器讓你好看

在測試tpch時，發現有一些SQL跑超時，原因是測試SQL中有一些JOIN查詢，而這些JOIN的列數據類型不一致，導致無法使用索引，或者無法使用hash join。
例子：
int和int8，都是整型，隻是長度不一樣。關聯時，會遇到性能問題，因為不能使用HASH JOIN。
digoal=# create table t(id int);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create table t1(id int8);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create index idx_t on t(id);
CREATE INDEX
digoal=# create index idx_t1 on t1(id);
CREATE INDEX
digoal=# insert into t select generate_series(1,10000);
INSERT 0 10000
digoal=# insert into t1 select generate_series(1,10000);
INSERT 0 10000
關聯字段為int和int8，不能使用hash join，隻能用nestloop。
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
即使強製關掉nestloop也不行。
digoal=# set enable_nestloop=off;
SET
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_nestloop=off; enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
和數據量當然並沒有關係。
digoal=# insert into t select generate_series(1,10000000);
INSERT 0 10000000
digoal=# insert into t1 select generate_series(1,10000000);
INSERT 0 10000000
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_nestloop=off; enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
接下來，使用同樣的數據類型，可以用HASH JOIN關聯。性能大幅提升。
digoal=# create table t2(id int8);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# insert into t2 select * from t1;
INSERT 0 10010000
digoal=# create index idx_t2 on t2(id);
CREATE INDEX
digoal=# explain select * from t2 join t1 on( t2.id=t1.id );
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=236320.07..497677.13 rows=10013203 width=16)
   ->  Hash Join  (cost=236320.07..497677.13 rows=455146 width=16)
         Hash Cond: t1.id = t2.id
         ->  Seq Scan on t1  (cost=0.00..111158.22 rows=455161 width=8)
         ->  Hash  (cost=111155.03..111155.03 rows=455146 width=8)
               ->  Seq Scan on t2  (cost=0.00..111155.03 rows=455146 width=8)
 Optimizer status: legacy query optimizer
(7 rows)

另一個例子，當char長度不一樣，不會有以上類似的問題。
digoal=# create table t(id char(10));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create table t1(id char(20));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# insert into t1 select generate_series(1,10000000);
INSERT 0 10000000
digoal=# insert into t select generate_series(1,10000000);
INSERT 0 10000000
digoal=# explain select * from t join t1 on( t.id=t1.id );
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=240352.38..503423.59 rows=9975039 width=32)
   ->  Hash Join  (cost=240352.38..503423.59 rows=453411 width=32)
         Hash Cond: t.id = t1.id
         ->  Seq Scan on t  (cost=0.00..113396.58 rows=454303 width=11)
         ->  Hash  (cost=115664.39..115664.39 rows=453411 width=21)
               ->  Seq Scan on t1  (cost=0.00..115664.39 rows=453411 width=21)
 Optimizer status: legacy query optimizer
(7 rows)

digoal=# create index idx_t on t(id);
CREATE INDEX
digoal=# create index idx_t1 on t1(id);
CREATE INDEX
digoal=# explain select * from t join t1 on( t.id=t1.id );
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=240352.38..503423.59 rows=9975039 width=32)
   ->  Hash Join  (cost=240352.38..503423.59 rows=453411 width=32)
         Hash Cond: t.id = t1.id
         ->  Seq Scan on t  (cost=0.00..113396.58 rows=454303 width=11)
         ->  Hash  (cost=115664.39..115664.39 rows=453411 width=21)
               ->  Seq Scan on t1  (cost=0.00..115664.39 rows=453411 width=21)
 Optimizer status: legacy query optimizer
(7 rows)

digoal=# explain select * from t join t1 on( t.id=t1.id and t.id='1');
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..401.54 rows=4 width=32)
   ->  Nested Loop  (cost=0.00..401.54 rows=1 width=32)
         ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
               Index Cond: id = '1'::bpchar
         ->  Index Scan using idx_t1 on t1  (cost=0.00..200.83 rows=1 width=21)
               Index Cond: '1'::bpchar = id
 Optimizer status: legacy query optimizer
(7 rows)
varchar和char(n)也不存在問題。
postgres=# create table t(id varchar);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# create table t1(id char(10));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# explain select * from t , t1 where t.id=t1.id;
                                            QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.01..0.07 rows=4 width=76)
   ->  Hash Join  (cost=0.01..0.07 rows=1 width=76)
         Hash Cond: t.id::bpchar = t1.id
         ->  Redistribute Motion 22:22  (slice1; segments: 22)  (cost=0.00..0.02 rows=1 width=32)
               Hash Key: t.id::bpchar
               ->  Seq Scan on t  (cost=0.00..0.00 rows=1 width=32)
         ->  Hash  (cost=0.00..0.00 rows=1 width=44)
               ->  Seq Scan on t1  (cost=0.00..0.00 rows=1 width=44)
 Optimizer status: legacy query optimizer
(9 rows)

數據類型和查詢條件的類型不一致時，不走索引的例子。
digoal=# explain select * from t where id=1;
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=0.00..163369.87 rows=9995 width=11)
   ->  Seq Scan on t  (cost=0.00..163369.87 rows=455 width=11)
         Filter: id::text = '1'::text
 Optimizer status: legacy query optimizer
(4 rows)

digoal=# explain select * from t where id='1';
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..200.70 rows=1 width=11)
   ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
         Index Cond: id = '1'::bpchar
 Optimizer status: legacy query optimizer
(4 rows)
顯示轉換後，可以走索引。
digoal=# explain select * from t where id=1::bpchar;
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..200.70 rows=1 width=11)
   ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
         Index Cond: id = '1'::bpchar
 Optimizer status: legacy query optimizer
(4 rows)

greenplum因為用了早期的postgresql版本，所以這一方麵優化器並不好，在使用greenplum時，需要注意一下。
greenplum將合並到postgresql 9.5的版本，以上問題可以消除。
postgresql 9.4的例子：
digoal=# create table t(id int);
CREATE TABLE
digoal=# create table t1(id int8);
CREATE TABLE
digoal=# insert into t select generate_series(1,100000);
INSERT 0 100000
digoal=# insert into t1 select generate_series(1,100000);
INSERT 0 100000
digoal=# explain select * from t,t1 where t.id=t1.id;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Hash Join  (cost=2693.00..6136.00 rows=100000 width=12)
   Hash Cond: (t.id = t1.id)
   ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=4)
   ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)
         ->  Seq Scan on t1  (cost=0.00..1443.00 rows=100000 width=8)
(5 rows)
最後更新：2017-04-01 13:37:08
Greenplum 注意對其數據類型，否則優化器讓你好看

上一篇： MongoDB請求處理流程

下一篇：答疑解惑 · 歸檔進程cp命令的core文件追查

相關內容

熱門內容

最新內容

Greenplum 注意對其數據類型，否則優化器讓你好看

上一篇： MongoDB請求處理流程

下一篇： 答疑解惑 · 歸檔進程cp命令的core文件追查

相關內容

熱門內容

最新內容

下一篇：答疑解惑 · 歸檔進程cp命令的core文件追查