504 阿裏雲技術社區[雲棲]

阿裏雲HybridDB for PG實踐 - 行存、列存，堆表、AO表的原理和選擇

標簽

PostgreSQL , Greenplum , 向量計算 , 行存儲 , 列存 , AO表

背景

Greenplum支持行存和列存，支持堆表和AO表，那麼他們有什麼不同，如何選擇呢？

行存和列存的原理

1、行存，以行為形式組織存儲，一行是一個tuple，存在一起。當需要讀取某列時，需要將這列前麵的所有列都進行deform，所以訪問第一列和訪問最後一列的成本實際上是不一樣的。

在這篇文檔中，有deform的詳細介紹。《PostgreSQL 向量化執行插件(瓦片式實現) 10x提速OLAP》

行存小結：

全表掃描要掃描更多的數據塊。

壓縮比較低。

讀取任意列的成本不一樣，越靠後的列，成本越高。

不適合向量計算、JIT架構。（簡單來說，就是不適合批處理形式的計算）

需要REWRITE表時，需要對全表進行REWRITE，例如加字段有默認值。

2、列存，以列為形式組織存儲，每列對應一個或一批文件。讀取任一列的成本是一樣的，但是如果要讀取多列，需要訪問多個文件，訪問的列越多，開銷越大。

列存小結：

壓縮比高。

僅僅支持AO存儲（後麵會將）。

讀取任意列的成本是一樣的。

非常適合向量計算、JIT架構。對大批量數據的訪問和統計，效率更高。

讀取很多列時，由於需要訪問更多的文件，成本更高。例如查詢明細。

需要REWRITE表時，不需要對全表操作，例如加字段有默認值，隻是添加字段對應的那個文件。

什麼時候選擇行存

如果OLTP的需求偏多，例如經常需要查詢表的明細（輸出很多列），需要更多的更新和刪除操作時。可以考慮行存。

什麼時候選擇列存

如果OLAP的需求偏多，經常需要對數據進行統計時，選擇列存。

需要比較高的壓縮比時，選擇列存。

如果用戶有混合需求，可以采用分區表，例如按時間維度的需求分區，近期的數據明細查詢多，那就使用行存，對曆史的數據統計需求多那就使用列存。

堆表和AO表的原理

1、堆表，實際上就是PG的堆存儲，堆表的所有變更都會產生REDO，可以實現時間點恢複。但是堆表不能實現邏輯增量備份（因為表的任意一個數據塊都有可能變更，不方便通過堆存儲來記錄位點。）。

一個事務結束時，通過clog以及REDO來實現它的可靠性。同時支持通過REDO來構建MIRROR節點實現數據冗餘。

2、AO表，看名字就知道，隻追加的存儲，刪除更新數據時，通過另一個BITMAP文件來標記被刪除的行，通過bit以及偏移對齊來判定AO表上的某一行是否被刪除。

事務結束時，需要調用FSYNC，記錄最後一次寫入對應的數據塊的偏移。（並且這個數據塊即使隻有一條記錄，下次再發起事務又會重新追加一個數據塊）同時發送對應的數據塊給MIRROR實現數據冗餘。

因此AO表不適合小事務，因為每次事務結束都會FSYNC，同時事務結束後這個數據塊即使有空餘也不會被複用。（你可以測試一下，AO表單條提交的IO放大很嚴重）。

雖然如此，AO表非常適合OLAP場景，批量的數據寫入，高壓縮比，邏輯備份支持增量備份，因此每次記錄備份到的偏移量即可。加上每次備份全量的BITMAP刪除標記（很小）。

什麼時候選擇堆表

當數據寫入時，小事務偏多時選擇堆表。

當需要時間點恢複時，選擇堆表。

什麼時候選擇AO表

當需要列存時，選擇AO表。

當數據批量寫入時，選擇AO表。

測試對比行存deform和列存的性能差別

1、創建一個函數，用於創建400列的表（行存堆表、AO行存表、AO列存表）。

create or replace function f(name, int, text) returns void as $$  
declare  
  res text := '';  
begin  
  for i in 1..$2 loop  
    res := res||'c'||i||' int8,';  
  end loop;  
  res := rtrim(res, ',');  
  if $3 = 'ao_col' then  
    res := 'create table '||$1||'('||res||') with  (appendonly=true, blocksize=8192, compresstype=none, orientation=column)';  
  elsif $3 = 'ao_row' then  
    res := 'create table '||$1||'('||res||') with  (appendonly=true, blocksize=8192, orientation=row)';  
  elsif $3 = 'heap_row' then  
    res := 'create table '||$1||'('||res||') with  (appendonly=false)';  
  else  
    raise notice 'use ao_col, ao_row, heap_row as $3';  
    return;  
  end if;  
  execute res;  
end;  
$$ language plpgsql;

2、創建表如下

postgres=# select f('tbl_ao_col', 400, 'ao_col');  
postgres=# select f('tbl_ao_row', 400, 'ao_row');  
postgres=# select f('tbl_heap_row', 400, 'heap_row');

3、創建1個函數，用於填充數據，其中第一個和最後3個字段為測試數據的字段，其他都填充1。

create or replace function f_ins1(name, int, int8) returns void as $$  
declare  
  res text := '';  
begin  
  for i in 1..($2-4) loop  
    res := res||'1,';  
  end loop;  
  res := 'id,'||res;  
  res := rtrim(res, ',');  
  res := 'insert into '||$1||' select '||res||'id,random()*10000,random()*100000 from generate_series(1,'||$3||') t(id)';  
  execute res;  
end;  
$$ language plpgsql;

4、填充數據

postgres=# select f_ins1('tbl_ao_col',400,1000000);

5、創建1個函數，用於填充數據，其中前4個字段為測試數據的字段，其他都填充1。

create or replace function f_ins2(name, int, int8) returns void as $$  
declare  
  res text := '';  
begin  
  for i in 1..($2-4) loop  
    res := res||'1,';  
  end loop;  
  res := 'id,id,random()*10000,random()*100000,'||res;  
  res := rtrim(res, ',');  
  res := 'insert into '||$1||' select '||res||' from generate_series(1,'||$3||') t(id)';  
  execute res;  
end;  
$$ language plpgsql;

6、填充數據

postgres=# select f_ins1('tbl_ao_col',400,1000000);  
 f_ins1   
--------  
   
(1 row)  
  
postgres=# insert into tbl_ao_row select * from tbl_ao_col;  
INSERT 0 1000000  
postgres=# insert into tbl_heap_row select * from tbl_ao_col;  
INSERT 0 1000000

7、表分析

postgres=# analyze tbl_ao_col ;  
ANALYZE  
postgres=# analyze tbl_ao_row;  
ANALYZE  
postgres=# analyze tbl_heap_row;  
ANALYZE

8、表大小

postgres=# select pg_size_pretty(pg_relation_size('tbl_ao_col'));  
 pg_size_pretty   
----------------  
 3060 MB  
(1 row)  
  
postgres=# select pg_size_pretty(pg_relation_size('tbl_ao_row'));  
 pg_size_pretty   
----------------  
 3117 MB  
(1 row)  
  
postgres=# select pg_size_pretty(pg_relation_size('tbl_heap_row'));  
 pg_size_pretty   
----------------  
 3473 MB  
(1 row)

9、行存堆表，前麵幾個字段的統計

postgres=# explain analyze select c2,count(*),sum(c3),avg(c3),min(c3),max(c3) from tbl_heap_row group by c2;  
                                                                        QUERY PLAN                                                                           
-----------------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=136132.40..136132.42 rows=1 width=96)  
   Rows out:  1 rows at destination with 135 ms to end, start offset by 1.922 ms.  
   ->  HashAggregate  (cost=136132.40..136132.42 rows=1 width=96)  
         Group By: tbl_heap_row.c2  
         Rows out:  1 rows (seg42) with 0.002 ms to first row, 36 ms to end, start offset by 48 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=136132.35..136132.37 rows=1 width=96)  
               Hash Key: tbl_heap_row.c2  
               Rows out:  48 rows at destination (seg42) with 53 ms to end, start offset by 48 ms.  
               ->  HashAggregate  (cost=136132.35..136132.35 rows=1 width=96)  
                     Group By: tbl_heap_row.c2  
                     Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.008 ms to first row, 1.993 ms to end, start offset by 48 ms.  
                     ->  Seq Scan on tbl_heap_row  (cost=0.00..121134.54 rows=20831 width=16)  
                           Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 40 ms to first row, 73 ms to end, start offset by 50 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 345K bytes.  
   (slice1)    Executor memory: 751K bytes avg x 48 workers, 751K bytes max (seg0).  
   (slice2)    Executor memory: 359K bytes avg x 48 workers, 374K bytes max (seg42).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 138.524 ms  
(22 rows)

10、行存堆表，末尾幾個字段的統計

postgres=# explain analyze select c398,count(*),sum(c399),avg(c399),min(c399),max(c399) from tbl_heap_row group by c398;  
                                                                         QUERY PLAN                                                                           
------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=136576.82..136799.05 rows=9877 width=96)  
   Rows out:  10001 rows at destination with 212 ms to end, start offset by 1.917 ms.  
   ->  HashAggregate  (cost=136576.82..136799.05 rows=206 width=96)  
         Group By: tbl_heap_row.c398  
         Rows out:  Avg 208.4 rows x 48 workers.  Max 223 rows (seg17) with 0.001 ms to first row, 70 ms to end, start offset by 14 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=136132.35..136329.89 rows=206 width=96)  
               Hash Key: tbl_heap_row.c398  
               Rows out:  Avg 8762.2 rows x 48 workers at destination.  Max 9422 rows (seg46) with 93 ms to end, start offset by 48 ms.  
               ->  HashAggregate  (cost=136132.35..136132.35 rows=206 width=96)  
                     Group By: tbl_heap_row.c398  
                     Rows out:  Avg 8762.2 rows x 48 workers.  Max 8835 rows (seg2) with 0.003 ms to first row, 12 ms to end, start offset by 49 ms.  
                     ->  Seq Scan on tbl_heap_row  (cost=0.00..121134.54 rows=20831 width=16)  
                           Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 40 ms to first row, 133 ms to end, start offset by 51 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 377K bytes.  
   (slice1)    Executor memory: 1156K bytes avg x 48 workers, 1156K bytes max (seg0).  
   (slice2)    Executor memory: 414K bytes avg x 48 workers, 414K bytes max (seg1).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 214.024 ms  
(22 rows)

11、行存AO表，前麵幾個字段的統計

postgres=# explain analyze select c2,count(*),sum(c3),avg(c3),min(c3),max(c3) from tbl_ao_row group by c2;  
                                                                  QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=124755.04..124755.07 rows=1 width=96)  
   Rows out:  1 rows at destination with 149 ms to end, start offset by 1.890 ms.  
   ->  HashAggregate  (cost=124755.04..124755.07 rows=1 width=96)  
         Group By: tbl_ao_row.c2  
         Rows out:  1 rows (seg42) with 0.004 ms to first row, 55 ms to end, start offset by 64 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=124755.00..124755.02 rows=1 width=96)  
               Hash Key: tbl_ao_row.c2  
               Rows out:  48 rows at destination (seg42) with 32 ms to end, start offset by 64 ms.  
               ->  HashAggregate  (cost=124755.00..124755.00 rows=1 width=96)  
                     Group By: tbl_ao_row.c2  
                     Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.001 ms to first row, 46 ms to end, start offset by 59 ms.  
                     ->  Append-only Scan on tbl_ao_row  (cost=0.00..109755.00 rows=20834 width=16)  
                           Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 24 ms to end, start offset by 59 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 345K bytes.  
   (slice1)    Executor memory: 770K bytes avg x 48 workers, 770K bytes max (seg0).  
   (slice2)    Executor memory: 359K bytes avg x 48 workers, 374K bytes max (seg42).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 152.386 ms  
(22 rows)

12、行存AO表，末尾幾個字段的統計

postgres=# explain analyze select c398,count(*),sum(c399),avg(c399),min(c399),max(c399) from tbl_ao_row group by c398;  
                                                                     QUERY PLAN                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=125186.01..125401.52 rows=9578 width=96)  
   Rows out:  10001 rows at destination with 183 ms to end, start offset by 1.846 ms.  
   ->  HashAggregate  (cost=125186.01..125401.52 rows=200 width=96)  
         Group By: tbl_ao_row.c398  
         Rows out:  Avg 208.4 rows x 48 workers.  Max 223 rows (seg17) with 0.003 ms to first row, 97 ms to end, start offset by 22 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=124755.00..124946.56 rows=200 width=96)  
               Hash Key: tbl_ao_row.c398  
               Rows out:  Avg 8762.2 rows x 48 workers at destination.  Max 9422 rows (seg46) with 32 ms to end, start offset by 68 ms.  
               ->  HashAggregate  (cost=124755.00..124755.00 rows=200 width=96)  
                     Group By: tbl_ao_row.c398  
                     Rows out:  Avg 8762.2 rows x 48 workers.  Max 8835 rows (seg2) with 0.013 ms to first row, 48 ms to end, start offset by 22 ms.  
                     ->  Append-only Scan on tbl_ao_row  (cost=0.00..109755.00 rows=20834 width=16)  
                           Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 22 ms to end, start offset by 71 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 377K bytes.  
   (slice1)    Executor memory: 1144K bytes avg x 48 workers, 1144K bytes max (seg0).  
   (slice2)    Executor memory: 414K bytes avg x 48 workers, 414K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 184.723 ms  
(22 rows)

13、列存AO表，前麵幾個字段的統計

postgres=# explain analyze select c2,count(*),sum(c3),avg(c3),min(c3),max(c3) from tbl_ao_col group by c2;  
                                                                    QUERY PLAN                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=122928.04..122928.07 rows=1 width=96)  
   Rows out:  1 rows at destination with 104 ms to end, start offset by 1.878 ms.  
   ->  HashAggregate  (cost=122928.04..122928.07 rows=1 width=96)  
         Group By: tbl_ao_col.c2  
         Rows out:  1 rows (seg42) with 0.003 ms to first row, 18 ms to end, start offset by 55 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=122928.00..122928.02 rows=1 width=96)  
               Hash Key: tbl_ao_col.c2  
               Rows out:  48 rows at destination (seg42) with 30 ms to end, start offset by 55 ms.  
               ->  HashAggregate  (cost=122928.00..122928.00 rows=1 width=96)  
                     Group By: tbl_ao_col.c2  
                     Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.007 ms to first row, 3.991 ms to end, start offset by 54 ms.  
                     ->  Append-only Columnar Scan on tbl_ao_col  (cost=0.00..107928.00 rows=20834 width=16)  
                           Rows out:  0 rows (seg0) with 40 ms to end, start offset by 56 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 345K bytes.  
   (slice1)    Executor memory: 903K bytes avg x 48 workers, 903K bytes max (seg0).  
   (slice2)    Executor memory: 359K bytes avg x 48 workers, 374K bytes max (seg42).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 106.859 ms  
(22 rows)

14、列存AO表，末尾幾個字段的統計

postgres=# explain analyze select c398,count(*),sum(c399),avg(c399),min(c399),max(c399) from tbl_ao_col group by c398;  
                                                                       QUERY PLAN                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=123364.18..123582.28 rows=9693 width=96)  
   Rows out:  10001 rows at destination with 120 ms to end, start offset by 1.921 ms.  
   ->  HashAggregate  (cost=123364.18..123582.28 rows=202 width=96)  
         Group By: tbl_ao_col.c398  
         Rows out:  Avg 208.4 rows x 48 workers.  Max 223 rows (seg17) with 0.001 ms to first row, 54 ms to end, start offset by 35 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=122928.00..123121.86 rows=202 width=96)  
               Hash Key: tbl_ao_col.c398  
               Rows out:  Avg 8762.2 rows x 48 workers at destination.  Max 9422 rows (seg46) with 31 ms to end, start offset by 63 ms.  
               ->  HashAggregate  (cost=122928.00..122928.00 rows=202 width=96)  
                     Group By: tbl_ao_col.c398  
                     Rows out:  Avg 8762.2 rows x 48 workers.  Max 8835 rows (seg2) with 0.004 ms to first row, 8.004 ms to end, start offset by 82 ms.  
                     ->  Append-only Columnar Scan on tbl_ao_col  (cost=0.00..107928.00 rows=20834 width=16)  
                           Rows out:  0 rows (seg0) with 28 ms to end, start offset by 64 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 377K bytes.  
   (slice1)    Executor memory: 1272K bytes avg x 48 workers, 1272K bytes max (seg0).  
   (slice2)    Executor memory: 414K bytes avg x 48 workers, 414K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 122.173 ms  
(22 rows)

小結，對於非分布鍵的分組聚合請求，Greenplum采用了3階段聚合：

第一階段，在SEGMENT本地聚合。（需要掃描所有數據，這裏不同存儲，前麵的列和後麵的列的差別就體現出來了，行存儲的deform開銷，在對後麵的列進行統計時性能影響很明顯。）

第二階段，根據分組字段，重分布數據。（需要重分布需要用到的字段。）

第三階段，在SEGMENT本地聚合。（需要對重分布後的數據進行聚合。）

第四階段，返回結果。

參考

《PostgreSQL 向量化執行插件(瓦片式實現) 10x提速OLAP》

最後更新：2017-08-20 17:06:22

阿裏雲HybridDB for PG實踐 - 行存、列存，堆表、AO表的原理和選擇

標簽

背景

行存和列存的原理

什麼時候選擇行存

什麼時候選擇列存

堆表和AO表的原理

什麼時候選擇堆表

什麼時候選擇AO表

測試對比行存deform和列存的性能差別

參考

上一篇：阿裏雲HybridDB for PG實踐 - 列存儲加字段和默認值

下一篇： PostgreSQL 單列多條件查詢優化 - 多個多邊形查詢4倍提升的技巧

相關內容

熱門內容

最新內容

阿裏雲HybridDB for PG實踐 - 行存、列存，堆表、AO表的原理和選擇

標簽

背景

行存和列存的原理

什麼時候選擇行存

什麼時候選擇列存

堆表和AO表的原理

什麼時候選擇堆表

什麼時候選擇AO表

測試對比行存deform和列存的性能差別

參考

上一篇： 阿裏雲HybridDB for PG實踐 - 列存儲加字段和默認值

下一篇： PostgreSQL 單列多條件查詢優化 - 多個多邊形查詢4倍提升的技巧

相關內容

熱門內容

最新內容

上一篇：阿裏雲HybridDB for PG實踐 - 列存儲加字段和默認值