661 阿裏雲技術社區[雲棲]

行為、審計日誌 (實時索引/實時搜索)建模 - 最佳實踐 2

標簽

PostgreSQL , ES , 搜索引擎 , 全文檢索 , 日誌分析 , 倒排索引 , 優化 , 分區 , 分片 , 審計日誌 , 行為日誌 , schemaless

背景

在很多係統中會記錄用戶的行為日誌，行為日誌包括瀏覽行為、社交行為、操作行為等。

典型的應用例如：數據庫的SQL審計、企業內部的堡壘機（行為審計）等。

前麵寫了一篇最佳實踐，通過PostgreSQL來存儲審計日誌，同時對審計日誌需要檢索的字段建立全文索引。

SSD機器可以達到7萬/s的寫入（換算成全文索引條目，約280萬/s的條目建立速度）。達到這個性能指標時，CPU，磁盤IO全部吃滿。

全文如下：

《行為、審計日誌 (實時索引/實時搜索)建模 - 最佳實踐》

除了使用全文索引，還有其他方法呢？

本文將采用另一個角度來解決審計日誌的檢索和高速寫入問題。

審計日誌帶三個維度的查詢條件，一個是UID，一個是時間範圍，最後是詞條匹配。

1. UID表示客戶ID，用來區分不同用戶產生的行為數據。

2. TS字段，表示日誌是什麼時間點產生的。

3. 行為數據字段，表示用戶的行為。

優化思路：

1. 將UID作為表名的一部分，每個UID一張表。

（好處：省一個字段，節約空間。同時在數據組織時不會混淆不同用戶的數據，查詢時消除了IO放大的問題，提升了查詢效率。）

（缺點：每個UID一張表，表可能很多，元數據會變大。變更結構時，可能需要涉及較多表。）

2. TS字段，采用BRIN塊級索引，因為每個用戶產生的行為數據，都是時間順序的，所以堆存儲與值順序有非常強的線性相關性。

3. 將數據打散存放，使用元數據記錄UID對應的DB list，隨機寫入對應的DBs，查詢時按TS範圍查詢，查詢所有的DBs匯聚（應用層負責merge sort）後返回（行為字段不使用索引）。

PostgreSQL 10內置了merge sort的功能，所以你如果需要一個中間層來實現merge sort的話，PG也是個不錯的選擇。

隻需要將所有的數據源配置為fdw子表即可。

例如

postgres=# explain select * from bptest where ts between now()-interval '20 min' and now()-interval '10 min' order by ts;  
                                                     QUERY PLAN                                                        
---------------------------------------------------------------------------------------------------------------------  
 Merge Append  (cost=34.00..634325.20 rows=4808182 width=524)  
   Sort Key: bptest.ts  
   ->  Index Scan using idx_bptest_ts on bptest  (cost=0.14..3.15 rows=1 width=40)  
         Index Cond: ((ts >= (now() - '00:20:00'::interval)) AND (ts <= (now() - '00:10:00'::interval)))  
   ->  Index Scan using bptest_32_20170522_ts_idx on bptest_32_20170522  (cost=0.30..4802.19 rows=49918 width=524)  
         Index Cond: ((ts >= (now() - '00:20:00'::interval)) AND (ts <= (now() - '00:10:00'::interval)))  
   ->  Index Scan using bptest_64_20170522_ts_idx on bptest_64_20170522  (cost=0.30..4114.22 rows=42820 width=524)  
         Index Cond: ((ts >= (now() - '00:20:00'::interval)) AND (ts <= (now() - '00:10:00'::interval)))  
   ->  Index Scan using bptest_34_20170522_ts_idx on bptest_34_20170522  (cost=0.30..3984.00 rows=41459 width=524)  
         Index Cond: ((ts >= (now() - '00:20:00'::interval)) AND (ts <= (now() - '00:10:00'::interval)))  
   ->  Index Scan using bptest_37_20170522_ts_idx on bptest_37_20170522  (cost=0.30..4898.77 rows=50972 width=524)  
         Index Cond: ((ts >= (now() - '00:20:00'::interval)) AND (ts <= (now() - '00:10:00'::interval)))  
   ->  Index Scan using bptest_31_20170522_ts_idx on bptest_31_20170522  (cost=0.30..4682.19 rows=48668 width=524)  
   .....................

設計

方案1：

GIN索引 BUILD全文索引的方式，6萬tps時，基本榨幹了CPU和IO資源。BCACHE GC或輕微的IO抖動，會導致比較嚴重的性能變化。

方案2：

通過uid+ts_prefix分區，確保一個用戶的數據在一份堆存儲中，減少檢索時的IO開銷。

ts字段具備時序屬性，通過brin塊級索引降低索引大小。

當數據量達到一定程度時，自動觸發PG10並行查詢特性，提升查詢性能。

由於UID數據已經分片，查詢時會輸入TS和文本匹配兩個變量，數據分配到每個節點已經不多，使用模煳查詢代替全文檢索，加上PG10的多核並行，完全可以滿足查詢響應時延需求。

元數據表結構

create table db_meta
(
dbid int primary key, -- 每個數據庫節點一條記錄，表示一個數據庫分片
groupid int, -- 每個分片屬於一個分組
conn_info text -- 連接信息(URL)
);

create table uid_mapping
(
uid int primary key, -- 客戶唯一標示
dbgroupid int -- 數據庫分組，表示這個用戶的數據隨機寫入這個分組的所有分片中。
);

行為數據保留一段時間後清除。

如果用戶覺得這樣設計比較麻煩，可以將所有的數據庫作為一個大池，所有用戶都隨機寫入這個大池。

這種設計就好像greenplum和hawq的設計理念。greenplum是大池思想，hawq是分而治之思想。

行為數據表結構

主表結構：

create table bptest  
(  
  ts timestamptz,   -- 行為發生時間  
  content text    -- 行為內容  
);

每個用戶的表名為bptest_$uid_$yyyymmdd。

結構和索引與主表保持一致。

行為數據索引

ts字段的存儲順序與值的順序有非常強的線性相關性，采用塊級索引。

BRIN索引相比btree索引節省幾百倍空間，同時提升寫入性能。

create index idx_bptest_ts on bptest using brin(ts);  -- TS字段塊級索引

數據分片設計

每個分片屬於一個組，每個UID的數據隨機的寫入一個指定組的所有分片。

如果用戶覺得這樣設計比較麻煩，可以將所有的數據庫作為一個大池，所有用戶都隨機寫入這個大池。

就好像greenplum和hawq的設計理念。greenplum是大池思想，hawq是分而治之思想。

數據匯聚設計

當需要查詢某個UID的行為數據時，並行查詢所有分片的數據，按ts字段merge sort並返回。

merge sort可以放在數據庫中實現，也可以在應用層實現。

數據merge sort設計（建議業務層實現，以提升性能）

如果merge sort放在數據庫層實現，可以使用PostgreSQL 10的postgres_fdw，每個UID的每個分片對應一張FDW TABLE，掛在UID對應的父表中。

當查詢父表時，按TS排序，會使用merge sort。

merge sort功能詳見：

《PostgreSQL 10.0 preview 性能增強 - mergesort(Gather merge)》

排序下推功能詳見：

《PostgreSQL 10.0 preview sharding增強 - pushdown 增強》

如果在應用層實現，方法與之類似，並行的查詢UID對應的所有分片，每個分片都是有order by返回，在應用層使用merge sort的方法返回給客戶端。

數據寫入schema less設計（建議業務層實現，以提高性能）

由於每個UID對應若幹張表bptest_$uid_$yyyymmdd，我們可以在數據庫端設計類似mongo的schemaless寫入風格：

有表時則插入，沒有表時則創建後再插入。

實現方法詳見

《PostgreSQL schemaless 的實現(類mongodb collection)》

創建一個自動建表的函數，用於自動創建目標表。

create or replace function create_schemaless(target name) returns void as $$    
declare    
begin    
  execute format('create table if not exists %I (like bptest including all)', target);    
exception when others then    
  return;    
end;    
$$ language plpgsql strict;

創建一個插入數據的函數，使用動態SQL，如果遇到表不存在的錯誤，則調用建表函數進行建表。

create or replace function ins_schemaless(uid int, ts timestamptz, content text) returns void as $$    
declare   
  target name := 'bptest_'||uid||'_'||to_char(ts,'yyyymmdd');  
begin    
  execute format('insert into %I values (%L, %L)', target, ts, content);    
  exception     
    WHEN SQLSTATE '42P01' THEN     
    perform create_schemaless(target);    
    execute format('insert into %I values (%L, %L)', target, ts, content);     
end;    
$$ language plpgsql strict;

數據庫端的schemaless會犧牲一部分性能，因為無法使用綁定變量。

建議業務層實現schemaless(自動拚接表名，自動建表)，以提高性能。

曆史數據處理設計

曆史數據，可以清除，直接DROP分表即可(bptest_$uid_$yyyymmdd)。

如果有保留數據的需求，可以通過阿裏雲RDS PostgreSQL的OSS_FDW接口將數據寫入OSS對象存儲永久保存，要讀取時，通過FDW讀取。

雲端存儲與計算分離用法：

《RDS PostgreSQL : 使用 oss_fdw 讀寫OSS對象存儲》

《HybridDB PostgreSQL : 使用 oss_fdw 讀寫OSS對象存儲》

分析需求

如果有審計日誌的分析需求，可以將RDS PostgreSQL數據寫入OSS，通過HybridDB for PostgreSQL進行分析。

性能測試

環境準備

與上文《行為、審計日誌 (實時索引/實時搜索)建模 - 最佳實踐》一樣，使用同樣的服務器，依舊是12個實例。

1. 環境變量配置

vi env_pg10.sh   
export PS1="$USER@`/bin/hostname -s`-> "  
export PGPORT=$(($1+1920))  
export PGDATA=/disk$1/digoal/pgdata/pg_root$(($1+1920))  
export LANG=en_US.utf8  
export PGHOME=/home/digoal/pgsql10_8k  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH  
export LD_RUN_PATH=$LD_LIBRARY_PATH  
export DATE=`date +"%Y%m%d%H%M"`  
export PATH=$PGHOME/bin:$PATH:.  
export MANPATH=$PGHOME/share/man:$MANPATH  
export PGHOST=127.0.0.1  
export PGUSER=postgres  
export PGDATABASE=postgres  
alias rm='rm -i'  
alias ll='ls -lh'  
unalias vi

2. 初始化SQL

vi init.sql  
  
-- 清除環境  
drop table bptest cascade;  
  
-- 初始父表  
create table bptest  
(  
  ts timestamptz,   -- 行為發生時間  
  content text    -- 行為內容  
);   
  
-- 建立索引  
-- create index idx_bptest_ts on bptest using brin(ts) with(pages_per_range=1);  -- TS字段塊級索引，如果不需要按ts排序輸出結果，建議使用brin索引  
create index idx_bptest_ts on bptest using btree(ts) ;  -- 如果需要按ts sort輸出結果，建議使用btree  
  
-- schemaless相關，自動建表函數  
create or replace function create_schemaless(target name) returns void as $$    
declare    
begin    
  execute format('create table if not exists %I (like bptest including all) inherits(bptest)', target);    
  -- 這裏可以設置列的statistic  
exception when others then    
  return;    
end;    
$$ language plpgsql strict;    
  
-- 數據寫入函數  
create or replace function ins_schemaless(uid int, ts timestamptz, content text) returns void as $$    
declare   
  target name := 'bptest_'||uid||'_'||to_char(ts,'yyyymmdd');  
begin    
  execute format('insert into %I values (%L, %L)', target, ts, content);    
  exception     
    WHEN SQLSTATE '42P01' THEN     
    perform create_schemaless(target);    
    execute format('insert into %I values (%L, %L)', target, ts, content);     
end;    
$$ language plpgsql strict;    
  
-- 產生隨機字符串的函數, 本例的CPU消耗大量來自這個函數  
CREATE OR REPLACE FUNCTION public.gen_rand_str(integer)  
 RETURNS text  
 LANGUAGE sql  
 STRICT  
AS $function$  
  select string_agg(a[(random()*6)::int+1],'') from generate_series(1,$1), (select array['a','b','c','d','e','f',' ']) t(a);  
$function$;  
  
-- 分頁評估函數, 用於評估返回多少條記錄，盡量避免使用count(*)估算精確值  
CREATE FUNCTION count_estimate(query text) RETURNS INTEGER AS    
$func$    
DECLARE    
    rec   record;    
    ROWS  INTEGER;    
BEGIN    
    FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP    
        ROWS := SUBSTRING(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');    
        EXIT WHEN ROWS IS NOT NULL;    
    END LOOP;    
    RETURN ROWS;    
END    
$func$ LANGUAGE plpgsql;

初始化每個數據庫實例

psql -f ./init.sql -p 1921  
...  
psql -f ./init.sql -p 1932

寫入壓測

12個庫，100個UID。

每個UID每個庫寫入1000萬記錄，每個UID總共寫入1.2億，所有UID總共寫入120億記錄。

使用gen_rand_str生成指定長度的隨機字符串。

vi test.sql  
  
  
\set uid random(1,100)  
select ins_schemaless(:uid, now(), gen_rand_str(512));

測試腳本

vi test.sh  
  
  
for ((i=1;i<13;i++))  
do  
  . /home/digoal/env_pg10.sh ${i}  
  nohup pgbench -M prepared -n -r -P 3 -f ./test.sql -c 10 -j 10 -t 100000000 >/tmp/bptest_${i}.log 2>&1 &  
done

測試結果

寫入性能:

1. 使用brin索引時 9.47萬/s

2. 使用btree索引時 7.9萬/s

3. 服務器資源開銷：

CPU：  
  
Cpu(s): 90.7%us,  7.1%sy,  0.1%ni,  0.4%id,  0.1%wa,  0.0%hi,  1.6%si,  0.0%st  
  
IOUITL：  
  
SSD: 50 %

瓶頸分析

1. 大部分CPU開銷在產生隨機串的函數中，所以實際場景，CPU的消耗會小很多。

如下

perf top -ag  
  
            60953.00  6.0% ExecInterpExpr                          /home/digoal/pgsql10_8k/bin/postgres  
            33647.00  3.3% AllocSetAlloc                           /home/digoal/pgsql10_8k/bin/postgres  
            27560.00  2.7% advance_aggregates                      /home/digoal/pgsql10_8k/bin/postgres  
            22894.00  2.3% base_yyparse                            /home/digoal/pgsql10_8k/bin/postgres  
            21976.00  2.2% SearchCatCache                          /home/digoal/pgsql10_8k/bin/postgres  
            21768.00  2.2% array_seek                              /home/digoal/pgsql10_8k/bin/postgres  
            20957.00  2.1% appendBinaryStringInfo                  /home/digoal/pgsql10_8k/bin/postgres  
            19912.00  2.0% memcpy                                  /lib64/libc-2.12.so                     
            17058.00  1.7% array_get_element                       /home/digoal/pgsql10_8k/bin/postgres  
            14916.00  1.5% heap_form_minimal_tuple                 /home/digoal/pgsql10_8k/bin/postgres  
            13617.00  1.3% heap_fill_tuple                         /home/digoal/pgsql10_8k/bin/postgres  
            12201.00  1.2% __rint                                  /lib64/libm-2.12.so                     
            10938.00  1.1% palloc0                                 /home/digoal/pgsql10_8k/bin/postgres  
            10683.00  1.1% MemoryContextAllocZeroAligned           /home/digoal/pgsql10_8k/bin/postgres  
            10591.00  1.0% ExecStoreMinimalTuple                   /home/digoal/pgsql10_8k/bin/postgres  
            10292.00  1.0% string_agg_transfn                      /home/digoal/pgsql10_8k/bin/postgres  
            10202.00  1.0% advance_transition_function             /home/digoal/pgsql10_8k/bin/postgres  
             9973.00  1.0% core_yylex                              /home/digoal/pgsql10_8k/bin/postgres

2. bcache問題

bcache垃圾回收時，對IO的影響非常嚴重。

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
40207 root      20   0     0    0    0 R 100.0  0.0  34:47.86 [bch_gc-0b34a79a]

await已經到秒級

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util  
bcache11          0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00  
bcache10          0.00  6054.00    1.00  105.00     8.00 48640.00   458.94    17.19  157.73   9.43 100.00  
bcache9           0.00  2431.00    0.00  100.00     0.00 29408.00   294.08     4.22  102.87   9.86  98.60  
bcache8           0.00  4013.00    0.00   85.00     0.00 37352.00   439.44    10.04  174.19  11.76 100.00  
bcache7           0.00  1661.00    0.00   58.00     0.00 12952.00   223.31     2.34   53.47  17.02  98.70  
bcache6           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00  
bcache5           0.00    48.00    0.00   71.00     0.00 44896.00   632.34   183.24 2124.06  14.08 100.00  
bcache4           0.00  6506.00    0.00  211.00     0.00 56312.00   266.88    74.89  488.20   4.74 100.00  
bcache3           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00  
bcache2           0.00  5502.00    0.00  103.00     0.00 44168.00   428.82   144.85  709.10   9.71 100.00  
bcache1           0.00 12302.00    0.00   34.00     0.00 13464.00   396.00    84.11  877.03  29.41 100.00  
bcache0           0.00 15148.00    0.00  260.00     0.00 42504.00   163.48    47.39  155.56   3.85 100.00

3. 配置了smooth checkpoint後，checkpoint已經沒有問題, sync時間非常短暫。

2017-05-23 10:49:55.749 CST,,,25095,,59239d76.6207,9,,2017-05-23 10:24:54 CST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,"LogCheckpointStart, xlog.c:8369",""  
2017-05-23 10:50:25.434 CST,,,25095,,59239d76.6207,10,,2017-05-23 10:24:54 CST,,0,LOG,00000,"checkpoint complete: wrote 233036 buffers (5.6%); 0 WAL file(s) added, 0 removed, 131 recycled; write=29.250 s, sync=0.018 s, total=29.685 s; sync files=300, longest=0.004 s, average=0.000 s; distance=2145650 kB, estimate=2145650 kB",,,,,,,,"LogCheckpointEnd, xlog.c:8451",""  
2017-05-23 10:54:55.529 CST,,,25095,,59239d76.6207,11,,2017-05-23 10:24:54 CST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,"LogCheckpointStart, xlog.c:8369",""  
2017-05-23 10:57:59.222 CST,,,25095,,59239d76.6207,12,,2017-05-23 10:24:54 CST,,0,LOG,00000,"checkpoint complete: wrote 191434 buffers (4.6%); 0 WAL file(s) added, 0 removed, 131 recycled; write=118.012 s, sync=59.816 s, total=183.693 s; sync files=300, longest=16.126 s, average=0.199 s; distance=1752021 kB, estimate=2106288 kB",,,,,,,,"LogCheckpointEnd, xlog.c:8451",""  
2017-05-23 10:59:55.328 CST,,,25095,,59239d76.6207,13,,2017-05-23 10:24:54 CST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,"LogCheckpointStart, xlog.c:8369",""  
2017-05-23 11:00:25.350 CST,,,25095,,59239d76.6207,14,,2017-05-23 10:24:54 CST,,0,LOG,00000,"checkpoint complete: wrote 176278 buffers (4.2%); 0 WAL file(s) added, 0 removed, 107 recycled; write=29.688 s, sync=0.009 s, total=30.021 s; sync files=300, longest=0.003 s, average=0.000 s; distance=1630757 kB, estimate=2058734 kB",,,,,,,,"LogCheckpointEnd, xlog.c:8451",""  
2017-05-23 11:04:55.467 CST,,,25095,,59239d76.6207,15,,2017-05-23 10:24:54 CST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,"LogCheckpointStart, xlog.c:8369",""  
2017-05-23 11:05:25.381 CST,,,25095,,59239d76.6207,16,,2017-05-23 10:24:54 CST,,0,LOG,00000,"checkpoint complete: wrote 232981 buffers (5.6%); 0 WAL file(s) added, 0 removed, 99 recycled; write=29.555 s, sync=0.013 s, total=29.914 s; sync files=300, longest=0.006 s, average=0.000 s; distance=2142180 kB, estimate=2142180 kB",,,,,,,,"LogCheckpointEnd, xlog.c:8451",""  
2017-05-23 11:06:44.503 CST,,,36368,"127.0.0.1:44645",5923a744.8e10,1,"",2017-05-23 11:06:44 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=44645",,,,,,,,"BackendInitialize, postmaster.c:4178",""  
2017-05-23 11:06:44.504 CST,"postgres","postgres",36368,"127.0.0.1:44645",5923a744.8e10,2,"authentication",2017-05-23 11:06:44 CST,13/223,0,LOG,00000,"connection authorized: user=postgres database=postgres",,,,,,,,"PerformAuthentication, postinit.c:272",""  
2017-05-23 11:09:55.512 CST,,,25095,,59239d76.6207,17,,2017-05-23 10:24:54 CST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,"LogCheckpointStart, xlog.c:8369",""  
2017-05-23 11:09:56.221 CST,"postgres","postgres",36368,"127.0.0.1:44645",5923a744.8e10,3,"idle",2017-05-23 11:06:44 CST,,0,LOG,00000,"disconnection: session time: 0:03:11.717 user=postgres database=postgres host=127.0.0.1 port=44645",,,,,,,,"log_disconnections, postgres.c:4500","psql"  
2017-05-23 11:10:25.512 CST,,,25095,,59239d76.6207,18,,2017-05-23 10:24:54 CST,,0,LOG,00000,"checkpoint complete: wrote 232090 buffers (5.5%); 0 WAL file(s) added, 0 removed, 131 recycled; write=29.489 s, sync=0.141 s, total=29.999 s; sync files=300, longest=0.033 s, average=0.000 s; distance=2136269 kB, estimate=2141589 kB",,,,,,,,"LogCheckpointEnd, xlog.c:8451",""

查詢壓測

單節點2100萬記錄。

postgres=# select count(*) from bptest_1_20170522;  
  count     
----------  
 21106948  
(1 row)  
Time: 1776.889 ms (00:01.777)

查詢需求：

1. 範圍查詢，排序輸出

返回462萬記錄，2.5秒。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from bptest_1_20170522 where ts between '2017-05-22 19:04:19.05347+08' and '2017-05-22 20:04:19.05347+08' order by ts;  
                                                                                        QUERY PLAN                                                                                           
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using bptest_1_20170522_ts_idx on public.bptest_1_20170522  (cost=0.44..2373942.95 rows=4631011 width=524) (actual time=0.015..2326.653 rows=4622534 loops=1)  
   Output: ts, content  
   Index Cond: ((bptest_1_20170522.ts >= '2017-05-22 19:04:19.05347+08'::timestamp with time zone) AND (bptest_1_20170522.ts <= '2017-05-22 20:04:19.05347+08'::timestamp with time zone))  
   Buffers: shared hit=4639344  
 Planning time: 0.207 ms  
 Execution time: 2578.147 ms  
(6 rows)  
Time: 2578.789 ms (00:02.579)

2. 範圍+全文檢索查詢，排序輸出

返回2941196萬記錄，8.5秒。

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from bptest_1_20170522    
where ts between '2017-05-22 19:04:19.05347+08' and '2017-05-22 20:04:19.05347+08'   
and content ~ 'abc' and content ~ 'bcd'   
order by ts;  
                                                     QUERY PLAN                                                                                                    
--------------------------------------------------------------------------------------------------------------------------  
 Gather Merge  (cost=1890834.11..1933136.32 rows=2870936 width=524) (actual time=6601.842..8136.187 rows=2941196 loops=1)  
   Output: ts, content  
   Workers Planned: 4  
   Workers Launched: 4  
   Buffers: shared hit=84046, temp read=102440 written=102588  
   ->  Sort  (cost=1890834.06..1892628.39 rows=717734 width=524) (actual time=6584.684..6804.063 rows=588239 loops=5)  
         Output: ts, content  
         Sort Key: bptest_1_20170522.ts  
         Sort Method: external merge  Disk: 313080kB  
         Buffers: shared hit=347169, temp read=501609 written=502338  
         Worker 0: actual time=6582.649..6803.139 rows=588224 loops=1  
           Buffers: shared hit=66037, temp read=100242 written=100388  
         Worker 1: actual time=6590.768..6813.019 rows=587934 loops=1  
           Buffers: shared hit=66168, temp read=100191 written=100337  
         Worker 2: actual time=6579.297..6799.509 rows=587915 loops=1  
           Buffers: shared hit=66014, temp read=100172 written=100318  
         Worker 3: actual time=6569.191..6785.155 rows=578030 loops=1  
           Buffers: shared hit=64904, temp read=98564 written=98707  
         ->  Parallel Bitmap Heap Scan on public.bptest_1_20170522  (cost=72481.78..1603389.84 rows=717734 width=524) (actual time=338.604..5182.340 rows=588239 loops=5)  
               Output: ts, content  
               Recheck Cond: ((bptest_1_20170522.ts >= '2017-05-22 19:04:19.05347+08'::timestamp with time zone) AND (bptest_1_20170522.ts <= '2017-05-22 20:04:19.05347+08'::timestamp with time zone))  
               Rows Removed by Index Recheck: 19  
               Filter: ((bptest_1_20170522.content ~ 'abc'::text) AND (bptest_1_20170522.content ~ 'bcd'::text))  
               Rows Removed by Filter: 336268  
               Heap Blocks: exact=7063 lossy=60173  
               Buffers: shared hit=347141  
               Worker 0: actual time=336.885..5215.415 rows=588224 loops=1  
                 Buffers: shared hit=66030  
               Worker 1: actual time=337.105..5239.414 rows=587934 loops=1  
                 Buffers: shared hit=66161  
               Worker 2: actual time=337.128..5213.433 rows=587915 loops=1  
                 Buffers: shared hit=66007  
               Worker 3: actual time=337.078..5129.862 rows=578030 loops=1  
                 Buffers: shared hit=64897  
               ->  Bitmap Index Scan on bptest_1_20170522_ts_idx  (cost=0.00..71764.05 rows=4631011 width=0) (actual time=336.215..336.215 rows=4622534 loops=1)  
                     Index Cond: ((bptest_1_20170522.ts >= '2017-05-22 19:04:19.05347+08'::timestamp with time zone) AND (bptest_1_20170522.ts <= '2017-05-22 20:04:19.05347+08'::timestamp with time zone))  
                     Buffers: shared hit=16810  
 Planning time: 1.996 ms  
 Execution time: 8560.577 ms  
(39 rows)  
Time: 8563.154 ms (00:08.563)

3. 分頁數評估

如果業務允許，建議使用評估值，評估值的準確性取決於統計信息的準確性，使用alter table 表名 alter column 列名 SET STATISTICS 1000可以調整列的統計精準度，默認為100。

《論count與offset使用不當的罪名和分頁的優化》

評估記錄數與實際記錄數對比如下，足夠精確：

postgres=# \timing  
Timing is on.  
  
-- 評估記錄數  
postgres=# select count_estimate($$select * from bptest_1_20170522 where ts between '2017-05-22 19:04:19.05347+08' and '2017-05-22 20:04:19.05347+08'$$);  
 count_estimate   
----------------  
        4631011  
(1 row)  
Time: 0.733 ms  
  
-- 實際記錄數  
postgres=# select count(*) from bptest_1_20170522 where ts between '2017-05-22 19:04:19.05347+08' and '2017-05-22 20:04:19.05347+08';  
  count    
---------  
 4622534  
(1 row)  
Time: 1389.424 ms (00:01.389)  
  
-- 評估記錄數  
postgres=# select count_estimate($$select * from bptest_1_20170522    
where ts between '2017-05-22 19:04:19.05347+08' and '2017-05-22 20:04:19.05347+08'   
and content ~ 'abc' and content ~ 'bcd' and content ~ 'cdef'$$);  
 count_estimate   
----------------  
         914755  
(1 row)  
Time: 3.713 ms  
  
-- 實際記錄數  
postgres=# select count(*) from bptest_1_20170522    
where ts between '2017-05-22 19:04:19.05347+08' and '2017-05-22 20:04:19.05347+08'   
and content ~ 'abc' and content ~ 'bcd' and content ~ 'cdef';  
 count    
--------  
 962780  
(1 row)  
Time: 7756.863 ms (00:07.757)

4. 分頁查詢返回

流式返回，返回10行僅需0.562毫秒。

postgres=# begin;  
BEGIN  
Time: 0.170 ms  
postgres=# declare cur cursor for select * from bptest_1_20170522    
postgres-# where ts between '2017-05-22 19:04:19.05347+08' and '2017-05-22 20:04:19.05347+08'   
postgres-# and content ~ 'abc' and content ~ 'bcd'   
postgres-# order by ts;  
DECLARE CURSOR  
  
postgres=# fetch 10 from cur;  
 .........................  
(10 rows)  
Time: 0.562 ms  
  
下一頁，fetch繼續獲取

如果要回翻，使用SCROLL遊標

postgres=# begin;  
BEGIN  
Time: 0.114 ms  
postgres=# declare cur SCROLL cursor for select * from bptest_1_20170522    
where ts between '2017-05-22 19:04:19.05347+08' and '2017-05-22 20:04:19.05347+08'   
and content ~ 'abc' and content ~ 'bcd'   
order by ts;  
DECLARE CURSOR  
Time: 2.362 ms  
  
回翻  
postgres=# fetch BACKWARD 10 from cur;  
前翻  
postgres=# fetch FORWARD 10 from cur;

小結

性能指標

1. 數據量：

單個UID，單節點，一天2100萬記錄(12GB, 索引600MB)。（100個節點/分片，單個用戶一天約21億數據量）

2. 寫入性能

2.1. 使用brin索引時 9.47萬/s

2.2. 使用btree索引時 7.9萬/s

3. 範圍查詢，排序輸出

返回462萬記錄，2.5秒。

4. 範圍+全文檢索查詢，排序輸出

返回294萬記錄，8.5秒。

5. 分頁數評估

精確度：+- 5% 左右

響應速度：1毫秒左右。

6. 精確分頁數

與實際數據量、條件有關。1秒以上

7. 分頁查詢

範圍+全文檢索查詢，排序輸出：每獲取1000條記錄約11毫秒。

（與命中率有關），極端情況為處理所有記錄，隻有最後一條記錄滿足條件。

隨機分布與可用性、數據傾斜問題

使用jdbc或libpq時，一個連接可以設置多個實例，將從先到後，自動選擇一個可讀寫的實例。（相當於客戶端自動failover）。

配置示例，假設有4個數據庫實例，可以配置4個數據源如下：

db1: host1:port1,host2:port2,host3:port3,host4:port4  
  
db2: host2,port2,host3:port3,host4:port4,host1:port1  
  
db3: host3:port3,host4:port4,host1:port1,host2,port2  
  
db4: host4:port4,host1:port1,host2:port2,host3:port3

當任意一個實例出現問題時，每個數據源還是能獲取到下一個可用的連接，不會堵塞寫入。

當實例修複後，依舊使用首選實例。

使用這種方法，可以最大化的提高可用性，無需備庫。

另外異常的實例活了之後，就會繼續被首選，無需擔心傾斜問題，因為不保留曆史。時間會抹平傾斜問題。