PostgreSQL 十億級模煳查詢最佳實踐
標簽
PostgreSQL , 模煳查詢 , 正則查詢 , pg_trgm , bytea , gin , 函數索引
背景
前模煳(有前綴的模煳),後模煳(有後綴的模煳),前後模煳(無前後綴的模煳),正則匹配都屬於文本搜索領域常見的需求。
PostgreSQL擁有很強的文本搜索能力,除了支持全文檢索,還支持模煳查詢、正則查詢。內置的pg_trgm插件是一般數據庫沒有的,可能很多人沒有聽說過。同時還內置了表達式索引、GIN索引的功能。
不同的模煳查詢需求,有不同的優化方法。
對於前模煳和後模煳,PostgreSQL則與其他數據庫一樣,可以使用btree來加速。後模煳可以使用反轉函數的函數索引來加速。
對於前後模煳和正則匹配,一種方法是使用pg_trgm插件,利用GIN索引加速模煳和正則查詢(輸入3個或3個以上字符的模煳查詢效果很好)。另一種方法是自定義GIN表達式索引的方法,適合於定製的模煳查詢。
一、前模煳與後模煳的優化
1. 前模煳(有前綴的模煳)優化方法
使用b-tree可以支持前模煳的查詢。僅適合於collate="C"的查詢,當數據庫默認的lc_collate<>C時,索引和查詢都需要明確指定collate "C"。
索引、查詢條件的collate必須一致才能使用索引。
例子
test=# create table test(id int, info text);
CREATE TABLE
test=# insert into test select generate_series(1,1000000),md5(random()::text);
INSERT 0 1000000
test=# create index idx on test(info collate "C");
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1)
Output: id, info
Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "C")
Buffers: shared hit=18 read=3
Planning time: 0.424 ms
Execution time: 0.124 ms
(7 rows)
2. 後模煳(有後綴的模煳)的優化方法
使用反轉函數(reverse)索引,可以支持後模煳的查詢。僅適合於collate="C"的查詢,當數據庫默認的lc_collate<>C時,索引和查詢都需要明確指定collate "C"。
索引、查詢條件的collate必須一致才能使用索引。
例子
test=# create index idx1 on test(reverse(info) collate "C");
CREATE INDEX
test=# select * from test limit 1;
id | info
----+----------------------------------
1 | b3275976cdd437a033d4329775a52514
(1 row)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%' collate "C";
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.061..0.097 rows=18 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) >= '4152'::text) AND (reverse(test.info) < '4153'::text))
Filter: (reverse(test.info) ~~ '4152%'::text COLLATE "C")
Buffers: shared hit=18 read=3
Planning time: 0.128 ms
Execution time: 0.122 ms
(7 rows)
test=# select * from test where reverse(info) like '4152%' collate "C";
id | info
--------+----------------------------------
847904 | abe2ecd90393b5275df8e34a39702514
414702 | 97f66d26545329321164042657d02514
191232 | 7820972c6220c2b01d46c11ebb532514
752742 | 93232ac39c6632e2540df44627c42514
217302 | 39e518893a1a7b1e691619bd1fc42514
1 | b3275976cdd437a033d4329775a52514
615718 | 4948f94c484c13dc6c4fae8a3db52514
308815 | fc2918ceff7c7a4dafd2e04031062514
149521 | 546d963842ea5ca593e622c810262514
811093 | 4b6eca2eb6b665af67b2813e91a62514
209000 | 1dfd0d4e326715c1739f031cca992514
937616 | 8827fd81f5b673fb5afecbe3e11b2514
419553 | bd6e01ce360af16137e8b6abc8ab2514
998324 | 7dff51c19dc5e5d9979163e7d14c2514
771518 | 8a54e30003a48539fff0aedc73ac2514
691566 | f90368348e3b6bf983fcbe10db2d2514
652274 | 8bf4a97b5f122a5540a21fa85ead2514
233437 | 739ed715fc203d47e37e79b5bcbe2514
(18 rows)
3. 前、後模煳的合體優化方法
使用pg_trgm索引,可以支持前、後模煳的查詢,注意:
(有前綴的模煳)至少輸入1個字符,(有後綴的模煳)至少輸入2個字符。才有好的索引過濾效果。如果要支持中文,數據庫lc_collate,lc_ctype不能為"C"。
索引、查詢條件的collate必須一致才能使用索引。
test=# \l+ test
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
------+----------+----------+------------+------------+-------------------+--------+------------+-------------
test | postgres | UTF8 | zh_CN.utf8 | zh_CN.utf8 | | 245 MB | pg_default |
(1 row)
test=# create extension pg_trgm;
test=# create table test001(c1 text);
CREATE TABLE
生成隨機中文字符串的函數
test=# create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;
CREATE FUNCTION
生成隨機數據
test=# insert into test001 select gen_hanzi(20) from generate_series(1,100000);
INSERT 0 100000
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
CREATE INDEX
test=# select * from test001 limit 5;
c1
------------------------------------------
埳噪辦甾訥昃碇玾隉箖燋邢賀浮媊踮菵暔谉橅
秌橑籛鴎擬倶敤麁黿醠轇坙騉鏦纗蘛婃坹嫻儅
蔎緾鎧爪鵬二悲膼朠麻鸂鋬楨窷違繇糭嘓索籓
馳泅薬鐗愅撞竅溮滲蛁灎厀攚摐瞪拡擜詜隝緼
襳鋪煃匶瀌懲荼黹樆惺篋搔羾憯墆鋃硍蔓恧顤
(5 rows)
模煳查詢
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.030..0.034 rows=3 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '你%'::text)
Heap Blocks: exact=3
Buffers: shared hit=7
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=3 loops=1)
Index Cond: (test001.c1 ~~ '你%'::text)
Buffers: shared hit=4
Planning time: 0.119 ms
Execution time: 0.063 ms
(10 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%恧顤';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.031..0.034 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%恧顤'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=2
Buffers: shared hit=6
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=2 loops=1)
Index Cond: (test001.c1 ~~ '%恧顤'::text)
Buffers: shared hit=4
Planning time: 0.136 ms
Execution time: 0.062 ms
(11 rows)
二、前後均模煳的優化
使用pg_trgm插件,支持前後模煳的查詢。注意:
如果需要讓pg_trgm支持中文的模煳查詢,數據庫lc_collate,lc_ctype不能為"C"。
比如輸入3個或3個以上字符,否則效果不佳。
例子
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢賀%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.038..0.038 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%燋邢賀%'::text)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.025..0.025 rows=1 loops=1)
Index Cond: (test001.c1 ~~ '%燋邢賀%'::text)
Buffers: shared hit=4
Planning time: 0.170 ms
Execution time: 0.076 ms
(10 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=147.524..178.232 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%燋邢%'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=1137
Buffers: shared hit=14429
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=147.377..147.377 rows=100000 loops=1)
Index Cond: (test001.c1 ~~ '%燋邢%'::text)
Buffers: shared hit=13292
Planning time: 0.133 ms
Execution time: 178.265 ms
(11 rows)
三、正則匹配的優化
目前,pg_trgm對中文(或多字節字符)的正則匹配效果不佳,對ascii字符的正則匹配效果很好。
Rows Removed by Index Recheck很小,說明索引過濾性很好。
例子
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '12[0-9]{3,9}';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=65.08..75.20 rows=10 width=61) (actual time=0.196..0.196 rows=0 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~ '12[0-9]{3,9}'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=1
Buffers: shared hit=50
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..65.08 rows=10 width=0) (actual time=0.183..0.183 rows=1 loops=1)
Index Cond: (test001.c1 ~ '12[0-9]{3,9}'::text)
Buffers: shared hit=49
Planning time: 0.452 ms
Execution time: 0.221 ms
(11 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '燋邢賀';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=143.846..232.278 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~ '燋邢賀'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=1137
Buffers: shared hit=14429
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=143.688..143.688 rows=100000 loops=1)
Index Cond: (test001.c1 ~ '燋邢賀'::text)
Buffers: shared hit=13292
Planning time: 0.254 ms
Execution time: 232.312 ms
(11 rows)
pg_trgm模煳查詢的原理
首先,pg_trgm將字符串的前端添加2個空格,末尾添加1個空格。
然後,每連續的3個字符為一個TOKEN,拆開。
最後,對TOKEN建立GIN倒排索引。
查看字符串的TOKEN,可以使用如下方法。
test=# select show_trgm('123');
show_trgm
-------------------------
{" 1"," 12",123,"23 "}
(1 row)
pg_trgm前後模煳字符個數要求的原因
使用pg_trgm時,如果要獲得最好的效果,最好滿足這些條件。
1. 有前綴的模煳查詢,例如a%,至少需要提供1個字符。( 搜索的是token=' a' )
2. 有後綴的模煳查詢,例如%ab,至少需要提供2個字符。( 搜索的是token='ab ' )
3. 前後模煳查詢,例如%abcd%,至少需要提供3個字符。( 這個使用數組搜索,搜索的是token(s) 包含 {" a"," ab",abc,bcd,"cd "} )
原因是什麼呢?
因為pg_trgm生成的TOKEN是三個字符,隻有在以上三個條件下,才能匹配到對應的TOKEN。
test=# select show_trgm('123');
show_trgm
-------------------------
{" 1"," 12",123,"23 "}
(1 row)
四、小於3個輸入字符的模煳查詢的優化
當需要前後模煳搜索1個或者2個字符時,pg_trgm無法滿足需求,但是我們可以使用表達式GIN索引。
使用表達式,將字符串拆成1個單字,兩個連續的字符的數組,對數組建立GIN索引即可。
例子
test=# create or replace function split001(text) returns text[] as $$
declare
res text[];
begin
select regexp_split_to_array($1,'') into res;
for i in 1..length($1)-1 loop
res := array_append(res, substring($1,i,2));
end loop;
return res;
end;
$$ language plpgsql strict immutable;
CREATE FUNCTION
test=# create index idx_test001_2 on test001 using gin (split001(c1));
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你好'];
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.041..0.041 rows=0 loops=1)
Output: c1
Recheck Cond: (split001(test001.c1) @> '{你好}'::text[])
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.039..0.039 rows=0 loops=1)
Index Cond: (split001(test001.c1) @> '{你好}'::text[])
Buffers: shared hit=4
Planning time: 0.104 ms
Execution time: 0.068 ms
(9 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你'];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.063..0.183 rows=86 loops=1)
Output: c1
Recheck Cond: (split001(test001.c1) @> '{你}'::text[])
Heap Blocks: exact=80
Buffers: shared hit=84
-> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.048..0.048 rows=86 loops=1)
Index Cond: (split001(test001.c1) @> '{你}'::text[])
Buffers: shared hit=4
Planning time: 0.101 ms
Execution time: 0.217 ms
(10 rows)
test=# select * from test001 where split001(c1) @> array['你'];
c1
------------------------------------------
殐踨洪冨垓丩賢閚偉垢胸鍘崩你萭隡劭芛雫袰
靅慨熱臉罆淓寘鰻總襎戍謸棖陪丼倫柆套你仮
......
五、相似查詢優化
模煳查詢和正則匹配都是找出完全符合條件的記錄,還有一種需求是相似查詢。
例如postgresql字符串,輸入 p0stgresgl 也能根據相似度匹配到。
例子
test=# create index idx_test001_3 on test001 using gist (c1 gist_trgm_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) SELECT t, c1 <-> '癷磛鶿蠌鰓蠲123鶡埀婎鯿苿奶垨惸溴蔻筴熝憡' AS dist
FROM test001 t
ORDER BY dist LIMIT 5;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..0.52 rows=5 width=89) (actual time=37.462..37.639 rows=5 loops=1)
Output: t.*, ((c1 <-> '癷磛鶿蠌鰓蠲123鶡埀婎鯿苿奶垨惸溴蔻筴熝憡'::text))
Buffers: shared hit=1631
-> Index Scan using idx_test001_3 on public.test001 t (cost=0.28..4763.28 rows=100000 width=89) (actual time=37.461..37.636 rows=5 loops=1)
Output: t.*, (c1 <-> '癷磛鶿蠌鰓蠲123鶡埀婎鯿苿奶垨惸溴蔻筴熝憡'::text)
Order By: (t.c1 <-> '癷磛鶿蠌鰓蠲123鶡埀婎鯿苿奶垨惸溴蔻筴熝憡'::text)
Buffers: shared hit=1631
Planning time: 0.089 ms
Execution time: 37.668 ms
(9 rows)
test=# SELECT t, c1 <-> '癷磛鶿蠌鰓蠲123鶡埀婎鯿苿奶垨惸溴蔻筴熝憡' AS dist
FROM test001 t
ORDER BY dist LIMIT 5;
t | dist
--------------------------------------------+----------
(癷磛鶿蠌鰓蠲你鶡埀婎鯿苿奶垨惸溴蔻筴熝憡) | 0.307692
(坆桻悁斾耾瑚豌腏炁悿隖軻盃掗稐睟礓蜮鉛湆) | 0.976744
(癷鉜餯祂鼃恫蝅瓟顡廕梍蛸歡僷贊敔欓侑韌鐹) | 0.976744
(癷謔鳬戚蹪熼胘檙佌欔韜挹樷覄惶蹝頊鑜鞖媗) | 0.976744
(癷饎瞲餿堒歃峽盾豼擔禞嵪豦咢脈餛竨濟隘緘) | 0.976744
(5 rows)
六、小結
1. 如果隻有前模煳查詢需求,使用collate "C"的b-tree索引。
2. 如果隻有後模煳的查詢需求,使用collate "C"的reverse()表達式的b-tree索引。
3. 如果有前後模煳查詢需求,並且包含中文,請使用lc_collate,lc_ctype <> "C"的數據庫,同時使用pg_trgm插件的gin索引。
4. 如果有前後模煳查詢需求,並且不包含中文,請使用pg_trgm插件的gin索引。
5. 如果有含有ascii字符的正則表達式查詢需求,請使用pg_trgm插件的gin索引。
6. 如果有輸入條件少於3個字符的模煳查詢需求,可以使用GIN表達式索引,通過數組包含的方式進行搜索,性能一樣非常好。
七、性能
1億條記錄,每條記錄15個隨機中文,一共8GB。測試前後模煳查詢性能。
1. 生成測試數據
vi test.sql
insert into test001 select gen_hanzi(15) from generate_series(1,2500000);
pgbench -n -r -P 1 -f ./test.sql -c 40 -j 40 -t 1 test
test=# select count(*) from test001;
count
-----------
100000000
(1 row)
test=# select * from test001 limit 10;
c1
--------------------------------
釾笉皜鰈確艄騚馺腃彊釲忰采汦擇
槮搮圮墔婂蹾飄孡鶒鎮貲聵線麯櫕
孨鄈韞萅赫炧暤蟠檼駧餪崉媧譌筯
烸喖醝稦怩鷟棾奜妛曫仾飛饡繪韋
撐豁襉峊炠眏罱襄彊鰮莆壏妒辷闤
蜁愊鶱磹貳帵眲嚉榑蒼潵簷簄椰魨
瑄翁蠃巨躋壾蛸湗鑂顂櫟砣八癱栵
餷巍笿鞽裝棊嘢恓煓熴錩鋈蹃煿屓
訆韄踔牤嘇糺絢軿鵑燿螛梋鰢謇郼
撲蓨傷釱糕觩嬖蓷鰼繩圓醷熌靉掑
(10 rows)
2. 創建索引
test=# set maintenance_work_mem ='32GB';
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
3. 模煳查詢性能測試
3.1 前模煳
響應時間:9毫秒
返回4701行
select * from test001 where c1 like '你%';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=1.546..8.868 rows=4701 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '你%'::text)
Rows Removed by Index Recheck: 85
Heap Blocks: exact=4776
Buffers: shared hit=4784
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.879..0.879 rows=4786 loops=1)
Index Cond: (test001.c1 ~~ '你%'::text)
Buffers: shared hit=8
Planning time: 0.099 ms
Execution time: 9.166 ms
(11 rows)
3.2 後模煳
響應時間:0.25毫秒
返回2行
select * from test001 where c1 like '%靉掑';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%靉掑';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.049..0.223 rows=2 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%靉掑'::text)
Rows Removed by Index Recheck: 87
Heap Blocks: exact=89
Buffers: shared hit=94
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.031..0.031 rows=89 loops=1)
Index Cond: (test001.c1 ~~ '%靉掑'::text)
Buffers: shared hit=5
Planning time: 0.113 ms
Execution time: 0.249 ms
(11 rows)
3.3 前後模煳
響應時間:0.2毫秒
返回1行
select * from test001 where c1 like '%螛梋鰢%';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%螛梋鰢%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.044..0.175 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%螛梋鰢%'::text)
Rows Removed by Index Recheck: 81
Heap Blocks: exact=82
Buffers: shared hit=87
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.027..0.027 rows=82 loops=1)
Index Cond: (test001.c1 ~~ '%螛梋鰢%'::text)
Buffers: shared hit=5
Planning time: 0.112 ms
Execution time: 0.201 ms
(11 rows)
最後更新:2017-04-26 17:00:37