PostgreSQL 十亿级模糊查询最佳实践
标签
PostgreSQL , 模糊查询 , 正则查询 , pg_trgm , bytea , gin , 函数索引
背景
前模糊(有前缀的模糊),后模糊(有后缀的模糊),前后模糊(无前后缀的模糊),正则匹配都属于文本搜索领域常见的需求。
PostgreSQL拥有很强的文本搜索能力,除了支持全文检索,还支持模糊查询、正则查询。内置的pg_trgm插件是一般数据库没有的,可能很多人没有听说过。同时还内置了表达式索引、GIN索引的功能。
不同的模糊查询需求,有不同的优化方法。
对于前模糊和后模糊,PostgreSQL则与其他数据库一样,可以使用btree来加速。后模糊可以使用反转函数的函数索引来加速。
对于前后模糊和正则匹配,一种方法是使用pg_trgm插件,利用GIN索引加速模糊和正则查询(输入3个或3个以上字符的模糊查询效果很好)。另一种方法是自定义GIN表达式索引的方法,适合于定制的模糊查询。
一、前模糊与后模糊的优化
1. 前模糊(有前缀的模糊)优化方法
使用b-tree可以支持前模糊的查询。仅适合于collate="C"的查询,当数据库默认的lc_collate<>C时,索引和查询都需要明确指定collate "C"。
索引、查询条件的collate必须一致才能使用索引。
例子
test=# create table test(id int, info text);
CREATE TABLE
test=# insert into test select generate_series(1,1000000),md5(random()::text);
INSERT 0 1000000
test=# create index idx on test(info collate "C");
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1)
Output: id, info
Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "C")
Buffers: shared hit=18 read=3
Planning time: 0.424 ms
Execution time: 0.124 ms
(7 rows)
2. 后模糊(有后缀的模糊)的优化方法
使用反转函数(reverse)索引,可以支持后模糊的查询。仅适合于collate="C"的查询,当数据库默认的lc_collate<>C时,索引和查询都需要明确指定collate "C"。
索引、查询条件的collate必须一致才能使用索引。
例子
test=# create index idx1 on test(reverse(info) collate "C");
CREATE INDEX
test=# select * from test limit 1;
id | info
----+----------------------------------
1 | b3275976cdd437a033d4329775a52514
(1 row)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%' collate "C";
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.061..0.097 rows=18 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) >= '4152'::text) AND (reverse(test.info) < '4153'::text))
Filter: (reverse(test.info) ~~ '4152%'::text COLLATE "C")
Buffers: shared hit=18 read=3
Planning time: 0.128 ms
Execution time: 0.122 ms
(7 rows)
test=# select * from test where reverse(info) like '4152%' collate "C";
id | info
--------+----------------------------------
847904 | abe2ecd90393b5275df8e34a39702514
414702 | 97f66d26545329321164042657d02514
191232 | 7820972c6220c2b01d46c11ebb532514
752742 | 93232ac39c6632e2540df44627c42514
217302 | 39e518893a1a7b1e691619bd1fc42514
1 | b3275976cdd437a033d4329775a52514
615718 | 4948f94c484c13dc6c4fae8a3db52514
308815 | fc2918ceff7c7a4dafd2e04031062514
149521 | 546d963842ea5ca593e622c810262514
811093 | 4b6eca2eb6b665af67b2813e91a62514
209000 | 1dfd0d4e326715c1739f031cca992514
937616 | 8827fd81f5b673fb5afecbe3e11b2514
419553 | bd6e01ce360af16137e8b6abc8ab2514
998324 | 7dff51c19dc5e5d9979163e7d14c2514
771518 | 8a54e30003a48539fff0aedc73ac2514
691566 | f90368348e3b6bf983fcbe10db2d2514
652274 | 8bf4a97b5f122a5540a21fa85ead2514
233437 | 739ed715fc203d47e37e79b5bcbe2514
(18 rows)
3. 前、后模糊的合体优化方法
使用pg_trgm索引,可以支持前、后模糊的查询,注意:
(有前缀的模糊)至少输入1个字符,(有后缀的模糊)至少输入2个字符。才有好的索引过滤效果。如果要支持中文,数据库lc_collate,lc_ctype不能为"C"。
索引、查询条件的collate必须一致才能使用索引。
test=# \l+ test
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
------+----------+----------+------------+------------+-------------------+--------+------------+-------------
test | postgres | UTF8 | zh_CN.utf8 | zh_CN.utf8 | | 245 MB | pg_default |
(1 row)
test=# create extension pg_trgm;
test=# create table test001(c1 text);
CREATE TABLE
生成随机中文字符串的函数
test=# create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;
CREATE FUNCTION
生成随机数据
test=# insert into test001 select gen_hanzi(20) from generate_series(1,100000);
INSERT 0 100000
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
CREATE INDEX
test=# select * from test001 limit 5;
c1
------------------------------------------
埳噪办甾讷昃碇玾陧箖燋邢贺浮媊踮菵暔谉橅
秌橑籛鴎拟倶敤麁鼋醠轇坙騉鏦纗蘛婃坹娴儅
蔎緾铠爪鹏二悲膼朠麻鸂鋬桢窷违繇糭嘓索籓
驰泅薬鐗愅撞窍浉渗蛁灎厀攚摐瞪拡擜詜隝緼
襳铺煃匶瀌惩荼黹樆惺箧搔羾憯墆锒硍蔓恧顤
(5 rows)
模糊查询
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.030..0.034 rows=3 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '你%'::text)
Heap Blocks: exact=3
Buffers: shared hit=7
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=3 loops=1)
Index Cond: (test001.c1 ~~ '你%'::text)
Buffers: shared hit=4
Planning time: 0.119 ms
Execution time: 0.063 ms
(10 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%恧顤';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.031..0.034 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%恧顤'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=2
Buffers: shared hit=6
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=2 loops=1)
Index Cond: (test001.c1 ~~ '%恧顤'::text)
Buffers: shared hit=4
Planning time: 0.136 ms
Execution time: 0.062 ms
(11 rows)
二、前后均模糊的优化
使用pg_trgm插件,支持前后模糊的查询。注意:
如果需要让pg_trgm支持中文的模糊查询,数据库lc_collate,lc_ctype不能为"C"。
比如输入3个或3个以上字符,否则效果不佳。
例子
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢贺%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.038..0.038 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%燋邢贺%'::text)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.025..0.025 rows=1 loops=1)
Index Cond: (test001.c1 ~~ '%燋邢贺%'::text)
Buffers: shared hit=4
Planning time: 0.170 ms
Execution time: 0.076 ms
(10 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=147.524..178.232 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%燋邢%'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=1137
Buffers: shared hit=14429
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=147.377..147.377 rows=100000 loops=1)
Index Cond: (test001.c1 ~~ '%燋邢%'::text)
Buffers: shared hit=13292
Planning time: 0.133 ms
Execution time: 178.265 ms
(11 rows)
三、正则匹配的优化
目前,pg_trgm对中文(或多字节字符)的正则匹配效果不佳,对ascii字符的正则匹配效果很好。
Rows Removed by Index Recheck很小,说明索引过滤性很好。
例子
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '12[0-9]{3,9}';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=65.08..75.20 rows=10 width=61) (actual time=0.196..0.196 rows=0 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~ '12[0-9]{3,9}'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=1
Buffers: shared hit=50
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..65.08 rows=10 width=0) (actual time=0.183..0.183 rows=1 loops=1)
Index Cond: (test001.c1 ~ '12[0-9]{3,9}'::text)
Buffers: shared hit=49
Planning time: 0.452 ms
Execution time: 0.221 ms
(11 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '燋邢贺';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=143.846..232.278 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~ '燋邢贺'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=1137
Buffers: shared hit=14429
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=143.688..143.688 rows=100000 loops=1)
Index Cond: (test001.c1 ~ '燋邢贺'::text)
Buffers: shared hit=13292
Planning time: 0.254 ms
Execution time: 232.312 ms
(11 rows)
pg_trgm模糊查询的原理
首先,pg_trgm将字符串的前端添加2个空格,末尾添加1个空格。
然后,每连续的3个字符为一个TOKEN,拆开。
最后,对TOKEN建立GIN倒排索引。
查看字符串的TOKEN,可以使用如下方法。
test=# select show_trgm('123');
show_trgm
-------------------------
{" 1"," 12",123,"23 "}
(1 row)
pg_trgm前后模糊字符个数要求的原因
使用pg_trgm时,如果要获得最好的效果,最好满足这些条件。
1. 有前缀的模糊查询,例如a%,至少需要提供1个字符。( 搜索的是token=' a' )
2. 有后缀的模糊查询,例如%ab,至少需要提供2个字符。( 搜索的是token='ab ' )
3. 前后模糊查询,例如%abcd%,至少需要提供3个字符。( 这个使用数组搜索,搜索的是token(s) 包含 {" a"," ab",abc,bcd,"cd "} )
原因是什么呢?
因为pg_trgm生成的TOKEN是三个字符,只有在以上三个条件下,才能匹配到对应的TOKEN。
test=# select show_trgm('123');
show_trgm
-------------------------
{" 1"," 12",123,"23 "}
(1 row)
四、小于3个输入字符的模糊查询的优化
当需要前后模糊搜索1个或者2个字符时,pg_trgm无法满足需求,但是我们可以使用表达式GIN索引。
使用表达式,将字符串拆成1个单字,两个连续的字符的数组,对数组建立GIN索引即可。
例子
test=# create or replace function split001(text) returns text[] as $$
declare
res text[];
begin
select regexp_split_to_array($1,'') into res;
for i in 1..length($1)-1 loop
res := array_append(res, substring($1,i,2));
end loop;
return res;
end;
$$ language plpgsql strict immutable;
CREATE FUNCTION
test=# create index idx_test001_2 on test001 using gin (split001(c1));
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你好'];
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.041..0.041 rows=0 loops=1)
Output: c1
Recheck Cond: (split001(test001.c1) @> '{你好}'::text[])
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.039..0.039 rows=0 loops=1)
Index Cond: (split001(test001.c1) @> '{你好}'::text[])
Buffers: shared hit=4
Planning time: 0.104 ms
Execution time: 0.068 ms
(9 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你'];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.063..0.183 rows=86 loops=1)
Output: c1
Recheck Cond: (split001(test001.c1) @> '{你}'::text[])
Heap Blocks: exact=80
Buffers: shared hit=84
-> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.048..0.048 rows=86 loops=1)
Index Cond: (split001(test001.c1) @> '{你}'::text[])
Buffers: shared hit=4
Planning time: 0.101 ms
Execution time: 0.217 ms
(10 rows)
test=# select * from test001 where split001(c1) @> array['你'];
c1
------------------------------------------
殐踨洪冨垓丩贤閚伟垢胸铡崩你萭隡劭芛雫袰
靅慨热脸罆淓寘鳗总襎戍謸枨陪丼伦柆套你仮
......
五、相似查询优化
模糊查询和正则匹配都是找出完全符合条件的记录,还有一种需求是相似查询。
例如postgresql字符串,输入 p0stgresgl 也能根据相似度匹配到。
例子
test=# create index idx_test001_3 on test001 using gist (c1 gist_trgm_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) SELECT t, c1 <-> '癷磛鹚蠌鳃蠲123鹖埀婎鳊苿奶垨惸溴蔻筴熝憡' AS dist
FROM test001 t
ORDER BY dist LIMIT 5;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..0.52 rows=5 width=89) (actual time=37.462..37.639 rows=5 loops=1)
Output: t.*, ((c1 <-> '癷磛鹚蠌鳃蠲123鹖埀婎鳊苿奶垨惸溴蔻筴熝憡'::text))
Buffers: shared hit=1631
-> Index Scan using idx_test001_3 on public.test001 t (cost=0.28..4763.28 rows=100000 width=89) (actual time=37.461..37.636 rows=5 loops=1)
Output: t.*, (c1 <-> '癷磛鹚蠌鳃蠲123鹖埀婎鳊苿奶垨惸溴蔻筴熝憡'::text)
Order By: (t.c1 <-> '癷磛鹚蠌鳃蠲123鹖埀婎鳊苿奶垨惸溴蔻筴熝憡'::text)
Buffers: shared hit=1631
Planning time: 0.089 ms
Execution time: 37.668 ms
(9 rows)
test=# SELECT t, c1 <-> '癷磛鹚蠌鳃蠲123鹖埀婎鳊苿奶垨惸溴蔻筴熝憡' AS dist
FROM test001 t
ORDER BY dist LIMIT 5;
t | dist
--------------------------------------------+----------
(癷磛鹚蠌鳃蠲你鹖埀婎鳊苿奶垨惸溴蔻筴熝憡) | 0.307692
(坆桻悁斾耾瑚豌腏炁悿隖轲盃挜稐睟礓蜮铅湆) | 0.976744
(癷鉜餯祂鼃恫蝅瓟顡廕梍蛸欢僷贊敔欓侑韧鐹) | 0.976744
(癷嚯鳬戚蹪熼胘檙佌欔韬挹樷覄惶蹝顼鑜鞖媗) | 0.976744
(癷饎瞲馊堒歃峡盾豼担禞嵪豦咢脉馄竨济隘缄) | 0.976744
(5 rows)
六、小结
1. 如果只有前模糊查询需求,使用collate "C"的b-tree索引。
2. 如果只有后模糊的查询需求,使用collate "C"的reverse()表达式的b-tree索引。
3. 如果有前后模糊查询需求,并且包含中文,请使用lc_collate,lc_ctype <> "C"的数据库,同时使用pg_trgm插件的gin索引。
4. 如果有前后模糊查询需求,并且不包含中文,请使用pg_trgm插件的gin索引。
5. 如果有含有ascii字符的正则表达式查询需求,请使用pg_trgm插件的gin索引。
6. 如果有输入条件少于3个字符的模糊查询需求,可以使用GIN表达式索引,通过数组包含的方式进行搜索,性能一样非常好。
七、性能
1亿条记录,每条记录15个随机中文,一共8GB。测试前后模糊查询性能。
1. 生成测试数据
vi test.sql
insert into test001 select gen_hanzi(15) from generate_series(1,2500000);
pgbench -n -r -P 1 -f ./test.sql -c 40 -j 40 -t 1 test
test=# select count(*) from test001;
count
-----------
100000000
(1 row)
test=# select * from test001 limit 10;
c1
--------------------------------
釾笉皜鲽确艄騚馺腃彊釲忰采汦择
槮搮圮墔婂蹾飘孡鶒镇赀聩线麯櫕
孨鄈韫萅赫炧暤蟠檼駧餪崉娲譌筯
烸喖醝稦怩鷟棾奜妛曫仾飞饡绘韦
撑豁襉峊炠眏罱襄彊鳁莆壏妒辷阛
蜁愊鶱磹贰帵眲嚉榑苍潵檐簄椰鲀
瑄翁蠃巨跻壾蛸湗鑂顂栎砣八瘫栵
馇巍笿鞒装棊嘢恓煓熴锠鋈蹃煿屓
訆韄踔牤嘇糺绚軿鹃燿螛梋鰢謇郼
扑蓨伤釱糕觩嬖蓷鳛绳圆醷熌叆掑
(10 rows)
2. 创建索引
test=# set maintenance_work_mem ='32GB';
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
3. 模糊查询性能测试
3.1 前模糊
响应时间:9毫秒
返回4701行
select * from test001 where c1 like '你%';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=1.546..8.868 rows=4701 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '你%'::text)
Rows Removed by Index Recheck: 85
Heap Blocks: exact=4776
Buffers: shared hit=4784
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.879..0.879 rows=4786 loops=1)
Index Cond: (test001.c1 ~~ '你%'::text)
Buffers: shared hit=8
Planning time: 0.099 ms
Execution time: 9.166 ms
(11 rows)
3.2 后模糊
响应时间:0.25毫秒
返回2行
select * from test001 where c1 like '%叆掑';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%叆掑';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.049..0.223 rows=2 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%叆掑'::text)
Rows Removed by Index Recheck: 87
Heap Blocks: exact=89
Buffers: shared hit=94
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.031..0.031 rows=89 loops=1)
Index Cond: (test001.c1 ~~ '%叆掑'::text)
Buffers: shared hit=5
Planning time: 0.113 ms
Execution time: 0.249 ms
(11 rows)
3.3 前后模糊
响应时间:0.2毫秒
返回1行
select * from test001 where c1 like '%螛梋鰢%';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%螛梋鰢%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.044..0.175 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%螛梋鰢%'::text)
Rows Removed by Index Recheck: 81
Heap Blocks: exact=82
Buffers: shared hit=87
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.027..0.027 rows=82 loops=1)
Index Cond: (test001.c1 ~~ '%螛梋鰢%'::text)
Buffers: shared hit=5
Planning time: 0.112 ms
Execution time: 0.201 ms
(11 rows)
最后更新:2017-04-26 17:00:37