閱讀755 返回首頁    go 阿裏雲 go 技術社區[雲棲]


海量數據,海明距離高效檢索(smlar) - 阿裏雲RDS PosgreSQL最佳實踐

標簽

PostgreSQL , 海明距離 , smlar , GiST索引


背景

https://www.cnblogs.com/lushilin/p/6549665.html

SimHash的應用

通過上麵的步驟,我們可以利用SimHash算法為每一個網頁生成一個向量指紋,那麼問題來了,如何判斷2篇文本的相似性?
這裏麵主要應用到是海明距離。

(1)什麼是海明距離
兩個碼字的對應比特取值不同的比特數稱為這兩個碼字的海明距離。在一個有效編碼集中,任意兩個碼字的海明距離的最小值稱為該編碼集的海明距離。舉例如下:10101和00110從第一位開始依次有第一位、第四、第五位不同,則海明距離為3。

(2)海明距離的幾何意義
n位的碼字可以用n維空間的超立方體的一個頂點來表示。兩個碼字之間的海明距離就是超立方體兩個頂點之間的一條邊,而且是這兩個頂點之間的最短距離。

(3)海明距離的應用場景
用於編碼的檢錯和糾錯

經過SimHash算法提取來的指紋(Simhash對長文本500字+比較適用,短文本可能偏差較大,具體需要根據實際場景測試),最後使用海明距離,求相似,在google的論文給出的數據中,64位的簽名,在海明距離為3的情況下,可認為兩篇文檔是相似的或者是重複的,當然這個值隻是參考值,針對自己的應用可能又不同的測試取值

到這裏相似度問題基本解決,但是按這個思路,在海量數據幾百億的數量下,效率問題還是沒有解決的,因為數據是不斷添加進來的,不可能每來一條數據,都要和全庫的數據做一次比較,按照這種思路,處理速度會越來越慢,線性增長。

針對海量數據的去重效率,我們可以將64位指紋,切分為4份16位的數據塊,根據抽屜原理在海明距離為3的情況,如果兩個文檔相似,那麼它必有一個塊的數據是相等的。

那麼數據庫是否支持這種高效率的檢索呢?

反正PostgreSQL是支持的,通過黑科技smlar插件。

一、需求

二、架構設計

在PostgreSQL中,從海量數據中,搜索海明距離小於N的數據,有多種設計手段。每種方法的能耗比都不一樣,讀者可以按需選擇。

1 暴力計算

1、單機多核並行計算,暴力掃描。采用阿裏雲RDS PostgreSQL 10提供的多核並行能力,暴力掃描。

2、多機多核並行計算,暴力掃描。采用阿裏雲HybridDB for PostgreSQL提供的多級並行計算能力,暴力掃描。

3、利用GPU、FPGA加速暴力運算。

PostgreSQL提供了擴展接口,可以利用GPU,FPGA的能力對數據進行計算。

4、利用CPU向量計算指令,暴力計算。

PostgreSQL提供了擴展接口,可以利用CPU向量計算指令的能力加速計算。

2 索引

索引是高效的做法,例如PostgreSQL smlar插件,在阿裏的導購平台就有使用,用於實時導購文的海量相似度查詢。

如果要讓smlar加速海明距離的搜索,需要采用更科學的方法,比如切片。

直接使用位置,會有問題,因為smlar的第一道工序是塊級收斂,而海明碼是bit64的編碼,在一個數據塊中,有若幹條記錄,任何位置都可能同時出現0和1,任何數據塊都包含0和1,因此無法完成第一道過濾。

我們可以采用切片,減少這種可能性。例如每2個BIT一片,或者每4個BIT一片,或者更多。

通常海明距離大於3的,就沒有什麼相關性了。

三、DEMO與性能

1 暴力計算

1、全掃,並行掃描

創建測試表

create table hm (  
  id int,        -- id  
  hmval bit(64)  -- 海明HASH  
);  

寫入1000萬測試數據

postgres=# insert into hm select id, val::int8::bit(64) from (select id, sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric as val from generate_series(1,10000000) t(id)) t;  
INSERT 0 10000000  
  
postgres=# select * from hm limit 10;  
 id |                              hmval                                 
----+------------------------------------------------------------------  
  1 | 0000101001110110110101010111101011100110101010000111100011110111  
  2 | 0110011100110101101000001010101111010001011101100111111011001110  
  3 | 1010110111001011011110110000111111101101101111010111111100101110  
  4 | 0110011110110000001011000010010000101011100101010100111000101001  
  5 | 0101110100101111010110010110000000101110000010001011010110110000  
  6 | 0011010000100000101011011100000101111110010110111101100001100001  
  7 | 1011110011101101101000011101011101010111011001011010110111101000  
  8 | 1110010011000101001101110010001111110100001101010101111101110010  
  9 | 0110111111110011101001001000101101011011111100010010111010001111  
 10 | 0011100011000010101011010001111000000110100011100100111011011001  
(10 rows)  

設置暴力並行度

postgres=# set force_parallel_mode = on;  
postgres=# set min_parallel_table_scan_size = 0;  
postgres=# set parallel_setup_cost = 0;  
postgres=# set parallel_tuple_cost = 0;  
postgres=# alter table hm set (parallel_workers = 128);  
postgres=# set max_parallel_workers_per_gather = 64;  

並行查詢海明距離小於4的記錄,耗時463毫秒。

postgres=# select * from hm where length(replace(bitxor(bit'0011110001011010110010001011010101001000111110000111110010010110', hmval)::text,'0','')) < 4;  
 id |                              hmval                                 
----+------------------------------------------------------------------  
 16 | 0011110001011010110010001011010101001000111110000111110010010110  
(1 row)  
  
Time: 463.314 ms  

非並行查詢海明距離小於4的記錄,耗時16秒。

postgres=# select * from hm where length(replace(bitxor(bit'0011110001011010110010001011010101001000111110000111110010010110', hmval)::text,'0','')) < 4;  
 id |                              hmval                                 
----+------------------------------------------------------------------  
 16 | 0011110001011010110010001011010101001000111110000111110010010110  
(1 row)  
  
Time: 16791.215 ms (00:16.791)  

求兩個BIT的不同位數,還有更高效率的方法。理論上可以達到100毫秒以內。

https://www.postgresql.org/message-id/flat/ab1ea6540903121110l2a3021d4h6632b206e2419898%40mail.gmail.com#ab1ea6540903121110l2a3021d4h6632b206e2419898@mail.gmail.com  

2 索引

阿裏雲RDS PostgreSQL提供了一個smlar插件,用於高效率的求數組的相似度。

我們需要將海明HASH,轉換為數組,根據前麵的設計,我們采用8個BIT一片的切法,支持索引查詢海明距離為8以內的值。

切之前,驗證一下切片後的過濾性:

postgres=# select relpages from pg_class where relname='hm';  
 relpages   
----------  
    63695  
(1 row)  
  
1、單個片為1時,不用說,每個塊都包含。  
  
postgres=# select count(*) from (select substring(ctid::text,'(\d+),') from hm where substring(hmval,1,1)='0' group by 1)t;  
 count   
-------  
 63695  
(1 row)  
  
2、單個片為8時,有接近一半的塊包含。  
  
postgres=# select count(*) from (select substring(ctid::text,'(\d+),') from hm where substring(hmval,1,8)='00000000' group by 1)t;  
 count   
-------  
 29100  
(1 row)  
  
3、單個片為16時,隻有100多個塊包含了。  
  
postgres=# select count(*) from (select substring(ctid::text,'(\d+),') from hm where substring(hmval,1,16)='0000000000000000' group by 1)t;  
 count   
-------  
   160  
(1 row)  

8片切法的性能驗證

創建插件

create extension smlar;  

創建測試表

create table hm1 (id int, hmval bit(64), hmarr text[]);  

生成1000萬測試數據,生成測試數據時,按切分手段進行切分,記錄為TEXT數組。

insert into hm1   
select   
  id,   
  val::bit(64),   
  regexp_split_to_array('1_'||substring(val,1,8)||',2_'||substring(val,9,8)||',3_'||substring(val,17,8)||',4_'||substring(val,25,8)||',5_'||substring(val,33,8)||',6_'||substring(val,41,8)||',7_'||substring(val,49,8)||',8_'||substring(val,57,8), ',')    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t;  
  
postgres=# select * from hm1 limit 10;  
 id |                              hmval                               |                                           hmarr                                             
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------------  
  1 | 0000001110101101100110011000100111100100001100100101101010010011 | {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}  
  2 | 0001001000010101001100100010101010111001001000000110101101100100 | {1_00010010,2_00010101,3_00110010,4_00101010,5_10111001,6_00100000,7_01101011,8_01100100}  
  3 | 0011111111010100011001001010110110100010101110101001101111010000 | {1_00111111,2_11010100,3_01100100,4_10101101,5_10100010,6_10111010,7_10011011,8_11010000}  
  4 | 1100110010011001001110101110111111111111010000100000010011000010 | {1_11001100,2_10011001,3_00111010,4_11101111,5_11111111,6_01000010,7_00000100,8_11000010}  
  5 | 0011000011010001011111010101010111100110000110000011101100000101 | {1_00110000,2_11010001,3_01111101,4_01010101,5_11100110,6_00011000,7_00111011,8_00000101}  
  6 | 0111101101111110101000010110101101110011011110100100010111011001 | {1_01111011,2_01111110,3_10100001,4_01101011,5_01110011,6_01111010,7_01000101,8_11011001}  
  7 | 0010001011111111100010101011110001001101001011100100011000010000 | {1_00100010,2_11111111,3_10001010,4_10111100,5_01001101,6_00101110,7_01000110,8_00010000}  
  8 | 1110001111100011011110110111101111010101000111000100111111111101 | {1_11100011,2_11100011,3_01111011,4_01111011,5_11010101,6_00011100,7_01001111,8_11111101}  
  9 | 0111110010111000010111001000000101111000000110110110000011101110 | {1_01111100,2_10111000,3_01011100,4_10000001,5_01111000,6_00011011,7_01100000,8_11101110}  
 10 | 0111001101100010001101101111000000100100000000010001010011100101 | {1_01110011,2_01100010,3_00110110,4_11110000,5_00100100,6_00000001,7_00010100,8_11100101}  
(10 rows)  

創建smlar索引

postgres=# create index idx_hm1 on hm1 using gin(hmarr _text_sml_ops );  

搜索海明距離小於等於1的VALUE。用到了smlar索引,耗時63毫秒。

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 7;  
  
select    
    *,    
    smlar( hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}')    
  from    
    hm1    
  where    
    hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'      
    and length(replace(bitxor(bit'0000001110101101100110011000100111100100001100100101101010010011', hmval)::text,'0','')) < 2  
  limit 100;  
  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select    
    *,    
    smlar( hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}')    
  from    
    hm1    
  where    
    hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'      
    and length(replace(bitxor(bit'0000001110101101100110011000100111100100001100100101101010010011', hmval)::text,'0','')) < 2  
  limit 100;  
                                                                            QUERY PLAN                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=117.83..420.48 rows=100 width=169) (actual time=62.928..62.929 rows=1 loops=1)  
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[]))  
   Buffers: shared hit=166  
   ->  Bitmap Heap Scan on public.hm1  (cost=117.83..10205.17 rows=3333 width=169) (actual time=62.927..62.927 rows=1 loops=1)  
         Output: id, hmval, hmarr, smlar(hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[])  
         Recheck Cond: (hm1.hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[])  
         Filter: (length(replace((bitxor(B'0000001110101101100110011000100111100100001100100101101010010011'::"bit", hm1.hmval))::text, '0'::text, ''::text)) < 2)  
         Heap Blocks: exact=1  
         Buffers: shared hit=166  
         ->  Bitmap Index Scan on idx_hm1  (cost=0.00..117.00 rows=10000 width=0) (actual time=62.898..62.898 rows=1 loops=1)  
               Index Cond: (hm1.hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'::text[])  
               Buffers: shared hit=165  
 Planning time: 0.147 ms  
 Execution time: 62.975 ms  
(14 rows)  
  
postgres=# select                  
    *,    
    smlar( hmarr, '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}')    
  from    
    hm1    
  where    
    hmarr % '{1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011}'      
    and length(replace(bitxor(bit'0000001110101101100110011000100111100100001100100101101010010011', hmval)::text,'0','')) < 2  
  limit 100;  
 id |                              hmval                               |                                           hmarr                                           | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------------+-------  
  1 | 0000001110101101100110011000100111100100001100100101101010010011 | {1_00000011,2_10101101,3_10011001,4_10001001,5_11100100,6_00110010,7_01011010,8_10010011} |     8  
(1 row)  
Time: 61.227 ms  

如果我們隻需要查詢4以內的海明距離,實際上可以使用16的分組,或者我們可以使用混合切法。

6片混合切法的性能驗證

切法為8,16,8,8,16,8。支持海明距離6以內的查詢。

create table hm2 (id int, hmval bit(64), hmarr text[]);  
  
insert into hm2   
select   
  id,   
  val::bit(64),   
  regexp_split_to_array('1_'||substring(val,1,8)||',2_'||substring(val,9,16)||',3_'||substring(val,25,8)||',4_'||substring(val,33,8)||',5_'||substring(val,41,16)||',6_'||substring(val,57,8), ',')    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t;  
  
postgres=# select * from hm2 limit 10;  
 id |                              hmval                               |                                        hmarr                                          
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------  
  1 | 1100111011000001100100100111111110100011100111111101101001101010 | {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}  
  2 | 0111111000101011000111010011011000000010010001111001000111011101 | {1_01111110,2_0010101100011101,3_00110110,4_00000010,5_0100011110010001,6_11011101}  
  3 | 0111111000101111000101011100100000001111011101101100110100000101 | {1_01111110,2_0010111100010101,3_11001000,4_00001111,5_0111011011001101,6_00000101}  
  4 | 0111010101010010100000110001100011110010111000001011000010010010 | {1_01110101,2_0101001010000011,3_00011000,4_11110010,5_1110000010110000,6_10010010}  
  5 | 1111101100110100101111000011001011111110111000100110101001100001 | {1_11111011,2_0011010010111100,3_00110010,4_11111110,5_1110001001101010,6_01100001}  
  6 | 0011110000100010101001000001100010000010111011100010011001000110 | {1_00111100,2_0010001010100100,3_00011000,4_10000010,5_1110111000100110,6_01000110}  
  7 | 0000111111001110100110011110000110001101110111111111111010111001 | {1_00001111,2_1100111010011001,3_11100001,4_10001101,5_1101111111111110,6_10111001}  
  8 | 0110100010010100111100110110000011101110101001001111010101011111 | {1_01101000,2_1001010011110011,3_01100000,4_11101110,5_1010010011110101,6_01011111}  
  9 | 0111001111001100101011001001100100000000111100000110110001000011 | {1_01110011,2_1100110010101100,3_10011001,4_00000000,5_1111000001101100,6_01000011}  
 10 | 1101111101011000111100101010101000100001101100101110100001111000 | {1_11011111,2_0101100011110010,3_10101010,4_00100001,5_1011001011101000,6_01111000}  
(10 rows)  
  
create index idx_hm2 on hm2 using gin(hmarr _text_sml_ops );  

查詢海明距離小於等於1的值,提高到2毫秒了。

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 5;  
  
postgres=# select    
    *,    
    smlar( hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}')    
  from    
    hm2   
  where    
    hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'      
    and length(replace(bitxor(bit'1100111011000001100100100111111110100011100111111101101001101010', hmval)::text,'0','')) < 2  
  limit 100;  
 id |                              hmval                               |                                        hmarr                                        | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------+-------  
  1 | 1100111011000001100100100111111110100011100111111101101001101010 | {1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010} |     6  
(1 row)  
Time: 1.954 ms  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select    
    *,    
    smlar( hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}')    
  from    
    hm2   
  where    
    hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'      
    and length(replace(bitxor(bit'1100111011000001100100100111111110100011100111111101101001101010', hmval)::text,'0','')) < 2  
  limit 100;  
                                                                            QUERY PLAN                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=103.83..406.06 rows=100 width=153) (actual time=2.414..2.416 rows=1 loops=1)  
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[]))  
   Buffers: shared hit=102  
   ->  Bitmap Heap Scan on public.hm2  (cost=103.83..10177.17 rows=3333 width=153) (actual time=2.414..2.415 rows=1 loops=1)  
         Output: id, hmval, hmarr, smlar(hmarr, '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[])  
         Recheck Cond: (hm2.hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[])  
         Filter: (length(replace((bitxor(B'1100111011000001100100100111111110100011100111111101101001101010'::"bit", hm2.hmval))::text, '0'::text, ''::text)) < 2)  
         Heap Blocks: exact=1  
         Buffers: shared hit=102  
         ->  Bitmap Index Scan on idx_hm2  (cost=0.00..103.00 rows=10000 width=0) (actual time=2.374..2.374 rows=1 loops=1)  
               Index Cond: (hm2.hmarr % '{1_11001110,2_1100000110010010,3_01111111,4_10100011,5_1001111111011010,6_01101010}'::text[])  
               Buffers: shared hit=101  
 Planning time: 0.149 ms  
 Execution time: 2.463 ms  
(14 rows)  

4片切法的性能驗證

create table hm3 (id int, hmval bit(64), hmarr text[]);  
  
insert into hm3   
select   
  id,   
  val::bit(64),   
  regexp_split_to_array('1_'||substring(val,1,16)||',2_'||substring(val,17,16)||',3_'||substring(val,33,16)||',4_'||substring(val,41,16), ',')    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,10000000) t(id)) t;  
  
postgres=# select * from hm3 limit 10;  
 id |                              hmval                               |                                     hmarr                                       
----+------------------------------------------------------------------+-------------------------------------------------------------------------------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}  
  2 | 1101011000010000000000000000111011011111011101110100000010101011 | {1_1101011000010000,2_0000000000001110,3_1101111101110111,4_0111011101000000}  
  3 | 0101000010110110110010001010100010101001001010111111011000110011 | {1_0101000010110110,2_1100100010101000,3_1010100100101011,4_0010101111110110}  
  4 | 0111000111100011111000100111000011101111110000011110101101000100 | {1_0111000111100011,2_1110001001110000,3_1110111111000001,4_1100000111101011}  
  5 | 0010111010101011111010011110110010011110111111110011101110010011 | {1_0010111010101011,2_1110100111101100,3_1001111011111111,4_1111111100111011}  
  6 | 0110111110011100100110010111010000000011100011000011110001010110 | {1_0110111110011100,2_1001100101110100,3_0000001110001100,4_1000110000111100}  
  7 | 0100110100111001110011011110100111101110101001000101010110110110 | {1_0100110100111001,2_1100110111101001,3_1110111010100100,4_1010010001010101}  
  8 | 0110010111001100111000011011011100001100111111101111011010100010 | {1_0110010111001100,2_1110000110110111,3_0000110011111110,4_1111111011110110}  
  9 | 0110111010110000001010101111000101110000010011100011100101000100 | {1_0110111010110000,2_0010101011110001,3_0111000001001110,4_0100111000111001}  
 10 | 0101101000000110100101100011111111000101110001010011100110101011 | {1_0101101000000110,2_1001011000111111,3_1100010111000101,4_1100010100111001}  
(10 rows)  
  
create index idx_hm3 on hm3 using gin(hmarr _text_sml_ops );  

查詢海明距離小於等於1的值,提高到0.2毫秒了。

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 3;  
  
postgres=# select    
    *,    
    smlar( hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}')    
  from    
    hm3  
  where    
    hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'      
    and length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 2  
  limit 100;  
 id |                              hmval                               |                                     hmarr                                     | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------+-------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} |     4  
(1 row)  
  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select    
    *,    
    smlar( hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}')    
  from    
    hm3  
  where    
    hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'      
    and length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 2  
  limit 100;  
                                                                            QUERY PLAN                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=99.83..401.19 rows=100 width=134) (actual time=0.169..0.170 rows=1 loops=1)  
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[]))  
   Buffers: shared hit=14  
   ->  Bitmap Heap Scan on public.hm3  (cost=99.83..10144.17 rows=3333 width=134) (actual time=0.168..0.169 rows=1 loops=1)  
         Output: id, hmval, hmarr, smlar(hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[])  
         Recheck Cond: (hm3.hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[])  
         Filter: (length(replace((bitxor(B'0101011111111010000001001011101101100011111101111101101100000011'::"bit", hm3.hmval))::text, '0'::text, ''::text)) < 2)  
         Heap Blocks: exact=1  
         Buffers: shared hit=14  
         ->  Bitmap Index Scan on idx_hm3  (cost=0.00..99.00 rows=10000 width=0) (actual time=0.145..0.145 rows=1 loops=1)  
               Index Cond: (hm3.hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'::text[])  
               Buffers: shared hit=13  
 Planning time: 0.101 ms  
 Execution time: 0.200 ms  
(14 rows)  

查詢海明距離小於等於4的,依舊在毫秒返回。

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 0;  
  
postgres=# select    
    *,    
    smlar( hmarr, '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}')    
  from    
    hm3  
  where    
    hmarr % '{1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}'      
    and length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 5  
  limit 100;  
 id |                              hmval                               |                                     hmarr                                     | smlar   
----+------------------------------------------------------------------+-------------------------------------------------------------------------------+-------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011} |     4  
(1 row)  
Time: 6.983 ms  

不使用索引,23秒。

postgres=# select * from hm3 where length(replace(bitxor(bit'0101011111111010000001001011101101100011111101111101101100000011', hmval)::text,'0','')) < 5;  
 id |                              hmval                               |                                     hmarr                                       
----+------------------------------------------------------------------+-------------------------------------------------------------------------------  
  1 | 0101011111111010000001001011101101100011111101111101101100000011 | {1_0101011111111010,2_0000010010111011,3_0110001111110111,4_1111011111011011}  
(1 row)  
  
Time: 22954.686 ms  

相比沒有索引的情況,性能從23秒提升到了0.2毫秒。提升了11.48萬倍。

自動切分

有人會說,怎麼自動生成simhash切分後的數組呢?

不用擔心,PostgreSQL提供了強大的UDF功能,可以建立UDF和TRIGGER,在寫入數據時,自動生成切分後的數組。

例子

create table hm4 (id int, hmval bit(64), hmarr text[]);  

create or replace function sp(val bit(64)) returns text[] as $$
select regexp_split_to_array('1_'||substring(val::text,1,10)||',2_'||substring(val::text,11,10)||',3_'||substring(val::text,21,10)||',4_'||substring(val::text,31,10)||',5_'||substring(val::text,41,10)||',6_'||substring(val::text,51,14), ',') ;            
$$ language sql strict;

postgres=# select sp(123::bit(64));
                                         sp                                          
-------------------------------------------------------------------------------------
 {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000001111011}
(1 row)

-- 寫入1億記錄

insert into hm4  
select   
  id,   
  val::bit(64),   
  sp(val::bit(64))    
from   
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,100000000) t(id)) t;  

-- 索引

create index idx_hm4 on hm4 using gin(hmarr _text_sml_ops );  

-- 查詢海明距離小於等於1的記錄,性能杠杠的

postgres=# set smlar.type = overlap;  
postgres=# set smlar.threshold = 5; 

select    
    *,    
    smlar( hmarr, sp(123::bit(64)))     -- 查詢與123::bit(64)海明距離小於2的記錄
  from    
    hm3  
  where    
    hmarr % sp(123::bit(64))      -- 查詢與123::bit(64)海明距離小於2的記錄
    and length(replace(bitxor(123::bit(64), hmval)::text,'0','')) < 2      -- 查詢與123::bit(64)海明距離小於2的記錄
  limit 100;  

創建觸發器,寫入simhash時,自動寫入切分數組

create or replace function tg() returns trigger as $$
declare
begin
  NEW.hmarr := sp(NEW.hmval);
  return NEW;
end;
$$ language plpgsql strict;

postgres=# create trigger tg before insert or update on hm4 for each row execute procedure tg();
CREATE TRIGGER

-- 效果很讚

postgres=# truncate hm4;
TRUNCATE TABLE
postgres=# insert into hm4 values (1,1::bit(64));
INSERT 0 1
postgres=# select * from hm4;
 id |                              hmval                               |                                        hmarr                                        
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------
  1 | 0000000000000000000000000000000000000000000000000000000000000001 | {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000000,6_00000000000001}
(1 row)

postgres=# update hm4 set hmval=123456::bit(64);
UPDATE 1
postgres=# select * from hm4;
 id |                              hmval                               |                                        hmarr                                        
----+------------------------------------------------------------------+-------------------------------------------------------------------------------------
  1 | 0000000000000000000000000000000000000000000000011110001001000000 | {1_0000000000,2_0000000000,3_0000000000,4_0000000000,5_0000000111,6_10001001000000}
(1 row)

爽就點個讚吧。

四、技術點

1、阿裏雲RDS PostgreSQL smlar插件,使用GIN索引,塊級收斂,二重過濾。0.2毫秒的響應速度,1000萬數據中,檢索海明距離2以內的記錄。

2、阿裏雲RDS PostgreSQL 10,使用多核並行,暴力掃描,1000萬數據,檢索海明距離為N以內的數據,約450毫秒。

3、阿裏雲HybridDB for PostgreSQL,使用多機並行,橫向擴展計算能力,也可以做到暴力掃描。根據實際的節點數計算查詢效率。

五、雲端產品

阿裏雲 RDS PostgreSQL

阿裏雲 HybridDB for PostgreSQL

六、類似場景、案例

《電商內容去重\內容篩選應用(實時識別轉載\盜圖\侵權?) - 文本、圖片集、商品集、數組相似判定的優化和索引技術》

《基於 阿裏雲 RDS PostgreSQL 打造實時用戶畫像推薦係統》

七、小結

采用阿裏雲RDS PostgreSQL的SMLAR插件,對千萬量級的simhash數據檢索相似文本,(更多數據量的測試後續提供,響應速度應該在毫秒級),相比沒有索引的情況,性能從23秒提升到了0.2毫秒。提升了11.48萬倍。

開不開心,意不意外。

八、參考

《從難纏的模煳查詢聊開 - PostgreSQL獨門絕招之一 GIN , GiST , SP-GiST , RUM 索引原理與技術背景》

《PostgreSQL結合餘弦、線性相關算法 在文本、圖片、數組相似 等領域的應用 - 3 rum, smlar應用場景分析》

《PostgreSQL結合餘弦、線性相關算法 在文本、圖片、數組相似 等領域的應用 - 2 smlar插件詳解》

《PostgreSQL結合餘弦、線性相關算法 在文本、圖片、數組相似 等領域的應用 - 1 文本(關鍵詞)分析理論基礎 - TF(Term Frequency 詞頻)/IDF(Inverse Document Frequency 逆向文本頻率)》

《17種文本相似算法與GIN索引 - pg_similarity》

《電商內容去重\內容篩選應用(實時識別轉載\盜圖\侵權?) - 文本、圖片集、商品集、數組相似判定的優化和索引技術》

《從相似度算法談起 - Effective similarity search in PostgreSQL》

《PostgreSQL (varbit, roaring bitmap) VS pilosa(bitmap庫)》

《阿裏雲RDS for PostgreSQL varbitx插件與實時畫像應用場景介紹》

《基於 阿裏雲 RDS PostgreSQL 打造實時用戶畫像推薦係統》

《塊級(ctid)掃描在IoT(物聯網)極限寫和消費讀並存場景的應用》

最後更新:2017-08-13 22:34:55

  上一篇:go  淘寶開源Android容器化框架Atlas開發者指南
  下一篇:go  【阿裏雲大學課程】機器學習入門:概念原理及常用算法