radix tree在數據庫PostgreSQL中的一些應用舉例

標簽

PostgreSQL , radix tree , suffix tree , trie , SP-GIST , 字符轉換 , 基因 , 路由表 , 全文檢索 , 關聯數組

背景

PostgreSQL 10.0發布了一個特性，使用radix tree來提升字符集轉換的效率。

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=aeed17d00037950a16cc5ebad5b5592e5fa1ad0f

Use radix tree for character encoding conversions.    
    
Replace the mapping tables used to convert between UTF-8 and other    
character encodings with new radix tree-based maps. Looking up an entry in    
a radix tree is much faster than a binary search in the old maps. As a    
bonus, the radix tree representation is also more compact, making the    
binaries slightly smaller.    
    
The "combined" maps work the same as before, with binary search. They are    
much smaller than the main tables, so it doesn't matter so much. However,    
the "combined" maps are now stored in the same .map files as the main    
tables. This seems more clear, since they're always used together, and    
generated from the same source files.

既然與搜索有關，順便也提一下PostgreSQL已經支持的索引接口如下:

B-Tree    
HASH    
GIN    
GiST    
SP-GiST    
BRIN    
BLOOM    
RUM

這些索引的原理，可以參考

https://leopard.in.ua/2015/04/13/postgresql-indexes

其中SP-GiST索引接口可以支持radix tree、quad-trees, k-d trees等非平衡數據結構的數據。用以支持合適場景的數據搜索需求。

https://www.postgresql.org/docs/devel/static/spgist-intro.html

SP-GiST is an abbreviation for space-partitioned GiST.     
    
SP-GiST supports partitioned search trees,     
which facilitate development of a wide range of different non-balanced data structures,     
such as quad-trees, k-d trees, and radix trees (tries).     
    
The common feature of these structures is that they repeatedly divide the search space into     
partitions that need not be of equal size.     
    
Searches that are well matched to the partitioning rule can be very fast.

radix tree是什麼？簡單描述如下。

一、radix tree 結構

https://en.wikipedia.org/wiki/Radix_tree

radix tree也被稱為"收縮的 prefix tree(trie)"。是一種多叉搜索樹，樹的葉子結點是實際的數據條目。

除了根節點以外，每個子節點都有一個唯一的父節點，你可以認為父節點就是子節點的prefix，並有一個指針指向父結點。

raidx tree的內部節點（除了根、葉子節點），每一個節點都有r個子節點，r=2^x，(x>=1)；也就是說r可能是2,4,8,......；

每個內部結點有一個固定的、2^n指針指向子結點（每個指針稱為槽slot），並有一個指針指向父結點。

這種結構使得radix tree的邊界，可能是1個葉子(如ulus)，也可能是若幹個葉子(如e, us)。

從層次上來看也是不平衡的，例如romulus就隻有三層，而romane有4層。

二、radix tree 應用

radix tree的常見應用特征如下

一個較小的數據集，每條記錄都是比較長的字符串，這些字符串有較多的共享的prefix。

例如：

1. 路由表結構，數據集不大，網絡地址有大量的prefix是可以共享的（如192.168.0.0/24, 192.168.1.0/24,......)，很符合radix tree的應用特征。

有一個開源項目，（Fast, robust and memory-savvy IP radix tree (Patricia trie) implementation in Java）

https://github.com/openstat/ip-radix-tree

2. 關聯數組（associative array）

Linux 內核中radix tree使用的例子。

https://www.infradead.org/~mchehab/kernel_docs/core-api/assoc_array.html

https://github.com/MintCN/linux-insides-zh/blob/master/DataStructures/radix-tree.md

擴展

3. 全文檢索

https://www.itu.dk/people/pagh/DBT09/07-text-indexing.pdf

PS：PostgreSQL 全文檢索、模煳查詢相關的例子(gin和rtree的應用)

《PostgreSQL 全文檢索接口 - RUM索引》

《PostgreSQL 行級全文檢索》

《PostgreSQL 模煳查詢最佳實踐》

《聊一聊雙十一背後的技術 - 分詞和搜索》

《聊一聊雙十一背後的技術 - 正則和相似度查詢加速》

原理參考

《電商內容去重\內容篩選應用(實時識別轉載\盜圖\侵權?) - 文本、圖片集、商品集、數組相似判定的優化和索引技術》

《PostgreSQL 模煳查詢最佳實踐》

4. 基因搜索

基因數據具有非常長的字符串，而且有很多是共享的片段，也特別適合radix tree。

PS：PostgreSQL 基因類型的例子(huffman tree的應用)

《如何通過PostgreSQL基因配對，產生優良下一代》

https://colab.mpi-bremen.de/wiki/display/pbis/PostBIS

postbis-src/sequence/code_set_creation.c

參考

https://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html

https://en.wikipedia.org/wiki/Huffman_coding

https://mitpress.mit.edu/sicp/full-text/sicp/book/node41.html

三、radix tree 數據操作

搜索

範例(取自wiki)

function lookup(string x)  
{  
  // Begin at the root with no elements found  
  Node traverseNode := root;  
  int elementsFound := 0;  
    
  // Traverse until a leaf is found or it is not possible to continue  
  while (traverseNode != null && !traverseNode.isLeaf() && elementsFound < x.length)  
  {  
    // Get the next edge to explore based on the elements not yet found in x  
    Edge nextEdge := select edge from traverseNode.edges where edge.label is a prefix of x.suffix(elementsFound)  
      // x.suffix(elementsFound) returns the last (x.length - elementsFound) elements of x  
    
    // Was an edge found?  
    if (nextEdge != null)  
    {  
      // Set the next node to explore  
      traverseNode := nextEdge.targetNode;  
      
      // Increment elements found based on the label stored at the edge  
      elementsFound += nextEdge.label.length;  
    }  
    else  
    {  
      // Terminate loop  
      traverseNode := null;  
    }  
  }  
    
  // A match is found if we arrive at a leaf node and have used up exactly x.length elements  
  return (traverseNode != null && traverseNode.isLeaf() && elementsFound == x.length);  
}

插入

插入時，首先搜索(直到最深的共享的prefix)，分為多種情況，可能直接插入到root下麵，也可能插入到最後一個共享的prefix下麵，也可能進行分裂，。。。。

一些例子如下：

刪除

刪除操作，首先進行定位，定位到之後，分為兩種情況：

如果是leaf node，則直接刪除；

如果是internal node（說明你要刪除的數據下麵還有節點，那麼將其他節點補齊缺的prefix並掛到你的父節點下麵，然後你就變成了leaf node，然後刪除）。

其他搜索

1. 搜索包含指定prefix(前綴)的所有數據。

2. 搜索指定字符串的前一個字符串。（比指定字符串更小的最大字符串）。

3. 搜索指定字符串的下一個字符串。（比指定字符串更大的最小字符串）。

四、如何開發PostgreSQL SP-GiST索引

https://www.postgresql.org/docs/devel/static/indexam.html

https://www.postgresql.org/docs/devel/static/spgist.html

https://www.postgresql.org/docs/devel/static/pgtrgm.html

樣板代碼

postgresql-src/contrib/pg_trgm/trgm_gist.c

postgresql-src/src/backend/access/spgist/spgtextproc.c

PostgreSQL 索引原理

這裏有講到，B-tree, R-tree, hash, bitmap, GIN, GIST, BRIN 索引的原理。

https://leopard.in.ua/2015/04/13/postgresql-indexes

小結

PostgreSQL的可擴展性，包容的BSD-Like許可，使得PG在很多垂直行業有許多和業務結合很緊密的應用。例如本文提到生命科學(基因)場景，危化品管理（化學RDkit插件），全文檢索，圖搜索，流計算，時序數據、GIS數據的處理等等場景，我們可以看到數據庫和應用結合得非常緊密，從特殊數據類型的支持，到數據的檢索支持，數據的操作支持，數據的服務端函數編程支持，一應俱全。