RocksDB · 特性介紹 · HashLinkList 內存表
Table of Contents
RocksDB 內存表簡介
RocksDB 是一個基於 LSM 樹(Log-Structured Merge-tree)結構的單機數據庫引擎,內存表是它最重要的數據結構之一。除了默認的跳表(SkipList)之外,它還增加了各種其他的內存表,例如:HashSkipList、HashLinkList、Vector 等。HashSkipList 是在跳表外套了一層 Hash,每個桶對應了一個跳表。類似的,HashLinkList 是在鏈表外套了一層 Hash,每個桶對應了一個鏈表。這兩種內存表都需要配置 prefix_extractor,以計算 hash 值。RocksDB 支持的內存表類型見下圖:
為了方便使用,內存表有對應的工廠類 MemTableRepFactory。與內存表類型對應,完整的內存表工廠類的繼承關係如下圖:
從內存表的基類 MemTableRep 中可以看到,它支持的 API 有 Get、Insert 等,但是沒有 Delete。這是因為 Delete 被上層轉換成了一個 Insert,真正的刪除是在數據合並過程中做的。
HashLinkList 內存表
應用示例
相比 HashSkipList 而言,HashLinkList 要更簡潔/簡單一些,也可以節約內存的占用。它的缺點是不支持並發的寫入。為了測試各種內存表的表現,RocksDB 提供了一個簡單的測試工具,參看:memtablerep_bench.cc。我們也可以在示例代碼 examples/simple_example.cc 中修改一下來驗證 HashLinkList 的使用。例如:
$ git diff examples/simple_example.cc
diff --git a/examples/simple_example.cc b/examples/simple_example.cc
index 57d1b25..7b67ff0 100644
--- a/examples/simple_example.cc
+++ b/examples/simple_example.cc
@@ -9,6 +9,7 @@
#include "rocksdb/db.h"
#include "rocksdb/slice.h"
#include "rocksdb/options.h"
+#include "rocksdb/slice_transform.h"
using namespace rocksdb;
@@ -17,6 +18,9 @@ std::string kDBPath = "/tmp/rocksdb_simple_example";
int main() {
DB* db;
Options options;
+ options.memtable_factory.reset(NewHashLinkListRepFactory(4, 0, 1, true, 4));
+ options.prefix_extractor.reset(NewFixedPrefixTransform(1));
+ options.allow_concurrent_memtable_write = false;
// Optimize RocksDB. This is the easiest way to get RocksDB to perform well
options.IncreaseParallelism();
options.OptimizeLevelStyleCompaction();
需要注意的幾個地方:
1. options.memtable_factory.reset() 是將默認的 SkipList 替換為我們要測試的 HashLinkList。
2. options.prefix_extractor.reset() 是設置 Hash 需要的 prefix_extractor,如果不設置,默認用的還會是 SkipList。
3. options.allow_concurrent_memtable_write = false 是因為 HashLinkList 不支持並發寫入,不設置運行時會報錯。
4. 數據存放在 /tmp/rocksdb_simple_example,如果數據目錄已經存在,則會打開已經存在的數據庫。
實現代碼
HashLinkList 的代碼存在以下兩個文件中:memtable/hash_linklist_rep.h 以及 memtable/hash_linklist_rep.cc。可以看到它做了一些細節的優化,從簡單的 Hash 鏈表逐步演化到 Hash 跳表。代碼注釋得非常清楚,參看下圖:
由於存在上述的一些優化,Hash 表在不同時刻會存在不同的形式。
- 第一種情況最簡單,也就是桶為空。不占用額外內存空間。
- 第二種情況下,桶裏頭隻有一條記錄,存在鏈表第一個節點中。Next 指針占用額外空間。Next 指針取值為 NULL。後續其他情況下 Next 指針都不是 NULL,可以依賴這一點來區分不同的場景。
- 第三種情況下,桶裏頭的記錄多餘一條,鏈表的第一個節點是個表頭,記錄鏈表中的記錄數。記錄數是為了判斷鏈表是否應該轉換為跳表而設的。額外空間包括這個表頭節點以及每個節點中的 Next 指針。
- 第四種情況下,桶裏頭的記錄數超過了設定的閾值(threshold_use_skiplist),鏈表被轉換為一個跳表。鏈表的第一個節點的 Next 指針指向自己。Count 繼續保留,可以用來做測試等。
如果 HashLinkList 要支持並發寫,就需要對數據結構做適當的控製。不過當前它並不支持並發寫,而是單寫多讀。它實現時采用了 C++ 11 的 Atomic 做一些特殊的處理避免加鎖。另外需要注意的是,這個 Iterator 的實現 MemTableRep::Iterator* HashLinkListRep::GetIterator() 比較費資源,它會 new 一個 MemtableSkipList,把記錄都遍曆出來並插入進去。
采用修改後的 simple_example.cc,可以看到插入、查找、刪除所對應的執行路徑,用來理解代碼的主要執行流程。
Put
(gdb) bt
#0 rocksdb::(anonymous namespace)::HashLinkListRep::Insert (this=0xcd2af0, handle=0xcf3250) at memtable/hash_linklist_rep.cc:580
#1 0x000000000044af7f in rocksdb::MemTable::Add (this=0xcf3000, s=1, type=rocksdb::kTypeValue, key=..., value=..., allow_concurrent=false, post_process_info=0x0) at db/memtable.cc:452
#2 0x00000000004a9850 in rocksdb::MemTableInserter::PutCF (this=0x7fffffffb550, column_family_id=0, key=..., value=...) at db/write_batch.cc:944
#3 0x00000000004a422e in rocksdb::WriteBatch::Iterate (this=0x7fffffffbb60, handler=0x7fffffffb550) at db/write_batch.cc:386
#4 0x00000000004a61f1 in rocksdb::WriteBatchInternal::InsertInto (writers=..., sequence=1, memtables=0xceb780, flush_scheduler=0xcf0780, ignore_missing_column_families=false, recovery_log_number=0,
db=0xcf0000, concurrent_memtable_writes=false) at db/write_batch.cc:1294
#5 0x0000000000609f2e in rocksdb::DBImpl::WriteImpl (this=0xcf0000, write_options=..., my_batch=0x7fffffffbb60, callback=0x0, log_used=0x0, log_ref=0, disable_memtable=false)
at db/db_impl_write.cc:215
#6 0x00000000006093c2 in rocksdb::DBImpl::Write (this=0xcf0000, write_options=..., my_batch=0x7fffffffbb60) at db/db_impl_write.cc:48
#7 0x000000000060ca99 in rocksdb::DB::Put (this=0xcf0000, opt=..., column_family=0xcce9e0, key=..., value=...) at db/db_impl_write.cc:794
#8 0x0000000000609240 in rocksdb::DBImpl::Put (this=0xcf0000, o=..., column_family=0xcce9e0, key=..., val=...) at db/db_impl_write.cc:23
#9 0x00000000005f7ee7 in rocksdb::DB::Put (this=0xcf0000, options=..., key=..., value=...) at ./include/rocksdb/db.h:201
#10 0x00000000004082a0 in main () at simple_example.cc:40
Get
(gdb) bt
#0 rocksdb::(anonymous namespace)::HashLinkListRep::Get (this=0xcd2af0, k=..., callback_args=0x7fffffffb750, callback_func=0x44b82c <rocksdb::SaveValue(void*, char const*)>)
at memtable/hash_linklist_rep.cc:727
#1 0x000000000044c1fc in rocksdb::MemTable::Get (this=0xcf3000, key=..., value=0x7fffffffc020, s=0x7fffffffc090, merge_context=0x7fffffffba60, range_del_agg=0x7fffffffb910, seq=0x7fffffffb898,
read_opts=...) at db/memtable.cc:678
#2 0x00000000005f9800 in rocksdb::MemTable::Get (this=0xcf3000, key=..., value=0x7fffffffc020, s=0x7fffffffc090, merge_context=0x7fffffffba60, range_del_agg=0x7fffffffb910, read_opts=...)
at ./db/memtable.h:196
#3 0x00000000005e667a in rocksdb::DBImpl::GetImpl (this=0xcf0000, read_options=..., column_family=0xcce9e0, key=..., pinnable_val=0x7fffffffbba0, value_found=0x0) at db/db_impl.cc:959
#4 0x00000000005e6296 in rocksdb::DBImpl::Get (this=0xcf0000, read_options=..., column_family=0xcce9e0, key=..., value=0x7fffffffbba0) at db/db_impl.cc:905
#5 0x00000000005f811f in rocksdb::DB::Get (this=0xcf0000, options=..., column_family=0xcce9e0, key=..., value=0x7fffffffc020) at ./include/rocksdb/db.h:289
#6 0x00000000005f8233 in rocksdb::DB::Get (this=0xcf0000, options=..., key=..., value=0x7fffffffc020) at ./include/rocksdb/db.h:299
#7 0x0000000000408364 in main () at simple_example.cc:44
Delete
(gdb) bt
#0 rocksdb::(anonymous namespace)::HashLinkListRep::Insert (this=0xcd2af0, handle=0xcf3278) at memtable/hash_linklist_rep.cc:580
#1 0x000000000044af7f in rocksdb::MemTable::Add (this=0xcf3000, s=2, type=rocksdb::kTypeDeletion, key=..., value=..., allow_concurrent=false, post_process_info=0x0) at db/memtable.cc:452
#2 0x00000000004a9d9e in rocksdb::MemTableInserter::DeleteImpl (this=0x7fffffffb680, column_family_id=0, key=..., value=..., delete_type=rocksdb::kTypeDeletion) at db/write_batch.cc:999
#3 0x00000000004a9eb8 in rocksdb::MemTableInserter::DeleteCF (this=0x7fffffffb680, column_family_id=0, key=...) at db/write_batch.cc:1018
#4 0x00000000004a42db in rocksdb::WriteBatch::Iterate (this=0x7fffffffbc60, handler=0x7fffffffb680) at db/write_batch.cc:393
#5 0x00000000004a61f1 in rocksdb::WriteBatchInternal::InsertInto (writers=..., sequence=2, memtables=0xceb780, flush_scheduler=0xcf0780, ignore_missing_column_families=false, recovery_log_number=0,
db=0xcf0000, concurrent_memtable_writes=false) at db/write_batch.cc:1294
#6 0x0000000000609f2e in rocksdb::DBImpl::WriteImpl (this=0xcf0000, write_options=..., my_batch=0x7fffffffbc60, callback=0x0, log_used=0x0, log_ref=0, disable_memtable=false)
at db/db_impl_write.cc:215
#7 0x00000000006093c2 in rocksdb::DBImpl::Write (this=0xcf0000, write_options=..., my_batch=0x7fffffffbc60) at db/db_impl_write.cc:48
#8 0x0000000000408480 in main () at simple_example.cc:53
最後更新:2017-05-21 09:01:53