閱讀973 返回首頁    go 阿裏雲 go 技術社區[雲棲]


HybridTime - Accessible Global Consistency with High Clock Uncertainty

Amazon’s Dynamo [9] and Facebook’s Cassandra [13], relax the consistency model,and offer only eventual consistency.

Others such as HBase [1] and BigTable [4] offer strong consistency only for operations touching a single partition, but not across the database as a whole.

Dynamo和Cassandra都是基於Vector Clocks和Gossip算法,強調高可用性,可以達到eventual consistency

而HBase,可以提供單分區的強一致性,但對於跨分區,或跨數據庫,就無法保證;

for example, in a geo-replicated service, a user may first be routed to a local datacenter and perform some action such as sending an email message. 
If they then reload their browser and are routed to another datacenter backend, they would still expect to see their sent message in their email outbox.

這個例子比較典型,要達到異地的一致性

 

一般要解決分布式問題的一致性問題的最簡單的辦法就是用synchronized clocks across all servers,但這基本是達不到的

the traditional approach to global consistency has been to use logical clocks. 
Logical clocks, such as Lamport Clocks [15] or Vector Clocks [10, 20].

所以傳統的做法就是用邏輯時鍾

https://www.zhihu.com/question/19994133

貼個回答,寫的不錯

當然可以,但是不適合在分布式數據庫裏麵。為什麼呢? 首先,需要先確定你這個時間戳取自哪裏。1、所有的時間戳都取自一個中心點節點,這樣讓時間戳得到一個全序關係,這是最簡單的實現,如果你的分布式數據庫隻是部署在一個數據中心的話,但是這樣就增加了一個中心節點,影響數據庫的擴展性,而且任何一個取時間戳的動作都要增加上網絡延遲的開銷。如果分布式數據庫部署需要跨數據中心,這種方案變的不可行,跨數據中心網絡時延太大。從而影響整體性能2、所有的時間戳取自已所在的節點。這就涉及到時鍾同步的問題,就像Dongdong說的,如果取的物理時間那麼這些節點並沒有一個一致的時間,總會有一點誤差,這種解決方案就有兩種思路: 
一、就是現在以Leslie Lamport 提出的邏輯時鍾的方法,重新定義了一種分布式的全序關係。vector clock就是以邏輯時鍾為基礎上發展而來的。 
二、使用物理時鍾,就是google spanner使用gps 加原子鍾進行時間同步,不同節點之間的時間是不能精確同步的,隻能把不同節點之間的時間同步到一個範圍之內,spanner一個節點上獲得一個時間戳,不是一個時間值,而是一個時間值加上一個誤差範圍,也就是一個時間窗,如果兩個時間窗有重合,就無法比較大小,從而無法比較出來時間窗代表的事件的先後關係。而時間同步算法就是讓分布式各節點之間的時間的誤差盡量的小,這樣取時間戳這個動作效率是最高的,隻在本節點取一下物理時間即可,但是為了同步時間,各個節點肯定在周期與其它節點進行通訊來同步物理時間(google時間同步的算法好像沒有開源)。時間同步算法隻能讓時間盡量的精確。分布式數據庫裏麵是不能直接使用NTP來同步時間的,因為NTP時間同步協議裏麵允許節點的時間能夠回退,而數據庫裏麵要求本地的時鍾必須是遞增的。 
三、還有一種混合邏輯時鍾(Logical Physical Clocks),也是由邏輯時鍾演變而來,但是時鍾的值比較接近於物理時鍾,而且不依賴於下麵物理時鍾的同步的算法,它的時間戳都可以在本地取,而且取出來是一個時間值,而不像spanner一樣,取出來是一個時間窗。詳細可以參見Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases這篇論文

關於Lamport或向量時鍾,

Lamport時鍾就是邏輯時鍾,即不依賴於物理時鍾,而依賴event之間的因果關係來定義的一種偏序

參考,全序, 分布式一致性的本質

而Vector時鍾是,邏輯時鍾的一種實現,具體的邏輯看下文,

Why Vector Clock are Easy or Hard?

摘自上麵的知乎回答

“dynamo paper是七、八年前寫的了,現在的amazon dynamo早已摒棄了version vector,而采用了synchronous replication(類似paxos的protocol),每一個partition會有一個leader負責寫入。其實version vector並不scale,因為對於一個key來說,隨著writer數量的增加,version vector數量成指數級的增長” 

邏輯時鍾的缺點,

first, they require that all clients must propagate clock data to achieve consistent views, 
and second, the assigned timestamps have no relation to physical time.

 

Spanner introduced commit-wait, a way of ensuring physical-time based consistent global state by forcing operations to wait long enough so that all participants agree that the operation’s timestamp has passed based on worst case synchronization error bounds.

While innovative, the system performance becomes highly dependent on the quality of the time synchronization infrastructure, and thus may have unacceptable performance absent specialized hardware such as atomic clocks and GPS receivers.

Spanner利用原子鍾和GPS接收器,實現了一個較為精確的時鍾,這個時鍾叫做TrueTime,每次調用TrueTime API返回的是一個時間區間,而不是一個具體的值,這個TrueTime保證的是真實時間(absolute time/real time)一定在這個區間內,這個區間範圍通常大約14ms,甚至更小。

所以隻有每個transaction的時間區間,都不重合,那麼就可以比較容易的做到全局有序;

所以這裏需要commit-wait,

image

可以看到,我在開始transaction的時候,取一次true time,[t1.earliest, t1.latest]

那麼什麼時候,我可以把這個transaction commit掉,即釋放掉鎖,即別人可以開始新的transaction

當我再取一次true time,[t2.earliest, t2.latest],當t2.earliest > t1.latest的時候,即可

因為這樣兩個transaction的時間區間就不可能重疊了,當然代價就是每個transaction都有等待大概2個error bound的時間

當然,spanner的實現依賴硬件的infra設備,普通用戶無法複製;

 

HybridTime(HT), a hybrid between physical and logical clocks and we show how HT can be used to achieve the same semantics as Spanner, but with good performance even with commonly available time synchronization.

Like Spanner, our approach relies on physical time measurements with bounded error to assign HybridTime timestamps to events that occur in the system. 
However, unlike Spanner, our approach does not usually require the error to be waited out, thus allowing for usage in common deployment scenarios where clocks are synchronized through common protocols such as the Network Time Protocol, in which clock synchronization error is often higher than with Spanner’s TrueTime.

The trade-off is that, in order to avoid commit-wait, HybridTime requires that timestamps be propagated across machines to achieve the same consistency semantics as Spanner. 
Contrary to vector clocks, which can expand as the number of participants in the cluster grows, HybridTime timestamps have constant and small size.

HybridTime說是physical and logical clocks的結合,和spanner一樣,也可以使用帶error bound的物理時間 
但由於他沒有spanner的硬件設施,所以做不到低延遲的時間同步,所以他依然使用邏輯時鍾來達到一致性;並且他解決了vector clocks隨著client的數量增加而會size膨脹的問題

HybridTime clocks follow similar update rules to Lamport clocks, but the time values are not purely logical: 
each time value has both a logical component, which helps in guaranteeing the same properties as a Lamport Clocks, and a physical component which allows the event to be associated with a physical point-in-time.

 

One of Spanner’s key benefits is that is externally consistent, which is defined as fully linearizable, even in the presence of hidden channels.

Additionally we use the term globally consistent to describe a system which provides the same linearizability semantics, provided that there are no hidden channels present.

區分為,是否有hidden channels?意思是clients間存在因果或先後關係,比如通過發message,都是這個因果關係對於係統是不可知的,比如A write,然後發送消息給B,然後B write,那麼因果關係上,B write一定後於A write,但是對於數據庫而言他並不知道,所以並不能保證A write先寫入;

spanner是可以保證externally consistent, 由於他通過 commit wait,來保證每個transaction之間一定是全局有序的

相對的globally consistent,更簡單些,因為並沒有hidden channels

 

Spanner’s key innovation is that timestamps assigned by the system can be used to achieve external consistency, but also have physical meaning.

HybridTime is always globally consistent, and through selective application of commit-wait is externally consistent.

Spanner的主要創新在於實現external一致性的,同時還保留了物理時間的

而HybridTime,默認情況下是可以實現globally consistent,即偏序,因為他本身就是使用lamport時鍾,並且當你選擇commit-wait時,也可以保證externally consistent;

 

HybridTime Assumptions

1. HybridTime assumes that machines have a reasonably accurate physical clock, represented by the PCi(e) function, which outputs the numeric timestamp returned by the physical clock as read by process i for event e, that is able to provide absolute time measurements (usually in milli- or microseconds since 1 January 1970).

2. keeps the physical clocks across different servers synchronized with regard to a reference server, the “reference” time, represented by the PCref(e) function which outputs the numeric timestamp returned by the “reference” process for event e;

Additionally, we assume that such a substrate is able to provide an error bound along with each time measurement, denoted by the Ei(e) function, which outputs the numeric value ε error of process i at the time e occurred

3. we make no assumptions on the actual accuracy of the clocks, i.e. the physical timestamps returned by server’s clocks may have an arbitrarily large but finite error, as long as this error’s bound is known

說白了,假設

有個相對靠譜的物理時鍾,一個理想的參照時鍾,以及他們之間相差的,error bound

最後,我們並不假設這個error bound會很小,隻要有限就可以

假設1,我們有有限的error bound

image

 

假設2, 進程級別的物理時鍾單調遞增,注意是進程級別

image

 

基於以上假設,提出HybridTime Clock and Protocol

 

HybridTime Clock and Protocol

HybridTime clock(HTC) is a pair (physical,logical) where the first component is a representation of the physical time at which the event occurred and the second component is a logical sequence number.

定義其實顯而易見。。。

Algorithm 2 depicts the HTC algorithm.

image

HTC算法如上,兩部分,NOW和UPDATE

now就是取當前的HybridTime clock

upate就是根據in 來更新當前的clock

Algorithm 2 implements a Lamport Clock, with the additional advantage that generated timestamps have physical meaning and are accurate representations of physical time within a bound error.

算法本身,其實就是Lamport Clock,隻是增加了physical clock的部分

給出一個例子,

image

 

To order the events timestamped using the HybridTime Clock algorithm we use Definition 1.

Definition 1. HCT(e) < HCT( f ) is defined as the lexicographical ordering of the timestamp two-tuple (physical,logical)

Theorem 1. The HybriTime clock happened-before relation forms a total order of events

Theorem 2. For any event in a causal chain f , the physical component of a HTC timestamp approximates the “real” time the event occurred, with a error defined and bounded by

image

 

Implementation

No Consistency - In this mode there are no external consistency guarantees, transactions are assigned timestamps from each server’s physical clock and no guarantee is made that reads are consistent or repeatable.

直接用local的物理時間,不保證一致性

HybridTime Consistency - In this mode our implementation guarantees the global consistency as Spanner, absent hidden channels, but using HybridTime instead of commit-wait. 
Clients choosing this consistency mode on writes must make sure that the timestamp that is received from the server is propagated to other servers and/or clients. 
Within the same client process, timestamps are automatically propagated on behalf of the user.

這個其實和邏輯時鍾,沒有區別,就是保證偏序

Commit-wait Consistency - In this mode our implementation guarantees the same external consistency semantics as Spanner by also using commit-wait in the way described in the original paper. 
However instead of using TrueTime, which is a proprietary and private API, we implemented commit-wait on top of the widely used Network Time Protocol (NTP). Hence, in this consistency mode we support hidden channels.

這個估計沒啥用,在沒有spanner硬件保證的情況下,commit-wait,性能不能忍吧

最後更新:2017-04-07 21:05:50

  上一篇:go Off-heap Memory in Apache Flink and the curious JIT compiler
  下一篇:go Flink DataStream API Programming Guide