MySQL · 源碼分析 · MySQL 半同步複製數據一致性分析

簡介

MySQL Replication為MySQL用戶提供了高可用性和可擴展性解決方案。本文介紹了MySQL Replication的主要發展曆程，然後通過三個參數rpl_semi_sync_master_wait_point、sync_binlog、sync_relay_log的配置簡要分析了MySQL半同步的數據一致性。

MySQL Replication的發展

在2000年，MySQL 3.23.15版本引入了Replication。Replication作為一種準實時同步方式，得到廣泛應用。

這個時候的Replicaton的實現涉及到兩個線程，一個在Master，一個在Slave。Slave的I/O和SQL功能是作為一個線程，從Master獲取到event後直接apply，沒有relay log。這種方式使得讀取event的速度會被Slave replay速度拖慢，當主備存在較大延遲時候，會導致大量binary log沒有備份到Slave端。

在2002年，MySQL 4.0.2版本將Slave端event讀取和執行獨立成兩個線程（IO線程和SQL線程），同時引入了relay log。IO線程讀取event後寫入relay log，SQL線程從relay log中讀取event然後執行。這樣即使SQL線程執行慢，Master的binary log也會盡可能的同步到Slave。當Master宕機，切換到Slave，不會出現大量數據丟失。

MySQL在2010年5.5版本之前，一直采用的是異步複製。主庫的事務執行不會管備庫的同步進度，如果備庫落後，主庫不幸crash，那麼就會導致數據丟失。

MySQL在5.5中引入了半同步複製，主庫在應答客戶端提交的事務前需要保證至少一個從庫接收並寫到relay log中。那麼半同步複製是否可以做到不丟失數據呢。

在2016年，MySQL在5.7.17中引入了Group Replication。

MySQL 半同步複製的數據一致性

源碼剖析

以下源碼版本均為官方MySQL 5.7。
MySQL semi-sync是以插件方式引入，在plugin/semisync目錄下。這裏以semi-sync主要的函數調用為入口，學習semi-sync源碼。

plugin/semisync/semisync_master.cc
403 /*******************************************************************************
404  *
405  * <ReplSemiSyncMaster> class: the basic code layer for sync-replication master.
406  * <ReplSemiSyncSlave>  class: the basic code layer for sync-replication slave.
407  *
408  * The most important functions during semi-syn replication listed:
409  *
410  * Master:
          //實際由Ack_receiver線程調用，處理semi-sync複製狀態，獲取備庫最新binlog位點，喚醒對應線程
411  *  . reportReplyBinlog():  called by the binlog dump thread when it receives
412  *                          the slave's status information.
          //根據semi-sync運行狀態設置數據包頭semi-sync標記
413  *  . updateSyncHeader():   based on transaction waiting information, decide
414  *                          whether to request the slave to reply.
          //存儲當前binlog 文件名和偏移量，更新當前最大的事務 binlog 位置
415  *  . writeTranxInBinlog(): called by the transaction thread when it finishes
416  *                          writing all transaction events in binlog.
          //實現客戶端同步等待邏輯
417  *  . commitTrx():          transaction thread wait for the slave reply.
418  *
419  * Slave:
          //確認網絡包頭是否有semi-sync標記
420  *  . slaveReadSyncHeader(): read the semi-sync header from the master, get the
421  *                           sync status and get the payload for events.
          //給Master發送ACK報文
422  *  . slaveReply():          reply to the master about the replication progress.
423  *
424  ******************************************************************************/

Ack_receiver線程，不斷遍曆slave，通過select監聽slave網絡包，處理semi-sync複製狀態，喚醒等待線程。
plugin/semisync/semisync_master_ack_receiver.cc Ack_receiver::run()
->plugin/semisync/semisync_master.cc ReplSemiSyncMaster::reportReplyPacket
  ->plugin/semisync/semisync_master.cc ReplSemiSyncMaster::reportReplyBinlog

binlog Dump線程。如果slave是semi-slave，通過add_slave將slave添加到監聽隊列，在發送網絡包時根據semi-sync運行狀態設置包頭的semi-sync標記。
sql/rpl_binlog_sender.cc Binlog_sender::run()
->sql/rpl_binlog_sender.cc Binlog_sender::send_binlog
  ->sql/rpl_binlog_sender.cc Binlog_sender::send_events
    ->sql/rpl_binlog_sender.cc Binlog_sender::before_send_hook
      ->plugin/semisync/semisync_master_plugin.cc repl_semi_before_send_event
        ->plugin/semisync/semisync_master.cc ReplSemiSyncMaster::updateSyncHeader

事務提交階段，在flush binlog後，存儲當前binlog 文件名和偏移量，更新當前最大的事務 binlog 位置。
sql/binlog.cc MYSQL_BIN_LOG::ordered_commit
 ->plugin/semisync/semisync_master_plugin.cc repl_semi_report_binlog_update//after_flush
   ->plugin/semisync/semisync_master.cc repl_semisync.writeTranxInBinlog

事務提交階段，客戶端等待處理邏輯，分為after_sync和after_commit兩種情況
sql/binlog.cc MYSQL_BIN_LOG::ordered_commit
  ->sql/binlog.cc process_after_commit_stage_queue || call_after_sync_hook
    ->plugin/semisync/semisync_master_plugin.cc repl_semi_report_commit || repl_semi_report_binlog_sync
      ->plugin/semisync/semisync_master.cc ReplSemiSyncMaster::commitTrx

Slave IO線程，讀取數據後後檢查包頭是否有semi-sync標記。
sql/rpl_slave.cc handle_slave_io
  ->plugin/semisync/semisync_slave_plugin.cc repl_semi_slave_read_event
    ->plugin/semisync/semisync_slave.cc ReplSemiSyncSlave::slaveReadSyncHeader

Slave IO線程，在queue event後，在需要回複Master ACK報文的時候，回複Master ACK報文。
sql/rpl_slave.cc handle_slave_io
  ->plugin/semisync/semisync_slave_plugin.cc repl_semi_slave_queue_event
    ->plugin/semisync/semisync_slave.cc ReplSemiSyncSlave::slaveReply

首先半同步方式，主庫在等待備庫ack時候，如果超時會退化為異步，這就可能導致數據丟失。在接下來分析中，先假設rpl_semi_sync_master_timeout足夠大，不會退化為異步方式。

這裏通過三個參數rpl_semi_sync_master_wait_point、sync_binlog、sync_relay_log的配置來對semi-sync做數據一致性的分析。

rpl_semi_sync_master_wait_point的配置

源碼剖析：

plugin/semisync/semisync_master_plugin.cc

68 int repl_semi_report_binlog_sync(Binlog_storage_param *param,
69                                  const char *log_file,
70                                  my_off_t log_pos)
71 {
72   if (rpl_semi_sync_master_wait_point == WAIT_AFTER_SYNC)
73     return repl_semisync.commitTrx(log_file, log_pos);
74   return 0;
75 }

97 int repl_semi_report_commit(Trans_param *param)
   ...
102   if (rpl_semi_sync_master_wait_point == WAIT_AFTER_COMMIT &&
106     return repl_semisync.commitTrx(binlog_name, param->log_pos);

配置為WAIT_AFTER_COMMIT

當rpl_semi_sync_master_wait_point為WAIT_AFTER_COMMIT時，commitTrx的調用在engine層commit之後（在ordered_commit函數中process_after_commit_stage_queue調用），如上圖所示。即在等待Slave ACK時候，雖然沒有返回當前客戶端，但事務已經提交，其他客戶端會讀取到已提交事務。如果Slave端還沒有讀到該事務的events，同時主庫發生了crash，然後切換到備庫。那麼之前讀到的事務就不見了，出現了幻讀，如下圖所示。圖片引自Loss-less Semi-Synchronous Replication on MySQL 5.7.2 。

配置為WAIT_AFTER_SYNC

MySQL針對上述問題，在5.7.2引入了Loss-less Semi-Synchronous，在調用binlog sync之後，engine層commit之前等待Slave ACK。這樣隻有在確認Slave收到事務events後，事務才會提交。在commit之前等待Slave ACK，同時可以堆積事務，利於group commit，有利於提升性能。如下圖所示，圖片引自Loss-less Semi-Synchronous Replication on MySQL 5.7.2 ：

其實上圖流程中存在著會導致主備數據不一致，使主備同步失敗的情形。見下麵sync_binlog配置的分析。

sync_binlog的配置

源碼剖析：

sql/binlog.cc ordered_commit
       //當sync_period(sync_binlog)為1時，在sync之後update binlog end pos
9002   update_binlog_end_pos_after_sync= (get_sync_period() == 1);
       ...
9021     if (!update_binlog_end_pos_after_sync)
           //更新binlog end position,dump線程會發送更新後的events
9022       update_binlog_end_pos();
       ...
         //
9057     std::pair<bool, bool> result= sync_binlog_file(false);
       ...
9061   if (update_binlog_end_pos_after_sync)
9062   {
       ...
9068       update_binlog_end_pos(tmp_thd->get_trans_pos());
9069   }



sql/binlog.cc sync_binlog_file
8618 std::pair<bool, bool>
8619 MYSQL_BIN_LOG::sync_binlog_file(bool force)
8620 {
8621   bool synced= false;
8622   unsigned int sync_period= get_sync_period();//sync_binlog值
       //sync_period為0不做sync操作，其他值為達到sync調用次數後sync
8623   if (force || (sync_period && ++sync_counter >= sync_period))
8624   {

配置分析

當sync_binlog為0的時候，binlog sync磁盤由操作係統負責。當不為0的時候，其數值為定期sync磁盤的binlog commit group數。當sync_binlog值大於1的時候，sync binlog操作可能並沒有使binlog落盤。如果沒有落盤，事務在提交前，Master掉電，然後恢複，那麼這個時候該事務被回滾。但是Slave上可能已經收到了該事務的events並且執行，這個時候就會出現Slave事務比Master多的情況，主備同步會失敗。所以如果要保持主備一致，需要設置sync_binlog為1。

WAIT_AFTER_SYNC和WAIT_AFTER_COMMIT兩圖中Send Events的位置，也可能導致主備數據不一致，出現同步失敗的情形。實際在rpl_semi_sync_master_wait_point分析的圖中是sync binlog大於1的情況。根據上麵源碼，流程如下圖所示。Master依次執行flush binlog， update binlog position， sync binlog。如果Master在update binlog position後，sync binlog前掉電，Master再次啟動後原事務就會被回滾。但可能出現Slave獲取到Events，這也會導致Slave數據比Master多，主備同步失敗。

由於上麵的原因，sync_binlog設置為1的時候，MySQL會update binlog end pos after sync。流程如下圖所示。這時候，對於每一個事務都需要sync binlog，同時sync binlog和網絡發送events會是一個串行的過程，性能下降明顯。

sync_relay_log的配置

源碼剖析

sql/rpl_slave.cc handle_slave_io

5764       if (queue_event(mi, event_buf, event_len))
           ...
5771       if (RUN_HOOK(binlog_relay_io, after_queue_event,
5772                    (thd, mi, event_buf, event_len, synced)))

after_queue_event
->plugin/semisync/semisync_slave_plugin.cc repl_semi_slave_queue_event
->plugin/semisync/semisync_slave.cc ReplSemiSyncSlave::slaveReply

queue_event
->sql/binlog.cc MYSQL_BIN_LOG::append_buffer(const char* buf, uint len, Master_info *mi)
->sql/binlog.cc after_append_to_relay_log(mi);
->sql/binlog.cc flush_and_sync(0)
->sql/binlog.cc sync_binlog_file(force)

配置分析

在Slave的IO線程中get_sync_period獲得的是sync_relay_log的值，與sync_binlog對sync控製一樣。當sync_relay_log不是1的時候，semisync返回給Master的position可能沒有sync到磁盤。在gtid_mode下，在保證前麵兩個配置正確的情況下，sync_relay_log不是1的時候，僅發生Master或Slave的一次Crash並不會發生數據丟失或者主備同步失敗情況。如果發生Slave沒有sync relay log，Master端事務提交，客戶端觀察到事務提交，然後Slave端Crash。這樣Slave端就會丟失掉已經回複Master ACK的事務events。

但當Slave再次啟動，如果沒有來得及從Master端同步丟失的事務Events，Master就Crash。這個時候，用戶訪問Slave就會發現數據丟失。

通過上麵這個Case，MySQL semisync如果要保證任意時刻發生一台機器宕機都不丟失數據，需要同時設置sync_relay_log為1。對relay log的sync操作是在queue_event中，對每個event都要sync，所以sync_relay_log設置為1的時候，事務響應時間會受到影響，對於涉及數據比較多的事務延遲會增加很多。

MySQL 三節點

在一主一從的主備semisync的數據一致性分析中放棄了高可用，當主備之間網絡抖動或者一台宕機的情況下停止提供服務。要做到高可用，很自然我們可以想到一主兩從，這樣解決某一網絡抖動或一台宕機時候的可用性問題。但是，前文敘述要保證數據一致性配置要求依然存在，即正常情況下的性能不會有改善。同時需要解決Master宕機時候，如何選取新主機的問題，如何避免多主的情形。

選取新主機時一定要讀取兩個從機，看哪一個從機有最新的日誌，否則可能導致數據丟失。這樣的三節點方案就類似分布式Quorum機製，寫的時候需要保證寫成功三節點中的法定集合，確定新主的時候需要讀取法定集合。利用分布式一致性協議Paxos/Raft可以解決數據一致性問題，選主問題和多主問題，因此近些年，國內數據庫團隊大多實現了基於Paxos/Raft的三節點方案。近來MySQL官方也以插件形式引入了支持多主集群的Group Replication方案。

總結

可以看到從replication功能引入後，官方MySQL一直在不停的完善，前進。同時我們可以發現當前原生的MySQL主備複製實現實際上很難在滿足數據一致性的前提下做到高可用、高性能。

最後更新：2017-04-21 09:01:15

MySQL · 源碼分析 · MySQL 半同步複製數據一致性分析

簡介

MySQL Replication的發展

MySQL 半同步複製的數據一致性

源碼剖析

rpl_semi_sync_master_wait_point的配置

源碼剖析：

配置為WAIT_AFTER_COMMIT

配置為WAIT_AFTER_SYNC

sync_binlog的配置

源碼剖析：

配置分析

sync_relay_log的配置

源碼剖析

配置分析

MySQL 三節點

總結

上一篇： MYSQL · 新特性 · MySQL 8.0對Parser所做的改進

下一篇： PostgreSQL 10.0 preview 功能增強 - 更強可靠性, 過去式事務狀態可查（杜絕unknown事務）

相關內容

熱門內容

最新內容