PostgreSQL 10.0 preview 功能增強 - 備庫支持"時間、字節"六維延遲

標簽

PostgreSQL , 10.0 , 時間度量備庫延遲 , pg_stat_replication

背景

pg_stat_replication視圖中有4個字段記錄了從備庫反饋的WAL位點。如下:

postgres=# \d pg_stat_replication   
          View "pg_catalog.pg_stat_replication"  
      Column      |           Type           | Modifiers   
------------------+--------------------------+-----------  
 pid              | integer                  |   
 usesysid         | oid                      |   
 usename          | name                     |   
 application_name | text                     |   
 client_addr      | inet                     |   
 client_hostname  | text                     |   
 client_port      | integer                  |   
 backend_start    | timestamp with time zone |   
 backend_xmin     | xid                      |   
 state            | text                     |   
 sent_location    | pg_lsn                   | -- 主庫已經發送的截至LSN位置  
 write_location   | pg_lsn                   | -- 備庫已經WRITE的LSN位置  
 flush_location   | pg_lsn                   | -- 備庫已經持久化的LSN位置  
 replay_location  | pg_lsn                   | -- 備庫已經APPLY的LSN位置  
 sync_priority    | integer                  |   
 sync_state       | text                     |

通過計算，我們可以得到當前備庫與主庫在4個維度的延遲（WAL字節數）情況。

pg_current_xlog_insert_location  --  得到當前數據庫的WAL插入位點  
  
pg_xlog_location_diff  --  計算兩個WAL LSN位點的差距  
  
-- SQL如下, 計算主備的發送延遲  
  
select pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_insert_location(), sent_location)) from pg_stat_replication;  
  
其他維度的查詢，更改sent_location即可.

使用以上方法，能查詢的延遲是字節形式反饋的。

那麼如果要轉換為時間形式，目前沒有很好的方法，除非從WAL文件中讀取對應LSN RECORD的時間信息。

10.0將在pg_stat_replication視圖中新增幾個字段，用於表示時間上的延遲(write_lag, flush_lag and replay_lag)。

Replication lag tracking for walsenders  
  
Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.  
  
Implements a lag tracker module that reports the lag times based upon  
measurements of the time taken for recent WAL to be written, flushed and  
replayed and for the sender to hear about it. These times  
represent the commit lag that was (or would have been) introduced by each  
synchronous commit level, if the remote server was configured as a  
synchronous standby.  For an asynchronous standby, the replay_lag column  
approximates the delay before recent transactions became visible to queries.  
If the standby server has entirely caught up with the sending server and  
there is no more WAL activity, the most recently measured lag times will  
continue to be displayed for a short time and then show NULL.  
  
Physical replication lag tracking is automatic. Logical replication tracking  
is possible but is the responsibility of the logical decoding plugin.  
Tracking is a private module operating within each walsender individually,  
with values reported to shared memory. Module not used outside of walsender.  
  
Design and code is good enough now to commit - kudos to the author.  
In many ways a difficult topic, with important and subtle behaviour so this  
shoudl be expected to generate discussion and multiple open items: Test now!  
  
Author: Thomas Munro, following designs by Fujii Masao and Simon Riggs  
Review: Simon Riggs, Ian Barwick and Craig Ringer

原理

1. 主庫開辟了一塊buffer，同時采樣並記錄主庫flush wal record的LSN位置以及對應的時間。

+/* A sample associating a log position with the time it was written. */    
+typedef struct  
+{  
+   XLogRecPtr lsn;  
+   TimestampTz time;  
+} WalTimeSample;  
+  
+/* The size of our buffer of time samples. */  wal flush lsn以及對應的時間 采樣  
+#define LAG_TRACKER_BUFFER_SIZE 8192  
+  
+/* A mechanism for tracking replication lag. */    
+static struct  
+{  
+   XLogRecPtr last_lsn;  
+   WalTimeSample buffer[LAG_TRACKER_BUFFER_SIZE];  
+   int write_head;  
+   int read_heads[NUM_SYNC_REP_WAIT_MODE];            --  對應備庫write, flush, replay三種模式.  
+   WalTimeSample last_read[NUM_SYNC_REP_WAIT_MODE];   --    
+} LagTracker;  
  
  
  
  
/* SyncRepWaitMode */  
#define SYNC_REP_NO_WAIT                (-1)  
#define SYNC_REP_WAIT_WRITE             0  
#define SYNC_REP_WAIT_FLUSH             1  
#define SYNC_REP_WAIT_APPLY             2  
  
#define NUM_SYNC_REP_WAIT_MODE  3

2. 實際上，目前記錄的並不是FLUSH WAL record的LSN位置以及對應的時間，而是wal sender LSN的時間，所以目前的代碼，我們應該理解為延時是從發送LSN到備庫write,flush,replay三個階段的延時。

而不是主庫FLUSH LSN到備庫write,flusn,replay的延時。

當主備WAL的發送延遲不大時，這個是比較準確的，當主備的WAL發送延遲較大時，這個就不準了。

代碼如下。

    /*  
+    * Record the current system time as an approximation of the time at which  
+    * this WAL position was written for the purposes of lag tracking.  
+    *  
+    * In theory we could make XLogFlush() record a time in shmem whenever WAL  
+    * is flushed and we could get that time as well as the LSN when we call  
+    * GetFlushRecPtr() above (and likewise for the cascading standby  
+    * equivalent), but rather than putting any new code into the hot WAL path  
+    * it seems good enough to capture the time here.  We should reach this  
+    * after XLogFlush() runs WalSndWakeupProcessRequests(), and although that  
+    * may take some time, we read the WAL flush pointer and take the time  
+    * very close to together here so that we'll get a later position if it  
+    * is still moving.  
+    *  
+    * Because LagTrackerWriter ignores samples when the LSN hasn't advanced,  
+    * this gives us a cheap approximation for the WAL flush time for this  
+    * LSN.  
+    *  
+    * Note that the LSN is not necessarily the LSN for the data contained in  
+    * the present message; it's the end of the the WAL, which might be  
+    * further ahead.  All the lag tracking machinery cares about is finding  
+    * out when that arbitrary LSN is eventually reported as written, flushed  
+    * and applied, so that it can measure the elapsed time.  
+    */  
+   LagTrackerWrite(SendRqstPtr, GetCurrentTimestamp());  
  
+/*  
+ * Record the end of the WAL and the time it was flushed locally, so that  
+ * LagTrackerRead can compute the elapsed time (lag) when this WAL position is  
+ * eventually reported to have been written, flushed and applied by the  
+ * standby in a reply message.  
+ * Exported to allow logical decoding plugins to call this when they choose.  
+ */  
+void  
+LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)  
+{  
+   bool buffer_full;  
+   int new_write_head;  
+   int i;  
+  
+   if (!am_walsender)  
+       return;  
+  
+   /*  
+    * If the lsn hasn't advanced since last time, then do nothing.  This way  
+    * we only record a new sample when new WAL has been written.  
+    */  
+   if (LagTracker.last_lsn == lsn)  
+       return;  
+   LagTracker.last_lsn = lsn;  
+  
+   /*  
+    * If advancing the write head of the circular buffer would crash into any  
+    * of the read heads, then the buffer is full.  In other words, the  
+    * slowest reader (presumably apply) is the one that controls the release  
+    * of space.  
+    */  
+   new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;  
+   buffer_full = false;  
+   for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)  
+   {  
+       if (new_write_head == LagTracker.read_heads[i])  
+           buffer_full = true;  
+   }  
+  
+   /*  
+    * If the buffer is full, for now we just rewind by one slot and overwrite  
+    * the last sample, as a simple (if somewhat uneven) way to lower the  
+    * sampling rate.  There may be better adaptive compaction algorithms.  
+    */  
+   if (buffer_full)  
+   {  
+       new_write_head = LagTracker.write_head;  
+       if (LagTracker.write_head > 0)  
+           LagTracker.write_head--;  
+       else  
+           LagTracker.write_head = LAG_TRACKER_BUFFER_SIZE - 1;  
+   }  
+  
+   /* Store a sample at the current write head position. */  
+   LagTracker.buffer[LagTracker.write_head].lsn = lsn;  
+   LagTracker.buffer[LagTracker.write_head].time = local_flush_time;  
+   LagTracker.write_head = new_write_head;  
+}

3. 每個wal sender都會維護一個LagTracker。

4. 備庫wal receiver進程會feedback 備庫的WAL write, flush, replay的LSN位點，主庫的wal sender進程收到feedback後，通過LagTrackerRead(記錄在BUFFER中的LSN+時間戳，以及當前時間)，得到備庫的延遲。

代碼如下

+/*  
+ * Find out how much time has elapsed between the moment WAL position 'lsn'  
+ * (or the highest known earlier LSN) was flushed locally and the time 'now'.  
+ * We have a separate read head for each of the reported LSN locations we  
+ * receive in replies from standby; 'head' controls which read head is  
+ * used.  Whenever a read head crosses an LSN which was written into the  
+ * lag buffer with LagTrackerWrite, we can use the associated timestamp to  
+ * find out the time this LSN (or an earlier one) was flushed locally, and  
+ * therefore compute the lag.  
+ *  
+ * Return -1 if no new sample data is available, and otherwise the elapsed  
+ * time in microseconds.  
+ */  
+static TimeOffset  
+LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)  
+{  
+   TimestampTz time = 0;  
+  
+   /* Read all unread samples up to this LSN or end of buffer. */  
+   while (LagTracker.read_heads[head] != LagTracker.write_head &&  
+          LagTracker.buffer[LagTracker.read_heads[head]].lsn <= lsn)  
+   {  
+       time = LagTracker.buffer[LagTracker.read_heads[head]].time;  
+       LagTracker.last_read[head] =  
+           LagTracker.buffer[LagTracker.read_heads[head]];  
+       LagTracker.read_heads[head] =  
+           (LagTracker.read_heads[head] + 1) % LAG_TRACKER_BUFFER_SIZE;  
+   }  
+  
+   if (time > now)  
+   {  
+       /* If the clock somehow went backwards, treat as not found. */  
+       return -1;  
+   }  
+   else if (time == 0)  
+   {  
+       /*  
+        * We didn't cross a time.  If there is a future sample that we  
+        * haven't reached yet, and we've already reached at least one sample,  
+        * let's interpolate the local flushed time.  This is mainly useful for  
+        * reporting a completely stuck apply position as having increasing  
+        * lag, since otherwise we'd have to wait for it to eventually start  
+        * moving again and cross one of our samples before we can show the  
+        * lag increasing.  
+        */  
+       if (LagTracker.read_heads[head] != LagTracker.write_head &&  
+           LagTracker.last_read[head].time != 0)  
+       {  
+           double fraction;  
+           WalTimeSample prev = LagTracker.last_read[head];  
+           WalTimeSample next = LagTracker.buffer[LagTracker.read_heads[head]];  
+  
+           Assert(lsn >= prev.lsn);  
+           Assert(prev.lsn < next.lsn);  
+  
+           if (prev.time > next.time)  
+           {  
+               /* If the clock somehow went backwards, treat as not found. */  
+               return -1;  
+           }  
+  
+           /* See how far we are between the previous and next samples. */  
+           fraction =  
+               (double) (lsn - prev.lsn) / (double) (next.lsn - prev.lsn);  
+  
+           /* Scale the local flush time proportionally. */  
+           time = (TimestampTz)  
+               ((double) prev.time + (next.time - prev.time) * fraction);  
+       }  
+       else  
+       {  
+           /* Couldn't interpolate due to lack of data. */  
+           return -1;  
+       }  
+   }  
+  
+   /* Return the elapsed time since local flush time in microseconds. */  
+   Assert(time != 0);  
+   return now - time;  
+}

以上僅適用於物理複製。

對於邏輯複製，decode plugin需要負責開發對應的延遲存儲和獲取的代碼。

目前實現的弊端,以及其他設計思路

目前10.0的做法是最小化的代碼改動量，實現一個近似的時間度量的主備延遲。弊端是，當WAL SENDER斷開後或者發送存在較大延遲是，這種方法統計得到的備庫時間的延遲偏差就會比較大。

其他的方法思考，比如在WAL中記錄一些WAL插值（比如通過wal custom接口，記錄LSN位置和時間）, 因為COMMIT\ROLLBACK WAL RECORD以及WAL插值都記錄了時間，備庫可以直接在解析時得到LSN以及時間，所以可以直接反饋write, flush, replay的時間。通過這種方法，主庫不需要開辟WAL來跟蹤LSN的時間。這種方法得到的時間相對比較精確，但是會增加日誌寫入量，同時可能需要修改流複製協議。

這個patch的討論，詳見郵件組，本文末尾URL。

PostgreSQL社區的作風非常嚴謹，一個patch可能在郵件組中討論幾個月甚至幾年，根據大家的意見反複的修正，patch合並到master已經非常成熟，所以PostgreSQL的穩定性也是遠近聞名的。

參考

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6912acc04f0bbcfdb799a120618507601e862490

最後更新：2017-04-01 16:42:10

PostgreSQL 10.0 preview 功能增強 - 備庫支持"時間、字節"六維延遲

標簽

背景

原理

目前實現的弊端,以及其他設計思路

參考

上一篇： phantomjs技巧之golang後端生成highcharts圖片文件

下一篇：從G1設計到堆空間調整

相關內容

熱門內容

最新內容

PostgreSQL 10.0 preview 功能增強 - 備庫支持"時間、字節"六維延遲

標簽

背景

原理

目前實現的弊端,以及其他設計思路

參考

上一篇： phantomjs技巧之golang後端生成highcharts圖片文件

下一篇： 從G1設計到堆空間調整

相關內容

熱門內容

最新內容

下一篇：從G1設計到堆空間調整