MySQL實例阻塞分析一例(線程statistics狀態)

本文用實例來分析MySQL阻塞—線程statistics狀態。

一、現象

某日下午下班後低峰期，現網MySQL一個庫突然報出大量慢sql，狀態是 statistics，但是過後拿這些sql去執行的時候，實際很快。處於 statistics 狀態的線程有個特征：查詢的都是視圖，但看監控那個時間段並沒有明顯的update/detele/insert。

通過我們的快照程序，去分析當時的 innodb status，發現如下信息：

SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 17208994
--Thread 139964610234112 has waited at srv0srv.cc line 2132 for 14.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x1635a00 created in file dict0dict.cc line 900
a writer (thread id 139964610234112) has reserved it in mode wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file row0purge.cc line 720
Last time write locked in file /home/admin/146_20161018140650857_13830810_code/rpm_workspace/storage/innobase/srv/srv0srv.cc line 2132
OS WAIT ARRAY INFO: signal count 256984450
Mutex spin waits 626367674, rounds 2776951802, OS waits 1973672
RW-shared spins 149944457, rounds 1650148561, OS waits 3972058
RW-excl spins 72090467, rounds 2017802579, OS waits 11148264
Spin rounds per wait: 4.43 mutex, 11.01 RW-shared, 27.99 RW-excl
...
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: doing file i/o (read thread) ev set
I/O thread 4 state: waiting for i/o request (read thread)
I/O thread 5 state: doing file i/o (read thread) ev set
I/O thread 6 state: doing file i/o (write thread) ev set
I/O thread 7 state: waiting for i/o request (write thread)
I/O thread 8 state: waiting for i/o request (write thread)
I/O thread 9 state: waiting for i/o request (write thread)
Pending normal aio reads: 18 [0, 12, 0, 6] , aio writes: 1 [1, 0, 0, 0] ,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
1346747614 OS file reads, 2869418806 OS file writes, 524616747 OS fsyncs
22 pending preads, 1 pending pwrites
6.00 reads/s, 16384 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
...
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
38 read views open inside InnoDB
Main thread process no. 34414, id 139964610234112, state: enforcing dict cache limit
Number of rows inserted 2546811699, updated 1708150459, deleted 1004154696, read 413168628410
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 54.19 reads/s

二、分析

從上麵的信息知道 Thread 139964610234112 是主線程，在源碼 srv0srv.cc:2132 行的地方等待信號14s，這個信號是在 dict0dict.cc:900 地方創建的 RW-latch 排它鎖。那麼奇怪了，主線程自己在等待自己的互斥鎖。

由於環境是阿裏雲的RDS(基於MySQL 5.6.16-log 版本)，拿不到他們的代碼，找來 5.6.35 的來看，行號對不上。但好在上段信息的最後麵有一個 Main thread state: enforcing dict cache limit，發現在 srv0srv.cc 函數 srv_master_do_active_tasks() 約2137行的位置：

if (cur_time % SRV_MASTER_DICT_LRU_INTERVAL == 0) {
        srv_main_thread_op_info = "enforcing dict cache limit";
        srv_master_evict_from_table_cache(50);
        MONITOR_INC_TIME_IN_MICRO_SECS(
            MONITOR_SRV_DICT_LRU_MICROSECOND, counter_time);
    }

應該是在調用 srv_master_evict_from_table_cache() 從innodb table cache裏麵清理緩存的地方waiting（這裏不是一定會清理，而是先判斷空間夠不夠用，參數50表示隻掃描 unused_table list的50%）。srv_master_evict_from_table_cache()：

srv_master_evict_from_table_cache(
/*==============================*/
    ulint    pct_check)    /*!< in: max percent to check */
{
    ulint    n_tables_evicted = 0;

    rw_lock_x_lock(&dict_operation_lock);

    dict_mutex_enter_for_mysql();

    n_tables_evicted = dict_make_room_in_cache(  /** 在dict0dict.cc裏麵 **/
        innobase_get_table_cache_size(), pct_check);

    dict_mutex_exit_for_mysql();

    rw_lock_x_unlock(&dict_operation_lock);

    return(n_tables_evicted);
}

就是在 rw_lock_x_lock(&dict_operation_lock) 這個地方獲取Latch的時候等待了14s，這個鎖就是在數據字典模塊 dict0dict.cc:dict_init() 約1065行的地方創建的，與innodb status輸出基本一致。

關於 dict_operation_lock 直接看注釋吧：

/** @brief the data dictionary rw-latch protecting dict_sys
table create, drop, etc. reserve this in X-mode; 

implicit or backround operations purge, rollback, foreign key checks reserve this in S-mode;

we cannot trust that MySQL protects implicit or background operations a table drop since MySQL does not know of them;

therefore we need this; NOTE: a transaction which reserves this must keep book on the mode in trx_t::dict_operation_lock_mode */

在嚐試把表定義逐出緩存時，獲取的是 dict_operation_lock X-mode lock，可是從已有的信息裏看不到另一個數據字典鎖是什麼。之前是懷疑是不是 table_definition_cache, table_open_cache, innodb_open_files 設置小了，視圖一般是多表join，更容易消耗打開表的數量，導致不斷的逐出cache而導致鎖爭用。但是檢查一番並沒發現什麼問題，更何況是14s的等待。關於它們的設置和關係，可以參考我的文章 table_open_cache 與 table_definition_cache 對MySQL的影響（詳見文末參考文獻）。

那麼得換個思路了，processlist裏麵有13個長時間處於 statistics 狀態的線程，表示正在計算統計數據，以製定一個查詢執行計劃。如果一個線程很長一段時間處於這種狀態，可能是磁盤IO性能很差，或者磁盤在執行其他工作。

此時注意到最上麵的信息裏有 Pending normal aio reads: 18 [0, 12, 0, 6] ，有18個讀IO被掛起(實際從監控圖 innodb_data_pending_reads看來，有達到過50)，四個read thread有三個處於忙碌狀態。再有 innodb_buffer_pool_pages_flushed 在出異常前10s沒有任何變化，也就是沒有成功的將髒數據刷盤動作。另外這是一個從庫，出異常前10s有出現過瞬間20多秒延遲：

640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy

640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=

（這一切關注的都是 18:59:05 之前的數據，之後的時間，一般恢複了都會有瞬間的讀行數上漲，這個時候別把它們反當做起因）

三、結論

結合上麵的 enforcing dict cache limit 和 statistics IO pending，找到兩個有關的bug report:

https://bugs.launchpad.net/percona-server/+bug/1500176
https://bugs.mysql.com/bug.php?id=84424

第一個是使用 pt-online-schema-change 去更改分區表的結構，可能會出現，但目前bug狀態是Undecided，我們的環境沒有分區表，沒外鍵，也沒有改表動作。第二個其實 Not a bug：

Thank you for your bug report. This is, however, not a bug, but a very well known issue.
You have to do several things in order to alleviate the problem:
* increase the additional memory pool
（注：這裏我認為不應該是additional memory pool，而是 buffer_pool，因為現在innodb內存管理基本是調用係統malloc，
即innodb_use_sys_malloc=ON，參考https://dev.mysql.com/doc/refman/5.7/en/innodb-performance-use_sys_malloc.html）
* increase total number of file handles available to MySQL
* increase number of file handles for InnoDB
* improve performance of the I/O on your operating system

說到底就是數據庫服務器IO遇到問題了，可以通過增加 buffer_pool 來緩存更多的數據，或者提高服務器IO能力，這個範圍就廣了，可參考《8.5.8 Optimizing InnoDB Disk I/O》（詳見文末參考文獻）。然而生產服務器都運行了1年之久，高峰期都沒出現過IO問題，現在何況低峰期，也沒有人為操作。那這個鍋隻能交給阿裏RDS了：懷疑是實例所在物理機磁盤有抖動。

分析這麼久得出這個結論，卻不能做什麼，因為我們沒辦法看到服務器級別的IO stats。其實想到去年也有實例出現過幾例類似 statistics 問題，向阿裏雲提工單確認物理機狀態，得到的結論都是：“是的，物理機有抖動。需要申請遷移實例嗎”，但是從來拿不到依據。如果自己能看到OS級別的監控，其實都不需要本文這麼冗長的分析。

本文來自雲棲社區合作夥伴“老葉茶館”，了解相關信息可以關注“老葉茶館

最後更新：2017-10-25 13:33:45

MySQL實例阻塞分析一例(線程statistics狀態)

一、現象

二、分析

三、結論

上一篇：數據庫的爭霸賽：從SQL到NewSQL分布式誰是王者？

下一篇： 25G SFP28 通信模塊稱為數據通信主流光模塊

相關內容

熱門內容

最新內容

MySQL實例阻塞分析一例(線程statistics狀態)

一、 現象

二、 分析

三、結論

上一篇： 數據庫的爭霸賽：從SQL到NewSQL分布式誰是王者？

下一篇： 25G SFP28 通信模塊稱為數據通信主流光模塊

相關內容

熱門內容

最新內容

一、現象

二、分析

上一篇：數據庫的爭霸賽：從SQL到NewSQL分布式誰是王者？