磁盤利用率和飽和度

如何觀察磁盤的IO利用率以及飽和度？
看文本文給你解藥！

在這篇文章裏，會介紹磁盤利用率和飽和度相關的知識。

In this blog post, I will look at disk utilization and saturation.

在之前的博客裏麵，我寫了一些關於CPU使用率和飽和度之間有什麼實質性不同，以及CPU使用率、飽和度如何從不同維度影響響應時間（RT）的文章。
現在我們來看另一個影響數據庫性能重要因素：存儲子係統。在下麵文章裏，我會用“磁盤”代替存儲子係統。

In my previous blog post, I wrote about CPU utilization and saturation, the practical difference between them and how different CPU utilization and saturation impact response times.

Now we will look at another critical component of database performance: the storage subsystem. In this post, I will refer to the storage subsystem as “disk” (as a casual catch-all).

監控IO性能最常用的工具是iostat，會顯示如下的信息：
The most common tool for command line IO performance monitoring is iostat, which shows information like this:

root@ts140i:~# iostat -x nvme0n1 5
Linux 4.4.0-89-generic (ts140i)         08/05/2017      _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.51    0.00    2.00    9.45    0.00   88.04

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 3555.57 5887.81 52804.15 87440.73    29.70     0.53    0.06    0.13    0.01   0.05  50.71

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.60    0.00    1.06   20.77    0.00   77.57

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7612.80    0.00 113507.20     0.00    29.82     0.97    0.13    0.13    0.00   0.12  93.68

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.50    0.00    1.26    6.08    0.00   92.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7653.20    0.00 113497.60     0.00    29.66     0.99    0.13    0.13    0.00   0.12  93.52

第一行(avg-cpu)顯示的是自係統啟動之後平均的性能。某些情況下，用當前係統的壓力和平均性能作對比是很有用的。這篇文章的案例是測試環境，所以可以忽略對比這兩種情況。
第二行(Device)顯示的當前5秒鍾的性能指標(在命令行中指定了每5秒輸出一次)。

The first line shows the average performance since system start. In some cases, it is useful to compare the current load to the long term average. In this case, as it is a test system, it can be safely ignored. The next line shows the current performance metrics over five seconds intervals (as specified in the command line).

iostat命令用%util列顯示利用率的信息，可以通過觀察平均請求隊列大小(the avgqu-sz 列)或者通過r_await和w_await列(顯示平均的讀和寫的等待)來觀察IO飽和度。如果超過正常值，設備就會過度飽和了。

The iostat command reports utilization information in the %util column, and you can look at saturation by either looking at the average request queue size (the avgqu-sz column) or looking at the r_await and w_await columns (which show the average wait for read and write operations). If it goes well above “normal” then the device is over-saturated.

和之前的文章一樣，我們會執行Sysbench，然後觀察iostat命令、Percona PMM的輸出。

As in my previous blog post, we’ll perform some system Sysbench runs and observe how the iostat command line tool and Percona Monitoring and Management graphs behave.

我們用Sysbench測試文件IO，以便觀察磁盤的變化。我創建了一個100GB的文件，因為用了DirectIO方式所以所有的請求都會直接打到磁盤。我也會用"sync”刷新模式以便更好的控製IO請求的並發度。

To focus specifically on the disk, we’re using the Sysbench fileio test. I’m using just one 100GB file, as I’m using DirectIO so all requests hit the disk directly. I’m also using “sync” request submission mode so I can get better control of request concurrency.

在這個實驗中我用了一個Intel 750 NVME固態硬盤(雖然這並不重要)。

I’m using an Intel 750 NVME SSD in this test (though it does not really matter).

Sysbench FileIO 1線程


    root@ts140i:/mnt/data# sysbench  --threads=1 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run

    File operations:
       reads/s:                      7113.16
       writes/s:                     0.00
       fsyncs/s:                     0.00

    Throughput:
       read, MiB/s:                  111.14
       written, MiB/s:               0.00

    General statistics:
       total time:                          600.0001s
       total number of events:              4267910

    Latency (ms):
            min:                                  0.07
            avg:                                  0.14
            max:                                  6.18
            95th percentile:                      0.17

單線程測試結果通常可以作為基線，通常隻有一個請求時的響應時間通常也最快（不過一般也不是最佳吞吐量）。

A single thread run is always great as a baseline, as with only one request in flight we should expect the best response time possible (though typically not the best throughput possible).

Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7612.80    0.00 113507.20     0.00    29.82     0.97    0.13    0.13    0.00   0.12  93.68

磁盤讀寫延遲

磁盤延遲圖像證實了我們從iostat命令看到的磁盤IO延遲，它指定了特定設備。我們用它作為基線和更高並發做對比。

The Disk Latency graph confirms the disk IO latency we saw in the iostat command, and it will be highly device-specific. We use it as a baseline to compare changes to with higher concurrency.

磁盤IO使用率

即便我們隻發起一個IO請求（隊列深度），但磁盤IO使用率已很接近100%。這個是linux磁盤使用率顯示的問題，它不像CPU利用率，由於IO設備的特殊設計機製linux不能直接顯示其利用率。它真正有多少可執行單元?他們是怎麼被使用的？當做了raid後，每個物理磁盤都可以被視為單獨的執行單元，而SSD磁盤以及雲存儲（比如EBS）則可能有更多執行單元。

Disk IO utilization is close to 100% even though we have just one outstanding IO request (queue depth). This is the problem with Linux disk utilization reporting: unlike CPUs, Linux does not have direct visibility on how the IO device is designed. How many “execution units” does it really have? How are they utilized? Single spinning disks can be seen as a single execution unit while RAID, SSDs and cloud storage (such as EBS) are more than one.

磁盤負載

這個圖片顯示磁盤負載（或者請求隊列大小），它和實際壓到磁盤上線程的數量是大致匹配的。

This graph shows the disk load (or request queue size), which roughly matches the number of threads that are hitting disk as hard as possible.

飽和度（IO壓力）

飽和度指標圖上的IO負載幾乎顯示了相同的數值。唯一的不同是不像磁盤IO統計信息一樣，它顯示了整個係統的概要。

The IO load on the Saturation Metrics graph shows pretty much the same numbers. The only difference is that unlike Disk IO statistics, it shows the summary for the whole system.

Sysbench FileIO 4線程

現在讓我們把IO提高到四個並發線程，再來看看磁盤的情況：

Now let’s increase IO to four concurrent threads and see how disk responds:


root@ts140i:/mnt/data# sysbench  --threads=4 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run

    File operations:  
       reads/s:                     26248.44  
       writes/s:                    0.00  
       fsyncs/s:                    0.00  
    Throughput:  
       read, MiB/s:                  410.13  
       written, MiB/s:               0.00  
    General statistics:  
       total time:                                       600.0002s  
       total number of events:                 15749205  
    Latency (ms):  
            min:                                  0.06  
            avg:                                  0.15  
            max:                                 8.73  
            95th percentile:               0.21

我們看到請求數量線性增加，而請求延遲變化卻很小：0.14ms vs 0.15ms。這表明設備內部有足夠的執行單元並行處理負載，而且不存在其他瓶頸（如連接接口）。

We can see the number of requests scales almost linearly, while request latency changes very little: 0.14ms vs. 0.15ms. This shows the device has enough execution units internally to handle the load in parallel, and there are no other bottlenecks (such as the connection interface).


Iostat  
Device:        rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await   w_await  svctm  %util  
nvme0n1       0.00    0.00  28808.60    0.00   427668.00     0.00    29.69     4.05    0.14    0.14    0.00   0.03  99.92

磁盤讀寫延遲

磁盤IO使用率

磁盤負載

飽和度（IO壓力）

Sysbench FileIO 16線程

root@ts140i:/mnt/data# sysbench  --threads=16 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run

    File operations:  
       reads/s:                    76845.96  
       writes/s:                     0.00  
       fsyncs/s:                    0.00  
    Throughput:  
       read, MiB/s:                  1200.72  
       written, MiB/s:                0.00  
    General statistics:
       total time:                          600.0003s  
       total number of events:              46107727  
    Latency (ms):  
            min:                                  0.07  
            avg:                                  0.21  
            max:                                  9.72  
            95th percentile:                 0.36

從4個線程到16個線程，我們再次看到吞吐量有較大提升而響應時間隻輕微變大。如果你仔細觀察結果，你將注意到一個更有趣的事情：平均響應時間從0.15ms增加到0.21ms（增加了40％），而95％的響應時間從0.21ms增加到0.36 ms（增加了71％）。我還進行了一個單獨的測試來測量99％的響應時間，發現差異甚至更大：0.26ms vs 0.48ms（增加了84％）。

Going from four to 16 threads, we again see a good throughput increase with a mild response time increase. If you look at the results closely, you will notice one more interesting thing: the average response time has increased from 0.15ms to 0.21ms (which is a 40% increase), while the 95% response time has increased from 0.21ms to 0.36ms (which is 71%). I also ran a separate test measuring 99% response time, and the difference is even larger: 0.26ms vs. 0.48ms (or 84%).

這是一個重要的觀察：一旦趨於飽和，方差可能會增加，而且其中一些請求會受到不同程度的影響（不僅僅隻是我們看到的平均響應時間）。

This is an important observation to make: once saturation starts to happen, the variance is likely to increase and some of the requests will be disproportionately affected (beyond what the average response time shows).


Iostat  
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util  
nvme0n1           0.00     0.00 82862.20    0.00 1230567.20     0.00    29.70    16.33    0.20    0.20    0.00   0.01 100.00

磁盤讀寫延遲

磁盤IO使用率

磁盤負載

飽和度（IO壓力）

上麵的圖表結果符合預期：磁盤負載和IO負載從基本飽和增長到約為16，磁盤IO利用率則保持在100％。

The graphs show an expected figure: the disk load and IO load from saturation are up to about 16, and utilization remains at 100%.

需要注意的是圖形中的抖動有所增加。IO利用率飆到100％以上，磁盤IO負載峰值上升到18，而請求隊列卻並不是很大。這就需要先從這些信息如何去收集的角度去考慮問題。嚐試每秒對這些數據進行采樣，但是在有實際負載的係統中，這個采集進程也需要費些時間：我們本來希望每秒采集一次數據，而實際上間隔則是1.05 ～ 0.95秒。當數學應用於數據采集時，我們所看到的圖表中就可能會有本不應該出現的波峰、波穀（或刺點）。所以從大的角度來看，是可以忽略這些波動的。

One thing to notice is increased jitter in the graphs. IO utilization jumps to over 100% and disk IO load spikes to 18, when there should not be as many requests in flight. This comes from how this information is gathered. An attempt is made to sample this data every second, but with the loaded system it takes time for this process to work: sometimes when we try to get the data for a one-second interval but really get data for 1.05- or 0.95-second intervals. When the math is applied to the data, it creates the spikes and dips in the graph when there should be none. You can just ignore them if you’re looking at the big picture.

Sysbench FileIO 64 Threads

最後，讓我們用sysbench跑64個並發線程的情況：

Finally, let’s run sysbench with 64 concurrent threads hitting the disk:

root@ts140i:/mnt/data# sysbench  --threads=64 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run

    File operations:  
       reads/s:                      127840.59  
       writes/s:                     0.00  
       fsyncs/s:                     0.00  

    Throughput:  
       read, MiB/s:                  1997.51  
       written, MiB/s:               0.00  

    General statistics:  
       total time:                          600.0014s  
       total number of events:              76704744  

    Latency (ms):  
            min:                                  0.08  
            avg:                                  0.50  
            max:                                  9.34  
            95th percentile:                      1.25

我們可以看到平均響應耗時從0.21ms上升到0.50ms（兩倍多），此外95％的響應時間從0.36ms躍升到1.25ms。實際上，我們可以看到IO負載開始飽和了，以前我們看到隨著CPU飽和度的增加，並發線程數越多則響應耗時也越久，但這次隨著並發線程數的增加，我們並未看到響應耗時隨著線性增加。猜測是因為我們測試IO設備有不錯的內部IO並行能力，所以響應請求效率很高（即便並發線程從16增加到64）。

We can see the average has risen from 0.21ms to 0.50 (more than two times), and 95% almost tripped from 0.36ms to 1.25ms. From a practical standpoint, we can see some saturation starting to happen, but we’re still not seeing a linear response time increase with increasing numbers of parallel operations as we have seen with CPU saturation. I guess this points to the fact that this IO device has a lot of parallel capacity inside and can process requests more effectively (even going from 16 to 64 concurrent threads).

在這個測試中，當我們將並發性從1個增加到64個時，我們看到響應耗時從0.14ms增加到0.5ms（約三倍）。此時95％的響應耗時從0.17ms增加到1.25ms（約七倍）。實際上，從這時開始看到IO設備趨於飽和了。

Over the series of tests, as we increased concurrency from one to 64, we saw response times increase from 0.14ms to 0.5ms (or approximately three times). The 95% response time at this time grew from 0.17ms to 1.25ms (or about seven times). For practical purposes, this is where we see the IO device saturation start to show.

Iostat  
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await   w_await  svctm  %util  
nvme0n1           0.00     0.00 138090.20    0.00   2049791.20     0.00    29.69    65.99    0.48    0.48    0.00   0.01 100.24

我們將略過相關圖表，因為它們基本雷同，隻是在64並發線程下有更高的延遲。

We’ll skip the rest of the graphs as they basically look the same, just with higher latency and 64 requests in flight.

Sysbench FileIO 256線程

    root@ts140i:/mnt/data# sysbench  --threads=256 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run

    File operations:
       reads/s:                      131558.79
       writes/s:                     0.00
       fsyncs/s:                     0.00

    Throughput:
       read, MiB/s:                  2055.61
       written, MiB/s:               0.00

    General statistics:
       total time:                          600.0026s
       total number of events:              78935828

    Latency (ms):
            min:                                  0.10
            avg:                                  1.95
            max:                                 17.08
            95th percentile:                      3.89            
Iostat
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 142227.60    0.00 2112719.20     0.00    29.71   268.30    1.89    1.89    0.00   0.0

最後，當並發256個線程時，從平均響應耗時的線性增長結果表明了設備已經過載，且IO請求開始排隊。這裏沒有簡單的辦法判斷它是由於IO總線飽和(此時IO讀速率為2GB/sec)，還是由於設備內部處理能力所限導致的。

With 256 threads, finally we’re seeing the linear growth of the average response time that indicates overload and queueing to process requests. There is no easy way to tell if it is due to the IO bus saturation (we’re reading 2GB/sec here) or if it is the internal device processing ability.

正如我們所見，並發數從16個到64個時，響應耗時小於線性增長，當從64到256則是線性增長，我們可以看到這個設備最佳的並發是介於16個和64之間。這時的吞吐量可以達到峰值，且沒有大量的請求排隊。

As we’ve seen a less than linear increase in response time going from 16 to 64 connections, and a linear increase going from 64 to 256, we can see the “optimal” concurrency for this device: somewhere between 16 and 64 connections. This allows for peak throughput without a lot of queuing.

在總結之前，我想對這次特別的測試先做一個重要的說明。這是個隨機讀的測試，對於很多數據庫來說是個重要的工作負載模式，但可能不是最主要負載場景。可能是寫為主，或者順序IO讀 (則會有相應不同的表現)。對於其它的工作負載模式，我希望本文可以幫助你在分析時提供些參考、思路。

Before we get to the summary, I want to make an important note about this particular test. The test is a random reads test, which is a very important pattern for many database workloads, but it might not be the dominant load for your environment. You might be write-bound as well, or have mainly sequential IO access patterns (which could behave differently). For those other workloads, I hope this gives you some ideas on how to also analyze them.

另一種思考飽和度的方法

Another Way to Think About Saturation

當我問起他Percona的同學對本文的反饋意見時，Yves Trudeau同學提供了另外一種思考飽和度的方法：與單線程模式相比，平均響應耗時的增長百分比作為飽和度的衡量指標。例如：

When I asked the Percona staff for feedback on this blog post by, my colleague Yves Trudeau provided another way to think about saturation: measure saturation as percent increase in the average response time compared to the single user. Like this:

總結
Summary

我們可以看到如何理解磁盤利用率和飽和度要比CPU複雜的多； We can see how understanding disk utilization and saturation is much more complicated than for the CPU:
利用率指標(由iostat和PMM輸出的%Util)對理解真實的IO設備利用率是沒啥作用。因為它隻是衡量在運行中至少有一個請求時的耗時。如果是CPU的這個指標，它將相當於運行在至少一個核上(對於高度並發係統不是很有用)； The Utilization metric (as reported by iostat and by PMM) is not very helpful for showing true storage utilization, as it only measures the time when there is at least one request in flight. If you had the same metric for the CPU, it would correspond to something running on at least one of the cores (not very useful for highly parallel systems).
不同於CPU，Linux工具不給我們提供關於存儲設備的底層結構和在不飽和的情況下還可以處理多少並行負載的信息。更甚的是，存儲設備可能有不同的底層源頭會導致飽和。例如，它可能是網絡連接，SATA總線，甚至是老內核的內核IO棧去處理高速存儲設備（譯者注：跟不上時代發展啦）。 Unlike a CPU, Linux tools do not provide us with information about the structure of the underlying storage and how much parallel load it should be able to handle without saturation. Even more so, storage might well have different low-level resources that cause saturation. For example, it could be the network connection, SATA BUS or even the kernel IO stack for older kernels and very fast storage.
根據運行中的請求數量來測量飽和有助於判斷是否飽和，但是由於我們不知道設備可以高效的並發處理多少請求，隻看基礎指標我們不能確定設備是否過載； Saturation as measured by the number of requests in flight is helpful for guessing if there might be saturation, but since we do not know how many requests the device can efficiently process concurrently, just looking the raw metric doesn’t let us determine that the device is overloaded
平均響應耗時對於觀察飽和度是一個很好的指標，但是和響應耗時一樣，您無法判斷這個設備的響應耗時是高是低。您需要在上下文中查看它並將其與基線進行比較。當你查看平均響應耗時時，確保你分別查看讀響應時間和寫響應時間，並且記下平均請求隊列大小以確保是相同角度的比較。 Avg Response Time is a great metric for looking at saturation, but as with the response time you can’t say what response time is good or bad for this device. You need to look at it in context and compare it to the baseline. When you’re looking at the Avg Response Time, make sure you’re looking at read request response time vs. write request response time separately, and keep the average request size in mind to ensure we are comparing apples to apples.

原文發布時間為：2017-09-11
本文作者：菜鳥盟@知數堂
本文來自雲棲社區合作夥伴“老葉茶館”，了解相關信息可以關注“老葉茶館”微信公眾號

最後更新：2017-09-11 11:02:38

磁盤利用率和飽和度

上一篇： CRP升級到RDC，遷移指南

下一篇：西部數據攜全線產品亮相IDF 2014

相關內容

熱門內容

最新內容

磁盤利用率和飽和度

上一篇： CRP升級到RDC，遷移指南

下一篇： 西部數據攜全線產品亮相IDF 2014

相關內容

熱門內容

最新內容

下一篇：西部數據攜全線產品亮相IDF 2014