MOM係列文章之 - zero copy 那些事（上）

最近準備了兩篇文章，主要是針對MOM中的關鍵技術zero copy（物理層麵和邏輯層麵）進行一些介紹。

在基於文件存儲的MOM Kafka，ActiveMQ以及其它諸如Hornetq,Kestrel中的Journal設計實現中，無不見zero copy的神威。為此我準備了一個係列文章，希望能夠為大家解開zero copy的神秘麵紗，也希望大家能夠喜歡。

這篇文章主要聚焦在zero copy的基礎部分。首先通過E文導讀來理解其內在原理，理解為什麼zero copy能夠提升一些IO密集型應用的性能，為什麼能夠將上下文切換從4次降到2次，數據copy從4次降低到3次（注：隻有一次會占用CPU cycle）？其次，簡單介紹下Java世界，尤其是Netty中的zero-copy的設計實現。最後通過幾篇擴展閱讀，開闊一下視野，帶領大家了解一下國外同行在zero copy上的一些技術性研究及其成果。OK，開篇~

Zero copy View

Many Web applications serve a significant amount of static content, which amounts to reading data off of a disk and writing the exact same data back to the response socket. This activity might appear to require relatively little CPU activity, but it's somewhat inefficient: the kernel reads the data off of disk and pushes it across the kernel-user boundary to the application, and then the application pushes it back across the kernel-user boundary to be written out to the socket. In effect, the application serves as an inefficient intermediary that gets the data from the disk file to the socket.

Each time data traverses the user-kernel boundary, it must be copied, which consumes CPU cycles and memory bandwidth. Fortunately, you can eliminate these copies through a technique called — appropriately enough — zero copy. Applications that use zero copy request that the kernel copy the data directly from the disk file to the socket, without going through the application. Zero copy greatly improves application performance and reduces the number of context switches between kernel and user mode.

下麵，我們以數據傳輸為例，來重點分析一下傳統與零拷貝傳輸方式：

traditional approach

figure 1: Traditional data copying approach

Figure 2：Traditional context switching:

The steps involved are:
The read() call causes a context switch (see Figure 2) from user mode to kernel mode. Internally a sys_read() (or equivalent) is issued to read the data from the file. The first copy (see Figure 1) is performed by the direct memory access (DMA) engine, which reads file contents from the disk and stores them into a kernel address space buffer.
The requested amount of data is copied from the read buffer into the user buffer, and the read() call returns. The return from the call causes another context switch from kernel back to user mode. Now the data is stored in the user address space buffer.
The send() socket call causes a context switch from user mode to kernel mode. A third copy is performed to put the data into a kernel address space buffer again. This time, though, the data is put into a different buffer, one that is associated with the destination socket.
The send() system call returns, creating the fourth context switch. Independently and asynchronously, a fourth copy happens as the DMA engine passes the data from the kernel buffer to the protocol engine.
Use of the intermediate kernel buffer (rather than a direct transfer of the data into the user buffer) might seem inefficient. But intermediate kernel buffers were introduced into the process to improve performance. Using the intermediate buffer on the read side allows the kernel buffer to act as a "readahead cache" when the application hasn't asked for as much data as the kernel buffer holds. This significantly improves performance when the requested data amount is less than the kernel buffer size. The intermediate buffer on the write side allows the write to complete asynchronously.
Unfortunately, this approach itself can become a performance bottleneck if the size of the data requested is considerably larger than the kernel buffer size. The data gets copied multiple times among the disk, kernel buffer, and user buffer before it is finally delivered to the application.
Zero copy improves performance by eliminating these redundant data copies.

zero copy approach

figure 3: zero copy data copying approach

figure 4: zero copy context switch

The steps taken when you use transferTo() as in Listing 4 are:
The transferTo() method causes the file contents to be copied into a read buffer by the DMA engine. Then the data is copied by the kernel into the kernel buffer associated with the output socket.
The third copy happens as the DMA engine passes the data from the kernel socket buffers to the protocol engine.
This is an improvement: we've reduced the number of context switches from four to two and reduced the number of data copies from four to three (only one of which involves the CPU). But this does not yet get us to our goal of zero copy. We can further reduce the data duplication done by the kernel if the underlying network interface card supports gather operations. In Linux kernels 2.4 and later, the socket buffer descriptor was modified to accommodate this requirement. This approach not only reduces multiple context switches but also eliminates the duplicated data copies that require CPU involvement. The user-side usage still remains the same, but the intrinsics have changed:
The transferTo() method causes the file contents to be copied into a kernel buffer by the DMA engine.
No data is copied into the socket buffer. Instead, only descriptors with information about the location and length of the data are appended to the socket buffer. The DMA engine passes data directly from the kernel buffer to the protocol engine, thus eliminating the remaining final CPU copy.

關於zero-copy的性能：

Michael Santy（https://zeromq.org/results:copying）做了一些實驗，對於一個256MB的數據，單次數據拷貝延遲達到了0.1秒，由此可見在大數據傳輸過程中，這塊有多麼大的提升空間。

Zero copy In java

Java中跟zero copy相關的主要集中在FileChannel和MappedByteBuffer中。對應的，我們所熟知的網絡通訊框架Netty4中跟zero copy相關的則主要集中在FileRegion和CompositeByteBuf中。

Zero copy readings

文獻2主要介紹了Sockets Direct Protocol（他們改進了最初的開源SDP實現，在TCP SOCK_STREAM語義的開源實現中添加了zero copy對同步操作的支持，同時宣稱在同一個主機下同時開啟8個連接，CPU使用率降低了8倍，而這一切的損失，僅僅是帶寬的壓力從500 MB/S增長到800 MB/S）。而SDP協議被廣泛應用在Infiniband架構（低延遲、高帶寬數據中心互聯架構，采用RDMA 實現高性能IPC。它用於廣泛的關鍵通信計算環境，適用於HPC係統、大型數據中心和嵌入式應用等廣泛環境）。

文獻3主要描述了MoD場景下 zero-copy buffer的分配方式（static or dynamic allocation，before the transmission starts）。使用靜態分配能夠避免了per-operation allocation of memory，從而降低per-packet cost（such as.CPU cycles）。

文獻4為我們展示了一種基於zero copy的Advanced Data Transfer Service (ADTS) 跨廣域網高效FTP設計思路，從他們的測試數據來看，在傳輸大數據的時候，這種策略有近乎80%的速度提升。

文獻5將zero copy分為兩種形式，passive zero-copy非常適合那些有著deterministic communication timing and sizes的應用，而active zero copy則恰恰相反（適用那些non-deterministic...）

ok，這篇文章差不多就這些內容。下篇文章我會通過代碼，監控等多種方式來重點闡述一下零拷貝Internal~

參考文獻：

1.https://zeromq.org/blog:zero-copy

2.https://www.mellanox.com/pdf/whitepapers/SDP_Whitepaper.pdf

3.https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.8050&rep=rep1&type=pdf

4.https://www.cse.ohio-state.edu/~subramon/Tech-Reports/ftp08-tr.pdf

5.https://cscjournals.org/csc/manuscript/Journals/IJCSS/volume6/Issue4/IJCSS-756.pdf

6.https://www.info.kochi-tech.ac.jp/yama/papers/ispdc05_active.pdf

7.https://www.linuxplumbersconf.org/2011/ocw/system/presentations/129/original/2011-09-LPC-Towards-Zero-Copy-PV-Networking.pdf

8.https://www-old.itm.uni-luebeck.de/teaching/ws1112/vs/Uebung/GrossUebungNetty/VS-WS1112-xx-Zero-Copy_Event-Driven_Servers_with_Netty.pdf?lang=de