《kafka中文手冊》-快速開始(一)
- 1.1 Introduction
-
Kafka™ is a distributed streaming platform. What exactly does that mean? Kafka是一個分布式數據流處理係統, 這意味著什麼呢?
- Topics and Logs 主題和日誌
- Distribution 分布式
- Producers 生產者
- Consumers 消費者
- Guarantees 保障
- Kafka as a Messaging System kafka當作消息係統
- Kafka as a Storage System kafka當作儲存係統
- Kafka for Stream Processing kafka作為數據流處理
- Putting the Pieces Together 把各個塊整合起來
- 1.2 Use Cases 用例
- 1.3 Quick Start 快速開始
- 1.4 Ecosystem
- 1.5 Upgrading From Previous Versions
1.1 Introduction
Kafka™ is a distributed streaming platform. What exactly does that mean? Kafka是一個分布式數據流處理係統, 這意味著什麼呢?
We think of a streaming platform as having three key capabilities:我們回想下流數據處理係統的三個關鍵能力指標
- It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.係統具備發布和訂閱流數據的能力, 在這方麵, 它類似於消息隊列或企業消息總線
- It lets you store streams of records in a fault-tolerant way.係統具備在存儲數據時具備容錯能力
- It lets you process streams of records as they occur.係統具備在數據流觸發時進行實時處理
What is Kafka good for?那kafka適用在哪些地方?
It gets used for two broad classes of application: 它適用於這兩類應用
- Building real-time streaming data pipelines that reliably get data between systems or applications 在係統或應用間需要相互進行數據流交互處理的實時係統
- Building real-time streaming applications that transform or react to the streams of data 需要對數據流中的數據進行轉換或及時處理的實時係統
To understand how Kafka does these things, let’s dive in and explore Kafka’s capabilities from the bottom up. 為了了解Kafka做了哪些事情, 我們開始從下往上分析kafka的能力
First a few concepts: 首先先了解這幾個概念
- Kafka is run as a cluster on one or more servers. kafka是一個可以跑在一台或多台服務器上的集群
- The Kafka cluster stores streams of records in categories called topics. Kafka集群存儲不同的數據流以topic形式進行劃分
- Each record consists of a key, a value, and a timestamp. 每條數據流中的每條記錄包含key, value, timestamp三個屬性
Kafka has four core APIs: Kafka擁有4個核心的api
- The Producer API allows an application to publish a stream records to one or more Kafka topics. Producer API 用於讓應用發布流數據到Kafka的topic中
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them. Consumer API 用於讓應用訂閱一個或多個topic後, 獲取數據流進行處理
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams. Streams API 用於讓應用傳教流處理器, 流處理器的輸入可以是一個或多個topic, 並輸出數據流結果到一個或多個topic中, 它提供一種有效的數據流處理方式
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table. Connector API 用於為現在的應用或數據係統提供可重用的生產者或消費者, 他們連接到kafka的topic進行數據交互. 例如, 創建一個到關係型數據庫連接器, 用於捕獲對某張表的所有數據變更
In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages. Kafka 基於簡單、高效的tcp協議完成服務器端和客戶端的通訊, 該協議是受版本控製的, 並可以兼容老版本. 我們有提供java的kafka客戶端, 但也提供了很多其他語言的客戶端
Topics and Logs 主題和日誌
Let’s first dive into the core abstraction Kafka provides for a stream of records—the topic.首先我們考察下kafka提供的核心數據流結構– topic(主題)
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it. topic是一個分類欄目,由於記錄一類數據發布的位置. topic在kafka中通常都有多個訂閱者, 也就是說一個topic在寫入數據後, 可以零個, 一個, 或多個訂閱者進行消費
For each topic, the Kafka cluster maintains a partitioned log that looks like this: 針對每個topic隊列, kafka集群構建一組這樣的分區日誌:
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
每個日誌分區都是有序, 不可變, 持續提交的結構化日誌, 每條記錄提交到日誌分區時, 都分配一個有序的位移對象offset, 用以唯一區分記數據在分區的位置
The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka’s performance is effectively constant with respect to data size so storing data for a long time is not a problem.
無論發布到Kafka的數據是否有被消費, 都會保留所有已經發布的記錄, Kafka使用可配置的數據保存周期策略, 例如, 如果保存策略設置為兩天, 則兩天前發布的數據可以被訂閱者消費, 過了兩天後, 數據占用的空間就會被刪除並回收. 在存儲數據上, kafka提供高效的O(1)性能處理算法, 所以保存長期時間不是一個問題
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from “now”.
實際上, 每個消費者唯一保存的元數據信息就是消費者當前消費日誌的位移位置. 位移位置是被消費者控製, 正常情況下, 如果消費者讀取記錄後, 位移位置往前移動. 但是事實上, 由於位移位置是消費者控製的, 所以消費者可以按照任何他喜歡的次序進行消費, 例如, 消費者可以重置位移到之前的位置以便重新處理數據, 或者跳過頭部從當前最新的位置進行消費
This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to “tail” the contents of any topic without changing what is consumed by any existing consumers.
這些特性表明Kafka消費者消費的代價是十分小的, 消費者可以隨時消費或停止, 而對集群或其他消費者沒有太多的影響, 例如你可以使用命令行工具, 像”tail”工具那樣讀取topic的內容, 而對其它消費者沒有影響
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
分區在日誌中有幾個目的, 首先, 它能擴大日誌在單個服務器裏麵的大小, 每個分區大小必須適應它從屬的服務器的規定的大小, 但是一個topic可以有任意很多個分區, 這樣topic就能存儲任意大小的數據量, 另一方麵, 分區還和並發有關係, 這個後麵會講到
Distribution 分布式
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
kafka的日誌分區機製跨越整個kafka日誌集群, 每個服務器使用一組公用的分區進行數據處理, 每個分區可以在集群中配置副本數
Each partition has one server which acts as the “leader” and zero or more servers which act as “followers”. The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
每個分區都有一台服務器是主的, 另外零台或多台是從服務器, 主服務器責所有分區的讀寫請求, 從服務器被動從主分區同步數據. 如果主服務器分區的失敗了, 那麼備服務器的分區就會自動變成主的. 每台服務器的所有分區中, 隻有部分會作為主分區, 另外部分作為從分區, 這樣可以在集群中對個個服務器做負載均攤
Producers 生產者
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
生產者發布消息到他們選擇的topic中, 生產者負責選擇記錄要發布到topic的那個分區中, 這個可以簡單通過輪詢的方式進行負載均攤, 或者可以通過特定的分區選擇函數(基於記錄特定鍵值), 更多分區的用法後麵馬上介紹
Consumers 消費者
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
消費者使用消費組進行標記, 發布到topic裏麵的每條記錄, 至少會被消費組裏麵一個消費者實例進行消費. 消費者實例可以是不同的進程, 分布在不同的機器上
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
如果所有的消費者屬於同一消費組, 則記錄會有效地分攤到每一個消費者上, 也就是說每個消費者隻會處理部分記錄
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
如果所有的消費者都屬於不同的消費組, 則記錄會被廣播到所有的消費者上, 也就說每個消費者會處理所有記錄
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
圖為一個2個服務器的kafka集群, 擁有4個分區, 2個消費組, 消費組A有2個消費者, 消費組B有4個消費者
More commonly, however, we have found that topics have a small number of consumer groups, one for each “logical subscriber”. Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.
在大多數情況下, 一般一個topic值需要少量的消費者組, 一個消費組對應於一個邏輯上的消費者. 每個消費組一般包含多個實例用於容錯和水平擴展. 這僅僅是發布訂閱語義,其中訂閱者是消費者群集,而不是單個進程.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a “fair share” of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
在kafka中實現日誌消費的方式, 是把日誌分區後分配到不同的消費者實例上, 所以每個實例在某個時間點都是”公平共享”式獨占每個分區. 在這個處理過程中, 維持組內的成員是由kafka協議動態決定的, 如果有新的實例加入組中, 則會從組中的其他成員分配一些分區給新成員, 如果某個實例銷毀了, 則它負責的分區也會分配給組內的其它成員
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
kafka值提供在一個日誌分區裏麵順序消費的能力, 在同一topic的不同分區裏麵是沒有保證的. 由於記錄可以結合鍵值做分區, 這樣的分區順序一般可以滿足各個應用的需求了, 但是如果你要求topic下的所有記錄都要按照次序進行消費, 則可以考慮一個topic值創建一個分區, 這樣意味著你這個topic隻能讓一個消費者消費
Guarantees 保障
At a high-level Kafka gives the following guarantees: 在一個高可用能的kafka集群有如下的保證:
- Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
- 被同一個發布者發送到特定的日誌分區後, 會按照他們發送的順序進行添加, 例如 記錄M1 和記錄M2 都被同一個提供者發送, M1比較早發送, 則M1的位移值比M2小, 並記錄在比較早的日誌位置
- A consumer instance sees records in the order they are stored in the log.
- 消費者實例按照日誌記錄的順序進行讀取
- For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
- 如果topic有N個副本, 則可以容忍N-1台服務器宕機時, 提交的記錄不會丟失
More details on these guarantees are given in the design section of the documentation.
更多關於kafka能提供的特性會在設計這個章節講到
Kafka as a Messaging System kafka當作消息係統
How does Kafka’s notion of streams compare to a traditional enterprise messaging system?
kafka的流概念和傳統的企業消息係統有什麼不一樣呢?
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren’t multi-subscriber—once one process reads the data it’s gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.
傳統的消息係統有兩種模型, 隊列模型和發布訂閱模型, 在訂閱模型中, 一群消費者從服務器讀取記錄, 每條記錄會分發到其中一個消費者中, 在發布和訂閱模型中, 記錄分發給所有的消費者. 這兩種模型都有各自的優缺點, 隊列的優點是它允許你把數據處理提交到多個消費者實例中, 適用於數據處理的水平擴展, 但是隊列不是多訂閱的, 一旦其中的一個消費者讀取了記錄, 則記錄就算處理過了. 在發布訂閱模型中允許你廣播到記錄到不同的訂閱者上, 但是這種方式沒法對不同的訂閱者進行負載均攤
The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.
kafka的消費組產生源於對著兩種概念的融合,在隊列模型中, 它允許你把記錄分攤在同一個消費組的不同處理者身上, 在訂閱和發布模型中, 它允許你把消費廣播到不同的消費組中
The advantage of Kafka’s model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
kafka這個模型的好處是, 這樣每個topic都能同時擁有這樣的屬性, 既能消費者有水平擴展的處理能力, 又能允許有多個不同的訂閱者–不需要讓用戶選擇到底是要使用隊列模型還是發布訂閱模型
Kafka has stronger ordering guarantees than a traditional messaging system, too.
Kafka也比傳統的消息係統有更強的消息順序保證
A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of “exclusive consumer” that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.
傳統的隊列在服務器端按順序保存記錄, 如果有多個消費者同時從服務器端讀取數據, 則服務器按保存的順序分發記錄. 但是盡管服務器按順序分發記錄, 這些記錄使用異步分發到消費者上, 所以記錄到不同的消費者時順序可能是不一致的. 這就是說記錄的順序有可能在記錄被並發消費時已經被丟失了, 在消息係統中為了支持順序消費這種情況經常使用一個概念叫做”獨占消費者”, 表示隻允許一個消費者去訂閱隊列, 這也意味了犧牲掉記錄並行處理能力
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.
kafka在這點上做得更好些, 通過對日誌提出分區的概念, kafka保證了記錄處理的順序和對一組消費者實例進行負載分攤的水平擴展能力. 通過把topic中的分區唯一指定消費者分組中的某個消費者, 這樣可以保證僅且隻有這樣的一個消費者實例從這個分區讀取數據, 並按順序進行消費. 這樣topic中的多個分區就可以分攤到多個消費者實例上, 當然消費者的數量不能比分區數量多, 否則有些消費者將分配不到分區.
Kafka as a Storage System kafka當作儲存係統
Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.
作為存儲係統, 任意消息隊列都允許發布到消息隊列中, 並能高效消費這些消息記錄, kafka不同的地方是它是一個很好的存儲係統
Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn’t considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.
數據寫入kafka時被寫入到磁盤, 並複製到其他服務器上進行容錯, kafka允許生產者隻有在消息已經複製完, 並存儲後才得到寫成功的通知, 否則就認為失敗.
The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.
磁盤結構kafka也很有效率利用了–無論你存儲的是50KB或50TB的數據在kafka上, kafka都會有同樣的性能
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
對於那些需要認真考慮存儲性能, 並允許客戶端自主控製讀取位置的, 你可以把kafka當作是一種特殊的分布式文件係統, 並致力於高性能, 低延遲提交日誌存儲, 複製和傳播.
Kafka for Stream Processing kafka作為數據流處理
It isn’t enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.
僅僅讀取、寫入和存儲數據流是不夠的,最終的目的是使流實時處理.。
In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.
kafka的流數據處理器是持續從輸入的topic讀取連續的數據流, 進行數據處理, 轉換, 後產生連續的數據流輸出到topic中
For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.
例如,一個零售的應用可能需要在獲取銷售和出貨量的輸入流, 在計算分析了之後, 重新輸出價格調整的記錄
It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.
通常情況下可以直接使用提供者或消費者的api方法做些簡單的處理. 但是kafka通過stream api 也提供一些更複雜的數據轉換處理機製, stream api可以讓應用計算流的聚合或流的歸
This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.
這些功能有助於解決一些應用上的難題: 處理無序的數據, 在編碼修改後從新處理輸入數據, 執行有狀態的計算等
The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.
流的api基於kafka的核心基本功能上構建的: 它使用生產者和消費者提供的api作為輸入輸出, 使用kafka作為狀態存儲, 使用一樣的分組機製在不同的流處理器上進行容錯
最後更新:2017-05-19 10:25:02