閱讀884 返回首頁    go 阿裏雲 go 技術社區[雲棲]


The world beyond batch: Streaming 101

這篇文章,首先要說清的一個問題是,給‘Streaming’正名

What is streaming?

The crux of the problem is that many things that ought to be described bywhat they are (e.g., unbounded data processing, approximate results, etc.), have come to be described colloquially by how they historically have been accomplished (i.e., via streaming execution engines).

當前我們對Streaming的定義是不準確的,導致我們對Streaming會有些誤解 
比如,認為Streaming就意味著Low-latency, approximate,lack of precision

這個問題的症結在於,我們把一樣東西的本質是什麼和這樣東西被完成到什麼程度給混淆了

所以這裏作者給出streaming的定義,

I prefer to isolate the term streaming to a very specific meaning: a type of data processing engine that is designed with infinite data sets in mind. Nothing more.

而對於常常出現的和streaming相關的詞,也加以區別定義

Unbounded data: A type of ever-growing, essentially infinite data set. 
這個詞用於描述數據集本身的特性,而Streaming用於描述processing engine

Unbounded data processing: An ongoing mode of data processing, applied to the aforementioned type of unbounded data.
which is at best misleading:repeated runs of batch engines have been used to process unbounded data since batch systems were first conceived 
batch engine也可以用於repeated的處理Unbounded data 
同樣Streaming engine也可以用於處理Bounded data 
所以這個詞並不等同於Streaming

Low-latency, approximate, and/or speculative results:

作者認為隻是,batch engine在設計時沒有考慮要針對low-latency的場景,batch也可以做到low-latency,也可以得出approximate或speculative結果 
反之,streaming也可以balance low-latency來達到準確的結果

So,

From here on out, any time I use the term “streaming,” you can safely assume I mean an execution engine designed for unbounded data sets, and nothing more.

 

What streaming can do?

近期流計算的興起於Twitter’s Nathan Marz (creator of Storm)的Storm,當然也帶給Streaming以low-latency, inaccurate/speculative results這樣的標簽

為了提供eventually correct results,Marz提出Lambda Architecture. 這種架構雖然看上去很簡單,但是給出一種balance一致性和可用性的思路;

當然問題也很明顯,你需要維護streaming和batch兩個pipeline,這個代價是很大的。

作者表示對於這種架構 a bit unsavory。

Unsurprisingly, I was a huge fan of Jay KrepsQuestioning the Lambda Architecture post when it came out.

所以下位出場的是linkedin的Jay Krep,他提出的是基於Kafka的Kappa Architecture,

該架構也很簡單,但給出將兩個pipeline合並成一個pipeline的思路;更關鍵的這個方案用well-designed streaming system替代了batch pipeline,這個對於作者是有很大啟發的

作者對這個架構的評價,I’m not convinced that notion itself requires a name, but I fully support the idea in principle.

 

Quite honestly, I’d take things a step further. 
I would argue that well-designed streaming systems actually provide a strict superset of batch functionality.

作者進步提出,Streaming是Batch的超集,即這個時代已經不需要batch了,該退休了

Steaming要擊敗Batch,隻需要做到兩件事,

Correctness — This gets you parity with batch.

隻要做到這點,就至少可以等同於batch

At the core, correctness boils down to consistent storage. 
Streaming systems need a method for checkpointing persistent state over time (something Kreps has talked about in hisWhy local state is a fundamental primitive in stream processing post), and it must be well-designed enough to remain consistent in light of machine failures.

 

If you’re curious to learn more about what it takes to get strong consistency in a streaming system, I recommend you check out theMillWheel and Spark Streaming papers.

 

Tools for reasoning about time — This gets you beyond batch.

做到這點就可以超越batch

Good tools for reasoning about time are essential for dealing with unbounded, unordered data of varying event-time skew.

這是作者的重點,討論如何處理unbounded, unordered data

因為在現實中,我們往往需要安裝event-time來處理數據,而不是按照process-time

image

 

In the context of unbounded data, disorder and variable skew induce a completeness problem for event time windows: 
lacking a predictable mapping between processing time and event time, how can you determine when you’ve observed all the data for a given event time X? For many real-world data sources, you simply can’t. The vast majority of data processing systems in use today rely on some notion of completeness, which puts them at a severe disadvantage when applied to unbounded data sets.

這個問題會在102中詳細的描述,其實就是dataflow論文裏麵的內容

 

Data processing patterns

最終,作者描述下當前的數據處理的patterns

Bounded data

image

 

Unbounded data — batch

Fixed windows

image

 

Sessions

image

這個和上麵fixed windows的區別,人為的劃分fixed windows會切斷sessions,如圖中紅色

 

Unbounded data — streaming

現實中,unbounded data往往有兩個特點,

  • Highly unordered with respect to event times, meaning you need some sort of time-based shuffle in your pipeline if you want to analyze the data in the context in which they occurred.
  • Of varying event time skew, meaning you can’t just assume you’ll always see most of the data for a given event time X within some constant epsilon of time Y.

對於這樣的數據,處理的方式有如下幾類,

Time-agnostic

Time-agnostic processing is used in cases where time is essentially irrelevant — i.e., all relevant logic is data driven.

這個最簡單,時間無關的應用,所以stateless的情況,比如map或filter都屬於這個case

這種場景沒啥好說的,任何Streaming平台都可以很好的處理

 

Approximation algorithms

The second major category of approaches is approximation algorithms, such as approximate Top-Nstreaming K-means, etc.

 

Windowing by processing time

There are a few nice properties of processing time windowing:

  • It’s simple
  • Judging window completeness is straightforward.
  • If you’re wanting to infer information about the source as it is observed, processing time windowing is exactly what you want.

 

Windowing by event time

Event time windowing is what you use when you need to observe a data source in finite chunks that reflect the times at which those events actually happened.

It’s the gold standard of windowing. Sadly, most data processing systems in use today lack native support for it.

這種方式是作者所采用的,他認為是gold standard of windowing,而當前的system往往都是native不支持的,原因是比較困難,這也是作者的主要貢獻,102中會詳細描述

 

Of course, powerful semantics rarely come for free, and event time windows are no exception. Event time windows have two notable drawbacks due to the fact that windows must often live longer (in processing time) than the actual length of the window itself:

Buffering: Due to extended window lifetimes, more buffering of data is required.

Completeness: Given that we often have no good way of knowing when we’ve seen all the data for a given window, how do we know when the results for the window are ready to materialize? In truth, we simply don’t.

最後更新:2017-04-07 21:05:52

  上一篇:go 企業為什麼要使用基於Docker的PaaS/CaaS平台
  下一篇:go 2017深圳雲棲TechInsight活動回顧:動手實驗室-從零搭建一個金融級分布式交易係統