Flink DataStream API Programming Guide

Example Program

The following program is a complete, working example of streaming window word count application, that counts the words coming from a web socket in 5 second windows.

public class WindowWordCount {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStream<Tuple2<String, Integer>> dataStream = env
                .socketTextStream("localhost", 9999)
                .flatMap(new Splitter())
                .keyBy(0)
                .timeWindow(Time.of(5, TimeUnit.SECONDS))
                .sum(1);

        dataStream.print();

        env.execute("Window WordCount");
    }
    
    public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
            for (String word: sentence.split(" ")) {
                out.collect(new Tuple2<String, Integer>(word, 1));
            }
        }
    }
    
}

Flink應用的代碼結構如下，

Flink DataStream programs look like regular Java programs with a main() method. Each program consists of the same basic parts:

Obtaining a StreamExecutionEnvironment,
Connecting to data stream sources,
Specifying transformations on the data streams,
Specifying output for the processed data,
Executing the program.

以這個例子，說明

首先會創建socketTextStream，從socket讀入text流

接著是個flatMap，和map的不同是，map，1->1，而flatMap為1->n，而這個splitter就是將text用“”分割，將每個word作為一個tuple輸出

最後，keyBy產生一個有key的tuple流，這裏是以word為key

基於5s的timeWindow，對後麵的計數進行sum

最終，output是print

Transformations

太常用的就不列了

==============================================================================================

Reduce
KeyedStream → DataStream

A "rolling" reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.

keyedStream.reduce(new ReduceFunction<Integer>() {
    @Override
    public Integer reduce(Integer value1, Integer value2)
    throws Exception {
        return value1 + value2;
    }
});

Fold
KeyedStream → DataStream

A "rolling" fold on a keyed data stream with an initial value. Combines the current element with the last folded value and emits the new value.

A fold function that, when applied on the sequence (1,2,3,4,5), emits the sequence "start-1", "start-1-2", "start-1-2-3", ...

DataStream<String> result = 
  keyedStream.fold("start", new FoldFunction<Integer, String>() {
    @Override
    public String fold(String current, Integer value) {
        return current + "-" + value;
    }
  });

Fold和reduce的區別，fold可以有個初始值，而且foldfunciton可以將一種類型fold到另一種類型

而reduce function，隻能是一種類型

Aggregations
KeyedStream → DataStream

Rolling aggregations on a keyed data stream.
The difference between min and minBy is that min returns the minimun value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy).

keyedStream.sum(0);
keyedStream.sum("key");
keyedStream.min(0);
keyedStream.min("key");
keyedStream.max(0);
keyedStream.max("key");
keyedStream.minBy(0);
keyedStream.minBy("key");
keyedStream.maxBy(0);
keyedStream.maxBy("key");

可以認為是特殊的reduce

不帶by，隻是返回value

帶by，返回整個element

=============================================================================================

Union
DataStream* → DataStream

Union of two or more data streams creating a new stream containing all the elements from all the streams. Node: If you union a data stream with itself you will get each element twice in the resulting stream.

dataStream.union(otherStream1, otherStream2, ...);

Connect
DataStream,DataStream → ConnectedStreams

"Connects" two data streams retaining their types. Connect allowing for shared state between the two streams.

DataStream<Integer> someStream = //...
DataStream<String> otherStream = //...

ConnectedStreams<Integer, String> connectedStreams = someStream.connect(otherStream);

connect就是兩個不同type的流可以共享一個流，tuple可以同時拿到來自兩個流的數據

CoMap, CoFlatMap
ConnectedStreams → DataStream

Similar to map and flatMap on a connected data stream

connectedStreams.map(new CoMapFunction<Integer, String, Boolean>() {
    @Override
    public Boolean map1(Integer value) {
        return true;
    }

    @Override
    public Boolean map2(String value) {
        return false;
    }
});

Split
DataStream → SplitStream

Split the stream into two or more streams according to some criterion.

SplitStream<Integer> split = someDataStream.split(new OutputSelector<Integer>() {
    @Override
    public Iterable<String> select(Integer value) {
        List<String> output = new ArrayList<String>();
        if (value % 2 == 0) {
            output.add("even");
        }
        else {
            output.add("odd");
        }
        return output;
    }
});

Select
SplitStream → DataStream

Select one or more streams from a split stream.

SplitStream<Integer> split;
DataStream<Integer> even = split.select("even");
DataStream<Integer> odd = split.select("odd");
DataStream<Integer> all = split.select("even","odd");

====================================================================================

Project
DataStream → DataStream

Selects a subset of fields from the tuples

DataStream<Tuple3<Integer, Double, String>> in = // [...]
DataStream<Tuple2<String, Integer>> out = in.project(2,0);

===========================================================================================

Window
KeyedStream → WindowedStream

Windows can be defined on already partitioned KeyedStreams. Windows group the data in each key according to some characteristic (e.g., the data that arrived within the last 5 seconds). See windows for a complete description of windows.

dataStream.keyBy(0).window(TumblingTimeWindows.of(Time.of(5, TimeUnit.SECONDS))); // Last 5 seconds of data

基於keyedStream的window

WindowAll
DataStream → AllWindowedStream

Windows can be defined on regular DataStreams. Windows group all the stream events according to some characteristic (e.g., the data that arrived within the last 5 seconds). See windows for a complete description of windows.

WARNING: This is in many cases a non-parallel transformation. All records will be gathered in one task for the windowAll operator.

dataStream.windowAll(TumblingTimeWindows.of(Time.of(5, TimeUnit.SECONDS))); // Last 5 seconds of data

主要，由於沒有key，所以如果要對all做transform，是無法parallel的，隻能在一個task裏麵做

Window Apply
WindowedStream → DataStream
AllWindowedStream → DataStream

Applies a general function to the window as a whole. Below is a function that manually sums the elements of a window.

Note: If you are using a windowAll transformation, you need to use an AllWindowFunction instead.

Window Reduce
WindowedStream → DataStream

Applies a functional reduce function to the window and returns the reduced value.

Aggregations on windows
WindowedStream → DataStream

Aggregates the contents of a window. The difference between min and minBy is that min returns the minimun value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy).

windowedStream.sum(0);
windowedStream.sum("key");

Window Join
DataStream,DataStream → DataStream

Join two data streams on a given key and a common window.

dataStream.join(otherStream)
    .where(0).equalTo(1)
    .window(TumblingTimeWindows.of(Time.of(3, TimeUnit.SECONDS)))
    .apply (new JoinFunction () {...});

Physical partitioning

類似storm的group方式，可以自己配置

Hash partitioning，等同於 groupby field
DataStream → DataStream

Identical to keyBy but returns a DataStream instead of a KeyedStream.

dataStream.partitionByHash("someKey");
dataStream.partitionByHash(0);

Custom partitioning
DataStream → DataStream

Uses a user-defined Partitioner to select the target task for each element.

dataStream.partitionCustom(new Partitioner(){...}, "someKey");
dataStream.partitionCustom(new Partitioner(){...}, 0);

Random partitioning，等同於shuffle
DataStream → DataStream

Partitions elements randomly according to a uniform distribution.

dataStream.partitionRandom();

Rebalancing (Round-robin partitioning)
DataStream → DataStream

Partitions elements round-robin, creating equal load per partition. Useful for performance optimization in the presence of data skew.

dataStream.rebalance();

這個保證數據不會skew，round-robin就是每個一條，輪流來

Broadcasting，等同於globle
DataStream → DataStream

Broadcasts elements to every partition.

dataStream.broadcast();

Task chaining and resource groups

Chaining two subsequent transformations means co-locating them within the same thread for better performance.
Flink by default chains operators if this is possible (e.g., two subsequent map transformations).

The API gives fine-grained control over chaining if desired:

A resource group is a slot in Flink, see slots. You can manually isolate operators in separate slots if desired.

Start new chain

Begin a new chain, starting with this operator. The two mappers will be chained, and filter will not be chained to the first mapper.

someStream.filter(...).map(...).startNewChain().map(...);

注意startNewChain是應用於，左邊的那個operator，所以上麵從第一個map開始start new chain

Disable chaining

Do not chain the map operator

someStream.map(...).disableChaining();

Start a new resource group

Start a new resource group containing the map and the subsequent operators.

someStream.map(...).startNewResourceGroup();

意思就是他們share同一個slot？

Isolate resources

Isolate the operator in its own slot.

someStream.map(...).isolateResources();

使用獨立的slot

Execution Configuration

隻有下麵兩個和batch的配置不同，

Parameters in the ExecutionConfig that pertain specifically to the DataStream API are:

enableTimestamps() / disableTimestamps(): Attach a timestamp to each event emitted from a source.areTimestampsEnabled()returns the current value.
setAutoWatermarkInterval(long milliseconds): Set the interval for automatic watermark emission. You can get the current value withlong getAutoWatermarkInterval()

Debugging

A LocalEnvironment is created and used as follows:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

DataStream<String> lines = env.addSource(/* some source */);
// build your program

env.execute();

Collection data sources can be used as follows:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

// Create a DataStream from a list of elements
DataStream<Integer> myInts = env.fromElements(1, 2, 3, 4, 5);

// Create a DataStream from any Java collection
List<Tuple2<String, Integer>> data = ...
DataStream<Tuple2<String, Integer>> myTuples = env.fromCollection(data);

// Create a DataStream from an Iterator
Iterator<Long> longIt = ...
DataStream<Long> myLongs = env.fromCollection(longIt, Long.class);

Flink also provides a sink to collect DataStream results for testing and debugging purposes. It can be used as follows:

import org.apache.flink.contrib.streaming.DataStreamUtils

DataStream<Tuple2<String, Integer>> myResult = ...
Iterator<Tuple2<String, Integer>> myOutput = DataStreamUtils.collect(myResult)

Windows

Working with Time

3種時間，

Processing time，真正的處理時間

Event time，事件真正發生的時間

Ingestion time，數據進入flink時間，在data source

env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

默認是用processing 時間，

如果要用event time，you need to follow four steps:

Set env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
Use DataStream.assignTimestamps(...) in order to tell Flink how timestamps relate to events (e.g., which record field is the timestamp)
Set enableTimestamps(), as well the interval for watermark emission (setAutoWatermarkInterval(long milliseconds)) inExecutionConfig.

For example, assume that we have a data stream of tuples, in which the first field is the timestamp (assigned by the system that generates these data streams), and we know that the lag between the current processing time and the timestamp of an event is never more than 1 second:

DataStream<Tuple4<Long,Integer,Double,String>> stream = //...
stream.assignTimestamps(new TimestampExtractor<Tuple4<Long,Integer,Double,String>>{
    @Override
    public long extractTimestamp(Tuple4<Long,Integer,Double,String> element, long currentTimestamp) {
        return element.f0;
    }

    @Override
    public long extractWatermark(Tuple4<Long,Integer,Double,String> element, long currentTimestamp) {
        return element.f0 - 1000;
    }

    @Override
    public long getCurrentWatermark() {
        return Long.MIN_VALUE;
    }
});

Basic Window Constructs

Tumbling time window，非滑動
KeyedStream → WindowedStream

Defines a window of 5 seconds, that "tumbles".

keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS));

Sliding time window，滑動
KeyedStream → WindowedStream

Defines a window of 5 seconds, that "slides" by 1 seconds.

keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS));

Tumbling count window
KeyedStream → WindowedStream

keyedStream.countWindow(1000);

Sliding count window
KeyedStream → WindowedStream

keyedStream.countWindow(1000, 100)

Advanced Window Constructs

The general recipe for building a custom window is to specify (1) a WindowAssigner, (2) a Trigger (optionally), and (3) anEvictor (optionally).

上麵的如timeWindow，是封裝好的，而如果用advanced構建方式，需要3步，

1. 首先是WindowAssigner，主要是滑動和非滑動兩類，解決主要的是where的問題

Global window
KeyedStream → WindowedStream

All incoming elements of a given key are assigned to the same window. The window does not contain a default trigger, hence it will never be triggered if a trigger is not explicitly specified.

stream.window(GlobalWindows.create());

用於count window

Tumbling time windows
KeyedStream → WindowedStream

stream.window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)));

The window comes with a default trigger. For event/ingestion time, a window is triggered when a watermark with value higher than its end-value is received, 

whereas for processing time when the current processing time exceeds its current end value.

默認的trigger，
先理解watermark的含義：當我收到一個watermark時，表示我不可能收到event time 小於該water mark的數據
所以我收到的water mark都大於我window的結束時間，說明，window的數據已經到齊了，可以觸發trigger

Sliding time windows
KeyedStream → WindowedStream

stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)));

默認的trigger與上同，

2. 第二步，是定義trigger，何時觸發，解決的是when的問題

The Trigger specifies when the function that comes after the window clause (e.g., sum, count) is evaluated (“fires”) for each window.
If a trigger is not specified, a default trigger for each window type is used (that is part of the definition of theWindowAssigner).

Processing time trigger

A window is fired when the current processing time exceeds its end-value. The elements on the triggered window are henceforth discarded.

windowedStream.trigger(ProcessingTimeTrigger.create());

Watermark trigger

A window is fired when a watermark with value that exceeds the window's end-value has been received. The elements on the triggered window are henceforth discarded.

windowedStream.trigger(EventTimeTrigger.create());

Continuous processing time trigger

A window is periodically considered for being fired (every 5 seconds in the example). The window is actually fired only when the current processing time exceeds its end-value. The elements on the triggered window are retained.

windowedStream.trigger(ContinuousProcessingTimeTrigger.of(Time.of(5, TimeUnit.SECONDS)));

Continuous watermark time trigger

A window is periodically considered for being fired (every 5 seconds in the example). A window is actually fired when a watermark with value that exceeds the window's end-value has been received. The elements on the triggered window are retained.

windowedStream.trigger(ContinuousEventTimeTrigger.of(Time.of(5, TimeUnit.SECONDS)));

這個和上麵的不同，在於，window在觸發後，不會被discard，而是會保留，並且每隔一段時間會反複的觸發

Count trigger

A window is fired when it has more than a certain number of elements (1000 below). The elements of the triggered window are retained.

windowedStream.trigger(CountTrigger.of(1000));

按count觸發，window會被保留

Purging trigger

Takes any trigger as an argument and forces the triggered window elements to be "purged" (discarded) after triggering.

windowedStream.trigger(PurgingTrigger.of(CountTrigger.of(1000)));

上麵有些trigger是會retain數據的，如果你想discard，怎麼搞？用PurgingTrigger

Delta trigger

A window is periodically considered for being fired (every 5000 milliseconds in the example). A window is actually fired when the value of the last added element exceeds the value of the first element inserted in the window according to a `DeltaFunction`.

windowedStream.trigger(new DeltaTrigger.of(5000.0, new DeltaFunction<Double>() {
    @Override
    public double getDelta (Double old, Double new) {
        return (new - old > 0.01);
    }
}));

Delta trigger，即，每次會通過getDelta比較新來的值和舊值的delta，當delta大於定義的閾值時，就會fire

3. 最後，指定Evictor

After the trigger fires, and before the function (e.g., sum, count) is applied to the window contents, an optional Evictorremoves some elements from the beginning of the window before the remaining elements are passed on to the function.

說白了，當windows被觸發時，我們可以選取部分數據進行處理，

evictor，清除者，即清除部分數據，保留你想要的

Time evictor

Evict all elements from the beginning of the window, so that elements from end-value - 1 second until end-value are retained (the resulting window size is 1 second).

triggeredStream.evictor(TimeEvictor.of(Time.of(1, TimeUnit.SECONDS)));

Count evictor

Retain 1000 elements from the end of the window backwards, evicting all others.

triggeredStream.evictor(CountEvictor.of(1000));

邏輯是保留，而不是清除，比如CountEvictor.of(1000)是保留最後1000個，有點不好理解

Delta evictor

Starting from the beginning of the window, evict elements until an element with value lower than the value of the last element is found (by a threshold and a DeltaFunction).

triggeredStream.evictor(DeltaEvictor.of(5000, new DeltaFunction<Double>() {
  public double (Double oldValue, Double newValue) {
      return newValue - oldValue;
  }
}));

Recipes for Building Windows

下麵給出一些window定義的例子，理解一下，例子給的太簡單

Windows on Unkeyed Data Streams

window，也可以用於unkeyed的數據流，

不同，是在window後麵加上all，

Tumbling time window all
DataStream → WindowedStream

Defines a window of 5 seconds, that "tumbles". This means that elements are grouped according to their timestamp in groups of 5 second duration, and every element belongs to exactly one window. The notion of time used is controlled by the StreamExecutionEnvironment.

nonKeyedStream.timeWindowAll(Time.of(5, TimeUnit.SECONDS));

Sliding time window all
DataStream → WindowedStream

Defines a window of 5 seconds, that "slides" by 1 seconds. This means that elements are grouped according to their timestamp in groups of 5 second duration, and elements can belong to more than one window (since windows overlap by at least 4 seconds) The notion of time used is controlled by the StreamExecutionEnvironment.

nonKeyedStream.timeWindowAll(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS));

Tumbling count window all
DataStream → WindowedStream

Defines a window of 1000 elements, that "tumbles". This means that elements are grouped according to their arrival time (equivalent to processing time) in groups of 1000 elements, and every element belongs to exactly one window.

nonKeyedStream.countWindowAll(1000)

Sliding count window all
DataStream → WindowedStream

Defines a window of 1000 elements, that "slides" every 100 elements. This means that elements are grouped according to their arrival time (equivalent to processing time) in groups of 1000 elements, and every element can belong to more than one window (as windows overlap by at least 900 elements).

nonKeyedStream.countWindowAll(1000, 100)

Working with State

All transformations in Flink may look like functions (in the functional processing terminology), but are in fact stateful operators.

You can make everytransformation (map, filter, etc) stateful by declaring local variables or using Flink’s state interface.

You can register any local variable as managedstate by implementing an interface.

In this case, and also in the case of using Flink’s native state interface, Flink will automatically take consistent snapshots of your state periodically, and restore its value in the case of a failure.

The end effect is that updates to any form of state are the same under failure-free execution and execution under failures.

First, we look at how to make local variables consistent under failures, and then we look at Flink’s state interface.

By default state checkpoints will be stored in-memory at the JobManager. For proper persistence of large state, Flink supports storing the checkpoints on file systems (HDFS, S3, or any mounted POSIX file system), which can be configured in the flink-conf.yaml or viaStreamExecutionEnvironment.setStateBackend(…).

這塊是Flink流式處理的核心價值，可以方便的checkpoint的local state，有幾種方式，後麵會具體說；

默認情況下，這些checkpoints 是存儲在JobManager的內存中的，當然也可以配置checkpoint到文件係統

Checkpointing Local Variables

這個比較好理解

Local variables can be checkpointed by using the Checkpointed interface.

When the user-defined function implements the Checkpointed interface, the snapshotState(…) and restoreState(…) methods will be executed to draw and restore function state.

public class CounterSum implements ReduceFunction<Long>, Checkpointed<Long> {

    // persistent counter
    private long counter = 0;

    @Override
    public Long reduce(Long value1, Long value2) {
        counter++;
        return value1 + value2;
    }

    // regularly persists state during normal operation
    @Override
    public Serializable snapshotState(long checkpointId, long checkpointTimestamp) {
        return counter;
    }

    // restores state on recovery from failure
    @Override
    public void restoreState(Long state) {
        counter = state;
    }
}

如上，隻是實現snapshotState和restoreState，就可以對local變量counter實現checkpoint，這個很好理解

n addition to that, user functions can also implement the CheckpointNotifier interface to receive notifications on completed checkpoints via thenotifyCheckpointComplete(long checkpointId) method. Note that there is no guarantee for the user function to receive a notification if a failure happens between checkpoint completion and notification. The notifications should hence be treated in a way that notifications from later checkpoints can subsume missing notifications.、

除此，還能實現CheckpointNotifier ，這樣當完成checkpoints時，會調用notifyCheckpointComplete，但不能保證一定觸發

Using the Key/Value State Interface

這個是顯式調用state interface

The state interface gives access to key/value states, which are a collection of key/value pairs.
Because the state is partitioned by the keys (distributed accross workers), it can only be used on the KeyedStream, created via stream.keyBy(…) (which means also that it is usable in all types of functions on keyed windows).

The handle to the state can be obtained from the function’s RuntimeContext.
The state handle will then give access to the value mapped under the key of the current record or window - each key consequently has its own value.

The following code sample shows how to use the key/value state inside a reduce function.
When creating the state handle, one needs to supply a name for that state (a function can have multiple states of different types), the type of the state (used to create efficient serializers), and the default value (returned as a value for keys that do not yet have a value associated).

public class CounterSum implements RichReduceFunction<Long> {

    /** The state handle */
    private OperatorState<Long> counter;

    @Override
    public Long reduce(Long value1, Long value2) {
        counter.update(counter.value() + 1);
        return value1 + value2;
    }

    @Override
    public void open(Configuration config) {
        counter = getRuntimeContext().getKeyValueState("myCounter", Long.class, 0L);
    }
}

State updated by this is usually kept locally inside the flink process (unless one configures explicitly an external state backend). This means that lookups and updates are process local and this very fast.

The important implication of having the keys set implicitly is that it forces programs to group the stream by key (via thekeyBy() function), making the key partitioning transparent to Flink. That allows the system to efficiently restore and redistribute keys and state.

The Scala API has shortcuts that for stateful map() or flatMap() functions on KeyedStream, which give the state of the current key as an option directly into the function, and return the result with a state update:

val stream: DataStream[(String, Int)] = ...

val counts: DataStream[(String, Int)] = stream
  .keyBy(_._1)
  .mapWithState((in: (String, Int), count: Option[Int]) =>
    count match {
      case Some(c) => ( (in._1, c), Some(c + in._2) )
      case None => ( (in._1, 0), Some(in._2) )
    })

State Checkpoints in Iterative Jobs

Flink currently only provides processing guarantees for jobs without iterations. Enabling checkpointing on an iterative job causes an exception. In order to force checkpointing on an iterative program the user needs to set a special flag when enabling checkpointing:env.enableCheckpointing(interval, force = true).

Please note that records in flight in the loop edges (and the state changes associated with them) will be lost during failure.

對於iterative，即有環的case，做checkpoint更加複雜點，並且恢複後，會丟失中間過程，比如n次迭代，執行到n-1次，失敗，還是要從1開始

Iterations

For example, here is program that continuously subtracts 1 from a series of integers until they reach zero:

DataStream<Long> someIntegers = env.generateSequence(0, 1000);
        
IterativeStream<Long> iteration = someIntegers.iterate();

DataStream<Long> minusOne = iteration.map(new MapFunction<Long, Long>() {
  @Override
  public Long map(Long value) throws Exception {
    return value - 1 ;
  }
});

DataStream<Long> stillGreaterThanZero = minusOne.filter(new FilterFunction<Long>() {
  @Override
  public boolean filter(Long value) throws Exception {
    return (value > 0);
  }
});

iteration.closeWith(stillGreaterThanZero);

DataStream<Long> lessThanZero = minusOne.filter(new FilterFunction<Long>() {
  @Override
  public boolean filter(Long value) throws Exception {
    return (value <= 0);
  }
});

這個直接看例子，

首先，someIntegers是一個由0到1000的DataStream

對於每個tuple，都需要迭代的執行一個map function，在這兒，會不斷減一

什麼時候結束，

根據iteration.closeWith，closeWith後麵是一個filter，如果filter返回為true，這個tuple就繼續iterate，如果返回為false，就close iterate

而最後的lessThanZero是someIntegers經過iterate後，最終產生的輸出DataStream

Connectors

Connectors provide code for interfacing with various third-party systems.

Currently these systems are supported:

Apache Kafka (sink/source)
Elasticsearch (sink)
Hadoop FileSystem (sink)
RabbitMQ (sink/source)
Twitter Streaming API (source)

To run an application using one of these connectors, additional third party components are usually required to be installed and launched, e.g. the servers for the message queues. Further instructions for these can be found in the corresponding subsections. Docker containers are also provided encapsulating these services to aid users getting started with connectors.

隻看下kafka，

Then, import the connector in your maven project:

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka</artifactId>
  <version>0.10.2</version>
</dependency>

使用的例子，

Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> stream = env
    .addSource(new FlinkKafkaConsumer082<>("topic", new SimpleStringSchema(), properties))
    .print();

如何fault tolerance？

With Flink’s checkpointing enabled, the Flink Kafka Consumer will consume records from a topic and periodically checkpoint all its Kafka offsets, together with the state of other operations, in a consistent manner. In case of a job failure, Flink will restore the streaming program to the state of the latest checkpoint and re-consume the records from Kafka, starting from the offsets that where stored in the checkpoint.

原理就是會和其他state一起把所有的kafka partition的offset都checkpoint下來，這樣恢複的時候，可以從這些offset開始讀；

To use fault tolerant Kafka Consumers, checkpointing of the topology needs to be enabled at the execution environment:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000); // checkpoint every 5000 msecs

If checkpointing is not enabled, the Kafka consumer will periodically commit the offsets to Zookeeper.

由於用的是simple consumer，所以就算不開checkpoint，offset也要被記錄；這裏使用通常的做法把kafka的offset記錄到zookeeper

也可以把數據寫入kafka，FlinkKafkaProducer

The FlinkKafkaProducer writes data to a Kafka topic. The producer can specify a custom partitioner that assigns recors to partitions.

tream.addSink(new FlinkKafkaProducer<String>("localhost:9092", "my-topic", new SimpleStringSchema()));

最後更新：2017-04-07 21:05:50

Flink DataStream API Programming Guide

Example Program

Transformations

Physical partitioning

Task chaining and resource groups

Execution Configuration

Debugging

Windows

Working with State

Iterations

Connectors

上一篇： HybridTime - Accessible Global Consistency with High Clock Uncertainty

下一篇： Docker

相關內容

熱門內容

最新內容