114 汽車大全

使用Flume Log4j Appender正確的姿勢

我們使用Flume-ng的LoadBalancingLog4jAppender，將線上服務的日誌實時傳輸到日誌服務器，轉交給告警係統和HDFS做存儲。
FLume的Log4j Appender必須使用Log4j的異步加載器，否則一旦日誌服務器掛掉，將會導致應用服務器宕機。

使用過程中的坑

問題1： Flume Log4j使用異步加載器，日誌服務器宕機情況導致業務係統阻塞

在閱讀了Flume的RPC源碼以及LoadBalancingLog4jAppender的實現之後，發現問題原來在Log4j的異步加載器AsyncAppender。異步加載器的原理見這裏
根本原因是、日誌服務器宕機導致消費者消費能力不足，緩衝區滿的情況下，AsyncAppender會阻塞程序。設置Blocking=false之後就可以了。

問題2：Flume Log4j失敗重連策略異常

當其中一台日誌服務器宕機，其他的日誌服務器就會不停的接收到鏈接異常的日誌。明顯是重連的時間間隔太短。在LoadBalancingRpcClient中，

    while (it.hasNext()) {
      HostInfo host = it.next();
      try {
        RpcClient client = getClient(host);
        client.append(event); 
        eventSent = true;
        break;
      } catch (Exception ex) {
        selector.informFailure(host); //宕機情況標誌該主機異常
        LOGGER.warn("Failed to send event to host " + host, ex);
      }
    }

Flume默認不啟用back off,也就是說selector.informFailure(host)這行代碼完全沒用。簡直坑爹。OrderSelector.java:

  public void informFailure(T failedObject) {
    //If there is no backoff this method is a no-op.
    if (!shouldBackOff) {
      return;
    }
    //將該主機暫時移除可用主機列表
    ...

所以解決辦法：配置max back off

問題3：Flume Log4j失敗重連策略異常

問題體現在，設置了max back off,重連時間居然一直是2000ms,看了一下它的算法，指數退避算法。在OrderSelector.java的informFailure函數中。

  public void informFailure(T failedObject) {
    //If there is no backoff this method is a no-op.
    if (!shouldBackOff) {
      return;
    }
    FailureState state = stateMap.get(failedObject);
    long now = System.currentTimeMillis();
    long delta = now - state.lastFail;
    long lastBackoffLength = Math.min(maxTimeout, 1000 * (1 << state.sequentialFails));
    long allowableDiff = lastBackoffLength + CONSIDER_SEQUENTIAL_RANGE;
    if (allowableDiff > delta) {
      if (state.sequentialFails < EXP_BACKOFF_COUNTER_LIMIT) {
        state.sequentialFails++;
      }
    } else {
      state.sequentialFails = 1;
    }
    state.lastFail = now;
    //Depending on the number of sequential failures this component had, delay
    //its restore time. Each time it fails, delay the restore by 1000 ms,
    //until the maxTimeOut is reached.
    state.restoreTime = now + Math.min(maxTimeout, 1000 * (1 << state.sequentialFails));
  }

最後生成的restoreTime即下一次進行重試的時間。我沒有去設置avro connect time out 和request time out，默認都是20s,應該算是偏長了。根據他的算法，delta永遠是大於40s，但是allowableDiff卻一直是3s,4s.所以我直接改了判定條件，allowableDiff < delta,之後就正常。但是還存在一個問題，sequentialFails並不會在一段時間後reset.

問題4：Log4j異步加載器丟失日誌數據

AsyncAppender默認緩衝區大小128，滿了之後會丟失數據。調大緩衝區，avro connect time out 和request time out也得適當調一下

另外，由於我們的告警係統接收的告警日誌必須是時間順序的，所以我寫了個FlumeFailoverAppender了

https://github.com/EdwardsBean/FlumeFailoverLog4jAppender

本博客已遷移至：https://edwardsbean.github.io

最後更新：2017-04-03 12:54:48

使用Flume Log4j Appender正確的姿勢