764 iPhone_iPad_Mac_apple

Hive中近似計算Histogram的驗證

Histogram可以更直觀的反映數據的分布情況，有了Histogram就可以對執行參數和執行計劃有著更有針對性的優化。但想要得到準確的Histogram，需要巨大的計算量。如果能近似得到相對準確Histogram，就會變得很有價值。
目前HIVE中實現了針對Numeric的近似的Histogram的計算邏輯。NumericHistogram的實現說明如下：

/**
 * A generic, re-usable histogram class that supports partial aggregations.
 * The algorithm is a heuristic adapted from the following paper:
 * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
 * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
 * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
 * of histogram bins.
 */

感興趣的可以參考論文，“A streaming parallel decision tree algorithm”。

我簡單的測試了下：

package sunwg.test;

public class testHis {

    public static void main(String[] args) {

        NumericHistogram numericHistogram = new NumericHistogram();
        numericHistogram.allocate(10);

        for (double i=1.0; i<=50.0; i++) {
            numericHistogram.add(i);
        }

        System.out.println(Math.round(numericHistogram.quantile(0.1)));
        System.out.println(Math.round(numericHistogram.quantile(0.2)));
        System.out.println(Math.round(numericHistogram.quantile(0.3)));
        System.out.println(Math.round(numericHistogram.quantile(0.4)));
        System.out.println(Math.round(numericHistogram.quantile(0.5)));
        System.out.println(Math.round(numericHistogram.quantile(0.6)));
        System.out.println(Math.round(numericHistogram.quantile(0.7)));
        System.out.println(Math.round(numericHistogram.quantile(0.8)));
        System.out.println(Math.round(numericHistogram.quantile(0.9)));
        System.out.println(Math.round(numericHistogram.quantile(1.0)));
}

結果如下：

基本上還是挺靠譜的，如果想提高準確率，可以增加num_bins的個數，也就是上麵的10。

numericHistogram.allocate(10);

並且，NumericHistogram也支持多個partial Histogram的merge操作。

之所以要看這些內容，主要希望數據集成可以通過對數據的研究，獲得數據的特征，選擇更合適的splitpk，將任務可以拆分得更加平均，減少長尾task，也把用戶從優化中解放出來。

最後更新：2017-08-13 22:26:23

Hive中近似計算Histogram的驗證

上一篇： LightCounting預測以太網光模塊市場未來5年18%增速

下一篇：頂級域名.VIP將於5月18日開放注冊穀歌騰訊等均已購入

相關內容

熱門內容

最新內容

Hive中近似計算Histogram的驗證

上一篇： LightCounting預測以太網光模塊市場未來5年18%增速

下一篇： 頂級域名.VIP將於5月18日開放注冊 穀歌騰訊等均已購入

相關內容

熱門內容

最新內容

下一篇：頂級域名.VIP將於5月18日開放注冊穀歌騰訊等均已購入