Hive中近似計算Histogram的驗證
Histogram可以更直觀的反映數據的分布情況,有了Histogram就可以對執行參數和執行計劃有著更有針對性的優化。但想要得到準確的Histogram,需要巨大的計算量。如果能近似得到相對準確Histogram,就會變得很有價值。
目前HIVE中實現了針對Numeric的近似的Histogram的計算邏輯。NumericHistogram的實現說明如下:
/**
* A generic, re-usable histogram class that supports partial aggregations.
* The algorithm is a heuristic adapted from the following paper:
* Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
* J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
* guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
* of histogram bins.
*/
感興趣的可以參考論文,“A streaming parallel decision tree algorithm”。
我簡單的測試了下:
package sunwg.test;
public class testHis {
public static void main(String[] args) {
NumericHistogram numericHistogram = new NumericHistogram();
numericHistogram.allocate(10);
for (double i=1.0; i<=50.0; i++) {
numericHistogram.add(i);
}
System.out.println(Math.round(numericHistogram.quantile(0.1)));
System.out.println(Math.round(numericHistogram.quantile(0.2)));
System.out.println(Math.round(numericHistogram.quantile(0.3)));
System.out.println(Math.round(numericHistogram.quantile(0.4)));
System.out.println(Math.round(numericHistogram.quantile(0.5)));
System.out.println(Math.round(numericHistogram.quantile(0.6)));
System.out.println(Math.round(numericHistogram.quantile(0.7)));
System.out.println(Math.round(numericHistogram.quantile(0.8)));
System.out.println(Math.round(numericHistogram.quantile(0.9)));
System.out.println(Math.round(numericHistogram.quantile(1.0)));
}
結果如下:
3
8
12
18
24
29
33
38
42
48
基本上還是挺靠譜的,如果想提高準確率,可以增加num_bins的個數,也就是上麵的10。
numericHistogram.allocate(10);
並且,NumericHistogram也支持多個partial Histogram的merge操作。
之所以要看這些內容,主要希望數據集成可以通過對數據的研究,獲得數據的特征,選擇更合適的splitpk,將任務可以拆分得更加平均,減少長尾task,也把用戶從優化中解放出來。
最後更新:2017-08-13 22:26:23