Hadoop學習筆記

1. History

Doug Cutting is inspired by the the paper, MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat from Google Lab, to develop an open-source implementation of the Map-Reduce framework. He named itHadoop, after his son's toy elephant.

2. What is MapReduce?

Mapper:

Mapper maps input key/value pairs to a set of intermediate key/value pairs.

map的輸出並不是直接寫入硬盤，而是將其寫入緩存memory buffer。

當buffer中數據的到達一定的大小，一個背景線程將數據開始寫入硬盤。

在寫入硬盤之前，內存中的數據通過partitioner分成多個partition。The total number of partitions is the same as the number of reduce tasks for the job

在同一個partition中，背景線程會將數據按照key在內存中排序。

每次從內存向硬盤flush數據，都生成一個新的spill文件。

當此task結束之前，所有的spill文件被合並為一個整的被partition的而且排好序的文件。

reducer可以通過http協議請求map的輸出文件，tracker.http.threads可以設置http服務線程數。

Reducer:

Reducer reduces a set of intermediate values which share a key to a smaller set of values.

當map task結束後，其通知TaskTracker，TaskTracker通知JobTracker。

對於一個job，JobTracker知道TaskTracer和map輸出的對應關係。

reducer中一個線程周期性的向JobTracker請求map輸出的位置，直到其取得了所有的map輸出。

reduce task需要其對應的partition的所有的map輸出。

reduce task中的copy過程即當每個map task結束的時候就開始拷貝輸出，因為不同的map task完成時間不同。

reduce task中有多個copy線程，可以並行拷貝map輸出。

這個copy的過程也叫Shuffle（In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.）

然後是sort （The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.）

然而Shuffle和Sort可以同時進行的（The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.）

3. Install hadoop

4. Problem shooting

4.1 如何解決java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName 問題：
https://www.blogjava.net/snoics/archive/2011/03/10/333408.html

最後更新：2017-04-02 22:16:40