Understanding of merging in Hadoop on reduce side

Anton Kulev

I have problem with understanding of files merging process on reduce side in Hadoop as it is described in "Hadoop: The Definitive Guide" (Tom White). Citing it:

When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 50 map outputs and the merge factor was 10 (the default, controlled by the io.sort.factor property, just like in the map’s merge), there would be five rounds. Each round would merge 10 files into one, so at the end there would be five intermediate files. Rather than have a final round that merges these five files into a single sorted file, the merge saves a trip to disk by directly feeding the reduce function in what is the last phase: the reduce phase. This final merge can come from a mixture of in-memory and on-disk segments.

The number of files merged in each round is actually more subtle than this example suggests. The goal is to merge the minimum number of files to get to the merge factor for the final round. So if there were 40 files, the merge would not merge 10 files in each of the four rounds to get 4 files. Instead, the first round would merge only 4 files, and the subsequent three rounds would merge the full 10 files. The 4 merged files and the 6 (as yet unmerged) files make a total of 10 files for the final round. The process is illustrated in Figure 6-7. Note that this does not change the number of rounds; it’s just an opti- mization to minimize the amount of data that is written to disk, since the final round always merges directly into the reduce.

In the second example (with 40 files) we really get to the merge factor for the final round. In 5th round 10 files are not written to disk, they go directly to reduce. But in the first example there are really 6 rounds, not 5. In each of first five rounds 10 files are merged and written on disk, then in 6th round we have 5 files (not 10!) that directly go to reduce. Why? If to adhere to "The goal is to merge the minimum number of files to get to the merge factor for the final round" then for this 50 files we must merge 5 files in first round, then 10 files in each of subsequent 4 rounds and then we get to merge factor of 10 for the final 6th round.

Take into account, that we can't merge more than 10 files in each round (specified by io.sort.factor for both this examples).

What does I understand wrongly in the first example with 50 files merged?

Manjunath Ballur

This is what I understood. If you read carefully, the important thing to remember is:

Note that this does not change the number of rounds; it’s just an optimization to minimize the amount of data that is written to disk, since the final round always merges directly into the reduce.

With or without optimization, the number of merge rounds remains the same (5 in first case and 4 in second case).

  • First Case: 50 files are merged into final 5 and then they are directly fed into "reduce" phase (Total rounds is 5 + 1 = 6)
  • Second Case: 34 files are merged into final 4 and the remaining 6 are directly read from in-memory and fed into the "reduce" phase (Total rounds is 4 + 1 = 5)

In the both the cases, the number of merge rounds is determined by configuration mapreduce.task.io.sort.factor which is set to 10.

So number of merge rounds does not change (whether optimization is done or not). But, the number of files which are merged in each round could change (because Hadoop framework could introduce some optimizations to reduce the number of merges and hence the number spills to disk).

So, in the first case, without optimization, the contents of 50 files (merged into final 5 files) are spilled to the disk and these files are read from the disk, during "reduce" phase.

In the second case, with optimization, the contents of 34 files (merged into final 4 files) are spilled to the disk and these files are read from the disk and remaining 6 un-merged files are directly read from in-memory buffer, during the "reduce" phase.

The idea of optimization is to minimize merge and spill.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related