Map Reduce Joins

There is a huge amount of data that is present around the world! It is being created and we need to analyze it for sure. To do that, the normal functioning happen as well! So, there are going to be joins, indexes and so much more! In Hadoop, when joining two or more data sets if one data set is smaller than the other, the smaller one gets distributed to all the clusters.

There are two types of joins in map reduce –

  1. Map side join – In this case, the mapper uses the smaller data set to perform look up for matching sets in the larger data set. In this case, the join is performed before the data is consumed by map function. The input to the mapper is a partition and is sorted on keys.
  2. Reduce side join – These joins are simpler than the map side joins as the same kind of data goes directly to same reducer. Mapper reads the data and generates a key value pair. Shuffle and sort phases sort and shuffle the data grouping all values belonging to the same key and sends them to the same reducer.

Happy Learning 🙂



Leave a comment