Understanding Map Reduce

There are two main components of Hadoop –

  1. Hadoop Storage ( HDFS)
  2. Map Reduce ( Processing Unit)

What is Map Reduce?

It is the processing unit of Hadoop and is used to process the data parallel. The data is stored in Hadoop in chunks of data blocks. Map reduce has the capability of processing data present in the data blocks which is not present with other technologies like Java.

Applications of Map Reduce

Map Reduce is used for indexing and searching and also used for creating classifiers. It can also be used for recommendation engines and analytics. Map reduce is a programming model and has the capability of parallel programming. Two functions get executed –

  1. Map Function
  2. Reduce Function

Joins, summarization etc can all be done using map reduce framework!

Advantages of Map reduce –

  1. Parallel Processing – Since the data is being processed parallel it becomes faster. YARN was introduced in Hadoop in 2.0.
  2. Data Locality – It means that you can process your data at the place which is present. So, the data that is stored in slave machines in Hadoop. Here the big data is broken down into chunks. When slave machines are done with the processing, the results are stored in slave machines. The results are aggregated together and the final result is sent to the client machine!

Who decides which data will be sent to which data node?

Client submits the job to the resource manager and it decides on the nearest data node present so that not much network bandwidth is taken away!!

Resource manager is the master of processing!

TRADITIONAL VS MAP REDUCE WAY

Now, let us understand the difference between the traditional and the map reduce way of doing things! So, let us suppose there is an election going on and there are five booths in which people are coming and voting. Now, there is also a central server, where the final voting will be done.

Traditional – In this way, the votes will be moved towards the center and then the votes will be counted. This is going to be a costly affair. Result center gets over burdened. Even more, it is going to take a long time since it is going over burden the center!!

Map Reduce – In this way, the counting is done in respective booths in parallel. Once the results are counted, they are sent to the center. This way declaring results become very easy.

Understanding Map Reduce

Let there be a input and it be operated by all the map function. Every input at every booth is counted at map function. The aggregation is done by reducer and the final output is given.

Map Task – Output of the map task is to the reduce task.

Reduce Task – Output of reduce task is sent to the client.

Untitled

Map reduce works on key value pair. When talking about map, it takes key and value and returns key and value. In between map and reduce there is a shuffling phase that is done!

YARN in Map Reduce

YARN is yet another resource negotiator. It  allocates the resources to in the Hadoop ecosystem. Different applications like spark or storm or HBase is able to connect with hadoop using YARN.

Daemons  present in Hadoop2.x

Resource manager is the master daemon which runs on the master machine. It allocates the cluster level resources to particular job! It is the master machine which manages everything.

Client is the one who submits the job to Map Reduce. A client could be a java application!

Node Manager is the slave daemon. It monitors the resources of all data nodes.

Job History Tracker – It is the one that maintains information about the map reduce jobs after the application master terminates. It also keeps the log of all executions.

Application Master – It is a process that is executed on the slave machine to execute and manage the job.

Container – It is created by NM when requested. It allocates certain amount of resources on a slave node.

YARN Application Workflow in Map Reduce

Resource manager has got two components –

  1. Scheduler
  2. Application Manager

Application master and Application manager

Application Master is the one which executes all the task and asks for the resources. On the other hand, Application Manager is the one which ensures that the task are executed properly. If the job is executed to the resource manager, the scheduler schedules the job and the application manager creates a container  in one of the data nodes. Within the container, application master will be started which will then register to the application master and ask the container to execute the task. As soon as container is allocated, application master will connect to the node manager and asks to execute the container. When the container are launched, the application manager executes the task and sends the result to the client.

  1. Client submits the application.
  2. Resource manager submits an application to start the application manager.
  3. Application master registers with resource manager.
  4. Application manager asks the containers from the resource manager.
  5. Application master notifies the node manager to launch containers.
  6. Application code is executed in containers.
  7. Client contacts the resource manager or application master to monitor application details.
  8. Application master unregisters from the resource manager and the results are sent back to the client.

Happy Learning 🙂

 

 

 

 

 

 

 

 

 

 

 



Leave a comment