Hadoop is a framework that allows you to store big data in a distributed environment so that it can be processed parallel. It has two components –

  1. HDFS – It allows you to dump your data across any cluster.
  2. Map Reduce – It allows the parallel processing of the data stored in HDFS.

Storing exponentially huge data is stored in HDFS. Storing unstructured data can be easily done  using Hadoop as it does not requires any semantic issues!

What is HDFS?

HDFS is a single unit for storing data, but there are many systems across which data is stored. Name node contains the meta data about the data and data node contains the actual data. Since the data nodes are not any high end systems that are usually expected, in case of failure, there are chances of loss. To prevent that, the Hadoop system allows replication of data across the data nodes.

By default, the replication factor is 3. In HDFS, there is no pre-replication data validation. This simply means that you can simply dump your data and you do not need any validations or schema! There is horizontal schema instead of vertical schema. It simply means that instead of adding more resources to the data nodes, there are more data nodes that are added.

How Data Blocks are created?

If the user has a file of xml format, when this is dumped into the hadoop, it gets stored in various data blocks. The default block size for the data blocks is 128 MB and can be configured in whichever way one wants!

Components of HDFS

Name Node – It is the master node and stores the information related to data node. It records the metadata of files stored in the HDFS. there are two files stored with the metadata –

  1. fsimage –
  2. edit logs – it contains the recent modification stores in the fsimage.

Name node also has to see if the data node is active or not!

Data Node – these are the slave node and are non-expensive data blocks. They are replicated and perform low level read and write client request. The data nodes serve the read and write request by the Name Node.

By default, the frequency of heart beat is 3 seconds.

Secondary Name Node – Apart from the two daemons, the secondary name node works with name node and works as helper! It is not a backup name node! It constantly reads data from the RAM of the main name node and writes in the hard disk or the file system. It is also responsible for combining the edit logs and the fsimage (checkpointing).

CHECKPOINT PROCESS

Name node contains fsimage and fsimage contains the complete state of the file system. Suppose, there are any changes in the edit logs. Edit logs contain any recent modification in the file. The secondary name node, pulls the earlier fsimage and the edit logs and combines both of these and create a new fsimage. This is the main function of secondary name node.

YARN – Yet another resource negotiator

It can be considered as the brain of the Hadoop ecosystem. It performs all the functions including allocating resources and processing task.

Resource Manager – It is the main node in the processing manager. The client sends the processing request to the resource manager. Processing takes place in the node manager, which are present in the data nodes. They are configured on the same node as the data node.

Node Manager

ARCHITECTURE OF YARN

Client sends the processing request to the resource manager and the resource manager splits the request to the various paths and allocates path to the node manager. The node manager has two parts –

  1. App Master – It is a part of the container.
  2. Container – It monitors all the resources it is using. It then send the response to the scheduler. The scheduler is part of resource manager and it partitions various clusters, queues etc.

There are certain processing running in the node manager, the job of the scheduler is to ensure that only when the two jobs are completed, that they are combined.

App master requests for the resources from the app manager and when they are allocated to them, they use it for processing.

Resource Manager

App Manager – It accepts the job submission. It negotiates to containers for executing the application specific.

Application Master – They reside on the data nodes. It communicates to the containers for communication to the data node.

MAP REDUCE

It is a programming framework which helps in writing applications that process large data sets using distributed and parallel algorithms inside Hadoop environment. In map part, a part of data is read and it is processed in the reducer. The output of the mapper is the input to the reducer. The reducer integrates the key value pair of the mapper and creates a single version!

Happy Learning 🙂

 

Leave a comment