Hadoop Architecture is the important topic of the Data Mining & Business Intelligence. DMBI is the Important subject of the Computer.
Hadoop System Architecture
The Hadoop Architecture, developed in 2005 and now an open source platform managed by the Apache Software Foundation, uses a concept known as MapReduce that is composed of two separate functions.
The Map step inputs data and breaks it down for processing across nodes within a Hadoop instance. These “worker” nodes may, in turn, break the data down further for processing. In the Reduce step, the processed data then collected back together and assembled into a format based on the original query being performed.
To cope with truly massive-scale data analysis, Hadoop’s developers implemented a scale-out architecture, based on many low-cost physical servers with distributed processing of data queries during the Map operation.
Their logic was to enable a Hadoop system capable of processing many parts of a query in parallel to reduce execution times as much as possible.
This can contrasted with the legacy-structured database design that looks to scale up within a single server by using faster processors, more memory, and fast shared storage.
Looking at the storage layer, the design aim for Hadoop to execute the distributed processing with the minimum latency possible. This achieved by executing Map processing on the node that stores the data, a concept known as data locality.
As a result, Hadoop implementations can use SATA drives directly connected to the server, thereby keeping the overall cost of the system as low as possible.
To implement the data storage layer, Hadoop uses a feature known as HDFS or the Hadoop Distributed File System. HDFS, not a file system in the traditional sense and isn’t usually directly mounted for a user to view (although there are some tools available to achieve this), which can sometimes make the concept difficult to understand; it’s perhaps better to think of it simply as a Hadoop data store.
The Hadoop Distributed File System (HDFS) a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
HDFS highly fault-tolerant and designed to deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. Moreover, HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
Name Node and Data Nodes: Hadoop Architecture
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode. A master server that manages the file system namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the cluster. Which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to stored in files.
Internally, a file split into one or more blocks and these blocks stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system designed in such a way that user data never flows through the NameNode.