Description:
Hadoop is an open-source framework that allows you to store and process large datasets in a distributed computing environment. It is designed to scale from a single server to thousands of machines, each offering local computation and storage. It enables the processing of vast amounts of data quickly and efficiently, which is essential in the era of big data.
Components
HDFS:
NameNode: Manages metadata (file structure and locations of data blocks).
DataNode: Stores actual data blocks across the cluster.
Secondary NameNode: Helps with backup and recovery of the NameNode’s metadata.
HBase:
HMaster: Manages the RegionServers and coordinates HBase operations.
RegionServer: Handles reading and writing of data and stores the data in regions.
ZooKeeper: Provides coordination and synchronization between HBase components.
WAL (Write-Ahead Log): Ensures durability by logging changes before committing to storage.
HFile: Stores the actual data in Hbase.
Architecture
HDFS: HDFS is designed for large, sequential file storage. Its architecture focuses on storing and accessing large files in a distributed manner across multiple nodes.
HBase: HBase is a NoSQL database built for real-time read/write access to data. Its architecture is designed for random access to smaller, structured data units (rows in tables), with high availability and low-latency operations.
Reading Pattern
HDFS: HDFS is optimized for sequential reading of large files. Data is read in a streaming manner, ideal for batch processing tasks where entire files need to be processed at once.
HBase: HBase allows random read/write access to individual rows in a table. It’s optimized for low-latency, real-time access to smaller data items (e.g., rows in a table).
Writing Pattern
HDFS: Data is written to HDFS in large sequential chunks (blocks). Once a file is written, it is typically immutable, meaning it is not modified after it is written.
HBase: Data in HBase is written row-by-row in a random-access manner. It uses a Write-Ahead Log (WAL) to ensure durability before writing to storage (Hfiles).
Structured Storage
HDFS: HDFS is primarily designed to store large, unstructured data (such as log files, images, and videos). It doesn’t organize data in a table-like structure; instead, it stores it as large files distributed across the cluster in blocks.
HBase: HBase stores data in a structured format as tables with rows and columns (like a NoSQL database). Data is stored in column families, and it supports a schema, making it suitable for real-time applications requiring structured storage.
Maximum Data Size
HDFS: HDFS is capable of storing huge amounts of data, often in the petabyte range, since it can handle files of very large sizes (up to several terabytes per file). It's optimized for storing large, unstructured datasets.
HBase: HBase is designed to store smaller data items compared to HDFS, though it still supports huge amounts of data. Each row in HBase can be quite large, but HBase is optimized for random access to smaller, structured pieces of data (rows).
Dynamic Changes
HDFS: HDFS is designed for static data. Once data is written to HDFS, it is typically immutable. Modifications to data (like updates or deletions) are not supported efficiently, and files are generally appended to rather than modified.
HBase: HBase supports dynamic changes to data. It allows real-time updates, inserts, and deletions to individual rows in the table. This makes HBase well-suited for applications requiring frequent updates to data.
Data Distribution
HDFS: In HDFS, large files are split into fixed-size blocks (typically 128MB or 256MB), which are distributed across multiple nodes in the cluster. The system ensures fault tolerance by replicating these blocks (usually 3 replicas) across different nodes.
HBase: HBase stores data in tables, which are divided into regions. These regions are distributed across RegionServers. As data grows, regions can be split and distributed to different RegionServers dynamically, ensuring scalability.
Data Storage
HDFS: HDFS stores data as large files that are split into blocks (typically 128MB or 256MB in size). The data is stored in an unstructured format, and once written, the data is typically immutable, meaning it is not changed or updated after storage.
HBase: HBase stores data in a structured format, organized in tables with rows and columns. It allows for dynamic schema changes, and data is stored in column families. It is suitable for real-time data storage and operations on smaller, structured data items.
Operations
HDFS: HDFS is optimized for sequential read/write operations. It is designed for batch processing tasks where large files need to be processed in their entirety, making it less efficient for real-time data manipulation or small, frequent reads and writes.
HBase: HBase is optimized for random read/write operations. It allows for real-time updates, insertions, and deletions to individual rows, making it suitable for applications that require frequent changes or real-time access to data.