Availability: Open Source
It is a Big data framework used for storing and large scale processing of datasets on clusters using the MapReduce programming model
It is used for Batch processing. Batch processing is the collection of data and processing it afterward.
It can process with terabytes & petabytes of data.
It employs ordinary computers with sufficient processing capacity can be used as participating nodes in a Hadoop cluster without requiring any specialized hardware for data processing.
It uses a distributed file system which is shared across all participating nodes. Data is broken into smaller same sized blocks and sent to several computer nodes for processing in parallel.
It does not have a security model to check and validate data for security considerations. It processes whatever data are submitted to it.
Apache Hive is built on the top of Hadoop, and it is used to query and manage data sets in distributed storage. It provides ETL tools, SQL like query execution via MapReduce and enables plugging in custom mapper and reducers
Apache Pig platform analyzes the large data sets using its own high-level language called “Pig Latin”. This enables users to write MapReduce tasks using Pig Latin, a high-level SQL like language.
Hadoop Common: consists of libraries and utilities which are required by other Hadoop modules
Hadoop Distributed File System (HDFS): stores data on the commodity machines and provides very high bandwidth across the Hadoop cluster
Hadoop YARN: responsible for managing the computational resources in clusters and for scheduling those resources to user applications
Hadoop MapReduce: supports large-scale data processing
Open Source
Distributed Processing
Reliability
Fault Tolerance
Scalability and High Availability
Easy to use
Data Locality