Description:
MapReduce is a programming model and processing framework designed to process large-scale data across distributed systems like Hadoop. It divides a task into smaller sub-tasks, processes them in parallel, and aggregates the results.
Steps in MapReduce
Objective:
Count the frequency of words in a dataset.
Input:
Two lines of text:
Line 1: Hadoop MapReduce
Line 2: MapReduce Processes Big Data
Steps:
Step 1: Input Splitting
Input data is split into chunks based on the InputFormat (default: TextInputFormat). Each chunk is processed independently by a mapper.
Input split:
Line 1: ("Hadoop", 1), ("MapReduce", 1)
Line 2: ("MapReduce", 1), ("Processes", 1), ("Big", 1), ("Data", 1)
Step 2: Mapping
Each input split is processed by the Mapper class, generating intermediate key-value pairs, which are written to the local disk.
Step 3: Shuffling and Sorting
Intermediate data is grouped and sorted by key.
Grouped data:
Hadoop: [1]
MapReduce: [1, 1]
Processes: [1]
Big: [1]
Data: [1]
Step 4: Reducing
Reducers aggregate values for each key, generating the final output.
Step 5: Output Storage
The output is stored in HDFS in the specified format.
Output:
("Hadoop", 1), ("MapReduce", 2), ("Processes", 1), ("Big", 1), ("Data", 1)