How to Execute an End-to-End MapReduce Program on a Single-Node Hadoop Setup using an AWS EC2 Instance, Starting from Instance Creation to Running the WordCount Job and Retrieving the Output?
Share
Condition for Create an Amazon EMR Cluster with Spark and Hive using the AWS Management Console
Description: It demonstrates the complete process of running a MapReduce program on a single-node Hadoop setup using an AWS EC2 instance. It begins with launching a free-tier EC2 server and configuring essential ports for SSH, HDFS, and YARN access. After connecting to the instance, Java is installed as a prerequisite for Hadoop. Hadoop is then downloaded, extracted, and environment variables are configured before setting up the pseudo-distributed mode by updating core-site.xml and hdfs-site.xml. The HDFS namenode is formatted and required SSH configurations are created to enable Hadoop services to start properly. Once the Hadoop daemons are running, input data is created and uploaded to HDFS. The built-in WordCount MapReduce job is executed to process the data, and the results are retrieved from the output directory. Optional web interfaces such as HDFS NameNode UI and YARN ResourceManager UI provide visual access to cluster status, completing a full end-to-end MapReduce workflow on EC2.
Steps
Step 1: Launch EC2 (Free Tier)
Instance: t2.micro
OS: Amazon Linux 2023 or Amazon Linux 2
Security Group: open ports
Port
Purpose
22
SSH
9000
HDFS
8088
YARN UI (optional)
Step 2: SSH into the server
ssh -i your-key.pem ec2-user@YOUR_PUBLIC_IP
Step 10: Run WordCount MapReduce Job
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar \
wordcount /input /output
Step 11: View Output
hdfs dfs -ls /output
hdfs dfs -cat /output/part-r-00000
Output will be:
hadoop 1
hello 3
mapreduce 1
world 1
Optional UIs you can open in browser: