How to Execute an End-to-End MapReduce Program on a Single-Node Hadoop

How to Execute an End-to-End MapReduce Program on a Single-Node Hadoop Setup using an AWS EC2 Instance, Starting from Instance Creation to Running the WordCount Job and Retrieving the Output?

Condition for Create an Amazon EMR Cluster with Spark and Hive using the AWS Management Console

Description:
It demonstrates the complete process of running a MapReduce program on a single-node Hadoop setup using an AWS EC2 instance. It begins with launching a free-tier EC2 server and configuring essential ports for SSH, HDFS, and YARN access. After connecting to the instance, Java is installed as a prerequisite for Hadoop. Hadoop is then downloaded, extracted, and environment variables are configured before setting up the pseudo-distributed mode by updating core-site.xml and hdfs-site.xml. The HDFS namenode is formatted and required SSH configurations are created to enable Hadoop services to start properly. Once the Hadoop daemons are running, input data is created and uploaded to HDFS. The built-in WordCount MapReduce job is executed to process the data, and the results are retrieved from the output directory. Optional web interfaces such as HDFS NameNode UI and YARN ResourceManager UI provide visual access to cluster status, completing a full end-to-end MapReduce workflow on EC2.

Steps

Step 1: Launch EC2 (Free Tier)
Instance: t2.micro
OS: Amazon Linux 2023 or Amazon Linux 2
Security Group: open ports

Port	Purpose
22	SSH
9000	HDFS
8088	YARN UI (optional)

Step 2: SSH into the server
ssh -i your-key.pem ec2-user@YOUR_PUBLIC_IP
Step 3: Install Java
sudo yum update -y
sudo yum install java-1.8.0-amazon-corretto-devel -y
java -version
Step 4: Download & Install Hadoop
cd /home/ec2-user
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 hadoop
Step 5: Set Environment Variables
echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0-amazon-corretto.x86_64" >> ~/.bashrc
echo "export HADOOP_HOME=/home/ec2-user/hadoop" >> ~/.bashrc
echo "export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin" >> ~/.bashrc
source ~/.bashrc
Step 6: Configure Hadoop (Pseudo-distributed mode)
Edit core-site.xml
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Paste:
xml
<configuration>
  <property>
   <name>fs.defaultFS</name>
   <value>hdfs://localhost:9000</value>
  </property>
</configuration>
Edit hdfs-site.xml
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Paste:
xml
<configuration>
  <property>
   <name>dfs.replication</name>
   <value>1</value>
  </property>
</configuration>
Format HDFS
hdfs namenode -format
Issue:
start-dfs.sh
Issue Fix:
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
ssh localhost
exit
Re run: start-dfs.sh
jps
Step 7: Start Hadoop Services
start-dfs.sh
start-yarn.sh
jps
You should see:
NameNode
DataNode
ResourceManager
NodeManager
Step 8: Create Input File
echo -e "hello world\nhello hadoop\nhello mapreduce" > input.txt
Step 9: Copy file to HDFS
hdfs dfs -mkdir /input
hdfs dfs -put input.txt /input
hdfs dfs -ls /input
Step 10: Run WordCount MapReduce Job
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar \
wordcount /input /output

Step 11: View Output
hdfs dfs -ls /output
hdfs dfs -cat /output/part-r-00000
Output will be:
hadoop   1
hello    3
mapreduce 1
world    1
Optional UIs you can open in browser: