Apache Hadoop MapReduce (Pseudo-Distributed mode)- Part 2

In this article we will run the example from Part-1 in pseudo-distributed single server mode. Most of the configuration details are clearly laid out on the Hadoop site at Setting Up Single Server Pseudo-Distributed mode. For the sake of additional clarity I will note them here and also run our previous job from Part-1 against the new cluster. I assume that you already have Hadoop downloaded and setup from the previous article.

Note: Updated to Hadoop 2.4.1 and re-published from original Sept 28th, 2011 blog.

But first lets cover how this works!

It is worth discussing how the overall MapReduce job works in the context of HDFS. Large data files are no longer on a single disk. The data file is distributed (split) across the HDFS cluster. A typical HDFS data block is of 64MB in size. So if you had a file that was 192MB in size it would be distributed across 3 nodes with each node holding 64MB data.

Whereas in Hadoop 1.x the resource and job management portion (JobTracker/TaskTracker) was part of the MapReduce framework, Hadoop 2.x splits that out into a global ResourceManager (RM) and a per node per-application ApplicationMaster (AM). Each node in the cluster has a NodeManager (NM). This new generic cluster management engine is called YARN. YARN allows other types of processing frameworks (applications) to run on Hadoop (other than MapReduce such as Pig, Storm, Spark, Hive, etc.). An application is a job (type) that you wish to run on the Hadoop cluster – such as say a MapReduce job.

MapReduce works in two stages – the Map phase and the Reduce phase. The Map phase will run on each of the three nodes since the data is distributed on each of them. This ensures parallel processing and a faster response time as compared to processing the entire data on a single node. This location-based processing is key to the performance of the overall MapReduce job.

Enviroment

${HADOOP} in the rest of the blog indicates the root folder where you have extracted/installed hadoop (such as hadoop-2.4.1).

Ensure that the JAVA_HOME is set in ${HADOOP}/etc/hadoop/hadoop-env.sh. Set the HADOOP_HOME environment variable and set the PATH variable to point to HADOOP_HOME/bin.

Configuration Files

Update configuration in  ${HADOOP}/etc/hadoop/core-site.xml and hdfs-site.xml.

core-site.xml:

hdfs-site.xml:

Step-By-Step Instructions to Start Pseudu-Distributed Hadoop

In single server mode, Hadoop starts the NameNode and DataNodes on the same server. It uses SSH to remote into the NameNode and DataNodes. In single server mode the remote server is really the localhost. Thus making single server mode a special case of the full cluster mode. The list of all the slaves is at  ${HADOOP}/etc/hadoop/slaves. By default you will only see localhost.

Try ssh localhost to validate if you can perform passwordless ssh…if not do the following to enable that:

$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Once this is done you should be able to SSH to localhost without specifying a password.

Format the HDFS filesystem
$ bin/hdfs namenode -format

Start namenode and datanode daemons
$ sbin/start-dfs.sh

Create the folders
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/mthomas

Copy the input files to the hdfs filesystem
$ bin/hdfs dfs -put ~/github_projects/hadoop-patient-experience/src/main/resources/PatientExperiences.csv /user/mthomas/PatientExperiences.csv

List the files in the distributed filesystem
$ bin/hdfs dfs -ls /user/mthomas

Package the source jar for the mapreduce job
$ mvn clean package

Run the mapreduce job
$ bin/hadoop jar /Users/mathew/github_projects/hadoop-patient-experience/target/hadoop-patient-experience-1.0.jar doop1.InsufficientDoctorCommunication /user/mthomas/PatientExperiences.csv output

List the generated output file
$ bin/hdfs dfs -cat /user/mthomas/output/*
Or you can copy it to your local filesystem using
$ bin/hdfs dfs -get /user/mthomas/output output

Stop the daemons
$ sbin/stop-dfs.sh

Thats it !!