The Quick Guide to Configuring Hadoop

Hadoop runs in three different modes. They are:

local/standalone mode

This is the default configuration for Hadoop out of the box. In standalone mode, Hadoop runs as a single process on your machine.

pseudo-distributed mode

In this mode, Hadoop runs each daemon as a separate Java process. This mimics a distributed implementation while running on a single machine.

fully distributed mode

This is a production level implementation that runs on a minimum of two or more machines.

For this tutorial, we will be implementing Hadoop in pseudo-distributed mode. This will allow you to practice a distributed implementation without the physical hardware needed to run a fully distributed cluster.

Configuring Hadoop

If you followed Hadoop environment setup then you should already have Java and Hadoop installed. To configure Hadoop for pseudo-distributed mode, you'll need to configure the following files located in /usr/local/hadoop/etc/hadoop:

core-site.xml

This file defines port number, memory, memory limits, size of read/write buffers used by Hadoop. Find this file in the etc/hadoop directory and give it the following contents:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

This sets the URI for all filesystem requests in Hadoop.

hdfs-site.xml

This is the main configuration file for HDFS. It defines the namenode and datanode paths as well as replication factor. Find this file in the etc/hadoop/ directory and replace it with the following:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>file:///home/hadoop/hdfs/namenode </value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>file:///home/hadoop/hdfs/datanode </value>
  </property>
</configuration>

Notice how we set the replication factor via the dfs.replication property. We define the namenode path dfs.name.dir to point to an hdfs directory under the hadoop user folder. We point the data node path dfs.data.dir to a similar destination.

It's important to remember that the paths we define for the namenode and datanode should be under the user we created for hadoop. This keeps hdfs isolated within the context of the hadoop user and also ensures the hadoop user will have read/write access to the file paths it needs to create.

yarn-site.xml

Yarn is a resource management platform for Hadoop. To configure Yarn, find the yarn-site.xml file in the /etc/hadoop/ directory and replace it with the following:

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

mapred-site.xml

This file defines the MapReduce framework for Hadoop. Hadoop provides a mapred-site.xml.template file out of the box, so first copy this into a new mapred-site.xml file via:

cp mapred-site.xml.template mapred-site.xml

Now replace the contents of the mapred-site.xml with the following:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Configuring the Hadoop User Environment

Now that you've configured your Hadoop instance for pseudo-distributed mode, it's time to configure the hadoop user environment.

Log in as the hadoop user you created in Hadoop Environment Setup via:

su hadoop

As the Hadoop user, add the following to your ~/.bashrc profile:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

This will add all of the required path variables to your profile so you can execute Hadoop commands and scripts. To register the changes to your profile, run:

$ source ~/.bashrc

Configuring Java for Hadoop

To use Java with Hadoop, you must add the java_home environment variable in hadoop-env.sh. Find the hadoop-env.sh file in the same /etc/hadoop/ directory and add the following:

export JAVA_HOME=/usr/local/jdk1.7.0_71

This points Hadoop to your Java installation from Hadoop Environment Setup. You don't need to worry about running the source command, just update and save the file.

Verify Hadoop Configuration

You should be all set to start working with HDFS. To make sure everything is configured properly, navigate to the home directory for the hadoop user and run:

$ hdfs namenode -format

This will set up the namenode for HDFS. If everything is configured correctly, you should see something similar to this:

17/8/27 18:27:30 INFO util.ExitUtil: Exiting with status 0 
17/8/27 18:27:30 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.1.1
************************************************************/

Verify Yarn

To start Yarn, run the following:

$ start-yarn.sh

If yarn is configured properly, you should see something similar to the following output:

starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop
2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop
2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Verify HDFS

To ensure dfs is working properly, run the following command to start dfs:

$ start-dfs.sh

If dfs starts successfully, you won't see any stack-trace errors and should see something similar to the output below:

17/8/27 18:27:30
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop
2.4.1/logs/hadoop-hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop
2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]

Conclusion

Hadoop should now be properly configured for pseudo-distributed mode. You can verify things are working through the browser as well. Visit http://localhost:50070/ to see current running Hadoop services and http://localhost:8088/ to see a list of all applications running on the cluster.

Next we'll look at HDFS including basic architecture and commands.

Your thoughts?