The Quick Guide to Configuring Hadoop
Hadoop runs in three different modes. They are:
This is the default configuration for Hadoop out of the box. In standalone mode, Hadoop runs as a single process on your machine.
In this mode, Hadoop runs each daemon as a separate Java process. This mimics a distributed implementation while running on a single machine.
fully distributed mode
This is a production level implementation that runs on a minimum of two or more machines.
For this tutorial, we will be implementing Hadoop in pseudo-distributed mode. This will allow you to practice a distributed implementation without the physical hardware needed to run a fully distributed cluster.
If you followed Hadoop environment setup then you should already have Java and Hadoop installed. To configure Hadoop for pseudo-distributed mode, you'll need to configure the following files located in /usr/local/hadoop/etc/hadoop:
This file defines port number, memory, memory limits, size of read/write buffers used by Hadoop. Find this file in the etc/hadoop directory and give it the following contents:
This sets the URI for all filesystem requests in Hadoop.
This is the main configuration file for HDFS. It defines the namenode and datanode paths as well as replication factor. Find this file in the etc/hadoop/ directory and replace it with the following:
Notice how we set the replication factor via the dfs.replication property. We define the namenode path dfs.name.dir to point to an hdfs directory under the hadoop user folder. We point the data node path dfs.data.dir to a similar destination.
It's important to remember that the paths we define for the namenode and datanode should be under the user we created for hadoop. This keeps hdfs isolated within the context of the hadoop user and also ensures the hadoop user will have read/write access to the file paths it needs to create.
Yarn is a resource management platform for Hadoop. To configure Yarn, find the yarn-site.xml file in the /etc/hadoop/ directory and replace it with the following:
This file defines the MapReduce framework for Hadoop. Hadoop provides a mapred-site.xml.template file out of the box, so first copy this into a new mapred-site.xml file via:
cp mapred-site.xml.template mapred-site.xml
Now replace the contents of the mapred-site.xml with the following:
Configuring the Hadoop User Environment
Now that you've configured your Hadoop instance for pseudo-distributed mode, it's time to configure the hadoop user environment.
Log in as the hadoop user you created in Hadoop Environment Setup via:
As the Hadoop user, add the following to your ~/.bashrc profile:
This will add all of the required path variables to your profile so you can execute Hadoop commands and scripts. To register the changes to your profile, run:
$ source ~/.bashrc
Configuring Java for Hadoop
To use Java with Hadoop, you must add the java_home environment variable in hadoop-env.sh. Find the hadoop-env.sh file in the same /etc/hadoop/ directory and add the following:
This points Hadoop to your Java installation from Hadoop Environment Setup. You don't need to worry about running the source command, just update and save the file.
Verify Hadoop Configuration
You should be all set to start working with HDFS. To make sure everything is configured properly, navigate to the home directory for the hadoop user and run:
$ hdfs namenode -format
This will set up the namenode for HDFS. If everything is configured correctly, you should see something similar to this:
17/8/27 18:27:30 INFO util.ExitUtil: Exiting with status 0
17/8/27 18:27:30 INFO namenode.NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.1.1
To start Yarn, run the following:
If yarn is configured properly, you should see something similar to the following output:
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop
localhost: starting nodemanager, logging to /home/hadoop/hadoop
To ensure dfs is working properly, run the following command to start dfs:
If dfs starts successfully, you won't see any stack-trace errors and should see something similar to the output below:
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop
localhost: starting datanode, logging to /home/hadoop/hadoop
Starting secondary namenodes [0.0.0.0]
Hadoop should now be properly configured for pseudo-distributed mode. You can verify things are working through the browser as well. Visit http://localhost:50070/ to see current running Hadoop services and http://localhost:8088/ to see a list of all applications running on the cluster.
Next we'll look at HDFS including basic architecture and commands.