Hadoop Environment Setup

Hadoop installation on Linux

This tutorial demonstrates how to install Hadoop in a Linux environment. While running Hadoop on Windows or Mac is possible, Hadoop officially supports only GNU/Linux and Windows. Linux is also preferred for it's configuration flexibility and existing documentation.

If you are using something other than Linux, we recommend installing a Linux VM to follow this tutorial. If you don't want to use a VM, understand that the following configuration steps are similar for other OS implementations.

Adding a user

It's best practice to create a separate user for your Hadoop instance. This isolates Hadoop's file system from other file systems on the machine. Run the following commands in the Linux terminal:

$ sudo su
# useradd hadoop
# passwd hadoop

As the root user, you can add a user via useradd <username>. You can then add a password via passwd <username> command.

Configuring SSH

Hadoop requires SSH to perform operations on a cluster of shared server nodes. The Hadoop user needs password-less login capabilities for accessing these nodes.

For these reasons, you must generate a public/private key pair that is then shared across the cluster.

To generate the key:

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

This will generate the a SSH key/value pair, move it to the authorized_keys folder, and grant the necessary permissions.

Downloading Java

Java is required for Hadoop. You'll want to find the latest stable version of the Java JDK and install it on your machine.

1. Download Java

To download the latest JDK, visit Oracle. Find the latest stable JDK available and download and extract.

2. Move JDK to usr/local

After extracting, you'll want to move the jdk folder to /usr/local. This makes Java available to all users on the system.

3. Update Path

Finally, you'll want to update your PATH variable. To update this through the command line, run the following:

$ echo 'export JAVA_HOME=/usr/local/jdk1.7.0_71' >> ~./bashrc
$ echo 'export PATH=$PATH:$JAVA_HOME/bin' >> ~./bashrc

This creates an environment variable for JAVA_HOME and adds it to the class path via PATH=$PATH:$JAVA_HOME/bin'.

4. Apply changes

To apply the changes, run:

$ source ~/.bashrc

5. Confirm Installation

To make sure everything is working, run java -version and you should see similar output:

java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) Client VM (build 25.144-b01, mixed mode)

Download Hadoop

You will follow a similar process for downloading Hadoop. The steps below use wget to easily retrieve the latest tar.gz file.

1. Navigate to usr/local

$ cd /usr/local

Like Java, you want to install Hadoop in the /usr/local directory. You can now download and extract Hadoop from this directory.

2. Download Hadoop

Run wget to download the latest stable version of Hadoop.

# wget http://apache.claz.org/hadoop/common/hadoop-2.7.4/hadoop-2.7.4.tar.gz
# tar xzf hadoop-2.7.4.tar.gz
# mv hadoop-2.7.4/* to hadoop/

This will retrieve the Hadoop v2.7.4 and extract it with tar. Notice how you've also moved everything to another directory hadoop for easy reference.

3. Update Path

Update your ./bashrc file with Hadoop environment variables using the same technique...

$ echo 'export HADOOP_HOME=/usr/local/hadoop' >> ~./bashrc
$ echo 'export PATH=$PATH:$HADOOP_HOME/bin' >> ~./bashrc

4. Apply changes

To apply the changes, run:

$ source ~/.bashrc

5. Confirm Installation

To make sure everything is working correctly, run hadoop version and you should see something like this:

Hadoop 2.7.4
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1523454
Compiled by hortonmu on 2017-08-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum

Conclusion

Your Linux environment is now configured for Hadoop. Next, we'll look at configuring Hadoop and HDFS.

Your thoughts?