Hadoop Getting Started With HDFS

Now that you have Hadoop properly configured, it's time to start exploring Hadoops's main storage system, HDFS.

Hadoop's Distributed File System (HDFS) stores large amounts of data across a shared cluster of nodes. The data written to HDFS is partitioned into smaller blocks which are then replicated and stored across the cluster. This prevents data loss and improves performance as larger data sets are broken down into more manageable blocks. The main components of HDFS are summarized below:

NameNode

One server in a Hadoop cluster serves as a dedicated namenode. The namenode manages metadata associated with the cluster, including file system namespace, permissions, and opening/closing files.

DataNode

Every server that isn't the namenode functions as a datanode. A datanode performs the actual reads/writes on the file system and is responsible for storing actual data.

Blocks

Larger data sets are broken down into separate blocks. These blocks are then stored on the file system in a distributed fashion. The default block size is 64mb but this is a configurable value that can be increased.

Starting HDFS

To start using HDFS, you must frist format the namenode. Run the following:

$ hadoop namenode -format

This formats the namenode and datanode paths for HDFS. It takes the paths you defined in hdfs-site.xml and creates them if they don't already exist.

Once formatted, you can start HDFS via:

$ start-dfs.sh

HDFS Basic Commands

Interacting with HDFS is similar to working with any file system. For example, to list directories in HDFS:

$ hdfs dfs -ls <args>

This behaves like regular ls command and lists all the files/directories in the specified directory.

To create a new directory in HDFS, run:

hdfs dfs -mkdir test/

This creates a new directory test/ in HDFS.

To write to the newly created directory, you can run something like:

$ hdfs dfs -put testfile.txt test/

This writes the output of testfile.txt to the newly created test/ directory in HDFS

Reading data from HDFS

You can use cat to print file contents to the console:

$ hdfs dfs -cat test/testfile.txt
//file contents should print here

Stopping HDFS

You can stop HDFS with:

$ stop-dfs.sh

Other useful commands

You can see a full list of commands available for HDFS by running:

$ hdfs dfs

Running this with no arguments will list all of the commands available to HDFS. You'll notice these are similar to the file system commands you already use (-tail, -lsr, -mv, -ls, etc).

Conclusion

It's important to remember that HDFS is just another file system at the end of the day. Once you have Hadoop properly configured, reading/writing to HDFS is similar to any other command line interface for file system management.

Your thoughts?