What is Apache Spark?

Apache Spark is an open-source cluster-computing framework that makes data processing super fast. It can run on top of Hadoop or independently with other storage engines. Spark is easy to use and very popular for real-time data processing. In this article, we discuss the basics of Spark including how Spark works and the misleading "Spark vs Hadoop" argument.

Why Spark?

Spark was developed to improve data processing speed. It uses in-memory processing to make apps run up to 100x faster than traditional Hadoop MapReduce. While Spark still relies on a storage system (like HDFS), it can be used with or without Hadoop to process big data.

Spark is easy to learn. It runs in the shell and with different languages including Java, Scala, and Python.

How Spark works

Spark is a cluster-computing framework. It leverages a shared set of server nodes to process data in parallel across the cluster.

Unlike Hadoop MapReduce, Spark leverages memory for intermediate data storage to reduce the number of read/writes to disk. This is what makes Spark inherently faster than Hadoop's own MapReduce.

Spark vs Hadoop

The argument as to whether Spark is better than Hadoop is misleading. While Spark may surpass Hadoop MapReduce from a data processing standpoint, it has no storage mechanism of it's own. Since Spark relies on external data storage, it actually complements Hadoop's HDFS quite well.

Hadoop provides the infrastructure for distributed, fault-tolerant and scalable data storage through HDFS. Although Hadoop's own MapReduce works well with parallel data processing, Spark applies a much faster approach to the same service. Just like MapReduce, Spark can read/write directly from HDFS. For these reasons, it's said that Spark works well "on top" of Hadoop.

Conclusion

Spark compliments Hadoop. It does not compete with Hadoop. While Spark can be used independently of Hadoop as a data processor, it works even better in conjunction with the rest of Hadoop's ecosystem. Where Spark lacks data storage, Hadoop excels with HDFS. Where Hadoop lacks performance, Spark provides speed.

For these reasons, Spark and Hadoop remain one of the most popular combinations in big data systems.

Your thoughts?