Hadoop is at the centre of Big Data storage. Its framework enables users to store big data in a distributed environment so that they can process it in parallel. Apache Spark on the hand is a framework designed for real- time data analytics in a distributed computing environment. The popularity of these concepts have sparked the demand for Hadoop training as well as Spark training today.
In this article we will explore the key differences between Hadoop 2 & 3. The article will also do a general comparison between Hadoop and Apache Spark.
Level Of Difficulty & Abstraction
A key difference between Hadoop & Spark pertains to the level of abstraction which is high for Spark and low for Hadoop. Thus it is more difficult to learn and employ as Hadoop requires the developers to have an understanding of coding basic operations. As Hadoop is the core engine, when the developers make use of its advanced functionality, the plug in of other components are required which leads to the system becoming more complex.
As compared to Hadoop, Apache Spark is fully equipped for data analytics. It has a lot of useful libraries in its cluster. For instance, its MLlib enables the use of machine learning and Spark SQL can be used to carry out SQL queries. It also possess several high- level functions that are built-in and that operate work with the Resilient Distributed Dataset (RDD).
Hadoop is relatively safer than Spark because of the support of the Access Control Lists (ACL) and the usage of the Keberos(computer network authentication protocol). Spark on the other requires a shared secret password for authentication.
Cost & Hardware
For operating, Hadoop does not require much RAM as it works with a disk. This can be more economical than having a RAM that is large. Hadoop 3 takes very small disk space as compared to Hadoop 2 because of the changes in its fault- tolerance providing system. Spark on the other hand requires a lot of RAM for operating in the in- memory mode. What this means is that total cost using Spark is much more than using Hadoop.
The fault tolerance in Hadoop 2 is given by the replication technique. In this technique, the information in each block is copied and two replicas are created. This way Hadoop 2 is able to store three times more information which in turn increases the issue of wastage of disk space.
The fault tolerance in Hadoop 3 is given by erasure coding. The technique uses the parity and the other block to enable the recovery of a block of information. One parity block is created on each two blocks of data. Thus, the level of tolerance remains the same in Hadoop 3, only with less disk space needed for its operations.
It is possible for Spark to recover information through recomputation of the DAG (Directed Acyclic Graph) which is comprised by edges and vertices. Edges depict the operations on RDDs and Vertices depict RDDs. If some parts of the data is lost, it can be recovered by Spark by implementing the sequence of operations to the RDDS. Spark needs to perform all the required calculations before the user can recompute RDDs each time. Checkpoints are created in Spark for failure protection.
When it comes to speed, Spark is comparatively faster than Hadoop which works with a disk. It is not possible for Hadoop to cache the data in memory. Spark can also process information in memory 100 times faster than Hadoop; Spark is also 10 times more faster if working with a disk
As compared to Hadoop 2, Hadoop 3 is 30% faster because of the addition to MapReduce of native Java implementation of the map output collector.
YARN (Yet Another Resource Negotiator) is Hadoop’s resource management and job scheduling layer. It manages available resources such as CPU, memory and disk. Hadoop 2 uses version 1 of YARN while in Hadoop 3 YARN was updated to version 2. YARN version 2 introduced various changes that enhances scalability and usability. It also supports the flows- logical groups of YARN app and gives aggregating metrics at the level of the flows. Apache HBase is the primary backing storage in YARN 2.
Spark, in contrast is able to operate independently with Mesos or the YARN.
Programming Languages Support
By using Hadoop streaming, both Hadoop 2 & 3 are able to support various programming languages, the primary language being Java. Spark 2.X however supports Java,Scala, R and Python.
The primary difference between Hadoop 2 & 3 relates to certain improvements in the latter’s architecture, usability and optimization.
Hadoop and Spark on the other hand differ chiefly with respect to the abstraction level. Hadoop has an abstraction level that is low which enables users to perform complex manipulations; but can cause difficulties in managing and learning. Spark, in contrast is faster and simpler with many high-level function that can make work easier. Having said all this, spark is the best option if the work dosen’t need special features.