Understanding Big Data And Hadoop

Big Data & Hadoop are often used interchangeably with each other. A lot of people including IT professionals fail to grasp the difference between the two. The popularity of Big Data and Hadoop today, have confounded this confusion. In reality, they act as complementary to each other and is incomparable.

In this article, let us take a brief look into these terms, understand their key differences and see how they are connected. 

Understanding Big Data And Hadoop

Big Data

Big Data refers to the large volumes of data that are generated everyday. Big Data has no significance until it is processed to derive meaningful insights.  

The following are the defining characteristics of Big Data:

  • Volume: Volume refers to the quantity/ amount of the data that is rising data by day at a very fast pace
  • Velocity: Velocity refers to the speed with which various sources generate data
  • Variety: Variety refers to the different formats of the data that are generated like structured, semi- structured and unstructured
  • Veracity: Veracity refers to the uncertainty or doubt regarding the available data because incompleteness and inconsistency of data
  • Value: Value refers to the usefulness of the data to the individual/organisation that utilizes it
  • Storing and processing Big Data is a huge problem. Technologies like RDBMS for Big Data is only able to store and process structured data. It is here that Apache Hadoop can help.

Apache Hadoop

Apache Hadoop is a framework that can handle this large volume of data. Not only does it render Big Data more meaningful, it also makes access and storage of the latter easier. To define it, it is an open- source software that aids in data storage and running apps on clusters of commodity hardware. It also enables distributed processing of large data across computer clusters using simple programming models. Apache Hadoop can store and process alike all formats of data like structured, unstructured and semi- structured data.  One remarkable thing about Apache Hadoop is its less cost and easy accessibility to every level of companies.

Why is Apache Hadoop important?

  • It can store large volumes of data quickly
  • Its distributed computing model enables it to process data quickly. The more computing nodes is used, the more processing power
  • Fault tolerance – Data and application processing are safeguarded against hardware failure. In case any node goes down, jobs are automatically redirected to other nodes to ensure that distributed computing does not fail
  • Several copies of data are automatically stored
  • It is flexible. It enables users to store large volumes of data and they can also decide how to use it later including unstructured data like text, images, and videos
  • It does not incur much cost
  • Scalability. Users can enable their system to handle more data simply by adding nodes. Very little administration is needed. 

Apache Hadoop framework has two important components:

  • Hadoop Distributed File System (HDFS): Responsible for data storage
  • YARN: A unit of HDFS, which enables parallel processing of data that is stored across HDFS

Difference Between Big Data and Hadoop

It’s easier to understand the differences between both when Big Data is viewed as a problem and Hadoop as the solution to this problem. 

The three common problems when dealing with Big Data are

  • The problem of storing the huge volume of data. Using traditional systems of storage is inefficient because the storage will be limited to one system and the data is increasing at a rapid pace
  • The second problem pertains to the storage of heterogeneous data. It is paramount to ensure that there is a system to store data in all formats such as structured, semi- structured and unstructured data
  • The third problem pertains to accessing and processing speed.  The fact, that even though there is an increase in the hard disk capacity, the disk transfer or the access speed is not increasing at a parallel rate, is problematic.

Now let’s look at how Hadoop solves all the Big Data problems discussed above.

  • HDFS is the solution to the first problem pertaining to storage of Big Data. It provides a distributed storage of Big Data. The data is stored in blocks in data nodes. The user can also specify the size of every block. HDFS will also duplicate the data blocks on various data nodes. Thus, as we are using commodity hardware, storage is not a problem

  • HDFS is also a viable solution to the second Big Data problem pertaining to the storage of  data in all formats. With HDFS, users can store all types of data, be it structured, semi- structured or unstructured. It also does not have a pre-dumping schema validation. It is also based on a write once, read many times model which will enable the user to write the data only once and read it multiple times for finding insights
  • The problem of processing the data at a fast rate is solved by YARN. YARN enables processing to data and not data to processing. Here the processing logic is sent to different slave nodes and then the data is parallely processed across different slave nodes. The processed results is is then sent to the master node where the results are combined. Response is then sent back to the user. 

Now you know both cannot actually be compared as comparisons are made between things that are similar in nature. Big Data and Hadoop are complementary to each other; while the former is the problem , the latter is the solution to this problem.

About Natasha

Natasha

Natasha Manuel is an information analyst at SpringPeople. In her 7+ years of experience in the edu-tech industry, Natasha has led many corporate learning projects and delivered several high impact training programs. She is passionate about technology and its role in effective learning solutions


Posts by Natasha

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA

*