As the demand for big data technologies is increasing at a fast pace, Apache Hadoop, the open source software framework, is at the heart of the revolution. Hadoop is labeled as the next-generation platform for data processing because of its scalable data processing capabilities and low cost.
Since its release in 2011, Hadoop has rapidly become popular and a strong ecosystem of consultants, distributors, and vendors has grown to support the technology across industries. Hadoop is an open source system at its core, which implies that it is essentially free for anyone to use.
However, the need for organizations to align Hadoop with their business needs has fuelled the emergence of the commercial distributions. Commercial Hadoop Distributions are usually packaged with features designed to streamline the deployment of Hadoop while allowing users to build additional security, analytics, or data handling into their framework.
Commercial Hadoop distributions provide different combinations of various open source components from the Apache Software Foundation and elsewhere. The distributions integrate various components into a single product, serving a ready-made solution and saving the businesses from the hassle of having to assemble their own set of integrated components. Additionally, the vendors offer proprietary software, support, consulting services, and training.
Types of Commercial Hadoop Distribution Models
To fulfill the need of enterprises to deploy Hadoop for taming Big Data, several companies came up with commercial distribution models for Hadoop. Commercial Hadoop distributions are majorly categorized in three primary kinds. They are as follows:
- Distributions that provide paid support and training for the Apache Hadoop (e.g. Cloudera, HortonWorks, MapR, IBM, etc.).
- Distributions that offer a set of supporting tools for deployment and management of Apache Hadoop as an alternative (e.g. Cloudera, HortonWorks, MapR).
- Distributions that enable adding vendor specific features and code, paid enhancements, to enhance or customize the Apache Hadoop deployment and align it to the business needs (e.g. Cloudera, HortonWorks, MapR, IBM, etc.)
Now the big question is, how do you choose a Hadoop distribution from the numerous options that are available in the market? Let’s take a look at some of the criteria that may guide you to choose the suitable Hadoop distribution for you.
How To Choose The Best Commercial Hadoop Distribution
Before you decide upon a particular Hadoop distribution, remember not all distributions have the same components and not all components in one particular distribution are compatible with other distributions. However, all of them do possess Hadoop’s core capabilities.
One of the most important factors to consider when choosing a commercial Hadoop distribution is whether you require an on-premises or cloud-based solution. An on-site solution theoretically offers the highest level of security and complete control and ownership of your data. However, cloud-based solutions are comparatively less expensive, more flexible, and easier to scale.
One way of comparing commercial Hadoop distributions is to create a feature matrix, which is basically a table that lay down a detailed list of the specifications and features of each distribution under consideration. You can then choose depending on the set of features and specs that best suits your requirements and solves your specific business problems.
On the other hand, if you are looking for prototyping and experimentation, the latest official Apache Hadoop distribution would be the best approach for your requirements. The latest releases certainly contain the most exciting features.
However, if your requirement is stability, the best choice would be an older release that has been available long enough to boast of some incremental releases like bug fixes and other minor features upgrades.
You can also opt for open source Hadoop distributions. The open source approach enables engagement with the Apache Hadoop community, which provides the opportunity to tap into a deeper pool of innovations and skills to enrich your Hadoop experience.
Whatever the provider may be, the majority of Hadoop distributions include some kind of proprietary code, which comes in the form of installers and a set of management tools. Competition in this market is becoming fierce day by day. Moreover, the growing popularity of Spark has resulted in many vendors increasing the resources dedicated to Spark deployment and support.
The Best Hadoop Distributions In 2017
All of the top commercial Hadoop distributions have subtle differences by virtue of which they become more or less suitable for your business. Here, we have tried to provide a non-exhaustive guide to some of the most popular distributions in the market today.
The US-based software and solutions provider for Apache Hadoop technology and the first vendor to offer Hadoop as a package, Cloudera, still continues to be a market leader in Hadoop distributions. Cloudera’s CDH that comprise all the open source components targets enterprise-class deployments and is one of the most popular commercial Hadoop distributions.
Known for its innovations, Cloudera was the first to offer SQL-for-Hadoop with its Impala query engine. Other additions of Cloudera include security, user interface, and interfaces for integration with third party applications. Cloudera supports its distribution through the Cloudera Enterprise subscription service.
Hortonworks develops and supports Apache Hadoop for the distributed processing of large data sets across computer clusters. The Hortonworks Data Platform (HDP) is an entirely open source platform designed to maneuver data from many sources and formats. The platform includes various Hadoop technology such as the Hadoop Distributed File System, MapReduce, Zookeeper, HBase, Pig, and Hive, and additional components.
Hortonworks is known for making acquisitions of other companies with useful code and releasing the code into the open source community. A new trend towards consolidation in the market has resulted in the growing popularity of Hortonworks’ product. Recently, both Amazon and IBM has started offering Hortonworks as options on their own platforms alongside their own Hadoop distributions. HDP also serves as the core of the Open Data Platform Initiative, a group aiming to simplify and standardize specifications in the Big Data ecosphere.
Rather than a managed service provider like Amazon and Microsoft, MapR is a platform-focused Hadoop solutions provider, just like Hortonworks and Cloudera. MapR integrates its own database system, known as MapR-DB, while offering Hadoop distribution services. MapR-DB is claimed to be four to seven times faster than the stock Hadoop database, HBase, running on other distributions. Thanks to its speed, MapR is often seen as a preferred choice for large Big Data projects.
Amazon Elastic MapReduce
Amazon offers a pay-as-you-go model in a cloud-only platform. It provides Hadoop-as-a-Service platform through its Amazon Web Services arm. The key advantage of the pay-as-you-go model is the scalability. This model allows you to scale up or down as demands change. Amazon Elastic MapReduce also seamlessly connects with Amazon’s other cloud services infrastructure such as Amazon S3 and DynamoDB for storage, EC2 for cloud processing, and AWS IoT for collecting data from Internet of Things-enabled devices.
Microsoft also offers a cloud-only service in the form of Azure HDInsight platform that offers managed installations of several open source Hadoop distributions, including Cloudera, Hortonworks, and MapR. HDInsight integrates different Hadoop distributions with its own Azure Data Lake platform to provide a complete solution for cloud-based storage and analytics. Additionally, HDInsights provides Hive, Spark, Kafka, and Storm cloud services along with its own cloud security framework.
Opting for the right Hadoop Distribution entirely depends on the obstacles and problems an organization is facing in implementing Hadoop in the enterprise. Each commercial Hadoop distribution has its own pros and cons. Therefore, it is imperative to consider the risk and cost along with the additional value offered by each Hadoop distribution, for the distribution to prove beneficial for your business needs.