What is Hadoop?

By | April 7, 2020

What is Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.  It is designed to scale from single servers to thousands of machines, each offering local computation and storage.  Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so having a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Why is Hadoop important?

  • Ability to store and process vast amounts of data quickly.  With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
  • Computing power. Hadoop’s distributed computing model processes big data fast.  The more computing nodes you use, the more processing power you have.
  • Fault tolerance.  Data and application processing are protected against hardware failure.  If a node goes down, jobs are automatically redirected to other nodes to ensure the distributed computing does not fail. In addition, multiple copies of all data are stored automatically.
  • Flexibility.  Unlike traditional relational databases, you don’t have to preprocess data before storing it. Instead, you can store as much data as you want and decide how to use it later.  That includes unstructured data like text, images, and videos.
  • Low cost.  The free, open-source framework uses commodity hardware to store large quantities of data.
  • Scalability.  You can quickly grow your system to handle more data simply by adding nodes.  Little administration is required.

What are the challenges of using Hadoop?

MapReduce programming is not a good match for all problems. It’s suitable for simple information requests and concerns that can be divided into independent units, but it’s inefficient for iterative and interactive analytic tasks.  MapReduce is file-intensive.  Because the nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases.  This creates multiple files between MapReduce phases and is inefficient for advanced analytic computing.

There’s a widely acknowledged talent gap.  It cannot be easy to find entry-level programmers with sufficient Java skills to be productive with MapReduce. That’s one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. Unfortunately, finding programmers with SQL skills is much easier than finding MapReduce skills.  And, Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware, and Hadoop kernel settings.

Data security.  Another challenge centers around the fragmented data security issues, though new tools and technologies are surfacing.  The Kerberos authentication protocol is an excellent step toward making Hadoop environments secure.

Full-fledged data management and governance.  Hadoop does not have easy-to-use, full-feature tools for data management, data cleansing, governance, and metadata.  Especially lacking are tools for data quality and standardization.

Hadoop-related Products

Here is a condensed list of products built on Hadoop.

Hortonworks

  • Hortonworks Data Platform (HDP) is an open-source framework for distributed storage and processing large, multi-source data sets.  HDP modernizes IT infrastructure and keeps data secure—in the cloud or on-premises—while helping to drive new revenue streams, improve customer experience, and control costs.  Cloudera has recently acquired Hortonworks.

MapR

  • MapR software provides access to various data sources from a single computer cluster, including big data workloads such as Apache Hadoop and Apache Spark, a distributed file system, a multi-model database management system, and event stream processing, combining analytics in real-time with operational applications.  HPe acquired the business assets on August 2019.

Cloudera

  • CDH is Cloudera’s 100% open-source platform distribution, including Apache Hadoop, and built specifically to meet enterprise demands.  CDH delivers everything needed for enterprise use right out of the box.  By integrating Hadoop with more than a dozen other critical open-source projects, Cloudera has created a functionally advanced system that helps you perform end-to-end Big Data workflows.

Amazon EMR

  • Amazon EMR is a cloud-native big data platform for processing vast amounts of data quickly at scale.  Using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi (Incubating), and Presto, coupled with the scalability of Amazon EC2 and scalable storage of Amazon S3, EMR gives analytical teams the engines and elasticity to run Petabyte-scale analysis.

There are many more products. But, to keep this post to a minimum, I have given you the most prominent names.

How does Hadoop work?

Hadoop has two central systems:

Hadoop Distributed File System (HDFS) is the storage system for Hadoop that is spread out over multiple machines as a means to increase reliability.

The MapReduce engine is the algorithm that filters, sorts, and then use the data in some way.

How HDFS works?

HDFS transfers data very rapidly to MapReduce.

When HDFS takes in data, it breaks the information down into separate blocks and distributes them to different nodes in a cluster to perform tasks in parallel to allow it to work more efficiently.

The Hadoop Distributed File System is specially designed to be highly fault-tolerant. First, the file system replicates each piece of data multiple times (Known as Replication Factor). Then, it distributes the copies to individual nodes, placing at least one copy on a different server rack than the others.  It is done because, in any case, if a node crashes containing valuable data, then we can access that particular piece of data from the other, which was placed in the form of replication.

HDFS uses controller/agent architecture.  Each Hadoop cluster initially consisted of a single NameNode that managed file system operations and supported DataNodes that worked data storage on individual compute nodes.  The HDFS elements combine to support applications with large data sets.

How MapReduce Works?

Apache Hadoop MapReduce is a framework for processing large data sets in parallel across a Hadoop cluster.  Data analysis uses two steps:

1.  Map Process

2.  Reduce Process

The top-level unit of work in MapReduce is a job.  A job usually has a map, and a reduce phase, though the reduce phase can be omitted.  For example, consider a MapReduce job that counts the number of times each word is used across a set of documents.  The map phase counts the words in each document, then the reduce phase aggregates the per-document data into word counts spanning the entire collection.

During the map phase, the input data is divided into input splits for analysis by map tasks running parallel across the Hadoop cluster.  By default, the MapReduce framework gets input data from the Hadoop Distributed File System (HDFS).

The reduce phase uses results from map tasks as input to a set of parallel reduce tasks.  The reduced tasks consolidate the data into final results.  By default, the MapReduce framework stores result in HDFS.

Summary

We have just scratched the surface with this Hadoop overview.  This will give you a 1000-foot view of what Hadoop is and how it can benefit your business.