Basics of Hadoop Ecosystem – Part 1

Start Here

Get in touch with a
TriCore Solutions specialist

Blog | Jul 10, 2017

Basics of Hadoop Ecosystem – Part 1

Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework. Hadoop is an open source framework developed using Java based programming language to support and process humongous data in a distributed computing environment. Click here to read Part 2...

Developed back in 2005, Hadoop is an open source framework developed using Java based programming language to support and process humongous data in a distributed computing environment. Doug Cutting and Mike Cafarella are the developers of the Hadoop.  
Built on a commodity hardware, Hadoop works on the basic assumption that hardware failures are common. These failures are taken care by Hadoop Framework.

Introduction:

In this blog, I will discuss Big Data, its characteristics, different sources of Big Data and some key components of Hadoop Framework.

In the two part blog series, I will cover the basics of Hadoop Ecosystem.

Let us start with Big Data and its importance in Hadoop Framework. Ethics, privacy, security measures are very important and need to be taken care while dealing with the challenges of Big Data.

Big Data: When the Data itself becomes the part of the problem.

Data is crucial for all organizations. It has to be stored for future use. We can refer the term Big Data as the data, which is beyond the storage capacity and the processing power of an organization.

What are the sources of this huge data?

There are different sources of data such as the social networks, CCTV cameras, sensors, online shopping portals, hospitality data, GPS, automobile industry etc., that generate data massively.

Big Data can be characterized as:

  • The Volume of the Data
  • Velocity of the Data
  • The Variety of Data being processed

Volume of Data à Data is increasing rapidly in GB, TB, PB and so on, and requires huge disk space to store it.

Velocity of Data à Huge Data is stored in Data Centres to cater to the organizational needs. In order to get data to the local workstation high-speed data processors are needed.

Variety of Data à Data can be broadly classified into the following types-Structured, Unstructured & Semi structured.

Big Data = (Volume + Velocity + Variety) of Data
big data consulting services

Source: http://whatis.techtarget.com/definition/3Vs

What is Hadoop Ecosystem?
Hadoop Ecosystem refers to the various components of Apache Hadoop Software library. It is a set of tools and accessories to address particular needs in processing the Big Data.

In other words, a set of different modules interacting together forms a Hadoop Ecosystem.

I have given an overview of the applications, tools and modules or interfaces currently available in the Hadoop Ecosystem. Discussed below are different components of the Hadoop.

Let us start with core components of Hadoop Framework:

DISTRUBUTED STORAGE:

HDFS

  • It stands for Hadoop Distributed File System.
  • It is a distributed File system for redundant storage.
  • Designed to store data on the commodity hardware reliably.
  • Built to expect hardware failures.
Intended for large files and batch inserts. (Write Once, Read many times.)

hadoop environment support

Source: http://www.tdprojecthope.com

HBase (NoSQL Database)
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
• Storage of large data volumes (billions of rows) atop clusters of commodity hardware.
• Bulk storage of logs, documents, real-time activity feeds and raw imported data.
• Consistent performance of reads/writes to data used by Hadoop applications.
• Allows Data Store to be aggregated or processed using MapReduce functionality.
Data platform for Analytics and Machine Learning.

HCatalog
A table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to read and write data in a tabular form as opposed to the files.

• Centralized location of storage for data used by Hadoop applications.
• Reusable data store for sequenced and iterated Hadoop processes.
• Storage of data in a relational abstraction.
• Metadata Management.
Once Data is stored, we want it to check it and create insights from the data.

DISTRUBUTED PROCESSING:

MapReduce

A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm that breaks down all the operations into Map or Reduce functions.
• Aggregation (Counting, Sorting, and Filtering) on large and desperate data sets.
• Scalable parallelism of Map or Reduce tasks.
• Distributed task execution.

YARN
Yet Another Resource Negotiator (YARN) is the cluster & resource management layer for the Apache Hadoop ecosystem. It is one of the main features in the second generation of Hadoop framework.
• YARN 'schedules’ applications in order to prioritize tasks and maintains big data analytics systems.
• As one part of a greater architecture, Yarn aggregates and sorts data to conduct specific queries for data retrieval. .
• It helps to allocate resources to particular applications and manages other kinds of resource monitoring tasks.

MACHINE LEARNING

Mahout
Apache Mahout is an open source project. This is primarily used for creating scalable machine learning algorithms. Mahout is a data-mining framework that normally runs with the Hadoop infrastructure in the background to manage huge volumes of data.
• Mahout offers the coder a ready-to-use framework for doing data mining tasks on larger volumes of data.
• Written on top of the Hadoop, Algorithms of Mahout makes it work well in the distributed environment.
• Mahout lets applications to analyse large sets of data effectively and in quick time.
• Comes with the distributed fitness function capabilities for evolutionary programming. Includes matrix and vector libraries.

WORKFLOW MONITORING & SCHEDULING

Oozie
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. It runs workflow of the dependent jobs. It allows users to create Directed Acyclic Graphs (DAG) of workflows that run parallel and sequentially in Hadoop.

• Oozie is also very flexible. One can easily start, stop, suspend and rerun jobs.
• It makes it very easy to rerun failed workflows.
• Oozie is scalable and can manage timely execution of thousands of workflows (each consisting of dozens of jobs) in a Hadoop cluster.

SCRIPTING:

Pig
We can use Apache Pig for scripting in Hadoop. Scripting is a SQL based language and an execution environment for creating complex Map Reduce transformations. First written in the Pig Latin language Pig is translated into an executable Map Reduce jobs.
Pig also allows the user to create extended functions (UDFs) using Java.
• Scripting environment to execute ETL tasks/procedures on raw data in HDFS.
• SQL based language for creating and running complex Map Reduce functions.
• Data processing, stitching, schematizing on large and desperate data sets.
• It’s a high-level data flow language.
• It abstracts you from the specific details and allows you to focus on data processing.

Conclusion:
Hadoop and the Map Reduce framework already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis. Hadoop provides the robust, fault-tolerant Hadoop Distributed File System (HDFS). HBase adds a distributed, fault-tolerant scalable database, built on top of the HDFS file system, with random real-time read/write access to data. Mahout is an Apache project for building scalable machine learning libraries, with most algorithms built on top of Hadoop. Pig is designed for batch processing of the data. In the part two, I will discuss some more components of the Hadoop Ecosystem. For any questions on the above click below.
Ask Pavan