Hadoop Ecosystem in Big Data

TjMan 30/Mar/2019 Bigdata

Introduction:

Hadoop ecosystem is basically a combination of many different components designed for different use cases.In Hadoop version 1, there were MapReduce as a processing engine on top of a data storage layer called HDFS(Hadoop Distributed File System).As, MapReduce frame work was entirely written on Java, the only way to process data was to write the data modification logic in Java MapReduce which was a very difficult task. Due to this reason when Hadoop version 2 got released, the developers made a general purpose processing enginee called YARN(Yet Another Resource Negotiator) which can not only process MapReduce but can process codes in any supportable formats like Spark with Scala, R or Python. YARN also addressed many limitation related to Job execution.In below picture you can see all the primary Hadoop components which we will classify based on there purpose and discuss their use cases.

Component classification based on purpose:

Storage Layer
Processing Layer
Ingestion Layer
Adminitration Layer
Analytics Layer and BI tools

Storage Layer:

HDFS:

As we are talking about Bigdata, Imagine that you have 1TB space in your computer but you need to process 2TB data. What will be your approach?probably you will take one TB data first and process it and save the result somewhere and then take remaining 1TB data and process it then combines both the result and store the result. Hadoop do the same but in a different way. Basically each Hadoop cluster consist of many nodes, lets assume we have a cluster of 4 nodes and each node having 1TB of space(nodes means a computer and cluster is a group of computers working together).Now before we process we need to store the 2TB file in the cluster, so HDFS frame work will divide the 2TB file in smaller blocks and store each block in any node in the cluster.So, what we just discussed is called a distributed file system. The default block size in HDFS is 128MB for Hadoop 2(64MB for Hadoop 1) so for our example the HDFS framework will divide the 2TB file in 16384(calculation: 2TB/128MB) blocks and store across the cluster. Ok, now we have distributed data but if we have n number of such large or small file in the cluster then who will take care that which block is belong to which file?HDFS frame work has two component(or daemon/process) called name node and data node. Name node takes care of which block is belong to which file and where it is stored in cluster level. The work of data node process is to maintain data at node level. In a cluster, name node can be only one(one for the cluster) and data node can be many(one for each node).I will discuss this in great details in a dedicated HDFS architecture article.

Processing Layer:

YARN:

As a distributed processing engine YARN also have two process, one process to coordinate all the jobs in the cluster called "resource manager" and another process for each node to maintain tasks in respective nodes. Task is basically a subpart of the JOB. Let's continue with the example discussed on HDFS section, We have 2TB data stored distributedly on the cluster. Now suppose I want to know how many lines were there on the file, for that I write a MapReduce code and submit to YARN for processing. Resource manager will take the request and start a JOB, now to know how many lines are there on the file, we need to know How many lines are in each 16384 blocks and sum the result. So, there will be 16384 total tasks in this job to determine the overall result. This tasks will be executed in parallel on those four nodes so it will be very fast.This is how YARN works from a very upper level. I will discuss more specific to YARN architecture in other article.

MapReduce:

MapReduce is a programming framework for Hadoop majorly based on Java. As the name suggest, It has two stage call 'Map' and 'Reduce'. For better understanding, lets continue with the example, We want to know number of lines in the 2TB file so first for each block we have to count number of lines which we can do in Map stage. Now, in reduce stage we can sum all the counts for each block to get overall count of the file. Now, lets understand from YARN prospective, whatever code we write on MapReduce, we need to make a JOB file(for Java, its JAR) from it. Now, when the Job got submitted to YARN, it creates mappers for each block and store the intermidite line count in local storage. Now, when all the mappers were executed, YARN starts reduces to sum all the count of lines. User can specify number of reduces.These Mappers and Reduces are nothing but YARN takes. I will write more article on MapReduce for deeper understanding.

Code:

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class LineCount {
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text(“Total Lines”);

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
output.collect(word, one);
}
}

public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(LineCount.class);
conf.setJobName(“LineCount”);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}

Hive:

As, you can see on above example how difficult it is to write code in MapReduce. Most of the companies also faced the same problem before 2010, until Hive got released.It was really time-consuming to write such a complex code to analyze simple problem, so facebook first started working on a Hadoop solutions which can process SQL like queries.Hive is similar to SQL and supports most of the SQL queries to analyze data, SQL for Hive knows as HQL. Basically, Hive servers takes the HQL query and converts it to a MapReduce job to process the data. Hive is also interactive component, you can fire query and you will get the result on screen. The most useful feature of Hive is it's meta store where Hive stores all it's metadata information.Now, the question is what is a meta store? Suppose I have n number of 2TB files in the cluster, name nodes knows where the blocks are and files are but that is for internal use only. From users and business prospective, we need to visualize and categorize the files and tables above it.Hive can help you to organize your business data and tables. You can assume Hive as a database to understand initially, however it is very different underneath. It is an important topic which I will discuss in depth on up coming articles.

PIG:

Yahoo was the initiator of PIG development, the purpose was to build a Scripting framework which is syntaically very easy than MapReduce but versatile as well.PIG LATIN was the programming language derived for this purpose. PIG is not popular anymore as better alternative like Spark is available in the ecosystem. Many people are speculating that PIG will be discontinued on Hadoop 3.

Spark:

Spark is a distributed cluster computing framework run on resource managers like YARN and MESOS and very fast than MapReduce as it is designed for in-memory processing. Unlike MapReduce due to fast execution speed, Spark can be use for near real time processing. Spark supports batch processing, real time streaming data processing, ML and Spark SQL(faster than Hive).I will write series of articles on Spark to understand in great details.