Monday, June 10, 2013

Introduction to Hadoop


Hadoop is a platform that is well suited to deal with semi-structured & unstructured data, as well as when a data discovery process is needed. That isn’t to say that Hadoop can’t be used for structured dara that is readily available in a raw format; because it can.
Traditionally, data goes through a lot of rigor to make it into the warehouse. This data is cleaned up via various cleansing, enrichment, modeling, master data management and other services before it is ready for analysis; which is expensive process. Because of that expense, its clear that data that lands in warehouse is not just high value, but has a broad purpose; it is used to generate reports & dash-board where the accuracy is the key.
In contrast, Big Data repositories very rarely undergo the full quality control versions of data injected into a warehouse, Hadoop is built for the purpose of handling larger volumes of data, so prepping data and processing it should be cost prohibitive.
I say Hadoop as a system designed for processing mind-boggling amounts of data
Two main components of Hadoop:
1. Map – Reduce = Computation
2. HDFS = Storage
Hadoop Distributed file system (HDFS):
hdfs arch
Let’s discuss about the Hadoop cluster components before getting into details of HDFS.
A typical Hadoop environment consists of a master node, worker nodes with specialized software components.
Master node: There will be multiple master nodes to avoid single point of failure in any environment. The elements of master node are
1. Job Tracker
2. Task tracker
3. Name tracker
Job Tracker: Job tracker interacts with client applications. It is mainly responsible for distributing Map. Reduce tasks to particular nodes with in a cluster.
Task tracker: This process receives the tasks from a job tracker like Map, Reduce and shuffle.
Name node: All these processes are charged with storing a directory free of all files in the HDFS. They also keep track of where the file data is kept within the cluster. Client applications contact name nodes when they need to locate a file, or add, copy as delete a file.
Data Node: Data nodes stores data in the HDFS, it is responsible for replicating data across clusters. These interact with client apps and Name node supplied the data node’s address.
Worker Nodes: These are the commodity servers for processing the data that is coming through. Each worker node includes a data node and a task tracker
Scenario to better understand how “stuff” works:
1. Let’s say we have a 300mb file
2. By default we make it as 128mb blocks 
300mb= 128mb + 128mb + 44mb 
3. So HDFS splits 300mb into blocks as above
4. HDFS will keep 3 copies of each block
5. All these blocks are stored on data nodes
Bottom line is, Name node tracks blocks & data nodes and pays attention to all nodes in cluster. It do not save any data and no data goes through it.
• When a Data node (DN) fails it makes sure the copies are copied to another node and can handle upto 2 DN’s failure.
• Name node (NN) is a single point of failure.
• DN’s continuously runs check sums, if any block is corrupted then it will be process from other DN’s replicas.
There is lot more to discuss but let’s move on to M-R for now.
Map Reduce (M-R)
Google invented this. The main characteristics of M-R are:
1. Sort/merge is the primate
2. Batch oriented
3. Ad hoc queries (no schema)
4. Distribution handled by frame work
Let’s make it simple to understand, we get TB’s & PB’s of data to get processed & analyzed. So to handle it we use MR which basically has two major phases map & reduce.
Map: MR uses key/value pairs. Any data that comes in will be Splitted by HDFS into blocks and then we process it through M-R where we assign a value to every key.
Example: “Gurukulindia is the best site to learn big data”
Just to list network view & logical view and to make good view:
1. Input step: Load data into HDFS by splitting & load to DN’S. The blocks are replicated to overcome failures. The NN keeps track of blocks & DN’s.
2. Job step: Submit the MR Job & its details to the Job tracker.
3. Job init step: The Job tracker interacts with task tracker on each DN to schedule MR tasks.
4. Map step: Mapper process data blocks and generates a list of key value pairs
5. Sort step: Mapper sorts list of key value pair
6. Shuffle: Transfers mapped output to the reducers in sorted fashion.
7. Reduce: Reduces merge list of key value pairs to generate final result.
The results of Reduces are finally stored in HDFS replicated as per the configuration and then clients will be able to read from HDFS.