Monday, June 10, 2013

Unix Administration Learning MAP

Big Data – What is it???


Most of the technology geeks may have heard the recent buzz about Big Data; in recent times many of my colleagues and friends were asking several questions. So I thought I should write a blog post to better answer their questions.
What is Big Data?
Big Data is defined ‘n’ number of ways in the industry, so instead of trying to find the actual definition lets try to understand the concepts and idea behind it.
As the name says “big data” you may think it is all about size – but that is not just that. There is a lot more to deal with and enormous number of use cases in multiple domains. One of the ways to explain BigData is “V3
V – Volume V – Variety V – Velocity


Picture Source:
Another approach of this is “ABC
A – Analytics B – Bandwidth C – Capacity
One can say “V3” or “ABC” both as the characteristics of big data rather than as a definition.
Let’s get into more details of the V’s
V’ – Volume of Data:
The sheer “volume” of data being stored is exploding. 90% of the world’s data is generated from last 4 years. We expect this number to reach 35 Zettabytes (ZB) by 2020. Companies like Facebook, Twitter, CERN generates Terabytes of data every single day. Interestingly 80% of the world’s data is unstructured, businesses today do not have enough resources or technology to store this and turn it as “useful” data or in other words its is very hard to get information out of the available data.
One of the well-observed phenomena is the data available to an organization is “raising” where as the percent of data an organization can process is declining, this is kind of depressing as a technology lover. But don’t feel bad, we have Hadoop to take fix this ☺
V’ – Variety of Data:
With the growing amounts of data we now have a new challenge to deal with: its variety. With growing variety of sources we have seen “variety of data” to deal with; sensors, social networking, feeds, smart devices, location info, and many more. This has left us in a complex situation, because it not only has traditional relational data (very less percent) but majority of it is raw, semi structured & unstructured data from web logs, click-stream data, search indexes, email, photo videos and soon.
For us to handle this kind of data on a traditional system is impossible “0”. We need a fundamental shift in analysis requirement from traditional structured data to include “variety” of data. But as traditional analytic platforms can’t handle variety due to the nature of its built for supporting relational kind that’s neatly formatted and fits nicely into the strict schemas.
As we know the “truth” about 80% data is left unprocessed we now have a need to build a system to efficiently store and process non-relational data and here by perform required analytics and generate report via the Business Intelligence (BI) tools and make real value to the business and to its growth.
‘V’ – Velocity of Data:
Just as the sheer volume and variety of data we collect and store has changed, so, too, has the “velocity” at which it is generated and needs to be handled. As we know the growth rates associated with data repositories is very high with growth in the number of source.
Rather than confining the idea of velocity the above mentioned we could intercept it as “data in motion”: ‘The speed at which data is flowing’:
Two most important challenges to deal with are:
1. Data in motion
2. Data in rest
Dealing effectively with Big Data requires us to perform analytics against the volume and variety of data while it is “still in motion”, not just after it is at “rest”.
Consider a fraud prevention at real time use case: Lets say a credit card is cloned and used at two different locations at the same time, with our existing ‘traditional’ systems we have lag involved to detect this. But imagine if we have “real time” data processing and analyzing technology to prevent this. Its just wonderful as it sounds.
Why Big Data?
• To analyze not only raw structured data, but semi structured, unstructured data from a variety of sources.
• To effectively process and analyze larger set of data instead of analyzing sample of the data.
• To solve information challenges that don’t natively fit within a traditional relational data base approach for handling the v3.
• To improve “intelligence in business” and to take quicker actions by developing “real” B.I tools and reaching the customer needs like never before.
• To develop business patterns and trends in “real time”
• To improve the quality of business in various sector like e-health, retail, IT , call centers, agriculture & so on.
“To handle, process the data and do magical things that were never imagined by anyone”
Working With Big data:
Google in its initial days was successfully able to download the Internet and index the available data when it was small. But when data started growing and new sources started increasing everyday things became complex to handle. So Google come with up solution internally to process this growing volume in a completely different way.
In that process they have started developing GFS – Google File System and also something called Map-Reduce (M to efficiently manage this growing data. But Google has kept this for their internal use and has not open sourced it. They have published a paper in 2004 called “Map-Reduce” to explain what and how this data is processed to make the internet searches possible.
Using that paper people in the industry started thinking in a different way. A guy named “Doug” has started developing a repository to handle the growing and unstructured data which is named as “Hadoop”, this is a open source project and is been actively developed and highly contributed by “Yahoo”.

Introduction to Hadoop


Hadoop is a platform that is well suited to deal with semi-structured & unstructured data, as well as when a data discovery process is needed. That isn’t to say that Hadoop can’t be used for structured dara that is readily available in a raw format; because it can.
Traditionally, data goes through a lot of rigor to make it into the warehouse. This data is cleaned up via various cleansing, enrichment, modeling, master data management and other services before it is ready for analysis; which is expensive process. Because of that expense, its clear that data that lands in warehouse is not just high value, but has a broad purpose; it is used to generate reports & dash-board where the accuracy is the key.
In contrast, Big Data repositories very rarely undergo the full quality control versions of data injected into a warehouse, Hadoop is built for the purpose of handling larger volumes of data, so prepping data and processing it should be cost prohibitive.
I say Hadoop as a system designed for processing mind-boggling amounts of data
Two main components of Hadoop:
1. Map – Reduce = Computation
2. HDFS = Storage
Hadoop Distributed file system (HDFS):
hdfs arch
Let’s discuss about the Hadoop cluster components before getting into details of HDFS.
A typical Hadoop environment consists of a master node, worker nodes with specialized software components.
Master node: There will be multiple master nodes to avoid single point of failure in any environment. The elements of master node are
1. Job Tracker
2. Task tracker
3. Name tracker
Job Tracker: Job tracker interacts with client applications. It is mainly responsible for distributing Map. Reduce tasks to particular nodes with in a cluster.
Task tracker: This process receives the tasks from a job tracker like Map, Reduce and shuffle.
Name node: All these processes are charged with storing a directory free of all files in the HDFS. They also keep track of where the file data is kept within the cluster. Client applications contact name nodes when they need to locate a file, or add, copy as delete a file.
Data Node: Data nodes stores data in the HDFS, it is responsible for replicating data across clusters. These interact with client apps and Name node supplied the data node’s address.
Worker Nodes: These are the commodity servers for processing the data that is coming through. Each worker node includes a data node and a task tracker
Scenario to better understand how “stuff” works:
1. Let’s say we have a 300mb file
2. By default we make it as 128mb blocks 
300mb= 128mb + 128mb + 44mb 
3. So HDFS splits 300mb into blocks as above
4. HDFS will keep 3 copies of each block
5. All these blocks are stored on data nodes
Bottom line is, Name node tracks blocks & data nodes and pays attention to all nodes in cluster. It do not save any data and no data goes through it.
• When a Data node (DN) fails it makes sure the copies are copied to another node and can handle upto 2 DN’s failure.
• Name node (NN) is a single point of failure.
• DN’s continuously runs check sums, if any block is corrupted then it will be process from other DN’s replicas.
There is lot more to discuss but let’s move on to M-R for now.
Map Reduce (M-R)
Google invented this. The main characteristics of M-R are:
1. Sort/merge is the primate
2. Batch oriented
3. Ad hoc queries (no schema)
4. Distribution handled by frame work
Let’s make it simple to understand, we get TB’s & PB’s of data to get processed & analyzed. So to handle it we use MR which basically has two major phases map & reduce.
Map: MR uses key/value pairs. Any data that comes in will be Splitted by HDFS into blocks and then we process it through M-R where we assign a value to every key.
Example: “Gurukulindia is the best site to learn big data”
Just to list network view & logical view and to make good view:
1. Input step: Load data into HDFS by splitting & load to DN’S. The blocks are replicated to overcome failures. The NN keeps track of blocks & DN’s.
2. Job step: Submit the MR Job & its details to the Job tracker.
3. Job init step: The Job tracker interacts with task tracker on each DN to schedule MR tasks.
4. Map step: Mapper process data blocks and generates a list of key value pairs
5. Sort step: Mapper sorts list of key value pair
6. Shuffle: Transfers mapped output to the reducers in sorted fashion.
7. Reduce: Reduces merge list of key value pairs to generate final result.
The results of Reduces are finally stored in HDFS replicated as per the configuration and then clients will be able to read from HDFS.