Suresh Kumar Pakalapati's Linux Administration: Big Data

Monday, June 10, 2013

Big Data – What is it???

Ref:-http://gurkulindia.com/main/2013/06/bigdata/

Most of the technology geeks may have heard the recent buzz about Big Data; in recent times many of my colleagues and friends were asking several questions. So I thought I should write a blog post to better answer their questions.

What is Big Data?

Big Data is defined ‘n’ number of ways in the industry, so instead of trying to find the actual definition lets try to understand the concepts and idea behind it.

As the name says “big data” you may think it is all about size – but that is not just that. There is a lot more to deal with and enormous number of use cases in multiple domains. One of the ways to explain BigData is “V3”

V – Volume V – Variety V – Velocity

Picture Source: Wired.com

Another approach of this is “ABC”

A – Analytics B – Bandwidth C – Capacity

One can say “V3” or “ABC” both as the characteristics of big data rather than as a definition.

Let’s get into more details of the V’s

‘V’ – Volume of Data:

The sheer “volume” of data being stored is exploding. 90% of the world’s data is generated from last 4 years. We expect this number to reach 35 Zettabytes (ZB) by 2020. Companies like Facebook, Twitter, CERN generates Terabytes of data every single day. Interestingly 80% of the world’s data is unstructured, businesses today do not have enough resources or technology to store this and turn it as “useful” data or in other words its is very hard to get information out of the available data.

One of the well-observed phenomena is the data available to an organization is “raising” where as the percent of data an organization can process is declining, this is kind of depressing as a technology lover. But don’t feel bad, we have Hadoop to take fix this ☺

‘V’ – Variety of Data:

With the growing amounts of data we now have a new challenge to deal with: its variety. With growing variety of sources we have seen “variety of data” to deal with; sensors, social networking, feeds, smart devices, location info, and many more. This has left us in a complex situation, because it not only has traditional relational data (very less percent) but majority of it is raw, semi structured & unstructured data from web logs, click-stream data, search indexes, email, photo videos and soon.

For us to handle this kind of data on a traditional system is impossible “0”. We need a fundamental shift in analysis requirement from traditional structured data to include “variety” of data. But as traditional analytic platforms can’t handle variety due to the nature of its built for supporting relational kind that’s neatly formatted and fits nicely into the strict schemas.

As we know the “truth” about 80% data is left unprocessed we now have a need to build a system to efficiently store and process non-relational data and here by perform required analytics and generate report via the Business Intelligence (BI) tools and make real value to the business and to its growth.

‘V’ – Velocity of Data:

Just as the sheer volume and variety of data we collect and store has changed, so, too, has the “velocity” at which it is generated and needs to be handled. As we know the growth rates associated with data repositories is very high with growth in the number of source.
Rather than confining the idea of velocity the above mentioned we could intercept it as “data in motion”: ‘The speed at which data is flowing’:

Two most important challenges to deal with are:

1. Data in motion
2. Data in rest

Dealing effectively with Big Data requires us to perform analytics against the volume and variety of data while it is “still in motion”, not just after it is at “rest”.

Consider a fraud prevention at real time use case: Lets say a credit card is cloned and used at two different locations at the same time, with our existing ‘traditional’ systems we have lag involved to detect this. But imagine if we have “real time” data processing and analyzing technology to prevent this. Its just wonderful as it sounds.

Why Big Data?

• To analyze not only raw structured data, but semi structured, unstructured data from a variety of sources.
• To effectively process and analyze larger set of data instead of analyzing sample of the data.
• To solve information challenges that don’t natively fit within a traditional relational data base approach for handling the v3.
• To improve “intelligence in business” and to take quicker actions by developing “real” B.I tools and reaching the customer needs like never before.
• To develop business patterns and trends in “real time”
• To improve the quality of business in various sector like e-health, retail, IT , call centers, agriculture & so on.

“To handle, process the data and do magical things that were never imagined by anyone”

Working With Big data:

Google in its initial days was successfully able to download the Internet and index the available data when it was small. But when data started growing and new sources started increasing everyday things became complex to handle. So Google come with up solution internally to process this growing volume in a completely different way.

In that process they have started developing GFS – Google File System and also something called Map-Reduce (M to efficiently manage this growing data. But Google has kept this for their internal use and has not open sourced it. They have published a paper in 2004 called “Map-Reduce” to explain what and how this data is processed to make the internet searches possible.

Using that paper people in the industry started thinking in a different way. A guy named “Doug” has started developing a repository to handle the growing and unstructured data which is named as “Hadoop”, this is a open source project and is been actively developed and highly contributed by “Yahoo”.