What is Big Data?
“Big Data” is a catch phrase that has been bubbling up from the high performance computing niche of the IT market. Increasingly suppliers of processing virtualization and storage virtualization software have begun to flog “Big Data” in their presentations. What, exactly, does this phrase mean?
“Big data” is data that becomes large enough that it cannot be processed using conventional methods.
Web search engines, social networks, mobile phones, sensors and science chip in to petabytes of data created on a daily basis. Scientists, intelligence analysts, governments, meteorologists, air traffic controllers, architects, civil engineers-nearly every industry or profession experience the era of big data. Add to that the fact that the democratization of IT has made everyone a (sort of) data expert, well-known with searches and queries, and we’re seeing a huge burst of awareness in big data.
An example often cited is how much weather data is collected on a daily basis by the U.S. National Oceanic and Atmospheric Administration (NOAA) to aide in climate, ecosystem, weather and commercial research. Add that to the masses of data collected by the U.S. National Aeronautics and Space Administration (NASA) for its research and the numbers get pretty big.
The greater part of data has multifaceted and undiscovered relationships. It doesn’t fit simply into relational models.
Practical examples  for big data processing are:
·For discovering People You May Know and other fun facts.
·Member and Company Derived Data
·User’s network statistics
·Who Viewed My Profile?
·User’s History Service
·Natural Language Processing
·Mobile Social Network Hacking
·Web Crawlers/Page scrapping
·Text to Speech
·Machine generated Audio & Video with remixing
·Automatic PDF creation & IR
·Batch-processing large RDF datasets, for indexing RDF data. RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link.
·Executing long-running offline SPARQL queries
D.GumGum-Iin-image ad network
·GumGum is an analytics and monetization platform for online content.
·Image and advertising analytics
E.Lineberger Comprehensive Cancer Center – Bioinformatics Group
·For accumulating and analyzing Next Generation sequencing data produced for the Cancer Genome Atlas project and other groups.
F.Pharm2Phork Project – Agricultural Traceability
·Processing of observation messages generated by RFID/Barcode readers as items move through supply chain.
·Analysis of BPEL generated log files for monitoring and tuning of workflow processes.
Why it is important for enterprises to look into this
Human-generated data fits well into relational tables or arrays; Examples are conventional transactions – purchase/sale, inventory/manufacturing, employment status change, etc.
Another type of data is the machine generated data. Machines produce unstoppable streams of big data:
2.Satellite telemetry (espionage or science)
4.Temperature and environmental sensors
6.Video from security cameras
7.Outputs from medical devises
8.Seismic and Geo-physical sensors
Big data that doesn’t be conventional to known models is discarded or sent to archive un-analyzed. As a result, Enterprises miss information, insight, and prospects to extract new value.
Big Data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Technologies being applied to Big Data include massively parallel processing (MPP) databases, data mining infrastructures such as the Apache Hadoop Framework, distributed file systems, distributed databases, MapReduce algorithms, and cloud computing platforms, the Internet, and archival storage systems.
MapReduce is a programming model and an associated implementation for processing and generating big data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Computational processing can take place on data stored either in a file system (unstructured) or within a database (structured). Programs written in this functional style are automatically parallelized and executed on a big cluster of commodity machines. This allows programmers without any experience with parallel and distributed systems to effortlessly utilize the resources of a large distributed system.
Map Reduce (Source: google)
There are two ways to process “Big data” with the use of MapReduce: 1) HPC 2) Cloud Computing
HPC includes advanced computing, communications, and information technologies. It includes scientific workstations, supercomputer systems, high speed networks, special purpose and experimental systems. New generation of large scale parallel systems, and applications and systems software with all components well incorporated and linked over a high speed network are used for big data processing.
Second way is to process “Big Data” with Cloud computing. It will be a key break through in Data processing due to benefits of using a Cloud Computing which are:
·Easy and inexpensive set-up because hardware, application and bandwidth costs are covered by the provider
·Scalability to meet needs.
·No wasted resources because you pay for what you use.
There are different ways to implement big data processing in the Cloud like 1) Hive 2) Pig and 3) Hadoop
Hive provides a rich set of tools in multiple languages to perform SQL-like data analysis on data stored in HDFS. Pigis usedfor writing SQL-like operations that apply to datasets. Pig project provides a compiler that produces MapReduce jobs from a Pig Latin script. Our major attention is on Hadoop. We can add flavor by introducing Hadoop for big data processing in Cloud.
Apache Hadoop is a software framework inspired by Google’s MapReduce and Google File System (GFS) papers.
Hadoop and its Usecases
Hadoop MapReduce is a programming model for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. 
Hadoop processes and analyzes variety of new and older data to extract meaningful business operations intelligence. Traditionally data moves to the computation node. In Hadoop, data is processed where the data resides. The types of questions one Hadoop helps answer are:
·Event analytics — what series of steps lead a purchase or registration?
·Large scale web click stream analytics
·Revenue assurance and price optimizations
·Financial risk management and affinity engine etc.
How Cloud Computing comes into picture?
In Cloud Computing, we have few options available for Hadoop implementation.1) Amazon IaaS 2) Amazon MapReduce 3) Cloudera
Amazon Elastic Compute Cloud (Amazon EC2 / IaaS) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. If you run Hadoop on Amazon EC2 you might consider using AmazonS3 for accessing job data (data transfer to and from S3 from EC2 instances is free). Initial input can be read from S3 when a cluster is launched. The final output can be written back to S3 before the cluster is decommissioned. Intermediate, temporary data, only needed between MapReduce passes, is more efficiently stored in Hadoop’s DFS. It became a popular way for big data processing and that lead to the emergence of another service called Amazon Elastic MapReduce.
Amazon Elastic MapReduce, a web service enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data.
Amazon Elastic Map Reduce
It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon S3. In a nutshell, the Elastic MapReduce service runs a hosted Hadoop instance on an EC2 instance (master). It’s able to instantly provision other pre-configured EC2 instances (slave nodes) to distribute the MapReduce process. All nodes are terminated once the MapReduce tasks complete running.
Cloudera has two products: Cloudera’s Distribution for Hadoop (CDH) and Cloudera Enterprise. CDH is a data management platform (incorporates HDFS, Hadoop MapReduce, Hive, Pig, HBase, Sqoop, Flume, Oozie, Zookeeper and Hue). It is available free under an Apache license.
Cloudera Enterprise is a package which includes Cloudera’s Distribution for Hadoop, production support and tools designed to make it easier to run Hadoop in a production environment. Cloudera offers services including support, consulting services and training (both public and private).
The Cloudera’s Distribution for Hadoop (CDH) cloud scripts enables you to run Hadoop on cloud providers’ clusters. There’s no need to install the RPMs for CDH or do any configuration; a working cluster will start immediately with one command. Cloudera supports Amazon EC2 only. Cloudera provides Amazon Machine Images and associated launch scripts that make it easy to run CDH on EC2. CDH being open source is free and management services have to be paid for.
Hadoop Wiki, http://wiki.apache.org/hadoop/PoweredBy
Miha Ahronovitz, Kuldip Pabla, Why Hadoop as part of the IT?, http://thecloudtutorial.com/hadoop-tutorial.html
Apache Hadoop, http://hadoop.apache.org/
Amazon EC2, http://aws.amazon.com/ec2/
Amazon Elastic MapReduce, http://aws.amazon.com/elasticmapreduce/
Ubin Malla, Using Hadoop and Amazon Elastic MapReduce to Process Your Data More Efficiently, http://blog.controlgroup.com/2010/10/13/hadoop-and-amazon-elastic-mapreduce-analyzing-log-files/
Cloudera, Apache Hadoop for Enterprise, http://www.cloudera.com/
Amazon EC2 Cost Comparison Calculator, http://media.amazonwebservices.com/Amazon_EC2_Cost_Comparison_Calculator.xls