NOSQL and Cloud Computing


Introduction

Cloud Computing is moving from being “IT buzzword” to reasonable yet reliable way of deploying applications in the Internet. IT managers within companies are considering deploying some applications within cloud. A cloud-related trend that developers have been paying attention is the idea of “NoSQL”, a set of operational-data technologies based on non-relational concepts. “NoSQL” is “a sea change” idea to consider data storage options beyond the traditional SQL-based relational database.

Cloud Computing – Download Free EBooks and Whitepapers

Accordingly, a new set of open source distributed database is actively propping up to leverage the facilities and services provided through the cloud architecture. Thus, web applications and databases in cloud are undergoing major architectural changes to take advantage of the scalability provided by the cloud. This article is intended to provide insight on the NOSQL in the context of Cloud computing.

Face off ~ SQL, NOSQL & Cloud Computing

A key disadvantage of SQL Databases is the fact that SQL Databases are at a high abstraction level. This is a disadvantage because to do a single Statement, SQL often requires the data to be processed multiple times. This, of course, takes time and performance. For instance, multiple queries on SQL Data occur when there is a ‘Join’ operation. Cloud computing environments need high-performing and highly scalable databases.

NoSQL Databases are built without relations. But is it really that “good” to go for NoSQL Databases? A world without relations, no joins and pure scalability!  NoSQL databases typically emphasize horizontal scalability via partitioning, putting them in a good position to leverage the elastic provisioning capabilities of the cloud.

The general definition of a NOSQL data store is that it manages data that is not strictly tabular and relational, so it does not make sense to use SQL for the creation and retrieval of the data. NOSQL data stores are usually non-relational, distributed, open-source, and horizontally scalable.

If we look at the big Platforms in the Web like Facebook or Twitter, there are some Datasets that do not need any relations. The challenge for NoSQL Databases is to keep the data consistent. Imagine the fact that a user deletes his or her account. If this is hosted on a NoSQL Database, all the tables have to check for any data the user has produced in the past. With NoSQL, this has to be done by code.

A major advantage of NoSQL Databases is the fact that Data replication can be done more easily then it would be with SQL Databases.

As there are no relations, Tables don’t necessary have to be on the same servers. Again, this allows better “scaling” than SQL Databases. Don’t forget: scaling is one of the key aspects in Cloud computing environments.

Another disadvantage of SQL databases is the fact that there is always a schema involved. Over time, requirements will definitely change and the database somehow has to support this new requirements. This can lead to serious problems. “Just imagine” the fact that applications  need two extra fields to store data. Solving this issue with SQL Databases might get very hard. NoSQL databases support a changing environment for data and are a better solution in this case as well.

SQL Databases have the advantage over NoSQL Databases to have better support for “Business Intelligence”.

Cloud Computing Platforms are made for a great number of people and potential customers. This means that there will be millions of queries over various tables, millions or even billions of read and write operations within seconds. SQL Databases are built to serve another market: the “business intelligence” one, where fewer queries are executed.

This implies that the way forward for many developers is a hybrid approach, with large sets of data stored in, ideally, cloud-scale NoSQL storage, and smaller specialized data remaining in relational databases. While this would seem to amplify management overhead, reducing the size and complexity of the relational side can drastically simplify things.

However, it is up to the Use-Case to identify if you want a NoSQL approach or if you better stay with SQL.

“NOSQL” Databases for Cloud

The NoSQL (or “not only SQL”) movement is defined by a simple premise: Use the solution that best suits the problem and objectives.

If the data structure is more appropriately accessed through key-value pairs, then the best solution is likely a dedicated key value pair database.

If the objective is to quickly find connections within data containing objects and relationships, then the best solution is a graph database that can get results without any need for translation (O/R mapping).

Today’s availability of numerous technologies that finally support this simple premise are helping to simplify the application environment and enable solutions that actually exceed the requirements, while also supporting performance and scalability objectives far into the future.  Many cloud web applications have expanded beyond the sweet spot for these relational database technologies. Many applications demand availability, speed, and fault tolerance over consistency.

Although the original emergence of NOSQL data stores was motivated by web-scale data, the movement has grown to encompass a wide variety of data stores that just happen to not use SQL as their processing language. There is no general agreement on the taxonomy of NOSQL data stores, but the categories below capture much of the landscape.

Tabular / Columnar Data Stores

Storing sparse tabular data, these stores look most like traditional tabular databases. Their primary data retrieval paradigm utilizes column filters, generally leveraging hand-coded map-reduce algorithms.

BigTable is a compressed, high performance, and proprietary database system built on Google File System (GFS), Chubby Lock Service, and a few other Google programs;

HBase is an open source; non-relational, distributed database modeled after Google’s BigTable and is written in Java. It runs on top of HDFS, providing a fault-tolerant way of storing large quantities of sparse data.

Hypertable is an open source database inspired by publications on the design of Google’s BigTable. Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS). It is written almost entirely in C++ for performance.

VoltDB is an in-memory database. It is an ACID-compliant RDBMS which uses a shared nothing architecture. VoltDB is based on the academic HStore project. VoltDB is a relational database that supports SQL access from within pre-compiled Java stored procedures.

Google Fusion Tables is a free service for sharing and visualizing data online. It allows you to upload and share data, merge data from multiple tables into interesting derived tables, and see the most up-to-date data from all sources.

Document Stores

These NOSQL data sources store unstructured (i.e., text) or semi-structured (i.e., XML) documents. Their data retrieval paradigm varies highly, but documents can always be retrieved by unique handle. XML data sources leverage XQuery. Text documents are indexed, facilitating keyword search-like retrieval.

Apache CouchDB, commonly referred to as CouchDB, is an open source document-oriented database written in the Erlang programming language. It is designed for local replication and to scale vertically across a wide range of devices.

MongoDB is an open source, scalable, high-performance, schema-free, document-oriented database written in the C++ programming language.

Terrastore is a distributed, scalable and consistent document store supporting single-cluster and multi-cluster deployments. It provides advanced scalability support and elasticity feature without loosening the consistency at data level.

Graph Databases

These NOSQL sources store graph-oriented data with nodes, edges, and properties and are commonly used to store associations in social networks.

Neo4j is an open-source graph database, implemented in Java. It is “embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs.

AllegroGraph is a Graph database. It considers each stored item to have any number of relationships. These relationships can be viewed as links, which together form a network, or graph.

FlockDB is an open source distributed, fault-tolerant graph database for managing data at webscale. It was initially used by Twitter to build its database of users and manage their relationships to one another. It scales horizontally and is designed for on-line, low-latency, high throughput environments such as websites.

VertexDB is a high performance graph database server that supports automatic garbage collection. It uses the HTTP protocol for requests and JSON for its response data format and the API are inspired by the FUSE file system API plus a few extra methods for queries and queues.

Key/Value Stores

These sources store simple key/value pairs like a traditional hash table. Their data retrieval paradigm is simple; given a key, return the value.

Dynamo is a highly available, proprietary key-value structured storage system. It has properties of both databases and distributed hash tables (DHTs). It is not directly exposed as a web service, but is used to power parts of other Amazon Web Services

Memcached is a general-purpose distributed memory caching system. It is often used to speed up dynamic database-driven websites by caching data and objects in RAM to reduce the number of times an external data source must be read.

Cassandra is an open source distributed database management system. It is designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. It is a NoSQL solution that was initially developed by Facebook and powers their Inbox Search feature.

Amazon SimpleDB is a distributed database written in Erlang by Amazon.com. It is used as a web service in concert with EC2 and S3 and is part of Amazon Web Services.

Voldemort is a distributed key-value storage system. It is used at LinkedIn for certain high-scalability storage problems where simple functional partitioning is not sufficient.

Kyoto Cabinet is a library of routines for managing a database. The database is a simple data file containing records; each is a pair of a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table or B+ tree.

Scalaris is a scalable, transactional, distributed key-value store. It can be used for building scalable Web 2.0 services.

Riak is a Dynamo-inspired database that is being used in production by companies like Mozilla.

Object and Multi-value Databases

These types of stores preceded the NOSQL movement, but they have found new life as part of the movement. Object databases store objects (as in object-oriented programming). Multi-value databases store tabular data, but individual cells can store multiple values. Examples include Objectivity, GemStone and Unidata. Proprietary query languages are used.

Miscellaneous NOSQL Sources

Several other data stores can be classified as NOSQL stores, but they don’t fit into any of the categories above. Examples include: GT.M, IBM Lotus/Domino, and the ISIS family.

Sources for further Reading

http://news.cnet.com/8301-13846_3-10412528-62.html#ixzz1DGORTRBP   http://cloudcomputing.blogspot.com/2010/03/nosql-is-not-sql-and-thats-problem.html

http://news.cnet.com/8301-13846_3-10412528-62.html http://www.readwriteweb.com/cloud/2010/07/cassandra-predicting-the-futur.php

http://cloudvane.wordpress.com/tag/nosql/

http://www.rackspacecloud.com/blog/2010/02/25/should-you-switch-to-nosql-too/ http://pro.gigaom.com/2010/03/what-cloud-computing-can-learn-from-nosql/ http://www.drdobbs.com/database/224900500

http://cloudcomputing.blogspot.com/2010/04/disruptive-cloud-computing-startups-at.html

http://www.informationweek.com/cloud-computing/blog/archives/2010/04/nosql_needed_fo.html

http://www.elance.com/s/cloudcomputing/

http://www.thesavvyguideto.com/gridblog/2009/11/a-look-at-nosql-and-nosql-patterns/

http://blogs.forrester.com/application_development/2010/02/nosql.html

http://www.yafla.com/dforbes/Getting_Real_about_NoSQL_and_the_SQL_Isnt_Scalable_Lie/

http://arstechnica.com/business/data-centers/2010/02/-since-the-rise-of.ars/2

Advertisements

Platform as a Service Comparison- Force.com vs AWS Elastic Beanstalk vs Google App Engine vs Windows Azure


Platform as a Service Comparison

Cloud Computing – Download Free EBooks and Whitepapers
Java – Download Free EBooks and Whitepapers
Windows – Download Free EBooks and Whitepapers
  Force.com AWS Elastic Beanstalk Google App Engine Microsoft Windows Azure
Description Force.com is a cloud computing platform as a service system from Salesforce.com, that developers use to build multi tenant applications hosted on their servers as a service. AWS Elastic Beanstalk is an even easier way for you to quickly deploy and manage applications in the AWS cloud. You simply upload your application, and Elastic Beanstalk automatically handles the deployment details of capacity provisioning, load balancing, auto-scaling, and application health monitoring. Google App Engine lets you run web applications on Google’s infrastructure. App Engine applications are easy to build, easy to maintain, and easy to scale as your traffic and data storage needs grow. With App Engine, there are no servers to maintain: You just upload your application, and it’s ready to serve your users. The Windows Azure Platform is a Microsoft cloud platform used to build, host and scale web applications through Microsoft data centers.
Web-Site https://www.force.com http://aws.amazon.com/elasticbeanstalk/ http://code.google.com/appengine/ http://www.windowsazure.com/en-us/
Development Status Production Production Production Production
Technology Java, .NET, Ruby, Objective-C and PHP Java Java, Python C#, Java, PHP, Ruby
Open Source No No No No
Database Force.com database Microsoft SQL Server, Oracle, or other relational databases running on EC2. Google Cloud SQL, GAE doesn’t support external databases SQL Azure
Database as a Service Yes Amazon RDS, Amazon SimpleDB Yes Yes
API REST API AWS Elastic Beanstalk API or AWS SDKs Datastore, Blobstore,Email, XMPP, Channel, Memcache, Files API REST API and a managed API for working with the storage services
Pricing http://www.force.com/platform-edition/index.jsp?d=70130000000rza8 http://aws.amazon.com/elasticbeanstalk/#pricing http://www.google.com/enterprise/cloud/appengine/pricing.html http://www.windowsazure.com/en-us/pricing/details/
DevTest Environment Development Development ?
Free Trail 30 day Free Trail New AWS customers who are eligible for the AWS free usage tier can deploy an application in Elastic Beanstalk for free, as the default settings for Elastic Beanstalk allow a low traffic application to run within the free tier without incurring charges. Google App Engine is free up to a certain level of consumed resources. With new Spending Limit feature, customers who sign up for a new 3-Month Free Trial, MSDN or Cloud Essentials offer can utilize Windows Azure without any fear of getting charged as long as they keep the Spending Limit feature turned on.
Real-time web monitoring and analytics CloudWatch monitoring metrics Third-party tools can be used Windows Azure Application Monitoring Management Pack
Continuous Integration (CI) server CruiseControl is a continuous integration tool Hudson using the API to push new releases/versions to Elastic Beanstalk. Hudson – Continuous Integration for a Google App Engine application ?
Eclipse Plugin Force.com IDE on Eclipse 3.5 AWS Toolkit for Eclipse Yes ?

Cost Benefits of Cloud Computing


Cloud Computing is a “newsworthy” term in the IT industry in recent times and it is here to stay!

Cloud Computing is not a technology, or even a set of technologies — it’s an idea. Cloud Computing is not a standard defined by any standards organization.

Basic understanding for Cloud: “Cloud” represents the Internet; Instead of using application installed on your computer or saving data to your hard drive, you’re working and storing stuff on the Web. Data is kept on servers and used by the service you’re using; tasks are performed in your browser using an interface / console provided by the service.

A credit card and Internet access is all you need to make an investment in technology. Businesses will find it easier than ever to provision technology services without the involvement of IT.

There are many definitions available in the market for Cloud Computing but we have aligned it with NIST publication and with our understanding. NIST defines cloud computing by describing five essential characteristics, three cloud service models, and four cloud deployment models.

NIST's Architecture of Cloud Computing
NIST’s Architecture of Cloud Computing

Ref: NIST

“Cloud Computing is a self service which is on demand, Elastic, Measured, Multi-tenant, Pay per use, Cost-effective and efficient”. It is the access of data, software applications, and computer processing power through a ‘cloud’/a group of many on line/demand resources. Tasks are assigned to a combination of connections, software and services accessed over a network. This network of servers and connections is collectively known as “the cloud.”

Cloud service delivery is divided among three fundamental classifications referred as the “SPI Model,”

Cloud Service Models - IaaS, PaaS, SaaS
Cloud Service Models – IaaS, PaaS, SaaS

Software as a Service is a model of software deployment where an application is hosted as a service provided to customers across the Internet. By eliminating the need to install and run the application on the customer’s own computer, SaaS alleviates the customer’s burden of software maintenance, ongoing operation, and support. Salesforce is very popular Customer Relationship Management (CRM) software that is offered only as a service.

PaaS model makes all of the facilities required to support the complete life cycle of building and delivering web applications and services entirely available from the Internet. Google App Engine (GAE) is an example of PaaS. GAE provides a Python environment within which you can build, test and then deploy your applications.

Infrastructure as a Service (IaaS) is the delivery of computer infrastructure as a service. Rather than purchasing servers, software, data center space or network equipment, clients instead buy those resources as a fully outsourced service. Amazon Web Services (AWS) is one of the pioneers of such an offering. AWS’ Elastic Compute Cloud (EC2) is “a web service that provides resizable compute capacity”.

There are four deployment models for cloud services regardless of the service model utilized (SPI).

Public clouds refer to shared cloud services that are made available to a broad base of users. Although many organizations use public clouds for private business benefit, they don’t control how those cloud services are operated, accessed or secured. Popular examples of public clouds include Amazon EC2, Google Apps and Salesforce.com.

Private cloud describes an IT infrastructure in which a shared pool of computing resources—servers, networks, storage, applications and software services—can be rapidly provisioned, dynamically allocated and operated for the benefit of a single organization.

Hybrid Cloud represents composition of two or more cloud deployment models (private, community, or public) that remain unique but are bound together by uniform or proprietary technology that enables data and application portability.

Community Cloud represents infrastructure is shared by several organizations and supports a specific community that has shared concerns. E.g. FDA compliance needs specific controls where audit requirements can’t be met by other deployment models.

Cloud computing brings efficiencies and savings. The diverse benefits of cloud computing are undoubtedly worth pursuing. Cost-cutting is at the top of most companies’ lists of priorities in these challenging economic times. Having turned from revolutionary possibility into increasingly well-established custom, the cost of ‘outsourcing to the cloud’ is now falling dramatically.

In only paying for the resources used, therefore, operating costs can be reduced. After all, in-house data centres typically leave 85%-90% of available capacity idle. Cloud computing can lead to energy savings too, removing from individual companies the costly burden of running a data centre plus generator back-up and uninterruptible power supplies. Thus it results in reduction of CAPEX & OPEX.

Cloud Computing is in its formative years, but expect it to grow up quick. The prospective of Cloud Computing is mind boggling and the technology and business options will increase exponentially.

Still question remains, how Clouds are beneficial to the enterprises?

  • Focus on core business
  • Cloud computing increases the profitability by improving resource utilization. Pooling resources into large clouds drives down costs and increases utilization by delivering resources only for as long as those resources are needed.
  • Cloud computing is particularly valuable to small and medium businesses, where effective and affordable IT tools are critical to helping them become more productive without spending lots of money on in-house resources and technical equipment.
  • Cost savings
  • Remote access
  • Ease of availability
  • Real-time collaboration capabilities
  • Gain access to latest technologies
  • We can leverage the sheer processing power of the cloud to do things that traditional productivity applications cannot do. “For instance, users can instantly search over 25 GB worth of e-mail online, which is nearly impossible to do on a desktop.
  • To take another example, each document created through Google Apps is easily turned into a living information source, capable of pulling the latest data from external applications, databases and the Web. This revolutionizes processes as simple as creating a Google spreadsheet to compare stock prices from vendors over time, because the cells can be populated and updated as the prices change in real time.
  • Cloud computing offers almost unlimited computing power and collaboration at a massive scale for enterprises of all sizes.
  • “Salesforce.com has 1.2m users on its platform. If that’s not scalable show me something that is. Gmail is SaaS and how many users are on that?”
  • Multi-tenancy enables sharing of resources and costs among a large pool of users, allowing for:
  1. Centralization of infrastructure in areas with lower costs (such as real estate, electricity, etc.)
  2. Peak-load capacity increases (users need not engineer for highest possible load-levels)
  3. Utilization and efficiency improvements for systems that are often only 10-20% utilized.
  • Sustainability comes about through improved resource utilization, more efficient systems, and carbon neutrality.

But, are there any issues with Cloud Computing?

  • The benefits of cloud computing will not be realized if businesses are not convinced that it is secure. Trust is at the centre of success and providers have to prove themselves worthy of that trust if hosted services are going to work.
  • CIA (Confidentiality, Integrity, Availability)
  • Application performance
  • IT Security Standards – There are multiple standards for security protocol for IT systems that have yet to be implemented into cloud computing.
  • Regulatory compliance— the vendor will be required to participate in internal and external audits. They will need to find a way to accommodate auditors from all firms using their service. [FDA Compliance is not feasible yet.]

Let’s consider Facts and Figures before jumping into minor details of Cloud Computing. Compare the annual cost of Amazon EC2 with an equivalent deployment in co-located and on-site data centers by entering a few basic inputs (Ref: Amazon EC2 Cost Comparison Calculator).

Cost Component - Amazon EC2(Public Cloud), Co-Location, On-Site
Cost Component – Amazon EC2(Public Cloud), Co-Location, On-Site

Configuration:

High-CPU Instances: Instances of this family have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.

20 High-CPU Extra Large Instance (75% utilization) and No. of Peak Instances – 5 with 10% Annual Utilization

  • 7 GB of memory
  • 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
  • 1690 GB of instance storage
  • 64-bit platform
  • I/O Performance: High
  • Avg. Monthly Data Transfer “In” Per Instance (GB) -10 GB
  • Avg. Monthly Data Transfer “Out” Per Instance (GB)- 20 GB
  • Region: US-East
  • OS: Linux

Cost Details:

On-site

TCO On-Site
TCO On-Site

Co-Location

TCO Co-Location
TCO Co-Location

Amazon EC2(Cloud Computing)

TCO Amazon EC2
TCO Amazon EC2

Annual Total Cost of Ownership (TCO) Summary

Annual Total Cost of Ownership (TCO) Summary
Annual Total Cost of Ownership (TCO) Summary

Hence … Proved 🙂

Cloud Computing – Download Free EBooks and Whitepapers
Java – Download Free EBooks and Whitepapers
Windows – Download Free EBooks and Whitepapers

Tutorial on Hadoop with VMware Player


Tutorial on Hadoop with VMware Player

Map Reduce (Source: google)
Map Reduce (Source: google)

Download Free EBooks and Whitepapers on Big DATA

Functional Programming
According to WIKI, In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state. Since there is no hidden dependency (via shared state), functions in the DAG can run anywhere in parallel as long as one is not an ancestor of the other. In other words, analyze the parallelism is much easier when there is no hidden dependency from shared state. Map/reduce is a special form of such a directed acyclic graph which is applicable in a wide range of use cases. It is organized as a “map” function which transform a piece of data into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where a “reduce” function is use to merge the values (of the same key) into a single result.
Map Reduce

A way to take a big task and divide it into discrete tasks that can be done in parallel. Map / Reduce is just a pair of functions, operating over a list of data.

MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers.

The framework is inspired by map and reduce functions commonly used in functional programming,[3] although their purpose in the MapReduce framework is not the same as their original forms.
Hadoop
A Large scale Batch Data Processing System.

It uses MAP-REDUCE for computation and HDFS for storage.

Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

It is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

Hadoop is an open source Java implementation of Google’s MapReduce algorithm along with an infrastructure to support distributing it over multiple machines. This includes it’s own filesystem ( HDFS Hadoop Distributed File System based on the Google File System) which is specifically tailored for dealing with large files. When thinking about Hadoop it’s important to keep in mind that the infrastructure it has is a huge part of it. Implementing MapReduce is simple. Implementing a system that can intelligently manage the distribution of processing and your files, and breaking those files down into more manageable chunks for processing in an efficient way is not.

HDFS breaks files down into blocks which can be replicated across it’s network (how many times it’s replicated it determined by your application and can be specified on a per file basis). This is one of the most important performance features and, according to the docs “…is a feature that needs a lot of tuning and experience.” You really don’t want to have 50 machines all trying to pull from a 1TB file on a single data node, at the same time, but you also don’t want to have it replicate a 1TB file out to 50 machines. So, it’s a balancing act.

Hadoop installations are broken into three types.

v  The NameNode acts as the HDFS master, managing all decisions regarding data replication.

v  The JobTracker manages the MapReduce work. It “…is the central location for submitting and tracking MR jobs in a network environment.”

v  Task Tracker and Data Node, which do the grunt work

Hadoop - NameNode, DataNode, JobTracker, TaskTracker
Hadoop – NameNode, DataNode, JobTracker, TaskTracker

The JobTracker will first determine the number of splits (each split is configurable, ~16-64MB) from the input path, and select some TaskTracker based on their network proximity to the data sources, then the JobTracker send the task requests to those selected TaskTrackers.

Each TaskTracker will start the map phase processing by extracting the input data from the splits. For each record parsed by the “InputFormat”, it invoke the user provided “map” function, which emits a number of key/value pair in the memory buffer. A periodic wakeup process will sort the memory buffer into different reducer node by invoke the “combine” function. The key/value pairs are sorted into one of the R local files (suppose there are R reducer nodes).

When the map task completes (all splits are done), the TaskTracker will notify the JobTracker. When all the TaskTrackers are done, the JobTracker will notify the selected TaskTrackers for the reduce phase.

Each TaskTracker will read the region files remotely. It sorts the key/value pairs and for each key, it invoke the “reduce” function, which collects the key/aggregatedValue into the output file (one per reducer node).

Map/Reduce framework is resilient to crash of any components. The JobTracker keep tracks of the progress of each phases and periodically ping the TaskTracker for their health status. When any of the map phase TaskTracker crashes, the JobTracker will reassign the map task to a different TaskTracker node, which will rerun all the assigned splits. If the reduce phase TaskTracker crashes, the JobTracker will rerun the reduce at a different TaskTracker.
Let’s try Hands on Hadoop
Objective of the tutorial is to set up multi-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux with the use of VMware Player.

Hadoop and VMware Player
Hadoop and VMware Player

Installations / Configurations Needed:

Laptop

Physical Machine

Laptop with 60 GB HDD, 2 GB RAM, 32bit Support, OS – Ubuntu 10.04 LTS – the Lucid Lynx

IP Address-192.168.1.3 [Used in configuration files]

Virtual Machine

See VMware Player sub section

Download Ubuntu ISO file

Ubuntu 10.04 LTS – the Lucid Lynx ISO file is needed to install on virtual machine created by VMware Player to set up multi-node Hadoop cluster.

Download Ubuntu Desktop Edition
Download Ubuntu Desktop Edition

http://www.ubuntu.com/desktop/get-ubuntu/download

Note: Login with user “root” to avoid any kind of permission issues (In your machine and Virtual Machine).

Update the Ubuntu packages: sudo apt-get update

VMware Player [Freeware]

Download it from http://downloads.vmware.com/d/info/desktop_downloads/vmware_player/3_0

Download VMware Player
Download VMware Player
Select VMware Player to Download
Select VMware Player to Download
VMware Player Free Product Download
VMware Player Free Product Download

Install VMware Player on your physical machine with the use of the downloaded bundle.

VMware Player - Ready to install
VMware Player – Ready to install
VMware Player - installing
VMware Player – installing

Now, create virtual machine with the use of it and install Ubuntu 10.04 LTS on it with the use of ISO file and do appropriate configurations for the virtual machine.

Browse Ubuntu ISO
Browse Ubuntu ISO

Proceed with instructions and let the set up finish.

Virtual Machine in VMware Player
Virtual Machine in VMware Player

Once you are done with it successfully*, Select Play virtual Machine.

Start Virtual Machine in VMware Player
Start Virtual Machine in VMware Player

Open Terminal (Command prompt in Ubuntu) and check the IP address of the Virtual Machine.

NOTE: IP address may change so if Virtual machine cannot be connected by SSH from physical machine then have a look on IP address 1st.

Ubuntu Virtual Machine - ifconfig
Ubuntu Virtual Machine – ifconfig

Apply following configuration in physical & virtual machine for Java 6 and Hadoop installation only.

Installing Java 6

sudo apt-get install sun-java6-jdk

sudo update-java-alternatives -s java-6-sun [Verify Java Version]

Setting up Hadoop  0.20.2

Download Hadoop from http://www.apache.org/dyn/closer.cgi/hadoop/core and place under /usr/local/hadoop

HADOOP Configurations

Hadoop requires SSH access to manage its nodes, i.e. remote machines [In our case virtual Machine] plus your local machine if you want to use Hadoop on it.

On Physical Machine

Generate an SSH key

Generate an SSH key
Generate an SSH key

Enable SSH access to your local machine with this newly created key.

Enable SSH access to your local machine
Enable SSH access to your local machine

Or you can copy it from $HOME/.ssh/id_rsa.pub to $HOME/.ssh/authorized_keys manually.

Test the SSH setup by connecting to your local machine with the root  user.

Test the SSH setup
Test the SSH setup

Use ssh 192.168.1.3 from physical machine as well. It will give same result.

On Virtual Machine

The root user account on the slave (Virtual Machine) should be able to access physical machine via a password-less SSH login.

Add the Physical Machine’s public SSH key (which should be in ) to the authorized_keys file of Vitual Machine (in this user’s ). You can do this manually

(Physical Machine)$HOME/.ssh/id_rsa.pub -> (VM)$HOME/.ssh/authorized_keys

SSH Key may look like (Can’t be same though J)

ssh

rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwjhqJ7MyXGnn5Ly+0iOwnHETAR6Y3Lh3UUKb

aCIP2/0FsVOWhBvcSLMEgT1ewrRPKk9IGoegMCMdHDGDfabzO4tUsfCdfvvb9KFRcB

U3pKdq+yVvCVxXtoD7lNnMtckUwSz5F1d04Z+MDPbDixn6IAu/GeX9aE2mrJRBq1Pz

n3iB4GpjnSPoLwQvEO835EMchq4AI92+glrySptpx2MGporxs5LvDaX87yMsPyF5tutu

Q+WwRiLfAW34OfrYsZ/Iqdak5agE51vlV/SESYJ7OqdD3+aTQghlmPYE4ILivCsqc7w

xT+XtPwR1B9jpOSkpvjOknPgZ0wNi8LD5zyEQ3w== root@mitesh-laptop

Use ssh 192.168.1.3 from virtual machine to verify ssh access and have a feel of it to understand ssh working.

For more understanding, Ping 192.168.1.3 and 192.168.28.136 from each other.

For detail information on Network Settings in VMWare Player visit http://www.vmware.com/support/ws55/doc/ws_net_configurations_common.html VMware Player has similar concepts.

Using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of Ubuntu box.

To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

#disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

Ubuntu - Disable IPv6
Ubuntu – Disable IPv6

 <HADOOP_INSTALL>/conf/hadoop-env.sh -> set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

 

# The java implementation to use.  Required.

export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.20

 

<HADOOP_INSTALL>/conf/core-site.xml ->

 

Configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System,

Hadoop - core-site.xml
Hadoop – core-site.xml

HDFS, even though our little “cluster” only contains our single local machine.

<property>

  hadoop.tmp.dir

  /usr/local/hadoop/tmp/dir/hadoop-${user.name}

</property>

 <HADOOP_INSTALL>/conf/mapred-site.xml ->

<property>

  <name>mapred.job.tracker</name>

  <value>192.168.1.3:54311</value>

</property>

Hadoop - mapred-site.xml
Hadoop – mapred-site.xml

 <HADOOP_INSTALL>/conf/hdfs-site.xml

 

<property>

  <name>dfs.replication</name>

  <value>2</value>

</property>

Physical Machine vs Virtual Machine (Master/Slave) Settings on Physical Machine only

<HADOOP_INSTALL>/conf/masters

The conf/masters file defines the namenodes of our multi-node cluster. In our case, this is just the master machine.

192.168.1.3

<HADOOP_INSTALL>/conf/slaves

 This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will be run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data.

192.168.1.3

192.168.28.136

NOTE: Here 192.168.1.3 & 192.168.28.136 are the IP addresses of Physical Machine and Virtual machine respectively which may vary in your case. Just Enter IP Addresses in files and you are done!!!

Let’s enjoy the ride with Hadoop:

All Set for having “HANDS ON HADOOP”.

Formatting the name node

ON Physical Machine and Virtual Machine

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem, this will cause all your data to be erased.

hadoop namenode -format
hadoop namenode -format

Starting the multi-node cluster

1.    Start HDFS daemons

Run the command /bin/start-dfs.sh on the machine you want the (primary) namenode to run on. This will bring up HDFS with the namenode running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file.

Physical Machine

Hadoop - start-dfs.sh
Hadoop – start-dfs.sh

VM

Hadoop - DataNode on Slave Machine
Hadoop – DataNode on Slave Machine

1.    Start MapReduce daemons

Run the command /bin/start-mapred.sh on the machine you want the jobtracker to run on. This will bring up the MapReduce cluster with the jobtracker running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file.

Physical Machine

Hadoop - Start MapReduce daemons
Hadoop – Start MapReduce daemons

VM

TaskTracker in Hadoop
TaskTracker in Hadoop

Running a MapReduce job

Here’s the example input data I have used for the multi-node cluster setup described in this tutorial.

All ebooks should be in plain text us-ascii encoding.

http://www.gutenberg.org/etext/20417

http://www.gutenberg.org/etext/5000

http://www.gutenberg.org/etext/4300

http://www.gutenberg.org/etext/132

http://www.gutenberg.org/etext/1661

http://www.gutenberg.org/etext/972

http://www.gutenberg.org/etext/19699

Download above ebooks and store it in local file system.

Copy local example data to HDFS

Hadoop - Copy local example data to HDFS
Hadoop – Copy local example data to HDFS

Run the MapReduce job

hadoop-0.20.2/bin/hadoop jar hadoop-0.20.2-examples.jar wordcount examples example-output

Failed Hadoop Job
Failed Hadoop Job

Retrieve the job result from HDFS

To read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.

mkdir /tmp/example-output-final

bin/hadoop dfs -getmerge example-output-final /tmp/ example-output-final

Hadoop - Word count example
Hadoop – Word count example

Hadoop - MapReduce Administration
Hadoop – MapReduce Administration
Hadoop - Running and Completed Job
Hadoop – Running and Completed Job

Task Tracker Web Interface

Hadoop - Task Tracker Web Interface
Hadoop – Task Tracker Web Interface

Hadoop - NameNode Cluster Summary
Hadoop – NameNode Cluster Summary

References

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)

http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python

http://java.dzone.com/articles/how-hadoop-mapreduce-works

http://ayende.com/Blog/archive/2010/03/14/map-reduce-ndash-a-visual-explanation.aspx

http://www.youtube.com/watch?v=Aq0x2z69syM

http://www.gridgainsystems.com/wiki/display/GG15UG/MapReduce+Overview

http://map-reduce.wikispaces.asu.edu/

http://blogs.sun.com/fifors/entry/map_reduce

http://www.vmware.com/support/ws55/doc/ws_net_configurations_common.html

http://www.ibm.com/developerworks/aix/library/au-cloud_apache/

 

ssh

rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwjhqJ7MyXGnn5Ly+0iOwnHETAR6Y3Lh3UUKb

aCIP2/0FsVOWhBvcSLMEgT1ewrRPKk9IGoegMCMdHDGDfabzO4tUsfCdfvvb9KFRcB

U3pKdq+yVvCVxXtoD7lNnMtckUwSz5F1d04Z+MDPbDixn6IAu/GeX9aE2mrJRBq1Pz

n3iB4GpjnSPoLwQvEO835EMchq4AI92+glrySptpx2MGporxs5LvDaX87yMsPyF5tutu

Q+WwRiLfAW34OfrYsZ/Iqdak5agE51vlV/SESYJ7OqdD3+aTQghlmPYE4ILivCsqc7w

xT+XtPwR1B9jpOSkpvjOknPgZ0wNi8LD5zyEQ3w== root@mitesh-laptop