NOSQL and Cloud Computing


Introduction

Cloud Computing is moving from being “IT buzzword” to reasonable yet reliable way of deploying applications in the Internet. IT managers within companies are considering deploying some applications within cloud. A cloud-related trend that developers have been paying attention is the idea of “NoSQL”, a set of operational-data technologies based on non-relational concepts. “NoSQL” is “a sea change” idea to consider data storage options beyond the traditional SQL-based relational database.

Cloud Computing – Download Free EBooks and Whitepapers

Accordingly, a new set of open source distributed database is actively propping up to leverage the facilities and services provided through the cloud architecture. Thus, web applications and databases in cloud are undergoing major architectural changes to take advantage of the scalability provided by the cloud. This article is intended to provide insight on the NOSQL in the context of Cloud computing.

Face off ~ SQL, NOSQL & Cloud Computing

A key disadvantage of SQL Databases is the fact that SQL Databases are at a high abstraction level. This is a disadvantage because to do a single Statement, SQL often requires the data to be processed multiple times. This, of course, takes time and performance. For instance, multiple queries on SQL Data occur when there is a ‘Join’ operation. Cloud computing environments need high-performing and highly scalable databases.

NoSQL Databases are built without relations. But is it really that “good” to go for NoSQL Databases? A world without relations, no joins and pure scalability!  NoSQL databases typically emphasize horizontal scalability via partitioning, putting them in a good position to leverage the elastic provisioning capabilities of the cloud.

The general definition of a NOSQL data store is that it manages data that is not strictly tabular and relational, so it does not make sense to use SQL for the creation and retrieval of the data. NOSQL data stores are usually non-relational, distributed, open-source, and horizontally scalable.

If we look at the big Platforms in the Web like Facebook or Twitter, there are some Datasets that do not need any relations. The challenge for NoSQL Databases is to keep the data consistent. Imagine the fact that a user deletes his or her account. If this is hosted on a NoSQL Database, all the tables have to check for any data the user has produced in the past. With NoSQL, this has to be done by code.

A major advantage of NoSQL Databases is the fact that Data replication can be done more easily then it would be with SQL Databases.

As there are no relations, Tables don’t necessary have to be on the same servers. Again, this allows better “scaling” than SQL Databases. Don’t forget: scaling is one of the key aspects in Cloud computing environments.

Another disadvantage of SQL databases is the fact that there is always a schema involved. Over time, requirements will definitely change and the database somehow has to support this new requirements. This can lead to serious problems. “Just imagine” the fact that applications  need two extra fields to store data. Solving this issue with SQL Databases might get very hard. NoSQL databases support a changing environment for data and are a better solution in this case as well.

SQL Databases have the advantage over NoSQL Databases to have better support for “Business Intelligence”.

Cloud Computing Platforms are made for a great number of people and potential customers. This means that there will be millions of queries over various tables, millions or even billions of read and write operations within seconds. SQL Databases are built to serve another market: the “business intelligence” one, where fewer queries are executed.

This implies that the way forward for many developers is a hybrid approach, with large sets of data stored in, ideally, cloud-scale NoSQL storage, and smaller specialized data remaining in relational databases. While this would seem to amplify management overhead, reducing the size and complexity of the relational side can drastically simplify things.

However, it is up to the Use-Case to identify if you want a NoSQL approach or if you better stay with SQL.

“NOSQL” Databases for Cloud

The NoSQL (or “not only SQL”) movement is defined by a simple premise: Use the solution that best suits the problem and objectives.

If the data structure is more appropriately accessed through key-value pairs, then the best solution is likely a dedicated key value pair database.

If the objective is to quickly find connections within data containing objects and relationships, then the best solution is a graph database that can get results without any need for translation (O/R mapping).

Today’s availability of numerous technologies that finally support this simple premise are helping to simplify the application environment and enable solutions that actually exceed the requirements, while also supporting performance and scalability objectives far into the future.  Many cloud web applications have expanded beyond the sweet spot for these relational database technologies. Many applications demand availability, speed, and fault tolerance over consistency.

Although the original emergence of NOSQL data stores was motivated by web-scale data, the movement has grown to encompass a wide variety of data stores that just happen to not use SQL as their processing language. There is no general agreement on the taxonomy of NOSQL data stores, but the categories below capture much of the landscape.

Tabular / Columnar Data Stores

Storing sparse tabular data, these stores look most like traditional tabular databases. Their primary data retrieval paradigm utilizes column filters, generally leveraging hand-coded map-reduce algorithms.

BigTable is a compressed, high performance, and proprietary database system built on Google File System (GFS), Chubby Lock Service, and a few other Google programs;

HBase is an open source; non-relational, distributed database modeled after Google’s BigTable and is written in Java. It runs on top of HDFS, providing a fault-tolerant way of storing large quantities of sparse data.

Hypertable is an open source database inspired by publications on the design of Google’s BigTable. Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS). It is written almost entirely in C++ for performance.

VoltDB is an in-memory database. It is an ACID-compliant RDBMS which uses a shared nothing architecture. VoltDB is based on the academic HStore project. VoltDB is a relational database that supports SQL access from within pre-compiled Java stored procedures.

Google Fusion Tables is a free service for sharing and visualizing data online. It allows you to upload and share data, merge data from multiple tables into interesting derived tables, and see the most up-to-date data from all sources.

Document Stores

These NOSQL data sources store unstructured (i.e., text) or semi-structured (i.e., XML) documents. Their data retrieval paradigm varies highly, but documents can always be retrieved by unique handle. XML data sources leverage XQuery. Text documents are indexed, facilitating keyword search-like retrieval.

Apache CouchDB, commonly referred to as CouchDB, is an open source document-oriented database written in the Erlang programming language. It is designed for local replication and to scale vertically across a wide range of devices.

MongoDB is an open source, scalable, high-performance, schema-free, document-oriented database written in the C++ programming language.

Terrastore is a distributed, scalable and consistent document store supporting single-cluster and multi-cluster deployments. It provides advanced scalability support and elasticity feature without loosening the consistency at data level.

Graph Databases

These NOSQL sources store graph-oriented data with nodes, edges, and properties and are commonly used to store associations in social networks.

Neo4j is an open-source graph database, implemented in Java. It is “embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs.

AllegroGraph is a Graph database. It considers each stored item to have any number of relationships. These relationships can be viewed as links, which together form a network, or graph.

FlockDB is an open source distributed, fault-tolerant graph database for managing data at webscale. It was initially used by Twitter to build its database of users and manage their relationships to one another. It scales horizontally and is designed for on-line, low-latency, high throughput environments such as websites.

VertexDB is a high performance graph database server that supports automatic garbage collection. It uses the HTTP protocol for requests and JSON for its response data format and the API are inspired by the FUSE file system API plus a few extra methods for queries and queues.

Key/Value Stores

These sources store simple key/value pairs like a traditional hash table. Their data retrieval paradigm is simple; given a key, return the value.

Dynamo is a highly available, proprietary key-value structured storage system. It has properties of both databases and distributed hash tables (DHTs). It is not directly exposed as a web service, but is used to power parts of other Amazon Web Services

Memcached is a general-purpose distributed memory caching system. It is often used to speed up dynamic database-driven websites by caching data and objects in RAM to reduce the number of times an external data source must be read.

Cassandra is an open source distributed database management system. It is designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. It is a NoSQL solution that was initially developed by Facebook and powers their Inbox Search feature.

Amazon SimpleDB is a distributed database written in Erlang by Amazon.com. It is used as a web service in concert with EC2 and S3 and is part of Amazon Web Services.

Voldemort is a distributed key-value storage system. It is used at LinkedIn for certain high-scalability storage problems where simple functional partitioning is not sufficient.

Kyoto Cabinet is a library of routines for managing a database. The database is a simple data file containing records; each is a pair of a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table or B+ tree.

Scalaris is a scalable, transactional, distributed key-value store. It can be used for building scalable Web 2.0 services.

Riak is a Dynamo-inspired database that is being used in production by companies like Mozilla.

Object and Multi-value Databases

These types of stores preceded the NOSQL movement, but they have found new life as part of the movement. Object databases store objects (as in object-oriented programming). Multi-value databases store tabular data, but individual cells can store multiple values. Examples include Objectivity, GemStone and Unidata. Proprietary query languages are used.

Miscellaneous NOSQL Sources

Several other data stores can be classified as NOSQL stores, but they don’t fit into any of the categories above. Examples include: GT.M, IBM Lotus/Domino, and the ISIS family.

Sources for further Reading

http://news.cnet.com/8301-13846_3-10412528-62.html#ixzz1DGORTRBP   http://cloudcomputing.blogspot.com/2010/03/nosql-is-not-sql-and-thats-problem.html

http://news.cnet.com/8301-13846_3-10412528-62.html http://www.readwriteweb.com/cloud/2010/07/cassandra-predicting-the-futur.php

http://cloudvane.wordpress.com/tag/nosql/

http://www.rackspacecloud.com/blog/2010/02/25/should-you-switch-to-nosql-too/ http://pro.gigaom.com/2010/03/what-cloud-computing-can-learn-from-nosql/ http://www.drdobbs.com/database/224900500

http://cloudcomputing.blogspot.com/2010/04/disruptive-cloud-computing-startups-at.html

http://www.informationweek.com/cloud-computing/blog/archives/2010/04/nosql_needed_fo.html

http://www.elance.com/s/cloudcomputing/

http://www.thesavvyguideto.com/gridblog/2009/11/a-look-at-nosql-and-nosql-patterns/

http://blogs.forrester.com/application_development/2010/02/nosql.html

http://www.yafla.com/dforbes/Getting_Real_about_NoSQL_and_the_SQL_Isnt_Scalable_Lie/

http://arstechnica.com/business/data-centers/2010/02/-since-the-rise-of.ars/2

Advertisements

What is Voldemort


Voldemort is a distributed key-value data store used at LinkedIn for high-scalability storage problems where simple functional partitioning is not sufficient.

It is named after the very popular fictional Harry Potter villain Lord Voldemort. Voldermort contains in-memory caching with storage system hence a separate caching tier is not needed. It supports horizontal scalability for reads and writes. It is a more of fault tolerant hash table.

Features:

  • Horizontal scalability and High availability for O/R mapper such as hibernate and active-record
  • Support for distribution across data centers that are far apart by pluggable data placement strategies
  • Automatic data replication over large number of servers
  • Versioned data items to maintain and maximize data integrity
  • Transparent failure handling
Cloud Computing – Download Free EBooks and Whitepapers
Java – Download Free EBooks and Whitepapers
Windows – Download Free EBooks and Whitepapers

Download Free EBooks and Whitepapers on Big DATA

What is Cassandra


Cassandra is an open source distributed database management system and an Apache Software Foundation project having Apache License (version 2.0).

It is designed to handle enormous amounts of data spread out across many commodity servers in traditional environment or in Cloud environment while providing a highly available service with no single point of failure. It is a NoSQL solution that was developed by Facebook and now used by companies that have large, active data sets such as eBay, Twitter, Reddit, Cisco, OpenX, Digg etc.

Download Free EBooks and Whitepapers on Big DATA

Data Model in Cassandra

 

Features

  • Scalability
  • Fault-tolerant
  • MapReduce support
  • Decentralized
Cloud Computing – Download Free EBooks and Whitepapers
Java – Download Free EBooks and Whitepapers
Windows – Download Free EBooks and Whitepapers

FlockDB Definition


What is FlockDB?

FlockDB is an open source, fault-tolerant, and distributed graph database licensed under the Apache license for managing data at webscale. Twitter used it to build user database and manage relationships. It can be efficiently used in high throughput and low latency environments. FlockDB was created by Twitter for relationship related analytics. FlockDB is a database that stores graph data which is optimized for very large adjacency lists, and quick reads and writes but not optimized for graph traversal operations.

Cloud Computing – Download Free EBooks and Whitepapers
Java – Download Free EBooks and Whitepapers
Windows – Download Free EBooks and Whitepapers

Download Free EBooks and Whitepapers on Big DATA

In FlockDB, graphs are stored as sets of edges between nodes which are identified by 64 bit integers. Each edge between nodes is also marked with a 64 bit position. Edge can be used for sorting. For social graphs, integer node IDs will be user IDs while in a graph containing favorite tweets, the destination will be a tweet ID.

Neo4j – graph database


Neo4j

Download Free EBooks and Whitepapers on Big DATA

What is Neo4j

It is an open source property graph database. It is implemented in Java. It is stores data structured in graphs. Graph based model makes it highly agile and fast. It is massively scalable, up to several billion nodes and highly available when it is distributed across multiple nodes. It can be easily embedded by including the Neo4j library jars in your build.

In high availability mode, it has single master and zero or more slaves. It’s high availability feature can handle write requests on all machines so there is no need to redirect those to the master particularly. A slave can handle writes by synchronizing with the master to maintain consistency. All updates propagate from the master to other slaves in due course so a write from one slave may not be immediately visible on all other slaves.

MongoDB definition


What is MongoDB

MongoDB is an open source, scalable, high-performance, and document-oriented database optimized for highly transient data and written in the C++ programming language. It provides RESTful API. Free Cloud based monitoring service is provided for monitoring MongoDB deployments. It supports search by range queries, fields, and regular expressions. Master slave replication is supported where master can perform read and write operations while slaves can read or take backup.MongoDB supports horizontal scaling with the use of sharding. It can be effectively used as a efficient file storage which is capable of taking benefits of load balancing and data replication.

Cloud Computing – Download Free EBooks and Whitepapers
Java – Download Free EBooks and Whitepapers
Windows – Download Free EBooks and Whitepapers

Download Free EBooks and Whitepapers on Big DATA

Use-cases:

  • Flexible schemas are best fit for document and content management systems.
  • Good fit in conjunction with RDBMS for ecommerce infrastructure
  • Good fit for Gaming due to its high performance read-writes
  • Very efficient for server side infrastructure of mobile applications

Apache CouchDB


What is Apache CouchDB

Apache CouchDB is an open source NoSQL database. CouchDB uses JSON (JavaScript Object Notation, lightweight data-interchange format) to store data. JavaScript is used as its query language. CouchDB is published under Apache Software Foundation in 2008. In CouchDB each database is a collection of independent documents. Each document manages its own data and meta data (self-contained schema). CouchDB is ideal in situation where network connection is not guaranteed due to its replication and synchronization capabilities. The BBC uses it for its dynamic content platforms. It can be used in applications such as CRM and CMS where data is changed occasionally and versioning is crucial. Cloudant is an enterprise software company which provides an open source distributed database service based on the Apache CouchDB project.

Cloud Computing – Download Free EBooks and Whitepapers
Java – Download Free EBooks and Whitepapers
Windows – Download Free EBooks and Whitepapers

Download Free EBooks and Whitepapers on Big DATA

Features:

  • CouchDB provides ACID semantics by implementing a form of Multi-Version Concurrency Control (high volume of concurrent readers and writers without conflict).
  • CouchDB supports bi-direction replication (or synchronization) and off-line operation
  • Unique URI that gets exposed via HTTP. REST uses the POST, GET, PUT, and DELETE HTTP methods for the four CRUD operations
  • It assures eventual consistency (model used in the domain of parallel programming) to be able to provide both availability and partition tolerance.