Wednesday, 25 July 2018

A Real Comparison Of NoSQL Databases HBase, Cassandra & MongoDB

What is NoSQL?

NoSQL provides the new data management technologies designed to meet the increasing volume, velocity, and variety of data. It can store and retrieve data that is modeled in means other than the tabular relations used in relational databases. NoSQL systems are also called “Not only SQL” to emphasize that they may also support SQL-like query languages.

Why do we need NoSQL?

The Relational Databases have the following challenges:
  • Not good for large volume (Petabytes) of data with variety of data types (eg. images, videos, text)
  • Cannot scale for large data volume
  • Cannot scale-up, limited by memory and CPU capabilities
  • Cannot scale-out, limited by cache dependent Read and Write operations
  • Sharding (break database into pieces and store in different nodes) causes operational problems (e.g. managing a shared failure)
  • Complex RDBMS model
  • Consistency limits the scalability in RDBMS
 Compared to relational databases, NoSQL databases are more scalable and provide superior performance. NoSQL databases address the challenges that the relational model does not by providing the following solution:
  • A scale-out, shared-nothing architecture, capable of running on a large number of nodes
  • A non-locking concurrency control mechanism so that real-time reads will not conflict writes
  • Scalable replication and distribution – thousands of machines with distributed data
  • An architecture providing higher performance per node than RDBMS
  • Schema-less data model

HBase:

Wide-column store based on Apache Hadoop and on concepts of BigTable.
Apache HBase is a NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase is partitioned to tables, and tables are further split into column families. Column families, which must be declared in the schema, group together a certain set of columns (columns don’t require schema definition). For example, the "message" column family may include the columns: "to", "from", "date", "subject", and "body". Each key/value pair in HBase is defined as a cell, and each key consists of row-key, column family, column, and time-stamp. A row in HBase is a grouping of key/value mappings identified by the row-key. HBase enjoys Hadoop’s infrastructure and scales horizontally using off the shelf servers.
HBase works by storing data as key/value. It supports four primary operations: put to add or update rows, scan to retrieve a range of cells, get to return cells for a specified row, and delete to remove rows, columns or column versions from the table. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). Although HBase includes tables, a schema is only required for tables and column families, but not for columns, and it includes increment/counter functionality.
HBase queries are written in a custom language that needs to be learned. SQL-like functionality can be achieved via Apache Phoenix, though it comes at the price of maintaining a schema. Furthermore, HBase isn’t fully ACID compliant, although it does support certain properties. Last but not least - in order to run HBase, ZooKeeper is required - a server for distributed coordination such as configuration, maintenance, and naming.
HBase is perfect for real-time querying of Big Data. Facebook use it for messaging and real-time analytics. They may even be using it to count Facebook likes.
Hbase has centralized architecture where The Master server is responsible for monitoring all RegionServer(responsible for serving and managing regions) instances in the cluster, and is the interface for all metadata changes. It provides CP(Consistency, Availability) form CAP theorem.

HBase is optimized for reads, supported by single-write master, and resulting strict consistency model, as well as use of Ordered Partitioning which supports row-scans. HBase is well suited for doing Range based scans.
Linear Scalability for large tables and range scans -
Due to Ordered Partitioning, HBase will easily scale horizontally while still supporting rowkey range scans.
Secondary Indexes - 
Hbase does not natively support secondary indexes, but one use-case of Triggers is that a trigger on a ""put"" can automatically keep a secondary index up-to-date, and therefore not put the burden on the application (client)."
Simple Aggregation- 
Hbase Co Processors support out-of-the-box simple aggregations in HBase. SUM, MIN, MAX, AVG, STD. Other aggregations can be built by defining java-classes to perform the aggregation
Real Usages: Facebook Messanger

Cassandra:

Wide-column store based on ideas of BigTable and DynamoDB
Apache Cassandra is the leading NoSQL, distributed database management system driving many of today's modern business applications by offering continuous availability, high scalability and performance, strong security, and operational simplicity while lowering overall cost of ownership.
Cassandra has decentralized architecture. Any node can perform any operation. It provides AP(Availability,Partition-Tolerance) from CAP theorem.
Cassandra has excellent single-row read performance as long as eventual consistency semantics are sufficient for the use-case. Cassandra quorum reads, which are required for strict consistency will naturally be slower than Hbase reads. Cassandra does not support Range based row-scans which may be limiting in certain use-cases. Cassandra is well suited for supporting single-row queries, or selecting multiple rows based on a Column-Value index.
If data is stored in columns in Cassandra to support range scans, the practical limitation of a row size in Cassandra is 10's of Megabytes. Rows larger than that causes problems with compaction overhead and time.
Cassandra supports secondary indexes on column families where the column name is known. (Not on dynamic columns).
Aggregations in Cassandra are not supported by the Cassandra nodes - client must provide aggregations. When the aggregation requirement spans multiple rows, Random Partitioning makes aggregations very difficult for the client. Recommendation is to use Storm or Hadoop for aggregations.
Real Usages: Twitter

MongoDB:

One of the most popular document stores.
It is a document oriented database.All data in mongodb is treated in JSON/BSON format.It is a schema less database which goes over tera bytes of data in database.It also supports master slave replication methods for making multiple copies of data over servers making the integration of data in certain types of applications easier and faster. 
MongoDB combines the best of relational databases with the innovations of NoSQL technologies, enabling engineers to build modern applications.
MongoDB maintains the most valuable features of relational databases: strong consistency, expressive query language and secondary indexes. As a result, developers can build highly functional applications faster than NoSQL databases.
MongoDB provides the data model flexibility, elastic scalability and high performance of NoSQL databases. As a result, engineers can continuously enhance applications, and deliver them at almost unlimited scale on commodity hardware.
Full index support for high performance
Real Usages: FourSquare

Comparison Of NoSQL Databases HBase, Cassandra & MongoDB:

HBase:

Key characteristics:
·        Distributed and scalable big data store
·        Strong consistency
·        Built on top of Hadoop HDFS
·        CP on CAP
Good for:
·        Optimized for read
·        Well suited for range based scan
·        Strict consistency
·        Fast read and write with scalability
Not good for:
·        Classic transactional applications or even relational analytics
·        Applications need full table scan
·        Data to be aggregated, rolled up, analyzed cross rows
Usage Case: Facebook message

Cassandra:

Key characteristics:
·        High availability
·        Incremental scalability
·        Eventually consistent
·        Trade-offs between consistency and latency
·        Minimal administration
·        No SPF (Single point of failure) – all nodes are the same in Cassandra
·        AP on CAP
Good for:
·        Simple setup, maintenance code
·        Fast random read/write
·        Flexible parsing/wide column requirement
·        No multiple secondary index needed
Not good for:
·        Secondary index
·        Relational data
·        Transactional operations (Rollback, Commit)
·        Primary & Financial record
·        Stringent and authorization needed on data
·        Dynamic queries/searching on column data
·        Low latency
Usage Case: Twitter, Travel portal

MongoDB:

Key characteristics:
·        Schemas to change as applications evolve (Schema-free)
·        Full index support for high performance
·        Replication and failover for high availability
·        Auto Sharding for easy Scalability
·        Rich document based queries for easy readability
·        Master-slave model
·        CP on CAP
Good for:
·        RDBMS replacement for web applications
·        Semi-structured content management
·        Real-time analytics and high-speed logging, caching and high scalability
·        Web 2.0, Media, SAAS, Gaming
Not good for:
·        Highly transactional system
·        Applications with traditional database requirements such as foreign key constraints
Usage Case: Craigslist, Foursquare

0 comments:

Post a Comment