Distributed NoSQL: Cassandra

In previous posts Distributed NoSQL: HBase and Accumulo and Distributed NoSQL: Riak, we explored two very different designs of key-value pair databases. In this post we will learn about Apache Cassandra, a hybrid of BigTable’s data model and Dynamo’s system design. With BigTable-like column/column family in mind, Cassandra provides a more flexible data model than Riak. Modeled after Dynamo’s system design, Cassandra has linear scalability and proven fault-tolerance on commodity hardware. Besides, Cassandra’s support for replicating across multiple datacenters is best-in-class. Since many features of Cassandra were already covered in previous posts as they are shared with HBase/Accumulo and Riak, we will focus on the additional unique features in what follows.

Data Model

Cassandra provides a two-dimensional row-column view to the data contained in a keyspace (i.e. table in HBase). Keyspaces are used to group column families together. If you need higher dimension to organize application data, there is the concept of super columns, which are columns that contain columns. However, super column is deprecated because of performance issues. Instead, developers are encouraged to use composite columns that was introduced in version 0.8.1. Before jumping into composite columns, we need to understand column sorting. Just like other key-value pair databases, the data type of keys and values in Cassandra are byte arrays. More interestingly, we can specify how column names will be compared for sort order when results are returned to the client. But why would I want to sort column names? This especially sounds strange to a relational database developer. In a relational database, we usually have tall tables, i.e. millions skinny rows with a handful columns. We could still follow this design in Cassandra although different rows don’t have to share same column set. On the other hand, one wide row could have millions columns in BigTable-like database, actually up to 2 billion columns in Cassandra. In this case, column names are usually part of the data (one may even go with valueless columns, i.e. column names are data themselves and the values are not really meaningful), rather than purely schema. For example, we can build inverted index with terms as the keys, document ids as the column names, and frequency as the value. With wide-row design, it is necessary to compare column names sometimes, for instance, each row is the time series of stock price in a day and the column names are time points. You can use compare_with attribute on a column family to tell Cassandra how to sort the columns. The default is BytesType, which is a straightforward lexical comparison of the bytes in each column. Other options are AsciiType, UTF8Type, LexicalUUIDType, TimeUUIDType, and LongType. You can also specify the fully-qualified class name to a class extending org.apache.cassandra.db.marshal.AbstractType. Again, we are sorting column names, not values. However, sorting column name providing a way to build second index, which is very useful in real world. Now come back to composite columns, which are arbitrary dimensional column names that can have types like CompositeType(UTF8Type, ReversedType(TimeUUIDType), LongType)). It’s also really simple: it is implemented as a comparator so adds very little complexity.

Storage

Because of the data model, Cassandra has the concepts of SSTable and MemTable, borrowed from Google BigTable. The details can be found in previous post Distributed NoSQL: HBase and Accumulo. On the other hand, Cassandra stores data in the native file system like Riak. Besides, Cassandra doesn’t require ZooKeeper (or any other third-party components). So it is very easy to configure and run. Actually Cassandra starts only one JVM per node, which brings a lot of simplicity to operation and maintenance. Remember how many moving parts in HBase/Accumulo?

Architecture

Same as Riak, Cassandra employs a ring topology but with more partition options. You can provide any IPartitioner implementation to distribute data on nodes. Out of the box, Cassandra provides RandomPartitioner, OrderPreservingPartitioner, ByteOrderedPartitioner, and CollatingOrderPreservingPartitioner. The default is RandomPartitioner to force equal spacing of tokens around the (MD5) hash space, especially for clusters with a small number of nodes. With OrderPreservingPartitioner the keys themselves are used to place on the ring. It brings data locality but also potential bottleneck on hot spots.

Beyond partitions, Cassandra also supports pluggable replication strategies through IReplicaPlacementStrategy to ensure reliability and fault tolerance. Out of the box, Cassandra provides SimpleStrategy (rack unaware), LocalStrategy (rack aware) and NetworkTopologyStrategy (datacenter aware). In addition to setting the number of replicas, the strategy sets the distribution of the replicas across the nodes in the cluster depending on the cluster’s topology. We are particularly interested in NetworkTopologyStrategy. With it, we can deploy the cluster across multiple data centers and specify how many replicas we want in each data center. If configured properly, Cassandra is able to read locally without incurring cross-datacenter latency, and handles failures nicely. The NetworkTopologyStrategy determines replica placement independently within each data center as follows:

The first replica is placed according to the partitioner
Additional replicas are placed by walking the ring clockwise until a node in a different rack is found. If no such node exists, additional replicas are placed in different nodes in the same rack.

To achieve this, we need a snitch maps IPs to racks and data centers. It defines how the nodes are grouped together within the overall network topology.

Vector Clock and Last-Write-Wins

Like Riak, Cassandra also has read repairs and active anti-entropy to resolve some consistency issues. However, Cassandra doesn’t have vector clocks and simply uses the last-write-wins approach. For sure, last-write-wins is simple but has problems with updates-based-on-existing-values issues. On the other hand, vector clocks are not adequate either in this case unless the data structure is CRDTs (consistency without concurrency control). It is debatable which approach is better. Meanwhile, Cassandra 2.0 provides compare and set based on Paxos consensus protocol, but misleadlingly labels it as “lightweight transaction”.

Summary

Overall, Cassandra is a very nice system with flexible data model, linear scalability and high availability. A Cassandra cluster is also much easier to set up and run than HBase/Accumulo. Besides, Cassandra offers secondary indexes to allow querying by value, which we didn’t discuss in detail in this post.

ALSO IN THIS SERIES:

Choosing a NoSQL

In the first installment of his Understanding NoSQL series, guest contributor Haifeng Li gives us a crucial advice for choosing a NoSQL database. Tips include: ignore benchmarking, and look into which problem the developers were originally trying to solve.

Haifeng Li is the Chief Data Scientist at ADP. He has a proven history in delivering end-to-end solutions, having previously worked for Bloomberg and Motorola. He is a technical strategist with deep understanding of computer theory and emerging technologies. He also has a diverse academic background, researching in fields including machine learning, data mining, computer vision, pattern recognition, NLP, and big data. His personal blog can be found here.