In previous post Distributed NoSQL: HBase and Accumulo, we explored two BigTable-like open source solutions. In this post we will learn about Riak, a highly available key-value store modeled after Amazon.com’s Dynamo. As we know, HBase and Accumulo provide strong consistency as a region/tablet is served by only one RegionServer/TabletServer at a time. However, this also introduces the availability problem. If a RegionServer fails, the corresponding regions will not be available during detection and recovery period. In contrast, Dynamo and Riak were designed to provide an “always-on” experience while sacrificing consistency under certain scenarios. Actually, the famous CAP theorem tells us that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
- Consistency (all nodes see the same data at the same time)
- Availability (a guarantee that every request receives a response about whether it was successful or failed)
- Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
Data Model
Riak has a “schemaless” design. Objects are comprised of key/value pairs, which are stored in flat namespaces called buckets. Riak treats both the key and value supplied by the client as an opaque byte array. In Riak, no operations (get and put) span multiple data items. The applications should be able to get the keys “for free”, without having to perform any queries to discover them, e.g. user id, session id, etc. Since all values are stored on disk as binaries, small binaries such as images and PDF documents can be stored directly as binary blobs. However, Riak does not recommend storing objects over 50M for performance reasons. Because it is schemaless, structured data (e.g. relational tables) should be denormalized and usually stored as JSON or XML.
Storage
In contrast to HBase and Accumulo replying on complicated Hadoop HDFS, Riak stores data in the native file system. Moreover, Riak supports pluggable storage backends, just like MySQL. If one needs maximum throughput, Bitcask is a good choice (but with a memory-bounded keyspace). On the other hand, if one needs to store a large number of keys or needs secondary indexes, then LevelDB would be a better backend recommendation.
Architecture
All nodes in a Riak cluster are equal. Each node is fully capable of serving any client request. There is no “master”. This uniformity provides the basis for Riak’s fault-tolerance and scalability. This symmetric architecture is based on consistent hashing to distribute data around the cluster. In consistent hashing, the output range of a hash function is treated as a ring. Riak uses the SHA1 hash function to map the keys of data items to a 160-bit integer space which is divided into equally-sized partitions. Each virtual node (vnode) will claim a partition on the ring. The physical nodes each attempt to run roughly an equal number of vnodes. Consistent hashing ensures data is evenly distributed around the cluster.
Nodes can be added and removed from the cluster dynamically and Riak will redistribute the data accordingly. The ring state is shared around the cluster by a gossip protocol. Whenever a node changes its claim on the ring, it announces its change via this protocol. It also periodically re-announces what it knows about the ring, in case any nodes missed previous updates.
Riak automatically replicates data to N (default N=3) separate partitions on the Riak Ring. Note that N=3 simply means that three different partitions/vnodes will receive copies of the data. There are no guarantees that the three replicas will go to three different physical nodes although Riak attempts to distribute the data evenly. When a value is being stored in the cluster, any node may participate as the coordinator for the request. The coordinating node consults the ring state to determine which vnode owns the partition which the value’s key belongs to, then sends the put request to that vnode, as well as the vnodes responsible for the next N-1 partitions in the ring. The put request may also specify that at least W (<= N) of those vnodes reply with success, and that DW (<= W) reply with success only after durably storing the value. A get request operates similarly, sending requests to the vnode/partition in which the key resides, as well as to the next N-1 partitions. The request also specifies R (<= N), the number of vnodes that must reply before a response is returned.
By default, W and R are set as quorum to provide eventual consistency. A value of quorum indicates a majority of the N value (N/2+1, or 2 for the default N value of 3). Consider that a failed node just recovered but doesn’t have requested key-value or has an old copy, or that the client reads the value immediately after a successful write such that the replication process is not finished yet. Because W=2 and R=2, the coordinating node will receive at least one response with latest value, which will be returned to the client. Meanwhile a read repair process will occur to force the errant nodes to update their object values based on the value of the successful read. With R=1, the client will get faster response but take a chance of receiving an old copy. In general, one can ensure that a read always reflects the most recent write by setting W+R>N. Note that this doesn’t guarantee the consistency if there are concurrent writes after read, which will be discussed in the next section “vector clock”.
Read repair is a passive process that is only triggered when data is read. Riak also has an automatic background process called active anti-entropy (AAE) that compares and repairs any divergent, missing, or corrupted replicas.
Another notable Riak feature is hinted handoff with the concept of “sloppy quorum”. If a node fails or is partitioned from the rest of the cluster, a neighbor server takes responsibility for serving its requests. When the failed node returns, the updates are handed back to it. This ensures availability for writes, and happens automatically.
Vector Clock
Vector clock is an algorithm for generating a partial ordering of events in a distributed system and detecting causality violations. A vector clock of a system of N processes is a vector of N logical clocks, one clock per process. When a key-value pair is added into bucket, it is tagged with a vector clock as the initial version. Later the vector clock is extended for each update so that two versioned replicas can be compared to determine:
- Whether one object is a direct descendant of the other
- Whether the objects are direct descendants of a common parent
- Whether the objects are unrelated in recent heritage
With vector clocks, each node of replicas can auto-repair out-of-sync data when feasible. If more than one client reads a key-value concurrently and writes it back, Riak cannot reconcile automatically and it simply accepts both the writes. When a read comes for the same key, Riak sends all the versions for that key and lets the client to do manual reconciliation.
Summary
Different from CP systems such as HBase, Riak is in the school of AP systems to provide an “always-on” experience. In the next post, I will cover Apache Cassandra, a hybrid of BigTable’s data model and Dynamo’s system design.
Haifeng Li is the Chief Data Scientist at ADP. He has a proven history in delivering end-to-end solutions, having previously worked for Bloomberg and Motorola. He is a technical strategist with deep understanding of computer theory and emerging technologies. He also has a diverse academic background, researching in fields including machine learning, data mining, computer vision, pattern recognition, NLP, and big data. His personal blog can be found here.
(Image Credit: Garrett Coakley)