Distributed NoSQL: MongoDB

We have explored several interesting distributed key-value databases including HBase and Accumulo, Riak, and Cassandra. Although key-value pairs are very flexible, it is tedious to map them to objects in applications. In this post we will learn about a popular document-oriented database MongoDB. MongoDB uses JSON-like documents with dynamic schemas, making the integration of data in certain types of applications easier and faster. Beyond key search, MongoDB supports search by value, range queries, and text searches. Any field in a document can be indexed (by B-Trees, similar to those in RDBMS).

Data Model

Documents are addressed in the database via a unique key. The documents contain one or more fields, and each field contains a value of a specific data type (maybe object or array). Documents that tend to share a similar structure are organized as collections. Compared to relational databases, collections could be considered analogous to tables and documents analogous to records. However, every record in a table has the same sequence of fields, while documents in a collection may have fields that are completely different. MongoDB stores documents in a binary representation called BSON (Binary JSON). The BSON encoding extends the JSON to include additional types such as date, int, long, double, byte array, etc.

Storage

In MongoDB, each database has a namespace file holding entries which each points to (the first extent of) a collection or index in data files. Data files are broken into contiguous disk space called extents (grow exponentially up to 2GB) to hold documents or B-Tree index nodes. Data files (and journals) are aggressively pre-allocated. MongoDB stores each document (plus some extra padding bytes as growth buffer) in an extent as a contiguous block. The documents are updated in-place and the whole document will be moved to a bigger space if the update increases the size of document beyond its current allocated space. All data files are memory mapped into the virtual memory of mongod, the primary daemon process for the MongoDB system. MongoDB reads/writes to RAM directly and OS takes care of the rest. This greatly simplifies the cache and file access logic in MongoDB and also leverages OS’s LRU (least recently used) cache behavior. On the other hand, it incurs the fragmentation management and read-ahead overhead. Clearly, this also limits the data size MongoDB can handle on 32-bit systems due to inherent memory limitations.

MongoDB Data Structure

As shown in the above diagram, data and space are mainly organized through doubly-linked-list. Each collection of data is organized in a linked list of extents. Each extent points to a head/tail of another linked list of documents (up to 16MB). Data files may get fragmented over time because of updates and deletes. It especially gets worse if documents have varized sizes. Fragmentation wastes memory and disk space and also make writes scattered and slower. MongoDB provides compact command for defragmentation.

Changes in memory mapped files are flushed to disk every 60 seconds. MongoDB uses write ahead logging to an on-disk journal to guarantee write operation durability and to provide crash resiliency. The write-ahead log is committed every 100 milliseconds by default (configurable with -journalCommitInterval). So the maximum data loss is 100 ms on a hard crash. To achieve durability (i.e. data written to the on-disk journal when acked), you can use the “j” option in write concern (discussed in details later).

MongoDB uses a readers-writer lock that allows concurrent read access to a database but exclusive write access to a single write operation. Before version 2.2, this lock was implemented on a per-mongod basis. Since version 2.2, the lock has been implemented at the database level. In a properly designed schema a write will hold the lock for approximately 10 microseconds. If a slow-running operation is predicted (i.e., a document or an index entry will need to be paged in from disk), then that operation will yield the write lock.

Cluster Architecture

Although MongoDB can run as a single instance, it is often configured in a cluster environment to provide scalability and high availability. High availability is achieved in MongoDb via Replica Set, which provides data redundancy across multiple physical servers, including a single primary as well as multiple secondaries that replicate the primary’s oplog and apply the operations to their data sets. The primary accepts all write operations from clients and therefore provides strict consistency. By default, clients also read from the primary. To improve read throughput, clients can specify a read preference to send read operations to secondaries. With “nearest” read preference, the client driver periodically pings the members and will favor issuing queries to the one with lowest latency. Notice that read request is issued to only one node, there is no quorum read or read from multiple nodes. With read preferences, however, the read results may not reflect latest writes because replications are done asynchronously. MongoDB allows users to specify write availability in the system, which is called the write concern. Write concern can include the w option to specify the required number of acknowledgments from replica set before returning, the j option to require writes to the journal before returning, and wtimeout option to specify a time limit to prevent write operations from blocking indefinitely. Prior to November 2012, MongoDB’s client drivers return when the writes had only entered the client’s outgoing queue. Now the default write concern acknowledges writes received by the primary (but before writing to journal), allowing the client to catch network exceptions and duplicate key errors.

Replic Set

Within the replica set, members are interconnected with each other to exchange heartbeat message. When a primary does not communicate with the other members of the set for more than 10 seconds, the replica set will attempt to select another member to become the new primary. The first secondary that receives a majority of the votes becomes primary. Because ofasynchronous replication, the newly elected primary doesn’t necessary having all the latest updates.

Note that a new primary may be elected even if the old one didn’t crash because of network partition or simply over-loaded primary. In these situations primaries will have accepted write operations that have not replicated to the secondaries after a failover occurs. When the former primary rejoins the replica set and attempts to continue replication as a secondary, the former primary must revert these operations to maintain database consistency across the replica set.

Although replica set provides data redundancy and potentially load balance of reads with eventually consistency, it doesn’t provide linear scalability for writes since all updates still has to go to the single primary. To load balance writes, MongoDB provides auto-sharding including range-based, hash-based, and tag-award sharding. With sharding, a collection is partitioned into chunks and have chunks distributed across multiple shards.

Sharding

A sharded MongoDB cluster consists of the shards, config servers, and routing instances. A shard is a MongoDB instance that holds a subset of a collection’s data. Each shard is usually a replica set in production although it can be a single mongod instance. Each config server is a mongod instance that holds metadata about the cluster. The metadata maps chunks to shards. Each router is a mongos instance that routes the reads and writes from applications to the shards. Applications do not access the shards directly.

For writes, the route server will forward the request to the corresponding primary server hosting the chunk whose key range covers the partition key of the document. In case of reads, the routing server will examine whether the partition key is part of the selection criteria and if so will only route the request to the corresponding shard. However, if the partition key is not part of the selection criteria, then the routing server will forward the request to every shard which will perform its local search, and the results will be gathered at the routing server and return to the client.

As chunks grow beyond the specified chunk size a mongos instance will attempt to split the chunk in half. Splits may lead to an uneven distribution of the chunks for a collection across the shards. In such cases, the mongos instances will initiate a round of migrations to redistribute chunks evenly across shards.

Summary

MongoDB is an agile database that allows schemas to change quickly as applications evolve. With careful setup, MongoDB clusters can also provide scalability and high availability.

ALSO IN THIS SERIES:

Choosing a NoSQL

In the first installment of his Understanding NoSQL series, guest contributor Haifeng Li gives us a crucial advice for choosing a NoSQL database. Tips include: ignore benchmarking, and look into which problem the developers were originally trying to solve.

Haifeng Li is the Chief Data Scientist at ADP. He has a proven history in delivering end-to-end solutions, having previously worked for Bloomberg and Motorola. He is a technical strategist with deep understanding of computer theory and emerging technologies. He also has a diverse academic background, researching in fields including machine learning, data mining, computer vision, pattern recognition, NLP, and big data. His personal blog can be found here.