NoSQL – Dataconomy

Three signs you might be experiencing a NoSQL hangover

Dennis Duckworth — Wed, 19 Jul 2017 09:00:32 +0000

Selecting a database technology to build your new application on is often a complex and even stressful process. While the business use case for the application is pretty straightforward, the nuances of the data platform that will power it are often much less clear, with the decision on what technology to use left in the hands of budget-starved development teams. The pressure to select not only the best database technology for the job, but also the technology with the lowest sticker price makes NoSQL technologies, particularly open source, a tempting choice. However, while going with a NoSQL solution can seem like a smart decision in the moment, just like the decision on Saturday night to stay for one more drink, the aftermath can be a similar kind of regret. Something we call the “NoSQL Hangover.”

That’s not to say NoSQL is never the right choice. There are many applications where NoSQL works well, but if you don’t fully understand both the current AND future needs of your business applications, then you may be wishing you hadn’t finished off that bottle of Barolo before making that NoSQL decision. For instance, will batch processing meet the real-time data ingestion, analysis and action needs of your application, even at a large scale? What about the consistency of transactions? If accuracy and speed matter, or will matter in the future, having a better understanding of the strengths and weaknesses of NoSQL offerings can help you make the right decision now and also provide some insurance for the future when your boss asks “can we add this new feature by next week?”

For years SQL was the go-to language and then new systems like NoSQL began to emerge as new data management requirements evolved. For many, the divide between SQL and NoSQL is becoming even less clear, as developers integrate additional technologies into the data platform stack. Below are three indications that your choice to go NoSQL while building applications for the digital economy may not have been right one.

Your application is available but not necessarily reliable

Looking at availability – being constantly responsive, even during network partitions – NoSQL systems appear to be an attractive option. In fact, you could even say that NoSQL was originally created for availability. When faced with network partitions, it is impossible for a distributed system to have both perfect availability and consistency, so each system makes a choice of favoring one or the other. Those choosing availability will always provide a fast answer, but it may not be the most accurate or current answer – in other words, it may be wrong. Depending on your use case and your SLAs, 100 percent of the answers you get from a NoSQL system may be inaccurate or out of date. And while this is not a big deal in some industries, the “almost” correct output can be crippling for those in financial services, healthcare, adtech and telecommunications. For applications where consistency is an absolute must, such as billing, trade verification, fraud detection, bid and offer management, sensor management in IoT, operations support (telco), SLA management and more, going NoSQL will likely lead to problems.

NewSQL systems tend to favor consistency over availability in network failure scenarios, which ensures that the results returned are correct and current. Most of these systems have highly available features as well, such as data replication within and across clusters, so that there is no disruption to your application or business should a cluster “break.”

The speed you bought suddenly isn’t fast enough

While NoSQL offers datastore speed, it struggles to administer consistent, reliable transactions at scale. For example, placing a trade at the best offer price, detecting a fraudulent card swipe before the transaction is approved, and allowing a mobile call to connect while verifying the caller’s balance, all require scalable, consistent transactions.

These types of transactions touch many industries, including online gaming, financial services, telco and adtech, and are occurring all over the world, sometimes millions of times per second. All of these industries will need to be able to handle the fluctuating velocity and volume of these types of events and require transactional consistency or risk basing critical decisions on incorrect or not current data.

Speed is also a characteristic of application development – you need to be able to develop new applications and modify existing applications quickly and easily. Agile devops is the way to go, particularly when you are running your application as a service that needs to stay up and be updated regularly. Some data management systems make that easy and some not so easy. The ones that make that easy provide more capabilities within the platform itself so app developers can take advantage of them in their applications without adding complexity through more code.

Your system can scale but there are difficulties

When it comes to scalability, NoSQL offerings can seem attractive, especially to companies that are struggling with large amounts of incoming, unstructured data from multiple sources like mobile devices, user status updates, web clicks, etc. As a result, these databases have to scale out, and NoSQL allows you to easily scale applications on inexpensive hardware.

Despite this benefit, NoSQL still has scalability challenges. In particular, not all NoSQL databases do well with automating sharding, which is the process of spanning a database across nodes. And if a database is unable to automatically shard, it cannot scale automatically according to varying demand.

Issues also begin to arise when you want to take action on streams of incoming data in real-time. Working on a static set of data is one thing – it is always there in case you need to access it again. But streaming data is always changing. If you have a hiccup in your processing or your system crashes, you need your system to recover quickly and gracefully and to be able to remember enough of the data for your analytics and actions to be valid, all while handling the newly arriving data. When you are dealing with complex stacks of components, that graceful recovery becomes much more challenging, particularly at scale. Solutions with fewer components are much easier to test for and manage through all possible failure scenarios.

NewSQL systems were built to address these same scalability challenges while also allowing you to perform real-time analytics and assign immediate actions from your database. If you want to dump data into a data lake for future analysis, NoSQL scales perfectly well. If you want to benefit from that same data by taking action on it the moment it arrives, when it is most relevant, then it’s time to take another Advil because NoSQL systems have difficulty blending speed and scale.

When it comes to choosing your next database technology, NoSQL may be the right choice for you and leave you regret-free. But in today’s world where real-time analytics, predictable high performance, trustworthy data and reliable scalability are becoming increasingly critical, it’s always a good idea to fully understand what your business and applications require in a data platform. Considering the symptoms above may help you make the right decision and avoid any future NoSQL hangovers.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Big Data’s Potential For Disruptive Innovation

Savaram Ravindra — Mon, 10 Jul 2017 09:00:28 +0000

An innovation that creates a new value network and market, and disrupts an existing market and value network by displacing the leading, highly established alliances, products and firms is known as Disruptive Innovation. Clayton M. Christensen and his coworkers defined and analyzed this phenomenon in the year 1995. But, every revolutionary innovation is not disruptive. When a revolution creates a disorder in the current marketplace, then only it is considered as disruptive.

The term ‘disruptive innovation’ has been very popular over the past few years. In spite of many differences in application, many agree on the following.

Disruptive innovations are:

More accessible (with respect to distribution or usability)
Cheaper (from a customer perspective)
And utilize a business model with structural cost advantages (with respect to existing solutions)

than their existing counterparts in the market.

The reason why the above characteristics of disruption are important is that when all 3 exist, it is very difficult for an existing business to stay in competition. Whether an organization is saddled with an outmoded distribution system, highly trained specialist employees or a fixed infrastructure, adapting quickly to new environments is challenging when one or all of those things become outdated. Writing off billions of dollars of investment, upsetting the distribution partners of your core business, firing hundreds of employees – these things are difficult for managers to examine, and with good reason.

Every day, new technologies emerge. The vendors in the market get shaken only if the technological innovation is extremely powerful. Big Data technologies such as NoSQL and Hadoop could be seen as catalysts for this type of innovation. We should understand here that big data is just raw data. The disruptive innovation coming from big data are big data analytics processes and technologies.

In the marketplace, big data is a disruptive force. It means that people require more than new skills, technologies and tools. They need an open mind to rethink about the processes they have followed for a long time and transform the way they operate. However, it is not particularly easy to force this type of change on long-time employees.

This must be viewed differently as many people believe that big data is a disruptive opportunity. Instead of the challenges that are stated above, we should consider 2 positive aspects:

There is an opportunity to gain advantage from the flux occurring in the market, market changes and disruptions.
Opportunity abhors a vacuum. If you don’t take advantage of the opportunities, you should expect that others will.

In his seminal work, The Innovator’s Dilemma, Clayton M. Christensen states a path forward for disruptive, new innovations in the following 4 steps:

Phase 1 – Performance

There are various new market entrants at this stage with a large amount of chaos and the major focus of customers is on the emerging feature sets and functionality. When a technology arrives in the market, the first thing people look for is advanced features and high product performance, while ensuring it is doing the new thing they expect.

Phase 2 – Reliability

When the market reaches this stage, people have accepted the feature set and they now want reliability and stability in the products. There is a shift in focus from ‘does this product do what we expected’ to ‘how reliable is this product.’

Phase 3 – Convenience

Here, the relevance for big data implies making the software accessible on mobile devices in the form of iPhone apps or similar ones. Instead of making the software products that are command-line driven, the UIs that are appealing have become operative and the customers began demanding them.

Phase 4 – Price

Once the above 3 phases have been completed, all market players have equal opportunity and they will start competing on price. When other criteria are satisfied and product turns into a commodity, the price will be the only differentiator.

With Big Data, I think we are still early in this lifecycle. Most of the products are in Phase 1, and some are entering Phase 2. If you consider Hadoop in spite of the amount of hype, few organizations are not using it. They want to utilize it as part of their Enterprise Data Warehouse, or as part of a Data Lake. Hadoop needs to have some features to make it more reliable for the enterprise for this to be a reality. It is getting there because active users of Hadoop are working on this, as are its vendors, like Pivotal and Cloudera. Expect a similar evolution for the types of tools along this continuum, and for Hadoop vendors and other somewhat highly established technologies of Big Data, as they begin to think of convenience and add reliability. YARN is an instance of emerging technology like Hadoop.

With information at the centre of most modern disruptions, there are new opportunities to attack industries from various angles. In a fragmented limo market, Uber built a platform that let it go into the broader logistics and transportation market. Through streaming video, Netflix grabbed users’r attention and it utilized the data it had to stir up the content production process. With a web mapping service known as Google Maps, Google mapped the world and then took its understanding of street layouts and traffic patterns to build autonomous cars.

There is not even a small doubt that disruption is in progress here. The products are created by these players and they are more accessible and cheaper when compared with their peers. It is coming from orthogonal industries with strong information synergy but not necessarily starting at the low end of the market. It is beginning where the source of data is and then building the information enabled system to attack an incumbent industry.

It is time for innovators, entrepreneurs and executives to stop arguing over whether something satisfies the traditional path of disruption. The disruption enabled by data may present an anomaly to the existing theory, but it is here and it is here to stay for a long time. The new questions must be

How can you adapt in the face of this new type of competition?
When data is a critical piece of any new disruption, what capabilities do you need and where do you get them?
How do you assess new threats?

In order to succeed in this new environment, businesses require a thoughtful approach to recognize the potential threats combined with the will to make the right long-term investments — in spite of short-term profit incentives.

In spite of various wild predictions made regarding big data, the reality is that big data is disruptive and it must follow an established path. Businesses need to know in which disruption phase they exist and should make sure they are meeting the requirements of current phase as well as the next phase in the progression. This is extremely important to understand to define as well as implement a big data strategy successfully and meet the needs proactively.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

The Modern Face of Master Data

Ramon Chen — Fri, 25 Mar 2016 09:30:44 +0000

MDM still has the potential to meet many critical needs in this data-rich environment, but only if it can keep up with changing times.

Let’s start with the obvious: We’re already drowning in data, and it just keeps coming. We need to gather, store, collate and analyze it, because we know it contains intelligence that can guide ongoing and future initiatives. One invaluable tool in that effort is Master Data Management (MDM): It brings together different policies, standards, mandates, processes and even technologies underpinning the core data to provide a single point of reference. That makes it ideal for companies submerged in data and looking to make better use of it.

So why isn’t that happening? Why does MDM seem like a legacy discipline? And what can we do about it?

One obstacle is the sheer dynamism of the industry. Exciting as it is, the constant emergence of new technologies, each enabling more data in different formats, makes it inevitable that some traditional technologies and processes supporting MDM just can’t keep pace. It can actually have the opposite effect: It devours IT resources and hurts other business initiatives. That’s one reason why many organizations have deployed multiple MDM programs, leading to exactly the kind of information silos that this advance was intended to eliminate.

Meanwhile, the data (and the number of data formats) keeps mounting, which means the critical task of developing actionable intelligence from different data sources—such as, for example, identifying key relationships between different customer subsets—gets more difficult each passing day.

Consumer-Enterprise: Bridging the Divide

Here’s a different perspective on the problem, and perhaps a path forward.

MDM originated in the enterprise arena, with extensions for different verticals, like life sciences. This is a world where untangling relationships between healthcare professionals and organizations—each with a complex web of plans and players—can provide a significant competitive advantage. However, mastering that massive dataset to gain a single view of the customer or product just isn’t enough: It requires comprehensive access to data from all the different areas of the organization, and a holistic view of the entire business to support multi-channel or omni-channel strategies. Meanwhile, there are other complications. We’ve got regulatory mandates to worry about, and data assets represent a potential revenue stream, to name two.

This is just one reason why so many MDM-only tools can’t do the job. They need to be supplemented with software programs that better enable data quality, enable governance, ensure self-service BI and analytics, etc. All that is so enterprise, but now consider the consumer angle. It’s easy to dismiss any similarity between the relative trivia of, say, LinkedIn or Facebook and the complexity inherent in industry-specific MDM systems, but that’s really the point.

Those services and others like them provide effortless access to—and management of—all types, not just master data, in the form of profile information, as well as transaction, interaction, and social data, all within the same application.

In fact, they uncover rich relationships and connections across people, products, and organizations, while using the information to predict and recommend the best course of action.

There are no SQL queries, and no need to understand the underlying data model or data structures. Business users get relevant insights, and recommended actions before even asking the question, directly from a single application.

And again, before dismissing it as trivia, let’s remember that these applications blend both analytical and operational capabilities, and continuously scale to handle millions of records in real time. In fact, the data volumes involved are staggering. They also deliver new capabilities seamlessly to improve user experience and productivity on a regular basis across a wide range of devices.

That’s modern data management. It encompasses the traditional role of Master Data Management but also incorporates Big Data, social media, event data and whole lot more.

Back to the Enterprise

This is the future of MDM: It should adhere to the Facebook and LinkedIn paradigm for B2B data management and data-driven applications. It’s a component of a wider modern data management platform, and it’s directly integrated with data-driven applications.

It builds on core MDM strengths such as address cleansing, match, and merge and applies them dynamically to data from internal and external sources, specifically as a component of an enterprise data-driven application. That’s how it can ensure reliable data, governance, role-level security, visibility, and more. Indeed, as with the consumer world, social and collaborative curation and feedback increases the value and use of the data.

Of course, the fundamental purpose or this discipline is to guide business initiatives. Appropriate context must be built in to offer business teams the insights and recommendations they need. In particular scenarios, they can add to data generation by submitting their own updates and ratings. (That, in essence, is the equivalent of a Facebook like or LinkedIn recommendation.)

Visibility—in the literal sense, the key to making the findings easier for more users to understand—is equally important. For example, one common technology thread in those consumer-facing applications is graph databases. Graphs highlight complex and evolving relationships between many different types of entities, and makes them more comprehensible than traditional relational databases ever did. For the record, most off-the-shelf graph databases are unable to handle the data volume, variety and velocity. However, the most scalable enterprise data management platforms use a blend of columnar and graph NoSQL hybrid technology.

Remember, it’s been a long time since MDM entered the mainstream, and it gained attention by promising a 360-degree view of corporate data. Yet after all this time, most of the offerings on the market focus on, and manage, master data only. Many use them for a single customer domain, with product and other entities managed separately, if at all. The creation of a 360-degree view, with the accompanying interpretation of complex relationships and affiliations, remains a separate effort.

Moving forward, MDM can offer tremendous benefits as a component of a modern data management platform, fully integrated with data-driven applications and delivering fast time-to-value across the enterprise. But without a high level of modernization that adapts freely from consumer equivalents, it’s headed toward legacy status.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

“I’d like to see NoSQL become much easier to adopt” – Interview with Basho’s John Musser

Darya Niknamian — Fri, 18 Mar 2016 08:30:47 +0000

John Musser is an expert in software development, specifically in building highly scalable and resilient software and currently serves as Vice President of Technology at Basho. In his role at Basho, Musser helps drive product strategy and direction for the Basho family of products, which simplify enterprises’ most critical data management challenges. Prior to joining Basho in 2016, Musser founded and led API Science and the ProgrammableWeb.

You have a long standing history as an expert in API strategies. Why do you think API strategies are so valuable to a company and what are your tips for others who want to become an expert in this field?

APIs are so valuable because they enable companies to become platforms, and platforms today provide a tremendous competitive opportunity and strategic advantage. Just look at Apple and Google’s mobile platforms, Salesforce.com’s enterprise SaaS platforms, Amazon’s AWS cloud platforms and so on. All these companies have a platform strategy powered by APIs. To become an expert in this field one should look at the playbook of both these largest API players, but also a new generation of API-first companies like Twilio and Stripe that have disrupted markets through APIs.

Which use cases do you think can benefit most from NoSQL and related distributed computing?

You should start thinking about NoSQL solutions anytime you’re looking at a data-intensive project that is “at scale” — that is, a project with key requirements around high volume, high throughput and high availability. If you look at the use cases that gave rise to this market you’ll see how all those companies needed tools to handle data-at-scale: social networking, eCommerce, telecommunications and so on. As more and more enterprises find themselves with high-volume data needs, whether from customer data, sensor data, or anywhere else, they should evaluate NoSQL as a way to successfully manage and derive value from this data.

What trends do you see gaining steam over the coming year that will impact the use of NoSQL?

The first big trend driving use of NoSQL today is the rapid rise of the Internet of Things (IoT). Devices of all sorts, from connected cars to wearables to new connected industrial equipment, all generate tremendous volumes of data. Gartner, Inc. forecasts that 6.4 billion connected things will be in use worldwide in 2016, up 30 percent from 2015, and will reach 20.8 billion things by 2020. In 2016, 5.5 million new things will get connected every day. Both the volume and nature of this data is a natural fit for NoSQL.

The other big trend on the NoSQL horizon is the movement toward integrated data workflows optimized for data processing and analysis at scale. This means NoSQL will become a core component of packaged “stacks” of application components — distributed storage, message queueing, analytics, visualization — which, taken together, accelerate and simplify deploying and managing this class of big data, IoT and analytics applications. This maturation of NoSQL will bring greater value to businesses by allowing them to focus at a higher level, one that’s more results-oriented, rather than spending time and resources on the underlying data plumbing.

What one piece of advice would you like to share with companies that are trying build out big data applications?

The era of monolithic solutions is over and they should choose the right tool for the job at hand. For some companies this can be hard because it often means change; it means moving beyond past assumptions, and introducing new tools, technologies and processes. For example, many of the best NoSQL platforms come from the open-source world, and companies should embrace this approach not fear it.

How do you hope to see the use of NoSQL change?

I’d like to see NoSQL become much easier to adopt. As we talked about earlier, companies should be able to focus on how to get real value from all this data they’re collecting rather than worrying about all the infrastructure and integration logistics needed just to make it work. Today companies are cobbling together data management, analytics and visualization toolsets as part of their own custom data supply chains, but it’s too much work. Just as the idea of well integrated sets of tools like the LAMP stack really accelerated web development, this is happening in the world of NoSQL as well.

Are there any industries you think need to adopt NoSQL over others? Why?

Data plays a bigger and bigger role in driving our economy each year, and every industry that I can think of needs to be leveraging data to improve management of their assets, develop products and improve the quality of their interactions with their customers. At Basho we see how far this has spread, with customers spanning healthcare, telecommunications, ecommerce, gaming, manufacturing, utilities, security and software so I can’t single out any one industry, but those industries looking to leverage IoT data will need to adopt NoSQL and quickly.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

SQL vs. NoSQL vs. NewSQL: Finding the Right Solution

John Piekos — Mon, 17 Aug 2015 10:34:10 +0000

With SQL now invading the NoSQL camp, (see here), how should an organization choose between a traditional SQL database, a NoSQL data store, or NewSQL database? 2015 Turing Award winner Mike Stonebraker said it best: “one size does not fit all”. The idea that a single database product can satisfy any (or all) use cases simply isn’t true these days.

[bctt tweet=”#NoSQL vs #SQL Databases: ‘One size does not fit all.'”]

On November 25th-26th 2019, we are bringing together a global community of data-driven pioneers to talk about the latest trends in tech & data at Data Natives Conference 2019. Get your ticket now at a discounted Early Bird price!

If you are happy with the performance, scalability, and high availability of your current traditional SQL database system (the likes of Oracle, SQL Server, MySQL), then there is no reason to read further. However, if you have growing pains in any of these areas, then a NoSQL or NewSQL offering may be right for you. So how do you choose between them?

Choosing the right tool for the job at hand is 80% of getting to a solution; the other 20% is really understanding the problem you’re trying to solve. Here’s a rundown of the advantages and disadvantages of traditional SQL databases (affectionately called OldSQL in this article), NoSQL and NewSQL that can help you focus your data store choices.

OldSQL

Traditional SQL databases have been around for decades and serve as the core foundation of nearly every application we use today. If you have a deployed application and it is behaving and performing acceptably, that is fantastic; there is no need to change your data store. Realize that porting or replacing your database not only introduces work, it introduces risk. It is rare that a database replacement candidate is feature-for-feature, as well as bug-for-bug, compatible with the database you are replacing. Beware any vendor who suggests otherwise! Here are some of the typical patterns of a traditional database application:

The OldSQL advantages:

Proven disk-based database technology and standard SQL support, hardened over several decades.
Compatible with ORM layers, such as Hibernate or ActiveRecord.
Rich client-side transactions.
Ad hoc query and reporting.
An established market and ecosystem with vast amounts of standards-based tooling.

The OldSQL disadvantages:

Not a scale-out architecture. Transactional throughput is often gated by the capacity of a single machine. Scaling out requires application-defined and managed sharding (or partitioning) of the data.
Traditional SQL systems were built for “one size fits all.” They are good for general purpose applications with modest performance requirements, but struggle as needs grow.
Complex tuning parameters often require deep expertise to get the best balance between performance, data safety, and resource use.

NoSQL

First, realize that the term NoSQL is about as descriptive as categorizing dogs and horses as “NoCats”. In truth, NoSQL is a broad category collecting disparate technologies beneath an ambiguous umbrella. The term offers little help to the developer trying to decide on the right tool for the right job.

So let’s break it down with an eye on what we really care about as software engineers: what problems can I solve with NoSQL? Equally important, where is NoSQL a bad fit? Where do the different technologies show their strengths?

Some NoSQL Systems Put Availability First

Say you have gigabytes to petabytes of data. New data is added regularly and, once added, is relatively static. A database that archives sensor readings or ad impression displays is a good example. You want to store this in a cloud and are willing to tolerate the programming challenges of eventual consistency (made easier because most updates are idempotent anyway) for distributed access, multi-datacenter replication, and the highest possible availability.

Your application-to-database interactions are simple “CREATE” and “GET” patterns that don’t require traditional transactions. The most important consideration is that the database is always available to accept new content and can always provide content when queried, even if that content is not the most recent version written. Such systems include DynamoDB, Riak and Cassandra.

Some NoSQL Systems Focus on Flexibility

Made famous by MongoDB and CouchDB, the Document Model expands upon the traditional key-value store by replacing the values with JSON-structured documents, each able to contain sub-keys and sub-values, arrays of value, or hierarchies of all of the above. Often described as schemaless, these systems don’t enforce premeditated or consistent schema across all of the stored documents. This makes managing schema different… less rigid, but also much messier. It is likely the benefits of this approach are more applicable for smaller development teams with simpler data needs.

Other systems expand upon key-value stores with organizational features. Redis is popular for creating many sorted lists of data for easy ranking and leaderboards. By adding more complex functions to order by and computer statistics on, its focus allows for functionality key to its specific use cases.

Some NoSQL Systems Focus on Alternative Data Models or Special Use Cases

The most common examples are systems tuned for graph processing, such as Neo4j. Array databases are another such example; SciDB uses Python and R to access MPP array data for scientific research. Accumulo is a variation on the wide-column-store model popularized by Cassandra and BigTable, but with a focus on cell-level security. Systems like etcd are distributed datastores with a focus on storing configuration data for other services. Elasticsearch is a popular system for implementing text search within applications.

The NoSQL advantages:

Eventual-consistency algorithms allow implementations to deliver the highest availability across multiple data centers.
Eventual-consistency based systems scale update workloads better than traditional
OLAP RDBMs, while also scaling to very large datasets.
Many NoSQL systems are optimized to support non-relational data, such as log messages, XML and JSON documents, as well as unstructured documents, allowing you to skip specifying schema-on-write, and allowing you to specify a schema-on-read.

The NoSQL disadvantages:

These systems are fundamentally not transactional (ACID). If they advertise otherwise, beware the over-reaching claim.
OLAP-style queries require a lot of application code. While the write-scaling advantages are appealing vs. OLAP stores (such as Vertica or GreenPlum), you sacrifice declarative ad hoc queries – important to historical analytical exploration.

NewSQL

The term NewSQL is not quite as broad as NoSQL. NewSQL systems all start with the relational data model and the SQL query language, and they all try to address some of the same sorts of scalability, inflexibility or lack-of-focus that has driven the NoSQL movement. Many offer stronger consistency guarantees.

But within this group there are many differences. HANA was created to be a business reporting powerhouse that could also handle a modest transactional workload, a perfect fit for SAP deployments. Hekaton adds sophisticated in-memory processing to the more traditional Microsoft SQL Server. Both systems are non-clustering for now, and both are designed to replace or enhance OldSQL deployments directly.

NuoDB set out to be a cluster-first SQL database with a focus on cloud-ops: run on many nodes across many datacenters and let the underlying system manage data locality and consistency for you. This comes at a cost in performance and consistency for arbitrary workloads. For workloads that are closer to key-value, global data management is a more tractable problem. NuoDB is the closest to being called eventually consistent of the NewSQL systems.

Other systems focus on clustered analytics, such as MemSQL. Distributed, with MySQL compatibility, MemSQL often offers faster OLAP analytics than all-in-one OldSQL systems, with higher concurrency and the ability to update data as it’s being analyzed.

VoltDB, the most mature of these systems, combines streaming analytics, strong ACID guarantees and native clustering. This allows VoltDB to be the system-of-record for data-intensive applications, while offering an integrated high-throughput, low-latency ingestion engine. It’s a great choice for policy enforcement, fraud/anomaly detection, or other fast-decisioning apps.

The Need for Speed: Fast in-memory SQL

Perhaps you have gigabytes to terabytes of data that needs high-speed transactional access. You have an incoming event stream (think sensors, mobile phones, network access points) and need per-event transactions to compute responses and analytics in real time. Your problem follows a pattern of “ingest, analyze, decide,” where the analytics and the decisions must be calculated per-request and not post-hoc in batch processing. NewSQL systems that offer the scale of NoSQL with stronger consistency may be the right choice.

The NewSQL advantages:

Minimize application complexity stronger consistency and often full transactional support.
Familiar SQL and standard tooling.
Richer analytics leveraging SQL and extensions.
Many systems offer NoSQL-style clustering with more traditional data and query models.

The NewSQL disadvantages:

No NewSQL systems are as general-purpose as traditional SQL systems set out to be.
In-memory architectures may be inappropriate for volumes exceeding a few terabytes.
Offers only partial access to the rich tooling of traditional SQL systems.

Summing it up

As a general rule of thumb, consider evaluating NoSQL offerings if you favor availability or have special data model needs. Consider NewSQL if you’d like the at-scale speed of NoSQL, but with stronger consistency and the familiar and powerful SQL query language.

The froth in the data management space is substantial – and our tendency to talk in terms of categories (SQL, NoSQL, NewSQL) vs. problems makes it hard for software developers to understand what’s in the toolbox. The current offerings of new databases are not all alike – and recognizing how the DNA behind each helps or hinders problem solvers is the key to choosing the best solution.

John is the Vice President of Engineering at VoltDB, and has expertise growing and managing Agile development teams in early to mid-stage companies, specializing in highly-scalable Database, Web, Search/NLP, B2C and Enterprise, customer-facing product offerings.

Prior to joining VoltDB, John served as the Vice President of Web Engineering at EasyAsk Inc, where he was responsible for developing EasyAsk’s precision search and dynamic navigation solutions. Following the acquisition of EasyAsk by Progress Software, John became their Director of OpenEdge Database Division, where he oversaw three product lines – OpenEdge Database, ObjectStore, and Orbix.

(image credit: Scott McLeod)

VoltDB Secures $9.8 Million to Expand Streaming Analytics & IoT Capability

Dataconomy News Desk — Thu, 23 Jul 2015 16:29:04 +0000

VoltDB today announced that it has raised a $9.8 million round of funding to extend its SQL in-memory database to power the next wave of real-time, fast data-driven applications. The round was led by strategic investors with participation from existing investors Kepha Partners and Sigma Prime Ventures.

Innovations in today’s enterprises are occurring at the intersection of fast data and Big Data through in-memory computing, streaming analytics and operational database technology. The rapid growth of mobility and the Internet of Things (IoT) drives the need for real-time data analysis and solutions that capture and process fast data to produce faster insight and action.

“The world of Big Data is evolving into one of actionable data, with organizations of all sizes searching for ways to capture, analyze and gain a competitive advantage on information the moment it enters the enterprise,” said John Mandile, Partner, Sigma Prime Ventures. “VoltDB has both best-in-class in-memory technology and the management team to empower organizations with the fast data-enabled applications required to compete in today’s data-driven economy.”

Data is fast before it is big, and as a result, the IoT has necessitated faster, real-time data analysis, intelligence and action. Fast data is generated by the massive flows of real-time data created by mobile devices, sensor networks, social media, machine-to-machine communications and connected devices. The value of fast data exists in the moment, and can be the difference between gaining or losing a customer, allowing unauthorized activity, or missing the opportunity to prevent a costly problem.

“We are committed to building the fastest in-memory database platform – with the performance and scale to deliver streaming analytics with real-time operational data interaction,” said Bruce Reading, CEO and President, VoltDB. “VoltDB is the fast front end of today’s data pipeline. We provide organizations in financial services, telco, gaming, advertising and energy and utilities with the ability to analyze streams of data in motion and enable real-time decisions with millisecond latency.”

With the addition of this investment, VoltDB has raised a total of $31.3 million in funding. VoltDB’s recently released Version 5.4 of its in-memory, scale-out operational database is available for download at: http://voltdb.com/download/software.

(image credit: Rhys A)

Coming Full Circle: Why SQL now powers the NoSQL Craze

Ryan Betts — Mon, 06 Jul 2015 10:05:10 +0000

Data that can’t be easily queried or read is like a book with the pages glued together: not very useful.

For decades, SQL was the established language used to interact with databases. Then data management requirements necessitated by cloud environments and big data use cases led to new systems. These new systems, collectively termed “NoSQL,” focused first on scalability and availability. They often sacrificed the interactivity and queryability of SQL as a false shortcut to scale.

Giving up SQL was not a necessary trade-off. In fact, this design choice created systems that were harder to use, narrowed their use case profile, and forced users to write complex programs to replicate what would have been simple SQL statements. For example, if you talk to enterprises that were early Hadoop adopters, many complain that using the new technology required the expertise of exclusive, highly skilled programmers, while legacy technologies that were SQL based were more accessible to analysts, data scientists, managers and marketers. The short-term choice to give up SQL caused a lot of pain for users, led to incompatible and proprietary query languages, and in some cases resulted in systems that were severely limited by weak interactivity and queryability.

Understanding the modern database landscape is difficult. There are a large variety of systems, many vendors, and many different and overlapping technologies. It is interesting, though, that across this landscape, almost all have learned that SQL is table stakes. The original NoSQL systems posited that SQL didn’t scale, that SQL didn’t work without relational (table-based) schema, and that SQL interfaces were inappropriate to “modern” use cases. All of these claims have been debunked. Users are driving vendors to embrace SQL. We now see NoSQL vendors advertising their SQL capabilities as competitive differentiators. It is easy to predict that over the next 12 to 18 months we will begin to see marketing literature comparing SQL compatibility across different NoSQL systems. Try explaining that one to your manager.

Let’s look at a few very different systems that have adopted SQL despite different storage designs, different primary use cases, and different data models.

“The system must provide full SQL query support and other functionality users expect from a SQL database. Features like indexes and ad hoc query are not just nice to have, but absolute requirements for our business.” (Google F1 paper). Google implemented a SQL system (F1) to manage its core adwords business. From 2012, F1 has been stated to manage 100s of TB and processes 100,000s of queries per second – a clear demonstration that SQL scales and that SQL features are critical to satisfying business users.

The headline feature of Couchbase 4.0, long a vocal proponent of NoSQL, is a SQL interface. Couchbase brands its version of SQL “N1QL.” But N1QL is nearly indistinguishable from SQL. Couchbase touts SQL benefits alongside its NoSQL messaging: N1QL enables you to query a document database without limitations and without compromise – sort, filter, transform, group, and combine data with a query. That’s right. You can combine data from multiple documents with a JOIN. That flexible data model you were promised? This is it. You’re no longer limited to “single table” and “table per query” data models. (www.couchbase.com). The strawman that users were ever limited to single table and table per query models is certainly frustrating to vendors that have seen and embraced the SQL requirement from day one – but the sentiment is certainly easy to understand.

Apache Cassandra is another prominent “NoSQL” system that has been pushed to deliver SQL as the primary database interface. The project brands its flavor of SQL “CQL” but goes to considerable length to explain that CQL is really just SQL. “CQL v3 offers a model very close to SQL in the sense that data is put in tables containing rows of columns. For that reason, when used in this document, these terms (tables, rows and columns) have the same definition than (sic) they have in SQL.” (https://cassandra.apache.org/doc/cql3/CQL.html#Preamble)

Google first birthed the MapReduce model into the marketplace. Behind the scenes the company quickly realized the shortcomings of MapReduce as a query model and built a large-scale analytic SQL-based system named “Dremel.” Dremel maps a SQL query model onto a document-based storage model at massive scale. Again, Google proves that SQL scales across volume, velocity and problem space. The Hadoop ecosystem has taken up the path, launching multiple SQL-on-Hadoop projects, including Impala, Presto, SparkSQL and Drill.

Why SQL? Data management systems must make data interactive. Interactivity is core to deriving value from the data that is being stored. All of the critical design choices made by database designers are motivated by interactivity. Vendors that put other priorities first are quickly realizing the corner they’ve backed into and are grafting SQL query processors into their core products.

If you have to support queries, why use SQL?

SQL is a Standard: It’s easy to find people who know SQL. It’s easy to connect to standard tools. A multitude of resources are available to learn SQL. Bonus: your editor already has syntax highlighting.

SQL is Declarative: With SQL, the query is written by specifying the form of the results declaratively. It is the job of the database software to understand the best way to access data, operate on it, and turn it into results. Declarative queries insulate the query-author from the underlying physical schema of the data. Compared to non-declarative processing, applications are much less fragile, and tolerate changes to the underlying schema, such as added columns or indexes, with no changes to the query itself.

SQL Scales: “SQL doesn’t scale” was oft cited as a key reason NoSQL exploded. Also frequently heard is, “To solve internet-scale problems, you need to abandon SQL.” Today, both Facebook and Google have publicly sung the praises of their SQL systems. Many NoSQL stores have added SQL or SQL-like query languages without crippling performance. New SQL systems such as VoltDB have shown that millions of SQL operations per second are within the reach of anyone with an Amazon Web Services (AWS) account.

SQL is Flexible: While there are a number of SQL standards, vendors and open-source projects have liberally extended SQL. VoltDB supports UPSERT functionality, JSON extensions and other non-standard SQL our customers have requested, all while supporting all of the standard SQL operations familiar to developers.

In sum, SQL is a proven technology. It is the easiest way to write queries. It is the most familiar way to write queries. It is the most compatible way to write queries. It is the most powerful way to write queries. The ability to query data is at the heart of the database value proposition.

Ryan Betts is CTO at VoltDB. He was one of the initial developers of VoltDB’s commercial product, and values his unique opportunity to closely collaborate with customers, partners and prospects to understand their data management needs and help them to realize the business value of VoltDB and related technologies.

Prior to joining VoltDB in 2008, Ryan was a software engineer at IBM. During a four-and-a-half year tenure, he was responsible for implementing device configuration and monitoring as well as Web service management. Before IBM, Ryan was a software engineer at Lucent Technologies for five years. In that role, he played an integral part in the implementation of an automation framework for acceptance testing of Frame Relay, ATM and IP services, as well as a high-availability upgrade capability and several internal components related to device provisioning. Ryan is an alumnus of Worcester Polytechnic Institute.

(image credit: Ishrona)

Meet Azure DocumentDB- Microsoft’s NoSQL Document Database Service

Dataconomy News Desk — Mon, 13 Apr 2015 10:50:31 +0000

Azure DocumentDB, Microsoft’s fully managed NoSQL document database service that was first previewed mid last year, is now generally available, the software giant revealed Wednesday.

The latest offering addresses the growing demands for mobile first, cloud first application development, wrote the Director of Program Management, DocumentDB, John Macintyre, in a blog post making the announcement.

“NoSQL databases are becoming the tool of choice for many developers however running and managing these databases can be complicated and costly, especially at scale. DocumentDB is delivered as a fully managed database-as-a-service (DBaaS) with built in high availability, SQL query over indexed JSON and multi-document transaction processing,” he added.

DocumentDB allows applications to query and process JSON data at scale and comes with code samples, a query playground and lots of documentation, to enable ease of use. The DocumentDB Data Migration tool, an open source solution, helps import data from a variety of sources, including JSON files, CSV files, SQL Server, MongoDB and existing DocumentDB collections, Macintyre says. The migration tool source code is available on GitHub.

The release is available for purchase, in three options, wherein each comes with “reserved performance, hourly billing and 99.95% availability.”

Photo credit: jovike / Foter / CC BY-NC

Scale Up is the New Scale Out

Don Marti — Tue, 07 Apr 2015 12:01:24 +0000

Scale up or scale out? As we develop better tools and strategies for treating the whole data center- real or virtual- as a “server,” the answer seems obvious: it’s all about scaling out. But that’s the big picture. While we’re scaling up, to more and more boxes, inside the box the server is becoming a miniature data center. And that change is bringing an esoteric set of programming challenges to more and more data store developers.

It’s not a question of “scale up” or “scale out” any more. Like most either/or questions in IT, the answer is: neither, but also both. “Scale up” is just “scale out” in a box.

From an IT operations point of view, it’s all good news. You can plan further in advance without knowing the specifics of what hardware you’re likely to use. Will you find it less expensive to acquire more, cheaper nodes (physical or virtual) or fewer, larger nodes as your project grows? Better not to have to worry about it early on. The extra work, though, falls on the programmers creating the next generation of scalable data stores.

First-generation scaling

When the industry began the first move from scale up to scale out, we did it in the obvious way. The individual nodes that form today’s first generation of scalable data stores are designed like old-school multi-threaded software. Complexity of scaling exists at the data center level, but within the node, the software design of today’s NoSQL servers would be familiar to a developer of the 1980s: classic, old-school threads.

MongoDB is an excellent example. At the big picture level, it’s scalable across many nodes. But dig into the documentation that covers individual nodes, and you find that on each individual node, it’s a conventional multi-threaded program. What’s wrong with that? Nothing, back in the day, when CPU cores were few. But classic multi-threaded programming requires locking in order to enable communication among threads. Threads carry out tasks and must share data. In order to protect the consistency of this shared data, the developers of multi-threaded server software have developed a variety of locking methods. When one thread is modifying data, other threads are locked out.

Multi-threaded programming works fine at small scales. But as the number of cores grows, the amount of CPU time spent on managing locks can outgrow the time spent on real work. Today’s NoSQL clusters can grow, because they’re designed for it- but the individual nodes can’t. Inside the node, we’re facing a version of the same problem that we solve by going from a master RDBMS with a replica, up to a resilient multi-master NoSQL system.

Next-generation scaling: new servers are like little data centers

The hardware on which the next generation of NoSQL systems will run has some important differences from the hardware that developers have been used to. Our software and mental tools are based on some assumptions that no longer hold. For example, not only are numbers of cores going up, but non-uniform memory access (NUMA) is the standard design for server hardware. To a conventionally written program, the OS hides the hardware design, and memory is memory from the data store point of view. But secret performance penalties lurk when you try to access memory that’s hooked up to another CPU.

Fortunately, old-school threading and locking isn’t the only programming paradigm we have available. Advanced computer science research can do for the individual node what NoSQL systems do for the data center. “Shared-nothing” designs allow all those multitudes of cores to run, doing real work, without waiting for locks. All shared data exchanged between cores must be explicitly passed as a message. The problem, though, is making new designs workable for real projects. Developers need to be able to reason about data flow, to write tests, and to troubleshoot errors. As we move from threads and locks to more efficient constructs, how will we be able to cope?

An example of a way to do it is Seastar, a new, open-source development framework designed specifically for high-performance workloads on multicore. Seastar is built around “scale out” within the server, so that as server capacity grows, projects will be able to “scale up.” Seastar reinvents server-side programming around several important concepts:

Shared-nothing design: Seastar uses a shared-nothing model to minimize coordination costs across cores.
High-performance networking: Seastar uses DPDK for fast user-space networking, without the complexity of the full-featured kernel network stack.
Futures and promises: programmers can obtain both high performance and the ability to create testable, debuggable, high-quality code.

The next generation of NoSQL will require a foundation that works as well for increasing complexity of workloads on the server as the first generation worked for increasing numbers of nodes in the data center. The kind of development tasks that elite high-performance computing projects have used is now, thanks to multicore, going to be a basic tool for production data stores where quality, maintainability, and extensibility are as important as performance.

Don Marti is Technical Marketing Manager for Cloudius Systems. Cloudius is a young and restless startup company that develop OSv: the next generation cloud operating system. Cloudius Systems is an open source centric company, led by the originators of the KVM hypervisors and employ superstar virtualization and OS veterans.

Photo credit: TheDesignInspiration

SQL for Documents is the Next Frontier for NoSQL Startup Couchbase

Dataconomy News Desk — Wed, 25 Mar 2015 11:31:33 +0000

Couchbase Server, the open source, distributed NoSQL document-oriented database has unveiled a new query tool compatible with SQL.

The new offering, has already been at the disposal of developers as N1QL since over a year now. It will henceforth be shipped as part of Couchbase Server under the name SQL for Documents, Couchbase explained president and chief executive Bob Wiederhold.

With this offering, Couchbase stands out among other NoSQL outfits like MongoDB and DataStax that don’t offer, SQL-based database Wiederhold said.

VentureBeat points out how SQL query languages becoming the standard is evidence that, ” database companies are looking to expand their relevance beyond developers who build applications and need to store and serve up data. Many business analysts know SQL well and could end up feeling comfortable enough to run queries on data in Couchbase.”

Intentions of gaining footing in enterprise IT infrastructures are made clear with the unveiling of “Multi-Dimensional Scaling” capable Couchbase Server 4.0., slated for a summer release.

“As a result, you’re going to get dramatically higher performance for indexing and querying,” Wiederhold said.

Image credit: Couchbase

Aerospike’s New CEO John Dillon on the 2015 Roadmap for Database Tech: Less Hype, Better Technology

Eileen McNulty — Thu, 05 Feb 2015 12:49:10 +0000

2015 is certainly an exciting time for NoSQL database vendors Aerospike. 2014 saw them open source their technology, post groundbreaking benchmarks and add a whole raft of updates to their platform. Today, they’ve announced that Silicon Valley veteran John Dillon will be stepping into the breach as their new CEO. Dillon has 30 years of experience in the database tech field under his belt, having previously been the CEO at Salesforce, EngineYard and Hyperion. His resume also boasts a leadership position at Oracle, during the time they went from multi-million to multi-billion profit margins. Needless to say, he has big plans for this promising company.

Speaking to John earlier in the week, one of the things that stuck out most was his commitment to better marketing. “The biggest challenge I’m going to put on the marketing department is to make sure we’re not a company that’s full of hype,” he said. “There’s sort of the notion in Silicon Valley and tech communities around the world that if you brag a lot and put out a lot of press releases, everybody is going to think you’ve won the battle. One of the reasons why I joined Aerospike is this is about better technology, and I wanted to join the company that I thought had the best potential and the longest term opportunity to tackle.”

Even the most casual followers of the database tech market will have noticed a marked trend towards news announcements and press releases which claim any given technology is the best, the biggest, and the fastest. But when you have every player in the market making these grand claims, the power of these assertions is lost. This is something Dillon is keenly aware of- “Everybody is saying “We’re really fast, we’re really scaleable, we’re really fault-tolerant, we’re really great- and everybody in the world is adopting our technology”. Well, that can’t be true if all of the companies are saying exactly the same thing.”

“What we have at Aerospike is a technology that works really, really well and the customers love it because we support them and we provide kind of a TLC that’s necessary.” Indeed, Aerospike have a rich history with giving back to the developer community; their last raft of updates was focused around giving back to the developer, bragging an impressive roster of new clients– as well as 33 key updates that the open source community gave back to them in return. “I think that’s a better approach these days than spending all the money on marketing,” Dylan noted. “Developers want vendors that are helpful, that have technology that works, that have support and resources that are willing to engage to help them tackle problems that they couldn’t tackle with other technologies. That’s who we are and I think that’s probably a marked difference from what I’ve seen in terms of of reviewing some of what I call “First Generation” NoSQL Database Vendors. They raise money at ridiculous valuations and then they spend it on marketing. I told Monica [Pal, Aerospike’s CMO] that we’re going to spend our money on building great technology and servicing our clients. It’s a little bit of a different notion but we think long term it’s probably a better strategy.”

So, if Aerospike’s focus moving forward isn’t going to be on breaking the bank with the marketing budget, what is it going to be? “Well we’re experiencing rapid growth and my focus is going to be to maintain a high degree of customer intimacy and customer support and satisfaction,” Dillon explained. “In other words we’re going to succeed 100% with every customer- that’s probably the core value, it’s already here in the company, but I intend to reinforce it. What we will also do is gradually tackle additional market places that have the characteristics that require higher data volumes, unbelievable scalability requirements and low latency and reliability and we’re just going to tackle one market after the other, do those really, really well.

“We have a pretty darn good database, but we just don’t think that trying to get everyone to adopt at once the right kind of focus. So I enjoy the press releases where other NoSQL vendors are saying “we’re opening offices all over the world”- I think that’s a great way to spend a whole bunch of money and have very little to show for it.We tend to tackle sort of a market at a time. We’d rather do fewer customers and do them well than try to spread ourselves too thin.”

Overall, the plan is not world domination- yet. It’s continuing to do what they do very well, and gradually trying to get more people to adopt their user-cenrtic approach. “We’re not really trying to save the world, but there’s something satisfying in being able to provide technology that actually meets user requirements and takes a headache off their plate.”

Follow @DataconomyMedia

Basho, Creators of Riak Database, Muster $25m in Fresh Funding

Dataconomy News Desk — Mon, 19 Jan 2015 10:18:55 +0000

The distributed systems outfit Basho Technologies have secured a $25 million Series G funding that will see expansion in development and marketing activities. The company also charted record sales growth in 2014, it revealed earlier last week.

Existing investor Georgetown Partners spearheaded the round.

Basho’s open source distributed NoSQL database, Riak and cloud storage software, Riak CS claim to be highly available, fault-tolerant and easy-to-operate at scale, and apparently are used by ‘fast growing Web businesses and by one third of the Fortune 50 to power their critical Web, mobile and social applications, and their public and private cloud platforms.’

“The new Basho management team has made strong progress in positioning the company to capitalize on growth opportunities for solutions that enable enterprises to extract value from the massive amounts of data they generate,” said Chester Davenport, the managing director of Georgetown Partners who is also the chairman of Basho Technologies.

“Riak and Riak CS software have extremely strong product roadmaps for 2015 and sales momentum is impressive. With Series G funding secured, I have confidence Basho will establish itself as a leading unstructured data solutions provider in 2015,” he added.

Founded in 2008, the company has garnered almost $60 million through equity and debt financing.

GigaOm points out that in an interview on Monday, newly appointed CEO Adam Wray cited an 89 percent annual increase in bookings, tens of millions in annual revenue and accounts at some of the world’s largest companies.With the growth projected for non-relational databases, Basho iand peers such as MongoDB, DataStax and Couchbase hold chances for greater success.

Read more here.

Follow @DataconomyMedia

(Image credit: Garrett Coakley, via Flickr)

Neo Technology Proves Demand and Mainstream Adoption of Neo4j with Latest $20M Funding

Eileen McNulty — Fri, 16 Jan 2015 10:07:30 +0000

Graph database innovator Neo Technology has raised $20 million in Series C funding, it announced yesterday.

The fresh funding will help the outfit to work on the further development of their graph database Neo4j, as well as fuel growth in the open source community all the while expand outreach globally to address the ‘increasing demand for graph databases,’ explains their news release.

“We have followed Emil and the Neo team over many years as they have built their leading position in graph databases. We are thrilled and honored to now be on board this exciting journey,” explains Johan Brenner, General Partner at Creandum.

“There are two strong forces propelling our growth: one is the overall market’s increasing adoption of graph databases in the enterprise. The other is proven market validation of Neo4j to support mission-critical operational applications across a wide range of industries and functions,” said Emil Eifrem, Neo Technology’s CEO and co-founder.

The company believes that the demand for their graph database product has increased made evident by over 500,000 downloads since the launch of Neo4j 2.0 last year along with “thousands of production deployments, a thriving community of developers worldwide and record turnout for Neo’s GraphConnect San Francisco 2014 conference,” the firm said.

The company boasts of both market veterans as well as startups using their product like Walmart, eBay, Earthlink, CenturyLink,Cisco and startups such as Medium, CrunchBase, Polyvore and Zephyr Health incorporate Neo4j.

The investment was headed by Creandum with Dawn Capital and saw participation from existing investors like Fidelity Growth Partners Europe, Sunstone Capital and Conor Venture Partners. Neo Technology also pointed out that veteran venture capital investor, Johan Brenner of Creandum will join Neo Technology’s Board of Directors.

Follow @DataconomyMedia

(Image credit: Screenshot of Neo4j’s graph viz by Karsten Schmidt)

Fresh Round of $80m Funding Takes MongoDB’s Valuation to 1.6 Billion

Eileen McNulty — Thu, 15 Jan 2015 10:02:40 +0000

Next-gen database innovator MongoDB revealed earlier this week that it had secured $80 million in fresh Series G funding. The round was led by an undisclosed sovereign wealth fund, with participation from Goldman Sachs and existing investors Altimeter Capital, NEA, Sequoia and funds managed by T. Rowe Price Associates. The round brings their total financing received till date up to 311 million dollars.

MongoDB’s President and Chief Executive, Dev Ittycheria said: “There’s a massive secular trend going on in the marketplace with the scale of data being consumed, the unstructured nature of data, the requirements to always be on and never offline.”

“Most developers and IT organizations today realize that modern applications can’t be built on a [database] architecture that was built over 40 years ago,” he added.

Wall Street Journal further reveals- citing knowledgeable sources- that MongoDB is now valued by venture capitalists at $1.6 billion after raising the new funding. Funding will be used for business development as the company shows no signs of going public.

With intentions to race past competitions from peer startups as well as industry veterans, MongoDB claims 2000 customers including 34 of the Fortune 100, and a workforce of 400 which also saw an expansion in the management team as Carlos Delatorre joined as the Chief Revenue Officer coming over from Clearside.

Read more here.

Follow @DataconomyMedia

(Image credit: Garrett Heath, via Flickr)

MongoDB and HCL Infosystems Hatch Global Alliance to Expand Big Data Influence

Eileen McNulty — Fri, 19 Dec 2014 11:33:47 +0000

MongoDB, the cross-platform document-oriented database developer and HCL Infosystems, India’s Premier Distribution and IT Services and Solutions Company, have signed a global partnership in a bid to enable HCL Infosystems gain further grounding with its offerings in the emerging big data segment by developing services around MongoDB.

“MongoDB provides immense value through its proven functionality for processing and analyzing large data sets and offers a scalable solution to customers. Our collaboration will enable us to broaden our market reach in the fast growing Big Data segment,” explained APS Bedi, President – Enterprise Business, HCL Infotech Ltd.

Kamal Brar, the Vice President APAC, MongoDB, enunciates: “Big Data is more than just addressing increasing volume; organizations need solutions that can manage increasing data variety and velocity. Today, businesses need to quickly ingest, store, and access useful information from massive pools of multi-structured data.”

“This is where MongoDB excels. HCL Infosystems will offer innovative solutions and services for organizations to gain value from their data and make smarter and faster business decisions. Partnering with HCL will further strengthen our footprint globally,” he further said explaining their strategy.

This is one of the many partnerships that MongoDB has gone into globally as a result of which it now has a thriving global community. “Common MongoDB use cases include Single View, Internet of Things, Mobile, Personalization, Content Management, Real-Time Analytics and Catalog,” reports a news release.

Read more here.

(Image credit: David Martin)

Three Key New Features from Aerospike’s Extensive Upgrade

Eileen McNulty — Mon, 15 Dec 2014 16:16:04 +0000

Back in August, Aerospike announced they were open-sourcing their signature platform. At the beginning of December, they were back again with news of a record-breaking Google Compute Engine Benchmark. Now, to round off what has been an exceptional year for the flash-optimised, in-memory NoSQL database, they’ve released a whole raft of updates to their database.

The full announcement is, in a word, extensive- you can read the whole developer’s Christmas wishlist of updates here. We recently discussed the announcements with Aerospike’s CTO Brian Bulkowski; he highlighted three key developments which he believes are instrumental in fuelling the real-time applications of tomorrow.

1. A Whole Host of New Clients

One of the key headlines from the announcement is new clients for Python, PHP, Go and Ruby, plus a whole host of upgrades for Aerospike’s Java, Node.js, .NET / C# clients. Bulkowski highlighted that the vast range of clients was a demonstration of Aerospike’s commitment to their community.

“In the polyglot language world, you can’t just have a world-class database; you have to be in the community, you have to have different connectors, you have to be giving back to your community,” he explained. “Monica [Pal, Aerospike CMO]’s catchphrase for this release is ‘It’s about what we’re giving to developers, and what developers are giving to us’. It’s about having a great Ruby client, having a great PHP client, having a great Go client- we’re having a huge amount of success with Go, which is a little underappreciated as a language. On paper, it looks a little like a laundry list- but it means to our communities, is that they know we have them covered.”

2. Hadoop Integration

The integration of InputFormat allows Hadoop tools to analyse data stored in Aerospike, without performing ETL from HDFS. Aerospike’s Indexed MapReduce also allows you to analyse specific subsets of data, eliminating the need for the large data lake analysis typically offered by HDFS.

As Bulkowski explained, “The nature of analytics is, once you’ve written your analysis, you don’t want to have to do it again- it’s troublesome, it’s error-prone- you want to use the tools you have. By backing the Hadoop tooling- even though we know our own tooling is better, faster and more capable- being compatible when it comes to analytics is really a benefit. Our key value comes from being row-oriented, rather than a streaming store like HDFS- so what we looked at was, what can we give to an analytics community as a row-oriented database? The benefit is not having to ETL- being able to run those jobs directly on your database in a safe and sane fashion.”

3. Docker Containers

As we have previously reported, taking a developer-centric approach in 2014 has become synonymous with Docker support. Bulkowski was certainly excited by the rapid development of containerisation, but was keen to stress the technology has a long way to go before it reaches maturation. “The containerisation of both the data centres and also the cloud are still in their infancy,” he remarked. “From a technical perspective, I think it’s a great idea. Containerisation definitely has the potential to reset the game in terms of high-IO applications like us. In terms of being in the community, having a Docker image that people can start playing with is great- but it’s still early days.”

These three headlines only really scratch the surface of this extensive announcement; Aerospike have also announced enhancements to their enterprise security, a raft of storage and performance improvements, and 33 key community contributions- just three months after going open source. I would urge you to give the full list a read here, and to keep your eyes peeled for further Aerospike innovation in 2015.

Follow @DataconomyMedia

(Image credit: Aerospike)

In-Memory Data Grid Outfit Hazelcast Pick Up $11m to Gain Ground as NoSQL Tech Provider

Eileen McNulty — Mon, 22 Sep 2014 08:02:02 +0000

Open source In-Memory Data Grid Hazelcast has secured an $11 million Series B Venture Capital round last week.

Roland Manger from Earlybird Venture Capital, who will be joining the Board of Directors at Hazelcast, said, “Over the last three years, we have seen Hazelcast making significant progress in its path from an open source in-memory data grid project to a full-featured enterprise-class In-memory computing company.”

“Today Hazelcast already provides a superior NoSQL solution, and is about to release a unique large scale caching product; We are confident that they will continue to lead the industry as Enterprise In-Memory Computing goes mainstream,” he added.

Spearheaded by Earlybird Venture Capital, the round also saw participation from existing investors like Ali Kutay (Seed), Rod Johnson and Bain Capital Ventures who, with approximately $5M, played a key benefactor, raising the total to $13.5M.

Owing to its extreme In-Memory Computing performance and elastic scalability, Hazelcast is fast becoming a top choice in NoSQL. Independent experts point out that Hazelcast is 10x faster than Cassandra on reads and faster on writes. They provide speed intensive apps with “an elastically scalable Key-Value store with extreme in-memory performance” utilizing Apache 2 open source software license that is in use by Cassandra as well, reports Hazelcast in a statement.

Read more here.

Follow @DataconomyMedia

(Image Credit: Hazelcast)

22 October, 2014- GraphConnect 2014, San Francisco

Eileen McNulty — Tue, 09 Sep 2014 11:26:42 +0000

GraphConnect is the only conference that focuses on the rapidly growing world of graph databases and applications, and features Neo4j, the world’s leading graph database.

Join the hundreds of graphistas from startups to Global 2000 companies, and see how they are leveraging the power of the graph to solve their most critical connected data issues.

Held at SF Jazz Centre, Graph Connect will feature talks from ebay, Polyvore and CrunchBase representatives, as well as training workshops for Neo4J.

A full programme and ticket information can be found here.

Distributed NoSQL: Cassandra

Haifeng Li — Mon, 25 Aug 2014 11:10:39 +0000

In previous posts Distributed NoSQL: HBase and Accumulo and Distributed NoSQL: Riak, we explored two very different designs of key-value pair databases. In this post we will learn about Apache Cassandra, a hybrid of BigTable’s data model and Dynamo’s system design. With BigTable-like column/column family in mind, Cassandra provides a more flexible data model than Riak. Modeled after Dynamo’s system design, Cassandra has linear scalability and proven fault-tolerance on commodity hardware. Besides, Cassandra’s support for replicating across multiple datacenters is best-in-class. Since many features of Cassandra were already covered in previous posts as they are shared with HBase/Accumulo and Riak, we will focus on the additional unique features in what follows.

Data Model

Cassandra provides a two-dimensional row-column view to the data contained in a keyspace (i.e. table in HBase). Keyspaces are used to group column families together. If you need higher dimension to organize application data, there is the concept of super columns, which are columns that contain columns. However, super column is deprecated because of performance issues. Instead, developers are encouraged to use composite columns that was introduced in version 0.8.1. Before jumping into composite columns, we need to understand column sorting. Just like other key-value pair databases, the data type of keys and values in Cassandra are byte arrays. More interestingly, we can specify how column names will be compared for sort order when results are returned to the client. But why would I want to sort column names? This especially sounds strange to a relational database developer. In a relational database, we usually have tall tables, i.e. millions skinny rows with a handful columns. We could still follow this design in Cassandra although different rows don’t have to share same column set. On the other hand, one wide row could have millions columns in BigTable-like database, actually up to 2 billion columns in Cassandra. In this case, column names are usually part of the data (one may even go with valueless columns, i.e. column names are data themselves and the values are not really meaningful), rather than purely schema. For example, we can build inverted index with terms as the keys, document ids as the column names, and frequency as the value. With wide-row design, it is necessary to compare column names sometimes, for instance, each row is the time series of stock price in a day and the column names are time points. You can use compare_with attribute on a column family to tell Cassandra how to sort the columns. The default is BytesType, which is a straightforward lexical comparison of the bytes in each column. Other options are AsciiType, UTF8Type, LexicalUUIDType, TimeUUIDType, and LongType. You can also specify the fully-qualified class name to a class extending org.apache.cassandra.db.marshal.AbstractType. Again, we are sorting column names, not values. However, sorting column name providing a way to build second index, which is very useful in real world. Now come back to composite columns, which are arbitrary dimensional column names that can have types like CompositeType(UTF8Type, ReversedType(TimeUUIDType), LongType)). It’s also really simple: it is implemented as a comparator so adds very little complexity.

Storage

Because of the data model, Cassandra has the concepts of SSTable and MemTable, borrowed from Google BigTable. The details can be found in previous post Distributed NoSQL: HBase and Accumulo. On the other hand, Cassandra stores data in the native file system like Riak. Besides, Cassandra doesn’t require ZooKeeper (or any other third-party components). So it is very easy to configure and run. Actually Cassandra starts only one JVM per node, which brings a lot of simplicity to operation and maintenance. Remember how many moving parts in HBase/Accumulo?

Architecture

Same as Riak, Cassandra employs a ring topology but with more partition options. You can provide any IPartitioner implementation to distribute data on nodes. Out of the box, Cassandra provides RandomPartitioner, OrderPreservingPartitioner, ByteOrderedPartitioner, and CollatingOrderPreservingPartitioner. The default is RandomPartitioner to force equal spacing of tokens around the (MD5) hash space, especially for clusters with a small number of nodes. With OrderPreservingPartitioner the keys themselves are used to place on the ring. It brings data locality but also potential bottleneck on hot spots.

Beyond partitions, Cassandra also supports pluggable replication strategies through IReplicaPlacementStrategy to ensure reliability and fault tolerance. Out of the box, Cassandra provides SimpleStrategy (rack unaware), LocalStrategy (rack aware) and NetworkTopologyStrategy (datacenter aware). In addition to setting the number of replicas, the strategy sets the distribution of the replicas across the nodes in the cluster depending on the cluster’s topology. We are particularly interested in NetworkTopologyStrategy. With it, we can deploy the cluster across multiple data centers and specify how many replicas we want in each data center. If configured properly, Cassandra is able to read locally without incurring cross-datacenter latency, and handles failures nicely. The NetworkTopologyStrategy determines replica placement independently within each data center as follows:

The first replica is placed according to the partitioner
Additional replicas are placed by walking the ring clockwise until a node in a different rack is found. If no such node exists, additional replicas are placed in different nodes in the same rack.

To achieve this, we need a snitch maps IPs to racks and data centers. It defines how the nodes are grouped together within the overall network topology.

Vector Clock and Last-Write-Wins

Like Riak, Cassandra also has read repairs and active anti-entropy to resolve some consistency issues. However, Cassandra doesn’t have vector clocks and simply uses the last-write-wins approach. For sure, last-write-wins is simple but has problems with updates-based-on-existing-values issues. On the other hand, vector clocks are not adequate either in this case unless the data structure is CRDTs (consistency without concurrency control). It is debatable which approach is better. Meanwhile, Cassandra 2.0 provides compare and set based on Paxos consensus protocol, but misleadlingly labels it as “lightweight transaction”.

Summary

Overall, Cassandra is a very nice system with flexible data model, linear scalability and high availability. A Cassandra cluster is also much easier to set up and run than HBase/Accumulo. Besides, Cassandra offers secondary indexes to allow querying by value, which we didn’t discuss in detail in this post.

ALSO IN THIS SERIES:

Choosing a NoSQL

In the first installment of his Understanding NoSQL series, guest contributor Haifeng Li gives us a crucial advice for choosing a NoSQL database. Tips include: ignore benchmarking, and look into which problem the developers were originally trying to solve.

Haifeng Li is the Chief Data Scientist at ADP. He has a proven history in delivering end-to-end solutions, having previously worked for Bloomberg and Motorola. He is a technical strategist with deep understanding of computer theory and emerging technologies. He also has a diverse academic background, researching in fields including machine learning, data mining, computer vision, pattern recognition, NLP, and big data. His personal blog can be found here.

Follow @DataconomyMedia

(Image credit: Yuko Honda)

Distributed NoSQL: Riak

Haifeng Li — Thu, 21 Aug 2014 10:43:25 +0000

In previous post Distributed NoSQL: HBase and Accumulo, we explored two BigTable-like open source solutions. In this post we will learn about Riak, a highly available key-value store modeled after Amazon.com’s Dynamo. As we know, HBase and Accumulo provide strong consistency as a region/tablet is served by only one RegionServer/TabletServer at a time. However, this also introduces the availability problem. If a RegionServer fails, the corresponding regions will not be available during detection and recovery period. In contrast, Dynamo and Riak were designed to provide an “always-on” experience while sacrificing consistency under certain scenarios. Actually, the famous CAP theorem tells us that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

Data Model

Riak has a “schemaless” design. Objects are comprised of key/value pairs, which are stored in flat namespaces called buckets. Riak treats both the key and value supplied by the client as an opaque byte array. In Riak, no operations (get and put) span multiple data items. The applications should be able to get the keys “for free”, without having to perform any queries to discover them, e.g. user id, session id, etc. Since all values are stored on disk as binaries, small binaries such as images and PDF documents can be stored directly as binary blobs. However, Riak does not recommend storing objects over 50M for performance reasons. Because it is schemaless, structured data (e.g. relational tables) should be denormalized and usually stored as JSON or XML.

Storage

In contrast to HBase and Accumulo replying on complicated Hadoop HDFS, Riak stores data in the native file system. Moreover, Riak supports pluggable storage backends, just like MySQL. If one needs maximum throughput, Bitcask is a good choice (but with a memory-bounded keyspace). On the other hand, if one needs to store a large number of keys or needs secondary indexes, then LevelDB would be a better backend recommendation.

Architecture

All nodes in a Riak cluster are equal. Each node is fully capable of serving any client request. There is no “master”. This uniformity provides the basis for Riak’s fault-tolerance and scalability. This symmetric architecture is based on consistent hashing to distribute data around the cluster. In consistent hashing, the output range of a hash function is treated as a ring. Riak uses the SHA1 hash function to map the keys of data items to a 160-bit integer space which is divided into equally-sized partitions. Each virtual node (vnode) will claim a partition on the ring. The physical nodes each attempt to run roughly an equal number of vnodes. Consistent hashing ensures data is evenly distributed around the cluster.

Riak Ring

Nodes can be added and removed from the cluster dynamically and Riak will redistribute the data accordingly. The ring state is shared around the cluster by a gossip protocol. Whenever a node changes its claim on the ring, it announces its change via this protocol. It also periodically re-announces what it knows about the ring, in case any nodes missed previous updates.

Riak automatically replicates data to N (default N=3) separate partitions on the Riak Ring. Note that N=3 simply means that three different partitions/vnodes will receive copies of the data. There are no guarantees that the three replicas will go to three different physical nodes although Riak attempts to distribute the data evenly. When a value is being stored in the cluster, any node may participate as the coordinator for the request. The coordinating node consults the ring state to determine which vnode owns the partition which the value’s key belongs to, then sends the put request to that vnode, as well as the vnodes responsible for the next N-1 partitions in the ring. The put request may also specify that at least W (<= N) of those vnodes reply with success, and that DW (<= W) reply with success only after durably storing the value. A get request operates similarly, sending requests to the vnode/partition in which the key resides, as well as to the next N-1 partitions. The request also specifies R (<= N), the number of vnodes that must reply before a response is returned.

Riak Replication

By default, W and R are set as quorum to provide eventual consistency. A value of quorum indicates a majority of the N value (N/2+1, or 2 for the default N value of 3). Consider that a failed node just recovered but doesn’t have requested key-value or has an old copy, or that the client reads the value immediately after a successful write such that the replication process is not finished yet. Because W=2 and R=2, the coordinating node will receive at least one response with latest value, which will be returned to the client. Meanwhile a read repair process will occur to force the errant nodes to update their object values based on the value of the successful read. With R=1, the client will get faster response but take a chance of receiving an old copy. In general, one can ensure that a read always reflects the most recent write by setting W+R>N. Note that this doesn’t guarantee the consistency if there are concurrent writes after read, which will be discussed in the next section “vector clock”.

Read repair is a passive process that is only triggered when data is read. Riak also has an automatic background process called active anti-entropy (AAE) that compares and repairs any divergent, missing, or corrupted replicas.

Another notable Riak feature is hinted handoff with the concept of “sloppy quorum”. If a node fails or is partitioned from the rest of the cluster, a neighbor server takes responsibility for serving its requests. When the failed node returns, the updates are handed back to it. This ensures availability for writes, and happens automatically.

Vector Clock

Vector clock is an algorithm for generating a partial ordering of events in a distributed system and detecting causality violations. A vector clock of a system of N processes is a vector of N logical clocks, one clock per process. When a key-value pair is added into bucket, it is tagged with a vector clock as the initial version. Later the vector clock is extended for each update so that two versioned replicas can be compared to determine:

Whether one object is a direct descendant of the other
Whether the objects are direct descendants of a common parent
Whether the objects are unrelated in recent heritage

With vector clocks, each node of replicas can auto-repair out-of-sync data when feasible. If more than one client reads a key-value concurrently and writes it back, Riak cannot reconcile automatically and it simply accepts both the writes. When a read comes for the same key, Riak sends all the versions for that key and lets the client to do manual reconciliation.

Summary

Different from CP systems such as HBase, Riak is in the school of AP systems to provide an “always-on” experience. In the next post, I will cover Apache Cassandra, a hybrid of BigTable’s data model and Dynamo’s system design.

Follow @DataconomyMedia

(Image Credit: Garrett Coakley)

MongoDB Offers Paid Support for Free Community Edition to Uplift Revenues

Eileen McNulty — Thu, 21 Aug 2014 08:08:18 +0000

MongoDB, the leading NoSQL database, announced the availability of Production Support for MongoDB for the free-to-download Community Edition. This allows users to get expert guidance from the same team that builds MongoDB, citing strong demand from community edition users for greater support options, without investing in the whole enterprise package. The Production Support offering is now available as a standalone service- separate from the MongoDB Enterprise software.

“(The) Production Support goes beyond the typical “break/fix” scenario. Our highly-experienced support team offers consultative, proactive assistance on topics ranging from schema design, index optimization, performance testing and scaling out,” the MongoDB blog said. “We support thousands of MongoDB systems with simple and complex deployment topologies. We can help ensure that you are following best practices and getting the best possible performance from MongoDB.”

Prior to this announcement, MongoDB’s open source community had to look to forums for support and troubleshooting, as is the case with many “freemium”-model databases.

Another open-source company offering paid support may not be standalone headline news, but this story hints at changing dynamics within the NoSQL market. Competitors such as Couchbase have been raising venture capital and winning new customers, and increasing numbers of database vendors have been encroaching on MongoDB’s territory by offering support for the JSON file type.

It has been rumoured that MongoDB hasn’t been gathering as many paid users as it would like, but their non-paying userbase remains strong. If they can convert even a small percentage of their open-source users into paid-support customers, it could have a serious impact on MongoDB’s revenue stream.

Read more here.

Follow @DataconomyMedia

(Image credit: Ian Whalen)

Distributed NoSQL: HBase and Accumulo

Haifeng Li — Mon, 18 Aug 2014 12:27:24 +0000

NoSQL (Not Only SQL) database, departing from relational model, is a hot term nowadays although the name is kind of misleading. The data model (e.g., key-value, document, or graph) is surely very different from the tabular relations in the RDBMS. However, these non-relational data models are actually not new. For example, BerkeleyDB, a key-value store, was initially released in 1994 (20 years ago). In the web and social network era, the motivations of (distributed) NoSQL movement are mainly towards to horizontal scaling and high availability. By playing with the CAP theorem, many NoSQL stores compromise consistency in favor of availability and partition tolerance, which also brings the simplicity of design. Note that a distributed database system doesn’t has to drop consistency. For instance, TeraData and Google’s F1 are ACID-compliant (Atomicity, Consistency, Isolation, Durability). However, it makes systems much more complicated and also imposes high performance overhead.

In this series, I will look into several popular open source NoSQL solutions. This post will cover Apache HBase and Apache Accumulo. Both of them are modeled after Google’s BigTable, implemented in Java, and run on top of Apache Hadoop. Overall, HBase and Accumulo are very similar in architecture and features (especially now HBase 0.98 supports cell-level security that was a unique offer from Accumulo). In what follows, we will mainly discuss the design of HBase and also talk about the differences of Accumulo.

Data Model

In BigTable-like stores, data are stored in tables, which are made of rows and columns. Columns are grouped into column families. A column name is made of its column family prefix and a qualifier. The column family prefix must be composed of printable characters. The column qualifiers can be made of any arbitrary bytes. In HBase, column families must be declared up front at schema definition time whereas new columns can bed added to any column family without pre-announcing them. In contrast, column family are not static in Accumulo and can be created on the fly. The only way to get a complete set of columns that exist for a column family is to process all the rows.

Table row keys are uninterrpreted byte arrays. Rows are lexicographically sorted by row keys. In HBase, the empty byte array is used to denote both the start and end of a table’s namespace while null is used for this purpose in Accumulo.

A cell’s content is an uninterpreted array of bytes. And table cells are versioned. A (row, column, version) tuple exactly specifies a cell. The version is specified using a long integer. Typically this long contains time instances. By default, when doing a get, the cell whose version has the largest value is returned. It is possible to return more than one version with Get.setMaxVersions() or to return versions other than the latest by Get.setTimeRange() (here we use HBase Get class as an example). Without specifiying the version, Put always creates a new version of a cell with the server’s currentTimeMillis. But the user may specify the version on a per-column level. The user-provided version may be a time in the past or the future, or a non-time purpose long value. To overwrite an existing value, an exact version should be provided. Delete can happen on a specific version of a cell or all versions. To save space, HBase also cleans up old or expired versions. To declare how much data to retain, one may define the number of versions or the time to live (TTL).

Deletes work by creating tombstone markers. Once a tombstone marker is set, the “deleted” cells become effectively invisible for Get and Scan operations but are not immediately removed from store files. There is a snag with the tombstone approach, namely “deletes mask puts”. Once a tombstone marker is set, even puts after the delete will be masked by the delete tombstone. Performing the put will not fail. However when you do a get, the put has no effect but will start working after the major compaction, which will really remove deletes and tombstone markers (see below for details).

Storage

Physically, both HBase and Accumulo use HDFS to store data. Empty cells are not stored as tables usually have a large number of columns and are very sparse. In HBase, tables are stored on a per-column family basis. All column family members are stored together on HDFS. Accumulo also supports storing sets of column families separately on disk to avoid scanning over column families that are not requested. But tables place all column families into the same “default” locality group by default. Additional locality groups can be configured at any time.

Recall that HDFS (modeled after Google’s GFS) is a write-once (and/or appending-only since 0.20) file system. It is very efficient for reading a large portion of big files but not designed for random access. So how does HBase provide random, realtime read/write access on top HDFS (which is actually the exact reason to build HBase)? Here we come to the concept of Store. In HBase, a Store corresponds to a column family in a Region (see next section for details). A Store hosts a MemStore and a set of zero or more StoreFiles. The MemStore holds in-memory modifications to the Store. When the MemStore reaches a certain size or the total size of all MemStores reaches the upper limit (both are configureable), the sorted key-value pairs in MemStore will flushed into a HDFS file called StoreFile in HFile format (based on SSTable file in the BigTable).

Because HDFS is write-once, a Store may have multiple StoreFiles that are created for each flush. In order to reduce the number of StoreFiles per Store, a process called compaction can be executed. There are two types of compactions: Minor and Major. Minor compactions pick up a couple of smaller adjacent StoreFiles and rewrite them as one. Minor compactions do not drop deletes and expired cells. In contrast, Major compactions pick up all the StoreFiles in the Store and generate a single StoreFile per Store that removes deletes and expired cells.

Sharding

Any serious distributed database needs a sharding strategy. HBase and Accumulo supports auto-sharding, which means that tables are dynamically partitioned by rows and distributed by the system. The basic unit of sharding is called a Region in HBase (or Tablet in Accumulo). A region is a contiguous and sorted range of rows of a table stored together on disk. Initially, there is only one region for a table. However, when regions become two large, a region is split into two at the middle key (recall that rows are lexicographically sorted by row keys). Regions are served by RegionServer (TabletServer in Accumulo). Each RegionServer is responsible a set of regions but one region can be served only by one RegionServer. Because of this design, it is easy for HBase/Accumulo to provide row level consistency.

Here we have an assignment problem: given a region, which RegionServer should be assigned to it? This coordination work (and other administrative operations) is done by HMaster and recorded in ZooKeeper. Each region is assigned to a RegionServer on startup. However, the Master may decide to move a region from one RegionServer to another for load balance. Besides, the Master also handles RegionServer failures by assigning the regions to another RegionServer.

In general, HBase is designed to run with a small (20-200) number of relatively large (5-20Gb) regions per RegionServer. A large number of regions per RegionServer will cause a lot memory overhead and possibly too many flushes and compactions. A large number of regions also take a lot of time for the Master to assign and move them because of the heavy usage of ZooKeeper. When there are too many regions, one can consolidate them with the utility Merge.

As we are talking about RegionServer and Master, let’s dig into the architecture of HBase.

Architecture

HBase ArchitectureSo far, we have discussed all the moving parts in HBase as shown in the above diagram. But how a client gets or puts data from/into HBase? From the diagram, one may think that clients needs to contact the HMaster to find out which RegionServer it should talk to for a given row. Actually, they don’t. The client HTable contains the logic of finding the server responsible for a particular region, and communicates with RegionServers directly to write and retrieve key-value pairs. It does this by querying the system table .META that contains (table name, region start key, region id) as keys and server information as values to keep track of regions. The region location information is cached in the client so that subsequent requests need not go through the lookup process. In case of a region split or reassignment due to load balance or RegionServer failure, the client will receive an exception and then refresh the cache by querying updated information. But how does the client find the .META table and read from it? In fact, the .META table is stored in ZooKeeper by HMaster, which the client reads directly from.

Typically, HBase setups a RegionServer co-located with an HDFS DataNode on the same physical node. When StoreFiles are written into HDFS, one copy is written locally and two are written to other nodes. As long as the regions are not moved, there is good data locality. When the regions are reassigned to a new RegionServer, the data locality is lost and the RegionServer needs to read the data over the network from remote DataNodes until the data is rewritten locally.

Fault Tolerance

One may think that the Master is a SPOF (single point of failure). Actually, we can set up multiple HMasters although only one is active. HMasters use heartbeats to monitor each other. If the active Master shuts down or loses its lease in ZooKeeper, the remaining Masters jostle to take over the Master role. Because the clients talk directly to the RegionServers, the HBase cluster can still function in a steady state in short period during the Master failover. Note that Accumulo doesn’t support multiple Masters currently and thus the Master is a SPOF.

So how about RegionServers? It looks like that we are safe since there are multiple instances. However, recall that a region is managed by a single RegionServer at a time. If a RegionServer fails, the corresponding regions are not available until the detection and recovery steps have happened. It is actually a SPOF although there are no global failures in HBase.

To be resilient to node failures, all StoreFiles are written into HDFS, which replicates the blocks of these files (3 times by default). Besides, HBase, just like any other durable databases, uses a write-ahead-log (WAL), which is also written into HDFS. To detect the silent death of RegionServers, HBase uses ZooKeeper. Each RegionServer is connected to ZooKeeper and the Master watches these connections. ZooKeeper itself employs heartbeats. On a timeout, the Master declares the RegionServer as dead and starts the recovery process. During the recovery, the regions are reassigned to random RegionServers and each RegionServer reads the WAL to recover the correct region state. This is a complicated process and the mean time to recovery (MTTR) of HBase is often around 10 minutes if a DataNode crash with default settings. But we may reduce the MTTR to less than 2 minutes with careful settings.

Replication

The replication feature in HBase copies data between HBase deployments, which usually sit in different geographically distant data centers and thus provides a way of disaster recovery. The approach of HBase replication is master-push, just like MySQL’s master/slave replication. The replication is done asynchronously and each RegionServer replicates their own stream of WAL edits. Although there is similar work in progress in Accumulo, it is not available yet.

Cell-Level Security

The cell-level security was a unique feature of Accumulo because of its NSA root. When mutations (i.e. puts in HBase) are applied, users can specify a security label for each cell by passing a ColumnVisibility object. Security labels consist of a set of user-defined tokens that are required to read the associated cell. The security label expression syntax supports boolean logic operations. When a client attempts to read data, any security labels present are examined against the set of authorizations passed with Scanner. If the authorizations are determined to be insufficient to satisfy the security label, the cell is suppressed from the results. Each user has a set of associated security labels, which can be manipulated in the shell.

HBase 0.98 also provides the cell-level security feature now, which requires HFile v3.

HBase Coprocessor And Accumulo Iterator

Both HBase and Accumulo provide the modular approaches (called coprocessor and iterator respectively) to add custom functionalities. This allows users to efficiently summarize, filter, and aggregate data directly on RegionServers/TabletServers. Compared to MapReduce, it gives a dramatic performance improvement by removing communication overheads. In fact, cell-level security and column fetching are implemented using iterators in Accumulo.

Summary

As BigTable clones, HBase and Accumulo provide a wide-column data model and random real-time CRUD operations on top of HDFS. They can horizontally scale out to efficiently serve billions of rows and millions of columns by auto-sharding. Because each region is served by only one RegionServer at a time, they also support strong consistency for reads and writes. Automatic failover of RegionServer/TableServer are supported although efforts are needed to reduce MTTR. With replications across multi data centers, HBase adds more supports of disaster recovery. In the next post, I will cover Riak, which follows a very different system design.

ALSO IN THIS SERIES:

Distributed NoSQL: MongoDB

In this installment of his Understanding NoSQL series, guest contributor Haifeng Li delves deeper into MongoDB, the fifth most popular database in the world. Examining the data model, storage and cluster architecture, Li aims to give us an in-depth understanding of MongoDB’s database technology.

Follow @DataconomyMedia

(Featured Image: Data Science Labs)

Choosing a NoSQL

Haifeng Li — Tue, 12 Aug 2014 12:48:24 +0000

Many people ask my opinion on different NoSQL databases, and they also want to know the benchmark numbers. I guess that the readers of this post probably have similar questions. When you start building your next cool cloud application, there are dozens of NoSQL options to choose. It is natural to ask which one is the fastest? AP or CP? What’s the sharding and fault tolerance strategy? And so on. But you won’t find any comparison chart or benchmark numbers here. You may be yelling now: “We benchmarked Oracle, DB2, SQL Server, MySQL and PostgreSQL in old days. And it was very meaningful and helpful!” Please let me explain.

Yes, it makes sense to benchmark relational databases because SQL-based relational database products are largely indistinguishable. You are doing an apples-to-apples comparison. The benchmarks do help us to understand which implementation is suitable for which use case. But NoSQL solutions are different animals. They offer different data models (key-value, wide-columnar, objects/documents, graph, etc.), CP or AP, synchronous or asynchronous replication, in memory or durability, strong consistency or eventual consistency, etc. When comparing them, we are comparing apples to oranges. The various benchmarks with contradicting results may just confuse us further.

“But I have to make a choice!”, you cry. Well, a quick (and cheating) way is to check out NoSQL’s creators and their applications. All NoSQL products are great in the sense that they do what they’re supposed do for their creators. In the old days, we built applications against a database. Today, it seems that people build databases against applications, because each company has so different requirements. It is why Google built BigTable, Amazon.com built Dynamo, and Facebook built Cassandra. This list can go on. For example, 10gen built MongoDB for their web platform originally. In other words, they built these NoSQLs for themselves. If your business aligns with one of theirs, congratulations! Just go with the corresponding open source model.

But what if your idea is truly innovative, and you are doing something wild that no existing solutions seem like a good fit? In an environment of rapid technology advancement and ever-changing user requisitions, it is not realistic to choose the “best” solution. It is better to think of the minimization of business risk first, rather than technical comparisons. When we embrace cutting-edge technologies including NoSQL, we have to be careful that cutting-edge technologies don’t turn the bleeding edge on us. It is always good to ask if we have a plan B that has minimal migration cost.

If we look back, history may teach us something helpful. Before relational databases, the world of DBMS was somehow similar to today. There were many different data models, systems, and interfaces. Why did relational databases replace these dinosaurs? There are many reasons. Let’s look at this from a programmer’s perspective. With relational databases, I, a software engineer, virtually don’t care what the back end looks like. No matter if it’s MySQL, Oracle, or TeraData, all I face is the ubiquitous relational data model and all I use is SQL. Yes, there are always some small differences on data types and SQL syntax among them. But it doesn’t take 10 years to migrate from one to another.

Based on this observation, we probably should firstly choose a data model that is flexible and expressive. More importantly, this data model should be supported in multiple major solutions that are from both CP and AP schools. With this in mind, I am thinking of BigTable’s wide-columnar data model. As we know, key-value pairs are the simplest yet most flexible data model. With the logic concept of column/column family, the wide-columnar data model also enables us to encapsulate document and graph models. Crucially, this data model is supported by both HBase (CP) and Cassandra (AP). Both HBase and Cassandra have very large community and are used in large-scale real-life systems. HBase provides strong consistency, tight integration with MapReduce, and in-database computation through coprocessors. Cassandra provides simple and symmetric architecture and also excellent multi-datacenter support. With some abstraction, we can easily switch from one to the other.

This is my two cents. What’s your opinion? Please feel free to leave your comment below.

ALSO IN THIS SERIES:

Distributed NoSQL: MongoDB

Follow @DataconomyMedia

Connecting the Dots Between Hadoop, SQL and NoSQL: Oracle Big Data SQL

Eileen McNulty — Wed, 16 Jul 2014 08:38:04 +0000

Oracle has unveiled their Oracle Big Data SQL, a tool which runs single SQL queries across Oracle’s own database, Hadoop and NoSQL. The software is a feature in Oracle’s Big Data Appliance, which includes Cloudera’s enterprise Hadoop product.

In discussion with ZDNet, vice president of big data and advanced analytics at Oracle Neil Mendelson identified three main problems enterprises face when managing big data: managing integration, finding the right people with new skill sets vs. training existing personnel, and security. He hopes that Oracle Big Data SQL will address these problems.

Connecting the dots between different big data solutions minimises data movement, which Oracle hope will lead increased speed as well as ease of use. They promise the queries can be run over an array of structured and unstructured data. As well as querying over Hadoop and NoSQL, Oracle’s enhanced security features will cover data used by these solutions.

In order to gain access to Oracle Big Data SQL, enterprises must be running Oracle Database 12c. Production is slated to start in Autumn, and cost will be announced when the product is generally available.

If you’re running Oracle, you may also be interested to know they released 113 crucial patches across their product line last night- find out more about that here.

(Image credit: Flickr)

Aerospike Gain $20 Million Funding, Plan to Expand into New Markets

Eileen McNulty — Wed, 25 Jun 2014 08:30:43 +0000

Aerospike have been making serious waves as a NoSQL solution for the advertising industry, due to their ability to process millions of transactions every second. It comes as little surprise, then, that they’ve managed to secure a $20 million series C round of venture capital. The round was led by NEA, with Columbus Nova Technology Partners, Alsop Louie Partners and Regis McKenna also contributing. Their battle plan with this new investment is two-fold; first, open source their technology. Second, use this new exposure to move into new markets where speed is vital, such as finance, retail and travel.

Aerospike have, up until this point, focused their efforts primarily on the advertising industry, but “there’s a moment when the time is right to go for broad adoption”, as CTO and Founder Brian Bulkowski noted. Now that they’ve built a robust platform, Aerospike is ripe for expansion.

They hope that open sourcing their product will build both trust and exposure for their product. They’re now following a similar “open core” business model to NoSQL competitors MongoDB, where the technology will be open-sourced, by they’ll maintain premium enterprise features for their paying clientele.

They are, however, stepping into a crowded marketplace. There’s a vast proliferation of NoSQL solutions out there offering the same focus on performance and low-latency, who’ve been around a while and already have established client bases. But Bukowski is confident his team’s technology will speak for itself. Moving forward, Aerospike are hoping to introduce connectors for other platforms such as Hadoop and improvements to SQL query language support. Only time will tell if Aerospike manage to establish themselves outside the realm of advertising and marketing.

Read more here.
(Image credit: Aerospike)

Follow @DataconomyMedia

Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

[mc4wp_form]

Understanding Big Data: Infrastructure

Eileen McNulty — Thu, 19 Jun 2014 10:54:19 +0000

Infrastructure is the cornerstone of Big Data architecture. Possessing the right tools for storing, processing and analysing your data is crucial in any Big Data project. In last installment of “Understanding Big Data”, we provided a general overview of all of the technologies in the Big Data landscape. In this edition, we’ll be closely examining infrastructural approaches- what they are, how they work and what each approach is best used for.

Hadoop

To recap, Hadoop is essentially an open-source framework for processing, storing and analysing data. The fundamental principle behind Hadoop is rather than tackling one monolithic block of data all in one go, it’s more efficient to break up & distribute data into many parts, allowing processing and analysing of different parts concurrently.

When hearing Hadoop discussed, it’s easy to think of Hadoop as one vast entity; this is a myth. In reality, Hadoop is a whole ecosystem of different products, largely presided over by the Apache Software Foundation. Some key components include:

HDFS- The default storage layer
MapReduce- Executes a wide range of analytic functions by analysing datasets in parallel before ‘reducing’ the results. The “Map” job distributes a query to different nodes, and the “Reduce” gathers the results and resolves them into a single value.
YARN- Responsible for cluster management and scheduling user applications
Spark- Used on top of HDFS, and promises speeds up to 100 times faster than the two-step MapReduce function in certain applications. Allows data to loaded in-memory and queried repeatedly, making it particularly apt for machine learning algorithms

More information about Apache Hadoop add-on components, can be found here.

The main advantages of Hadoop are its cost- and time-effectiveness. Cost, because as it’s open source, it’s free and available for anyone to use, and can run off cheap commodity hardware. Time, because it processes multiple ‘parts’ of the data set concurrently, making it a comparatively fast tool for retrospective, in-depth analysis. However, open source has its drawbacks. The Apache Software Foundation are constantly updating and developing the Hadoop ecosystem; but if you hit a snag with open-source technology, there’s no one go-to source for troubleshooting.

This is where Hadoop-on-Premium packages enter the picture. Hadoop-on-Premium services such as Cloudera, Hortonworks and Splice offer the Hadoop framework with greater security and support, with added system & data management tools and enterprise capabilities.

NoSQL

NoSQL, which stands for Not Only SQL, is a term used to cover a range of different database technologies. As mentioned in the previous article, unlike their relational predecessors, NoSQL databases are adept at processing dynamic, semi-structured data with low latency, making them better tailored to a Big Data environment.

The different strengths and uses of Hadoop and NoSQL are often described as “operational” and “analytical”. NoSQL is better suited for “operational” tasks; interactive workloads based on selective criteria where data can be processed in near real-time. Hadoop is better suited to high-throughput, in-depth analysis in retrospect, where the majority or all of the data is harnessed. Since they serve different purposes, Hadoop and NoSQL products are sometimes marketed concurrently. Some NoSQL databases, such as HBase, were primarily designed to work on top of Hadoop.

Some big names in NoSQL field include Apache Cassandra, MongoDB, and Oracle NoSQL. Many of the most widely used NoSQL technologies are open source, meaning security and troubleshooting may be an issue. It also places less focus on atomicity and consistency than on performance and scalability. Premium packages of NoSQL databases (such as Datastax for Cassandra) work to address these issues.

Massively Parallel Processing (MPP)

As the name might suggest, MPP technologies process massive amounts of data in parallel. Hundreds (or potentially even thousands) of processors, each with their own operating system and memory, work on different parts of the same programme.

As mentioned in the previous article, MPP usually runs on expensive data warehouse appliances, whereas Hadoop is most often run on cheap commodity hardware (allowing for inexpensive horizontal scale out). MPP uses SQL, and Hadoop uses Java as default (although the Apache Foundation developed Hive, a language used in Hadoop similar to SQL, to make using Hadoop slightly easier and less specialist). As with all technologies in this article, MPP has crossovers with the other technologies; Teradata, an MPP technology, has an ongoing partnership with Hortonworks (a Hadoop-on-Premium service).

Many of the major players in the MPP market have been acquired by technology vendor behemoths; Netezza, for instance, is owned by IBM, Vertica is owned by HP and Greenplum is owned by EMC.

Cloud

Cloud computing refers to a broad set of products that are sold as a service and delivered over a network. In other infrastructural approaches, when setting up your big architecture you need to buy hardware and software for each person involved with the processing and analysing of your data. In cloud computing, your analysts only require access to 1 application- a web-based service where all of the necessary resources and programmes are hosted. In cloud computing, up-front costs are minimal as you typically only pay for what you use, and scale out from there- Amazon Redshift, for instance, allows you to get started for as little as 25 cents an hour. As well as cost, Cloud computing also has an advantage in terms of delivering faster insights.

Of course, having your data hosted by third party can raise questions about security; many choose to host their confidential information in-house, and use the cloud for less private data.

Alot of big names in IT offer cloud computing solutions; Google has a whole host of Cloud computing products, including BigQuery, specifically designed for the processing and management of Big Data; Amazon Web Services also has a wide range, included EMR for Hadoop, RDS for MySQL and DynamoDB for NoSQL. There are also vendors such as Infochimps and Mortar specifically dedicated to offering cloud computing solutions.

As you can see, these different technologies are by no means direct competitors; each has its own particular uses and capabilities, and complex architectures will make use of combinations of all of these approaches, and more. In the next “Understanding Big Data”, we will be moving beyond processing data and into the realm of advanced analytics; programmes specifically designed to help you harness your data and glean insights from it.

(Image credit: NATS Press Office)

Follow @DataconomyMedia

Eileen McNulty-Holmes – Editor

Eileen has five years’ experience in journalism and editing for a range of online publications. She has a degree in English Literature from the University of Exeter, and is particularly interested in big data’s application in humanities. She is a native of Shropshire, United Kingdom.

Email: eileen@dataconomy.ru

Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

[mc4wp_form]

Cloudera and MongoDB Announce Product Partnership

admin — Wed, 30 Apr 2014 14:58:19 +0000

Hadoop store software company Cloudera and leading NoSQL database MongoDB have announced a strategic all-round product partnership to better serve their converging clienteles.

MongoDB’s Product Marketing Director Kelly Stirman said in a statement: “We have an enormous overlap in our customer bases”.

The first fruit of the partnership is the certification of the MongoDB Connector for Hadoop to run on the latest version of Cloudera’s Hadoop distribution. However, Stirman added that users could expect some meaningful co-developed technology improvements and customer support deals, and he underlined the importance of the new engineering opportunities to arise out of the partnership. Operations, security and data management were listed as potential areas of focus.

This partnership is only the most recent development in the ongoing battle between Cloudera and Hortonworks over the Hadoop market, and as the former strengthens its ties to IBM, Intel and MongoDB, and the latter to Red Hat, Microsoft, and Teradata, the focus is on the “ecosystems shaping up around each”, and not only their technological differences.

Big Data Attracts Biggest Salaries

admin — Wed, 23 Apr 2014 10:52:41 +0000

Professionals working in the tech sector – and especially those working with big data – have seen their salaries increase once again, this year by almost three percent, to $87,811 from $86,619 the year before. According to employment researcher Dice Tech, 45% of individuals surveyed reported receiving a merit pay increase. Across the entire tech sector, bonuses and salary hikes are increasingly being used to hold on to valued employees with one to five years’ experience.

According to the survey, “employers are using selective and strategic increases in compensation to hold onto experienced tech talent….While the overall average salary increase was smaller than the previous year’s historic jump of more than five percent, employers offered more frequent merit increases.” 34% of those asked reported receiving a bonus, with bonuses averaging north of $9000.

While salaries continue to rise in one of the most well paid industries, a growing number of survey respondents felt dissatisfied with their level of compensation, down three percentage points to 54%. This may well be linked to the large amount of work demanded with continuously tighter budgets set aside for IT. Of those surveyed, a vast majority felt they could find another job in the sector where unemployment remains far lower than the national average.

According to the survey, of the skills found most valuable, big data topped the list with nine of the top ten salaries connected to big data technologies. The largest salaries went to those working with big data related programming languages as well as those with database experience. “Companies are betting big that harnessing data can play a major role in their competitive plans and that is leading to high pay for critical skills,” said Dice President Shravan Goli. “Technology professionals should be volunteering for big data projects, which makes them more valuable to their current employer and more marketable to other employers.”

FoundationDB – NoSQL Database with ACID Transactions

Rohan Patil — Mon, 14 Apr 2014 14:10:25 +0000

We spoke with Dave Rosenthal, CEO of FoundationDB, over a conference call last week. Dave gave us an overview of FoundationDB and its business activities. FoundationDB is a NoSQL database with a shared nothing architecture.The product is designed around a “core” database, with additional features supplied in “layers.” The core database exposes an ordered key-value store with transactions. The transactions are able to read or write multiple keys stored on any machine in the cluster while fully supporting ACID properties. Transactions are used to implement a variety of data models via layers.

Who are you and what do you do?

We’re FoundationDB and we’re a database software startup tackling fundamental computing problems in the big data scene. We’re not a data science company per se’ – we’ve built a database itself – the core storage substrate that can be used in an operational capacity to generate the data to fill big interactive data analysis platforms and applications. We are especially concerned with the types of applications that require a lot of random writes and reads, and random database access versus just doing batch processing and scanning through data.

We think of ourselves as an operational database, rather than an analytical database, though due to its design FoundationDB is good at both.

How did you come up with this idea?

We previously started a company called Visual Sciences whose primary product was a high performance data analytics tool. It got its first major adoption in the field of web analytics, and it had a very rich graphical data analysis interface. What ended up happening was that Visual Sciences went through a series of acquisitions and was ultimately acquired by Adobe. Essentially, the product that my co-founder and I (along with the rest of the team) built for Visual Sciences is called “Adobe Insight.” It’s basically a high performance data analysis platform for ad-hoc queries on big data sets, and is used by some of the largest companies in the world. So, we were coming off of that start-up that we had worked at for almost 10 years, and then started looking into building a new company. One of the things we asked ourselves was what database should we use for our new products?

At this time, 2009 or so, NoSQL was starting to become a buzzword and we began looking in a lot of NoSQL databases out there. We became pretty excited by their distributed designs and the fault tolerance of many of these systems. Unfortunately, when we really started to use them, we realized that a lot of them made sacrifices that we didn’t want to make. The problem that we saw was that all of the NoSQL databases had given up on the transactions and ACID properties that traditional databases had used for decades. Having observed this problem, we wanted to build a NoSQL database that had those capabilities.

One thing led to another and we really thought that a good next startup would be to tackle this big problem in the NoSQL database market.

What is your business model?

It’s pretty simple. We want to build the best NoSQL database in the world and get everybody out there using it. Our license model is on a “per process basis”, which means that when you run a FoundationDB database process on a server, depending on how many of them you are using in your cluster each month, the price changes. The other thing we are excited about is that we allow people to use up to 6 server processes even in a production environment for free, and unlimited usage in non-production environments.

What makes FoundationDB unique?

What makes us unique is really the fact that there is no other NoSQL database out on the market that has these transactional guarantees that we offer. There are a couple of companies that have some transactional capabilities – MarkLogic, VoltDB for example – but we’re really the only NoSQL database on the market that is designed to be used all the time using transactions across the entire database.

The technical differentiator is that by supporting ACID transactions, it allows us to deliver a variety of data models and languages on top of a common engine. For example, we will be launching a SQL engine this year that allows you to have a true ANSI SQL database on top of our NoSQL database. That’s something that nobody in the industry can do because they don’t have that transactional capability. That is what is special about our product.

Are there any case studies you can talk about?

The one that comes to mind is on a company called Locality. Locality is an online marketplace that tries to collect all of the local service businesses in one place where users can search based on a number of criteria. If, for example, if you wanted to find a full list of your local dry cleaners, or hairdressers, you would go onto Locality and it would give you comprehensive pricing as well as customer satisfaction information.

At the time we started talking with them, Locality’s lead back-end engineers were looking for new database tools to reach their performance and reliability goals. The engineers at Locality started searching for a NoSQL database that would meet all the database requirements they had — flexibility, scalability, fault tolerance, etc — but they thought they had to compromise on other key features like transactional capabilities. There was nothing out there that met all of their requirements. It was then that they found us and started playing around with what we had built.

Today Locality’s entire product is running on FoundationDB! They came to us because we were the only company that was able to match all the things that were needed in NoSQL, but also offered additional features that didn’t force them to compromise.

Are you planning on expanding into Europe or Germany in particular?

We actually have a few companies that we’re working with. There’s a large, Swedish dating website company, a financial services company that’s based in London and a software vendor based in France. We have plans for Germany and will be rolling out there soon. We’re attending the NoSQL Matters conference in Cologne in late April, also.

FoundationDB is a NoSQL database with a shared nothing architecture.The product is designed around a “core” database, with additional features supplied in “layers.” The core database exposes an ordered key-value store with transactions. The transactions are able to read or write multiple keys stored on any machine in the cluster while fully supporting ACID properties. Transactions are used to implement a variety of data models via layers.

Sqrrl Collaborates with Macmillan Education

admin — Thu, 10 Apr 2014 09:54:20 +0000

Sqrrl, the company that develops secure NoSQL database software for Big Data applications, announced that it would be collaborating with Macmillan Education Australia to help them power a next generation education portal. Sqrrl Enterprise, the company’s NoSQL database, will allow Macmillan to secure and protect huge amounts of data that can only be accessed in authorized ways.

“We are very excited to be working with Macmillan in Australia,” says Sqrrl CEO Mark Terenzoni. “Big Data apps for education have strong security requirements, and Sqrrl brings unique Data-Centric Security capabilities to help make these apps possible.”

Sue McNab, the platform technical lead at Macmillan Education Australia, also commented on the partnership.

“For Macmillan Connect, we require a NoSQL database that can support interactive queries, process very large amounts of multi-structured data, and provide fine-grained access controls and encryption…we looked at a number of options, and only Sqrrl could provide these combined capabilities.”

SQL vs NoSQL: Weighing the advantages

Rohan Patil — Fri, 14 Mar 2014 15:04:02 +0000

SQL vs NoSQL

Since both languages have equal number of proponents when deciding which language to use, Network World invited two influencers to share their views on SQL vs NoSQL. Ryan Betts, CTO of VoltDB, is a stark proponent of SQL and will therefore take the structured side. Bob Wiederhold, CEO of CouchDB, on the other hand is of the strong opinion that, when evaluating SQL vs NoSQL in light of Big Data, NoSQL is clearly the better choice. We will try and sum up the points on both sides.

SQL

SQL is an incredibly well-established and long-running technology that is now deployed with the likes of Google, Facebook and Cloudera, all companies with a lot of clout in the Big Data sphere. In addition to simply pointing out that SQL is the tried and tested language, Betts provides four reasons for why it is a more appropriate choice for dealing with Big Data.

SQL opens up the insight potential that big data has to a much wider community of people, who will not necessarily have a software development background. Since the user types just the commands and leaves the decision of how to most efficiently perform that query up to the database engine, analysts, managers and other employees can run large-scale queries without understanding the underlying computational processes

One of the main reasons against SQL was its lack of scalability but Betts strongly disagrees with this, noting that companies such as Facebook would not be using SQL if it was not able to handle their petabytes of data.

NoSQL

NoSQL, a language that supports non-relational database queries, uses a distributed file system* and is able to handle data coming in non-standard shapes and sizes. NoSQL allows multiple users to access the information at the same time, which in turn means that the size of the dataset that is worked with can be immense without causing any issue.

Scalability without NoSQL might be possible, but it is unnecessarily costly since ever more expensive hardware is required. NoSQL on the other hand can be run across a large number of cheap nodes that, when combined, offer the same power at a much lower cost. Adding further space to the network is therefore easily done and light on the wallet.

NoSQL does not try to squeeze information into rows and columns that, in turn, are identified by further rows and columns, all of which need to be accessed and collated during each read/write operation. This does not make a crucial difference when working with smaller datasets but, as these grow, the computing power required to execute these operations takes up time. Its distributed nature makes NoSQL much faster. It may duplicate data in the process but since storage is comparatively cheap, the extra storage cost is in no relation to the speed gained.

Wiederhold’s clearest argument, and his most convincing, is that most of the data collected today is in unstructured form. As a result only a NoSQL database (like CouchDB) is able to deal with it.

Join in on the debate of SQL vs NoSQL here.

*see here for a definition of what a distributed file system is and here for how one would work

Image Credits: owenjell / Flickr

NoSQL – Dataconomy

Three signs you might be experiencing a NoSQL hangover

Your application is available but not necessarily reliable

The speed you bought suddenly isn’t fast enough

Your system can scale but there are difficulties

Big Data’s Potential For Disruptive Innovation

Phase 1 – Performance

Phase 2 – Reliability

Phase 3 – Convenience

Phase 4 – Price

The Modern Face of Master Data

Consumer-Enterprise: Bridging the Divide

Back to the Enterprise

“I’d like to see NoSQL become much easier to adopt” – Interview with Basho’s John Musser

You have a long standing history as an expert in API strategies. Why do you think API strategies are so valuable to a company and what are your tips for others who want to become an expert in this field?

Which use cases do you think can benefit most from NoSQL and related distributed computing?

What trends do you see gaining steam over the coming year that will impact the use of NoSQL?

What one piece of advice would you like to share with companies that are trying build out big data applications?

How do you hope to see the use of NoSQL change?

Are there any industries you think need to adopt NoSQL over others? Why?

Top 10 Big Data Videos on Youtube

SQL vs. NoSQL vs. NewSQL: Finding the Right Solution

OldSQL

The OldSQL advantages:

The OldSQL disadvantages:

NoSQL

Some NoSQL Systems Put Availability First

Some NoSQL Systems Focus on Flexibility

Some NoSQL Systems Focus on Alternative Data Models or Special Use Cases

The NoSQL advantages:

The NoSQL disadvantages:

NewSQL

The Need for Speed: Fast in-memory SQL

The NewSQL advantages:

The NewSQL disadvantages:

Summing it up

VoltDB Secures $9.8 Million to Expand Streaming Analytics & IoT Capability

Coming Full Circle: Why SQL now powers the NoSQL Craze

If you have to support queries, why use SQL?

Meet Azure DocumentDB- Microsoft’s NoSQL Document Database Service

Scale Up is the New Scale Out

First-generation scaling

Next-generation scaling: new servers are like little data centers

SQL for Documents is the Next Frontier for NoSQL Startup Couchbase

Aerospike’s New CEO John Dillon on the 2015 Roadmap for Database Tech: Less Hype, Better Technology

Basho, Creators of Riak Database, Muster $25m in Fresh Funding

Neo Technology Proves Demand and Mainstream Adoption of Neo4j with Latest $20M Funding

Fresh Round of $80m Funding Takes MongoDB’s Valuation to 1.6 Billion

MongoDB and HCL Infosystems Hatch Global Alliance to Expand Big Data Influence

Three Key New Features from Aerospike’s Extensive Upgrade

1. A Whole Host of New Clients

2. Hadoop Integration

3. Docker Containers

In-Memory Data Grid Outfit Hazelcast Pick Up $11m to Gain Ground as NoSQL Tech Provider

22 October, 2014- GraphConnect 2014, San Francisco

Distributed NoSQL: Cassandra

Data Model

Storage

Architecture

Vector Clock and Last-Write-Wins

Summary

Choosing a NoSQL

Distributed NoSQL: Riak

Data Model

Storage

Architecture

Vector Clock

Summary

MongoDB Offers Paid Support for Free Community Edition to Uplift Revenues

Distributed NoSQL: HBase and Accumulo

Data Model

Storage

Sharding

Architecture

Fault Tolerance

Replication

Cell-Level Security

HBase Coprocessor And Accumulo Iterator

Summary

Distributed NoSQL: MongoDB