MemSQL – Dataconomy

10 Essential Database Ingredients

Carlos Bueno — Mon, 14 Sep 2015 10:53:09 +0000

Whenever you ask a successful company why they wrote their own large distributed system, or put a lot of work into gluing multiple systems together, the reason almost always boils down to something like, “Well, XYZ did nearly everything we needed, except…”. There are a couple of stellar write-ups from experienced systems builders about what they want to see in the future. Having examined many real-world systems, I think there are 10 essential features that any serious database will need in order to keep being considered a serious database.

Massive scalability: If the success of companies like Google, Amazon, and Facebook has taught us anything, it’s the power of big piles of itty-bitty boxes working in concert to complete a task. This means the ability to function on hundreds or thousands of machines, handling failover automatically, and robust replication to multiple datacenters.

Massive parallelism: A distributed system does you no good if you can only do a few things at a time. This is an active area of innovation, but the trend is clearly toward lock-free data structures and multi-version concurrency control (MVCC). The goal is to be able to use all of the CPUs at once to service thousands or millions of transactions per second without having one part of the system waiting on a global lock.

Flexibility: It’s critical to be able to implement databases on commodity hardware with as few specific requirements as possible. Direct-attached storage instead of centralized (and oversubscribed) SANs. Standard Linux as the operating system using standard network hardware. The “run anywhere” idea can sometimes be taken too far. With easy virtualization and containerization, it’s not clear that you even need a Mac OS port, let alone a Windows port. The standard chipset these days is 64-bit x86. ARM is still over the horizon as far as mainstream datacenters are concerned.

Extensibility: User Defined Functions and scripting in general. The imagination of your userbase is always going to be greater than yours. Giving users a way to extend the database in ways you can’t imagine is a powerful tool.

Real-time capabilities: Too often, users are given a false choice between storing data efficiently and being able to query it immediately. Immediate access to data as it happens is increasingly valuable.

Ad-hoc analysis: Relatedly, the kinds of real-time queries you can run shouldn’t be limited. Many traditional “real-time” systems pre-aggregate metrics in order to drive graphs. That’s fine, except you literally have to be psychic to set up all the queries you might need before you know you need them. Exploratory, ad-hoc analysis should be just as easy as pre-baked reports. It’s through that exploration that you discover what queries should become pre-baked.

Mature monitoring: Too many database systems, especially the new ones, lack mature tools for monitoring cluster health, system load, data layout, and status.

Compatibility: No database works in a vaccuum, not even the ones in space. A database should have easy integration with other datastores, analysis tools, libraries, and existing applications.

Low barrier to entry: Databases should be easy to use. This means both simplicity of the interface and familiarity to existing skills. Of all the new APIs and query languages that have been pushed onto the market over the last 10 years, I’ve yet to see one that’s strictly better than standards like SQL. A more-flexible data model shouldn’t require a complete rewrite of your applications. You shouldn’t have to learn another language just because the data is over here instead of over there.

Real SQL / relational features: As various NoSQL databases matured over the last 5 years or so, a curious thing happened to their APIs: they started looking more like SQL. It’s not because SQL is the ultimate language. It’s because SQL is based on some pretty fundamental math called relational algebra. Math has a funny habit of being true and useful no matter how many mean things people say about it on their blogs. A database without true relational features built in is at best a fancy filesystem.

We’re not all there yet as an industry, but we’re getting closer. A lot of hype that used to obscure the landscape is falling away as more companies acquire first-hand experience with the reality of real-time distributed analysis.

(image credit: Andrea Goh, CC2.0)

10 Big Data Stories You Shouldn’t Miss this Week

Eileen McNulty — Fri, 09 Jan 2015 17:18:02 +0000

TOP DATACONOMY ARTICLES

Python Packages For Data Mining

“Just because you have a “hammer”, doesn’t mean that every problem you come across will be a “nail”.The intelligent key thing is when you use the same hammer to solve what ever problem you came across. Like the same way when we indented to solve a datamining problem we will face so many issues but we can solve them by using python in a intelligent way.”

How to Interview a Data Scientist

You know the feeling. You’re sat across the table from an interviewee and the conversation starts to run away from you. If this typically happens to you most when you’re conducting technical interviews, it could be either that the person your interviewing is talking in a language not known to you, or you simply don’t have the right questions to ask that cut through the techno-babble and get to the point of why they’re sat there in front of you.

How Self-Service BI Is Going to Revolutionise the Enterprise

Chris Neumann stands at the forefront of the fastest moving technology industry trend: Cloud BI. We recently spoke to Chris about DataHero’s journey, the importance of UX and how self-service BI is going to change the game for enterprises big and small.

TOP DATACONOMY NEWS

Meet Apache Samza – LinkedIn’s Stream Processing Framework

With the advent of Big Data and the rapidly growing scale of web-applications, monolithic relational databases were replaced by scalable, partitioned, NoSQL databases and HDFS; individual queries to relational databases were replaced by the likes of Hive and Pig. This growing scale and partitioned consumption model brought about by these systems, also put forth the need for smooth processing of “streams of events” at scale. That’s when LinkedIn came up with Samza.

MemSQL Reveal New Productivity Tool to Dramatically Improve Data Ingest

MemSQL, the San Francisco, CA based Big Data analytics outfit, has today rolled out a new data loading tool that increases ‘data ingest from popular data stores like Amazon Web Services S3 and the Hadoop Distributed File System (HDFS)’.

VMware Incorporates Workday’s Prediction Tech to Better Handle Probable Employee Resignations

Human capital management software vendor, Workday, has developed a new prediction technology that flags moments where employees might quit providing employers a heads up to resolve issues. VMware, the US based cloud and virtualization software and services provider is testing Workday’s latest offering, reveals Bloomberg.

TOP UPCOMING EVENTS

22-23 January, 2015- Business Analytics Summit, Las Vegas

The Business Analytics Innovation Summit provides 30+ industry case studies and over 20 hours of networking opportunities across 2 days. Make sure to check back regularly for schedule additions and changes.

14-16 January, 2015- NetSci Rio de Janeiro

The NetSci-X is the first Network Science Conference outside the USA-Europe axis. It will bring together leading researchers and practitioners working in the emerging area of network science.

TOP DATACONOMY JOBS

Pricing Manager / Analyst, Wayfair

As an Analyst of Pricing you will be responsible for pricing every product that appears on the website. You will manage the daily operational pricing functions while continually seeking to optimize procedures and test strategies to increase gross profit. If you love diving into deep data sets to identify areas for improvement, and be even more enthusiastic about solving those problems then do not hesitate to apply!

Business Intelligence Analyst, CupoNation

We are currently looking for new talent to join our highly professional and dynamic Business Intelligence team. In your role you will be responsible to design end-to-end solutions that meet our company’s Business Intelligence requirements. This covers the definition and implementation of technical requirements for ETL jobs, creation of new data layers and the optimization and enhancement of the current data warehouse infrastructure.

MemSQL Reveal New Productivity Tool to Dramatically Improve Data Ingest

Eileen McNulty — Mon, 05 Jan 2015 11:10:04 +0000

“The MemSQL Loader is another innovation of simplifying MemSQL implementations with production data workflows,” explains Chief Technology Officer and co-founder of MemSQL, Nikita Shamgunov.

“After working with customers during their MemSQL deployments, we found a simple way to eliminate steps in data pipelines, saving them time and reducing complexity,” he further added. “By streaming directly from popular data stores like Amazon S3 and HDFS, we offer customers an easy way to get started, and an efficient way to integrate the real-time transactions and analytics of MemSQL into existing environments.”

An announce made earlier today reveals, that in contrast to typical data loading which often requires multiple steps, the MemSQL Loader enables direct streaming from the originating datastore in a single transfer. Available as open source, the MemSQL Loader enables ‘multiple parallel input streams’, thus gaining time through lack of repetitive operations and increasing performance.

A Y Combinator company, MemSQL has been assisting companies in merging real-time and historical Big Data analytics with its distributed in-memory database and has garnered funding from Accel Partners, Khosla Ventures, First Round Capital, and Data Collective.

For further reading into the nuances of MemSQL Loader, the technical blog post can be found here.

Follow @DataconomyMedia

(Image credit: MemSQL)