To get value out of today’s big and fast data, organizations must evolve beyond traditional analytic cycles that are heavy with data transformation and schema management. The Hadoop revolution is about merging business analytics and production operations to create the ‘as-it-happens’ business. It’s not a matter of running a few queries to gain insight to make the next business decision, but to change the organization’s fundamental metabolic rate. It is essential to take a data-centric approach for infrastructure to provide flexible, real-time data access, collapsing data silos and automating data-to-action for immediate operational benefits.
Lesson #1 – Journey to the Real-Time Data-Centric Enterprise
There are two types of companies in the big data space. Those that are born in big data to deliver a competitive edge through the software or the processes for enabling big data and then there are the rest. Those who have a mandate to simultaneously cut IT and storage costs, and to create a platform for innovation.
There are a few MapR customers I want to highlight as they have been very successful on their journey to become Real-Time Data-Centric Enterprises and are great examples that this is real and completely possible.
Urban Airship sends over 180B text messages per month. Some of their users are sports fans who require information as-it-happens, and waiting for hours is not good enough. They deliver actionable information so that brands like Starbucks can ensure the best experience possible for their customers.
Rubicon Project created pioneering technology that created a new model for the advertising industry–similar to what NASDAQ did for stock trading. Rubicon Project’s automated advertising platform processes over 100B ad auctions each day and provides the most extensive reach in the industry, touching 96% of internet users in the US.
Machine Zone, creators of Game of War, had their operations isolated from analytics and their ability to deliver actionable data was limited. They reengineered their platform to deliver operations and analytics on the same platform and now support over 40M users and more than 300k events per second.
If you would like to read about more use cases for creating an as-it-happens business I recommend reading this new and freely downloadable O’Reilly book Real-World Hadoop written by Ted Dunning and Ellen Friedman.
Lesson #2 – Replatforming the Enterprise
Real-time doesn’t require just big or just fast–or big in one cluster and fast in another. It requires big and fast together, working harmoniously. What we should be thinking about now is whether or not there is an overall architectural approach to bring the two together that can be followed to ensure success.
To answer this question we should step back and understand where we have been and where we are going. Applications have dictated the data format, but now we can see that the data should freely support varied compute engines. ETL is a very heavy complicated process that often takes an extensive amount of time and resources, thus we would be better served by handling fast growing data sources with a load and go philosophy. Finally, where there has been a tremendous amount of latency in systems, we must continually figure out how to remove the latency.
With this new architectural approach, we need to consider all of the hardware we have in our data centers. Not just the hardware set aside for Hadoop, but everything. We need to build on top of all the hardware and change our way of thinking. We need to get away from static partitioning of hardware and onto a more dynamic approach. On top of all the hardware we need to be able to manage all resources globally; we must be able to elastically expand and contract resources on-demand. An example of this would be a hybrid approach by using Project Myriad to allow Apache Mesos to manage YARN. 451 Research wrote a research brief about MapR co-launching the Myriad project to unite Apache YARN with Apache Mesos.
On top of global resource management we need a set of cohesive data storage services which any application can use, including: a distributed file system that can deliver against all of our business continuity needs, real-time storage and processing. To enable the most value on this new architectural approach MapR announced as part of the 4.1 release that there will be two new pieces of functionality added forMapR-DB:
- First is the Multi-Master Replication between clusters. This means you will now be able to deploy multiple active (read/write) clusters across globally distributed data centers so that you may place live data closer to end users. MapR-DB will maintain global consistency via synchronous or asynchronous (when the network can’t handle synchronous) replication. This will minimize data loss (in the case of data center failure) with transaction-level replication.
- Second is the new C API for MapR-DB, which lets you write native C/C++ applications for MapR-DB. Additionally, you can leverage this C API from more than 20 different languages by utilizing a framework like SWIG.
Finally, in order to complete the replatforming we need the distributed applications that drive our businesses to play nicely with these enabling technologies. With MapR any application that can read/write to an NFS mount is already capable of plugging directly into this architecture.
Lesson #3 – Data Agility
The real-time data-centric enterprise is not really driven by the data-to-insight cycle; it is really driven by the data-to-action cycle. Insights are great, but being able to take action on the information makes all the difference in an as-it-happens business.
If we want to have operations and analytics on the same platform, we need to think about how we get data in and how we handle it at scale to deliver real-time and actionable analytics. We must think about managing, processing, exploring and leveraging the data. In order to shorten the data-to-action cycle we have to remove steps in legacy processes to move forward. We no longer want to create and maintain schemas, or to transform or copy data. These processes take time and they don’t deliver value in a data-to-action cycle. What we want moving forward is low latency, scalability and integration with ubiquitous technologies like SQL and the BI tools already in use.
This is where Apache Drill comes in. This is a SQL query engine that supports self-service data exploration without the need to predefine a schema. Drill is ANSI SQL 2003 compliant and plugs right into all of those BI tools you are accustomed to using in an industry standard way. With Drill you simply query your data in place; there is no need to perform ETL or to move your data. After all, if you currently use business intelligence tools, you should be enabled to still use them.
Lesson #4 – Good Enough?
“Good enough” won’t be once your business users start experiencing the value of real-time big data. They will only want more and faster. No matter what use case you start with for your journey, your journey will require real-time and enterprise-grade.
It is paramount to remember that an as-it-happens business is as much about business process reinvention as it is about the technology that runs the business. Keep in mind that doesn’t mean replacing all the tools you use in your business. It means rethinking how you operate and augment your tools where necessary to impact your business as-it-happens.
About the author: Jim Scott – Director of Enterprise Strategy and Architecture, MapR
Jim has held positions running Operations, Engineering, Architecture and QA teams. Jim is the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for the past 4 years. Jim has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day. Jim’s work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.
Image credit: MapR