Real-Time – Dataconomy

Better Allies than Enemies: Why Spark Won’t Kill Hadoop

Rick Delgado — Wed, 13 May 2015 09:40:31 +0000

Fans and supporters of Hadoop have no reason to fear; Hadoop isn’t going away anytime soon. There’s been a great deal of consternation about the future of Hadoop, most of it stemming from the growing popularity of Apache Spark. Some big data experts have even gone so far as to say Apache Spark will one day soon replace Hadoop in the realm of big data analytics. So are the Spark supporters correct in this assessment? Not necessarily. Apache Spark may represent a new technology that’s getting a lot of attention. In fact, the number of Apache Spark users are growing at a considerable pace, but that doesn’t make it Hadoop’s successor. The two technologies certainly have similarities, but their difference really set them apart, showing that the right platform really depends on what task they’ll be used for. To say Spark is on its way to dethroning Hadoop is simply a premature statement. If anything, the two look to be complementary in the work they do.

Every discussion surrounding Hadoop should include talk about MapReduce, which is a parallel processing framework where jobs can be run to process and analyze large sets of data. If an enterprise needs to analyze big data offline, Hadoop is usually the preferred choice. That’s what drew so many businesses and industries to Hadoop in the first place. Hadoop had the capability to store and analyze big data inexpensively. As Matt Asay of InfoWorld Tech Watch puts it, Hadoop essentially “democratized big data.” Suddenly, businesses had access to information the likes of which they never had before, and they could put that big data to good use, creating a large number of big data use cases. Hadoop’s batch-processing technology was revolutionary and is still used often today. When it comes to data warehousing and offline data analysis of jobs that may take hours to complete, it’s tough to go wrong with Hadoop.

Apache Spark, which was developed as a project independently from Hadoop, offers its own advantages that have made many organizations sit up and take notice. Many supporters say Spark represents a technological evolution of Hadoop’s capabilities. There are several categories where Apache Spark excels. The first and most touted is speed. When processing data in a Hadoop cluster, Spark can run applications much more quickly — anywhere from ten to a hundred times faster to some cases. This capability has basically ushered in the era of real-time big data analytics, sometimes referred to as streaming data. Beyond speed, Spark is also relatively easy-to-use, particularly when it comes to developers. Writing applications in a familiar programming language, like Java or Python, makes the processing of building apps that much easier. Spark is also quite versatile, meaning it can run on a myriad of different platforms like Hadoop or the cloud. Spark can also access a wide variety of different data sources, among them being Amazon Web Services’ S3, Cassandra, and Hadoop’s own data store.

With Spark’s capabilities in mind, some may wonder why any organization should stick with Hadoop at all. After all, Spark appears to be able to run more complex and sophisticated workloads more quickly. Who wouldn’t want real-time analytics? But the truth is Hadoop and Spark may in fact work better together. If anything, Spark loses some of its effectiveness without Hadoop since it was designed to run on top of it. Hadoop can support both the traditional batch-processing model and the real-time analytics model. Think of Spark as an added feature that can go with Hadoop. When needing interactive data mining, machine learning, and stream processing, Spark is the way to go. For businesses requiring more scalable infrastructure, enabling them to add servers for growing workloads, Hadoop and MapReduce are a better bet. Utilizing both at the same time in a complementary approach gets organizations the best that both have to offer.

Talk of the death of Hadoop always seemed a little hasty, no matter how impressive Spark’s capabilities have been. There’s no denying the advantages that Spark brings to the table, but Hadoop isn’t going to just disappear. Spark was never designed to replace Hadoop anyway. When used in tandem, businesses can gain the advantages of both, effectively increasing the benefits they receive. While there will still be movement toward real-time analytics, Hadoop will still be needed and readily available for all companies.

Rick Delgado- I’ve been blessed to have a successful career and have recently taken a step back to pursue my passion of freelance writing. I love to write about new technologies and keeping ourselves secure in a changing digital landscape. I occasionally write articles for several companies, including Dell.

Photo credit: Ben K Adams / Photo / CC BY-NC-ND

Challenging the Lambda Architecture: Building Apps for Fast Data with VoltDB v5.0

John Hugg — Tue, 07 Apr 2015 13:15:59 +0000

Traditional databases historically haven’t been fast enough to act on real-time data, forcing developers to go to great effort to write applications that capture and process fast data. Some developers have turned to tools like Storm or Spark Streaming; others have patched together a set of open source projects in the style of the Lambda Architecture. These complex, multi-tier systems often require the use of a dozen or more servers.

What we’ve learned developing the world’s fastest in-memory database is that OLTP at scale starts to look a lot like a streaming application. Whether you view your problem as OLTP with real-time analysis or as stream processing, VoltDB is the only system that combines ingestion, speed, robustness, and strong consistency to make developing and supporting real-time apps easier than ever.

In our latest release, VoltDB v5.0, we provide – in one system – a unified Fast Data/Big Data pipeline. On the streaming front-end, VoltDB connects to message queue tools like Kafka. On the backend, VoltDB efficiently exports streams of data to Hadoop/HDFS, or to another OLAP system of choice for historical archiving. In between, VoltDB v5.0 provides high velocity, transactional ingestion of data and events; real-time analytics on windows of streaming data; and also provides low-latency, per-event decision-making: the ability to react and apply business decisions to individual events.

Challenging Lambda

VoltDB is an ideal alternative to the Lambda Architecture’s speed layer. It offers horizontal scaling and high per-machine throughput. It can easily ingest and process millions of tuples per second with redundancy, while using fewer resources than alternative solutions. VoltDB requires an order of magnitude fewer nodes to achieve the scale and speed of the Lambda speed layer. As a benefit, substantially smaller clusters are cheaper to build and run, and easier to manage.

VoltDB also offers strong consistency, a strong development model, and can be directly queried using industry-standard SQL. These features enable developers to focus on business logic when building distributed systems at scale. If you can express your logic in single-threaded Java code and SQL, with VoltDB you can scale that logic to millions of operations per second.

Finally, VoltDB is operationally simpler than competing solutions. Within a VoltDB cluster, all nodes are the same; there are no leader nodes, no agreement nodes; there are no tiers or processing graphs. If a node fails, it can be replaced with any available hardware or cloud instance and that new node will assume the role of the failed node. Furthermore, by integrating processing and state, a Fast Data solution based on VoltDB requires fewer total systems to monitor and manage.

A Speed Layer Example

As part of the work that went into VoltDB v5.0, our developers prepared several sample applications that illustrate the power and flexibility of the in-memory, scale-out NewSQL database for developing fast data applications. One application in particular, the ‘Unique Devices’ example, demonstrates, in approximately 30 lines of code, how to handle volumes of fast streaming data while maintaining state and data consistency, and achieving the benefits of real-time analytics and near real-time decisions.

The VoltDB ‘Unique Devices” example represents a typical Lambda speed layer application. This isn’t a contrived sample – it is based on a real-world application hosted by Twitter that is designed to help mobile developers understand how many people are using their app. Every time an end-user uses a smartphone app, a message is sent with an app identifier and a unique device id. This happens 800,000 times per second over thousands of apps. App developers pay to see how many unique users have used their app each day, with per-day history available for some amount of time going back.

This Twitter system was built using the Lambda Architecture. In the speed layer, Kafka was used for ingestion, Storm for processing, Cassandra for state and Zookeeper for distributed agreement. In the batch layer, tuples were loaded in batches into S3, then processed with Cascading and Amazon Elastic MapReduce. To reduce processing and storage load on the system, the HyperLogLog cardinality estimation algorithm was used as well.

Replicating these requirements using VoltDB involves simplifying the architecture considerably. The VoltDB Unique Devices sample application has several components:

The Sample Data Generator

Generating fake but plausible data is often the hardest part of a sample app. Our generator generates tuples of ApplicationID, DeviceID with non-linear distributions. This component is also easily changed to support more/fewer applications/devices, or different distributions. View the code on Github.

The Client Code

The client code takes generated data and feeds it to VoltDB. The code looks like most VoltDB sample apps: it offers some configuration logic, some performance monitoring, and the connection code. View the code on Github.

The Ingest Logic

This is the key part of the application, and it’s what separates VoltDB from other solutions. The idea was to build a stored procedure to handle the incoming ApplicationID and DeviceID tuples, and then write to the relational state any updated estimate for the ApplicationID given.

First we found a HyperLogLog implementation in Java on Github. Then we simply wrote down the straightforward logic to achieve the test, trying a few variations and using the performance monitoring in the client code to pick the best choice. The bulk of the logic is about 30 new lines of code (see below), which can easily scale to over a million ops/sec with full fault tolerance on a cluster of fewer than 10 commodity nodes. This is a significant win over alternative stacks. View the code on Github.

Web Dashboard

Finally, the Unique Devices app includes an HTML file with accompanying Javascript source to query VoltDB using SQL over HTTP, displaying query results and statistics in a browser. The provided dashboard is very simple; it shows ingestion throughput and a top-ten list of most popular apps. Still, because the VoltDB ingestion logic hides most implementation complexity – such as the use of HyperLogLog – and boils the processing down to relational tuples, the top-ten query is something any SQL beginner could write. View the code on Github.

It took a little more than a day to build the app and the data generator; some time was lost adding some optimizations to the HyperLogLog library used. It was not hard to build the app with VoltDB.

Example in hand, we then decided to push our luck. We created versions that don’t use a cardinality estimator, but use exact counts. As expected the performance is good when the data size is smaller, but is slower as it grows, unlike the HyperLogLog version. We also created a hybrid version that keeps exact counts until 1,000 devices have been seen, then switches to the estimating version. You can really see the power of strong consistency when it’s so trivial to add what would be a complex feature in another stack. Furthermore, changing this behavior required changing only one file, the ingestion code; the client code and dashboard code are unaffected. Both modifications are bundled with the example.

Finally, we made one more modification. We added a history table, and logic in the ingestion code that would check on each call if a time period had rolled over. If so, the code would copy the current estimate into a historical record and then reset the active estimate data. This allows the app to store daily or hourly history within VoltDB. We didn’t include this code in the example, as we decided too many options and the additional schema might muddy the message, but it’s quite doable to add.

VoltDB isn’t just a good platform for static Fast Data apps. The strong consistency, developer-friendly interfaces and standard SQL access make VoltDB apps really easy to enhance and evolve over time. It’s not just easier to start: it’s more nimble when it’s finished.

About the author: John Hugg is the Founding Software Engineer for VoltDB. VoltDB provides a fully durable, in-memory relational database that combines high-velocity data ingestion and real-time data analytics and decisioning to enable organizations to unleash a new generation of big data applications that deliver unprecedented business value.

Photo credit: xTom

MapR’s 4 Takeaway Lessons from the Hadoop Revolution

Jim Scott — Thu, 26 Mar 2015 14:29:02 +0000

To get value out of today’s big and fast data, organizations must evolve beyond traditional analytic cycles that are heavy with data transformation and schema management. The Hadoop revolution is about merging business analytics and production operations to create the ‘as-it-happens’ business. It’s not a matter of running a few queries to gain insight to make the next business decision, but to change the organization’s fundamental metabolic rate. It is essential to take a data-centric approach for infrastructure to provide flexible, real-time data access, collapsing data silos and automating data-to-action for immediate operational benefits.

Lesson #1 – Journey to the Real-Time Data-Centric Enterprise

There are two types of companies in the big data space. Those that are born in big data to deliver a competitive edge through the software or the processes for enabling big data and then there are the rest. Those who have a mandate to simultaneously cut IT and storage costs, and to create a platform for innovation.

There are a few MapR customers I want to highlight as they have been very successful on their journey to become Real-Time Data-Centric Enterprises and are great examples that this is real and completely possible.

Urban Airship sends over 180B text messages per month. Some of their users are sports fans who require information as-it-happens, and waiting for hours is not good enough. They deliver actionable information so that brands like Starbucks can ensure the best experience possible for their customers.

Rubicon Project created pioneering technology that created a new model for the advertising industry–similar to what NASDAQ did for stock trading. Rubicon Project’s automated advertising platform processes over 100B ad auctions each day and provides the most extensive reach in the industry, touching 96% of internet users in the US.

Machine Zone, creators of Game of War, had their operations isolated from analytics and their ability to deliver actionable data was limited. They reengineered their platform to deliver operations and analytics on the same platform and now support over 40M users and more than 300k events per second.

If you would like to read about more use cases for creating an as-it-happens business I recommend reading this new and freely downloadable O’Reilly book Real-World Hadoop written by Ted Dunning and Ellen Friedman.

Lesson #2 – Replatforming the Enterprise

Real-time doesn’t require just big or just fast–or big in one cluster and fast in another. It requires big and fast together, working harmoniously. What we should be thinking about now is whether or not there is an overall architectural approach to bring the two together that can be followed to ensure success.

To answer this question we should step back and understand where we have been and where we are going. Applications have dictated the data format, but now we can see that the data should freely support varied compute engines. ETL is a very heavy complicated process that often takes an extensive amount of time and resources, thus we would be better served by handling fast growing data sources with a load and go philosophy. Finally, where there has been a tremendous amount of latency in systems, we must continually figure out how to remove the latency.

With this new architectural approach, we need to consider all of the hardware we have in our data centers. Not just the hardware set aside for Hadoop, but everything. We need to build on top of all the hardware and change our way of thinking. We need to get away from static partitioning of hardware and onto a more dynamic approach. On top of all the hardware we need to be able to manage all resources globally; we must be able to elastically expand and contract resources on-demand. An example of this would be a hybrid approach by using Project Myriad to allow Apache Mesos to manage YARN. 451 Research wrote a research brief about MapR co-launching the Myriad project to unite Apache YARN with Apache Mesos.

On top of global resource management we need a set of cohesive data storage services which any application can use, including: a distributed file system that can deliver against all of our business continuity needs, real-time storage and processing. To enable the most value on this new architectural approach MapR announced as part of the 4.1 release that there will be two new pieces of functionality added forMapR-DB:

First is the Multi-Master Replication between clusters. This means you will now be able to deploy multiple active (read/write) clusters across globally distributed data centers so that you may place live data closer to end users. MapR-DB will maintain global consistency via synchronous or asynchronous (when the network can’t handle synchronous) replication. This will minimize data loss (in the case of data center failure) with transaction-level replication.
Second is the new C API for MapR-DB, which lets you write native C/C++ applications for MapR-DB. Additionally, you can leverage this C API from more than 20 different languages by utilizing a framework like SWIG.

Finally, in order to complete the replatforming we need the distributed applications that drive our businesses to play nicely with these enabling technologies. With MapR any application that can read/write to an NFS mount is already capable of plugging directly into this architecture.

Lesson #3 – Data Agility

The real-time data-centric enterprise is not really driven by the data-to-insight cycle; it is really driven by the data-to-action cycle. Insights are great, but being able to take action on the information makes all the difference in an as-it-happens business.

If we want to have operations and analytics on the same platform, we need to think about how we get data in and how we handle it at scale to deliver real-time and actionable analytics. We must think about managing, processing, exploring and leveraging the data. In order to shorten the data-to-action cycle we have to remove steps in legacy processes to move forward. We no longer want to create and maintain schemas, or to transform or copy data. These processes take time and they don’t deliver value in a data-to-action cycle. What we want moving forward is low latency, scalability and integration with ubiquitous technologies like SQL and the BI tools already in use.

This is where Apache Drill comes in. This is a SQL query engine that supports self-service data exploration without the need to predefine a schema. Drill is ANSI SQL 2003 compliant and plugs right into all of those BI tools you are accustomed to using in an industry standard way. With Drill you simply query your data in place; there is no need to perform ETL or to move your data. After all, if you currently use business intelligence tools, you should be enabled to still use them.

Lesson #4 – Good Enough?

“Good enough” won’t be once your business users start experiencing the value of real-time big data. They will only want more and faster. No matter what use case you start with for your journey, your journey will require real-time and enterprise-grade.

It is paramount to remember that an as-it-happens business is as much about business process reinvention as it is about the technology that runs the business. Keep in mind that doesn’t mean replacing all the tools you use in your business. It means rethinking how you operate and augment your tools where necessary to impact your business as-it-happens.

About the author: Jim Scott – Director of Enterprise Strategy and Architecture, MapR
Jim has held positions running Operations, Engineering, Architecture and QA teams. Jim is the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for the past 4 years. Jim has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day. Jim’s work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.

Image credit: MapR

eBay Open Sources Pulsar to Analyse User Data in Real Time

Dataconomy News Desk — Thu, 26 Feb 2015 11:39:29 +0000

E-commerce giant eBay has released an open-source, real-time analytics platform and stream processing framework, dubbed Pulsar. Through Pulsar, user and business events will be collected and processed in real time, enabling superior interaction, to the tune of a million events per second with high availability.

Owing to the ever increasing buyer and seller traffic eBay faces, newer use cases are generated that need collection and processing in real-time. So far, user experience optimization and behaviour analysis was carried out using batch-oriented data platforms like Hadoop.

To enable better interaction, “derive actionable insights and generate signals for immediate action” eBay decided to develop their own distributed CEP framework. Pulsar CEP provides a Java-based framework as well as tooling to build, deploy, and manage CEP applications in a cloud environment, explains the blog post making the announcement.

The post points out that Pulsar CEP includes the following capabilities:

Declarative definition of processing logic in SQL
Hot deployment of SQL without restarting applications
Annotation plugin framework to extend SQL functionality
Pipeline flow routing using SQL
Dynamic creation of stream affinity using SQL
Declarative pipeline stitching using Spring IOC, thereby enabling dynamic topology changes at runtime
Clustering with elastic scaling
Cloud deployment
Publish-subscribe messaging with both push and pull models
Additional CEP capabilities through Esper integration

Built within Pulsar is a real-time analytics data pipeline that provides higher reliability and scalability through processes. such as data enrichment, filtering and mutation,aggregation and stateful processing.

The now open-sourced Pulsar has been deployed in production at eBay and is processing all user behavior events, the post says. A dashboard is under development which will make integration with metrics stores like Cassandra and Druid, easier.

Follow @DataconomyMedia

(Image credit: Justin Sullivan, Getty Images)

Bottlenose Scoops $13.4m to Help Businesses Keep Abreast of Real-Time Trends & Threats

Dataconomy News Desk — Thu, 12 Feb 2015 10:46:10 +0000

Real-time Trend Intelligence providers Bottlenose have secured a $13.4 million Series B funding, led by KPMG Capital, according to an SEC filing.

Earlier in December KPMG Capital, a new funding wing set up by KPMG International, whose first investment fund was created to accelerate innovation in data and analytics, had revealed acquiring a substantial equity stake in Bottlenose Inc.

Existing investors, including Lerer Ventures, Transmedia Capital, ff Venture Capital, and others, were joined by new investor Origin Ventures, taking Bottlenose’s total funding till date to approximately $17 million. The round however is yet to close with co-founder and CEO Nova Spivack expecting more in terms of venture debt.

Helping with the expansion of the business and further development of its product-line, the new investment will also help grow the existing workforce of 40 to 80, Tech Crunch reports.

By analyzing real-time streaming data, the cloud-based trend intelligence solution, enables enterprises to identify, anticipate and monitor the trends that drive their businesses.

As pointed out by Tech Crunch, if ‘Twitter currently sees around a billion messages a day on its service’ then Spivack says, that ‘Bottlenose is now analyzing 72 billion data records and messages daily.’

For clients like Pepsi, Warner Bros., DigitasLBI, Razorfish, FleishmanHillard, and others, Bottlenose analyzes live TV and radio data in real-time, using speech-to-text conversions, with enterprise data sources, like sales data, log data, transaction data, and even firewall data also available now.

In the wake of high-profile data breaches last year, Spivack explains how breach detection is an opportunity for Bottlenose as it makes sense of data from across sources: “If hackers are talking about a company, and we see also corresponding, correlated activity on that company’s firewall, that’s actually a good indicator that there’s actually something happening.”

Follow @DataconomyMedia

(Image credit: Bong Grit, via Flickr)

Creating a Transparent Payment Marketplace with Big Data

Eileen McNulty — Fri, 16 Jan 2015 14:00:51 +0000

Stevan is the Co-Founder and CTO of CurrencyTransfer.com – an international payments marketplace. Although many banks and currency brokers proudly boast “Zero fees!” on their advertising materials, hidden costs still proliferate. CurrencyTransfer.com have developed a marketplace, updated in real time, where you can compare fees and rates from a range of brokers. We spoke to Stevan Litobac, the Co-Founder and CTO of CurrencyTransfer about his work, and the challenges of building a transparent currency marketplace.

Why do you think it has taken until now for someone to provide a transparent comparison market for business foreign exchange?

In order to provide a true transparent marketplace, we’ve had to overcome some real challenges. Some FCA regulated brokers operate in the stone age, offering a phone based service. Most don’t have API’s we can build on top of. So too, the ability to offer live rates (rather than guide rates) & streamline the compliance and onboarding process for multi-broker aggregation has taken a lot of strategic negotiation with rate contributors, product development and investment.

However, it’s worth it. We passionately believe that in any industry where there is inefficiency, a marketplace will disrupt. We’ve seen it in flights, hotels and now with the launch of CurrencyTransfer.com – the ‘‘SkyScanner of Currency’’ international payments is simply next on the list.

The foreign exchange space has typically been very opaque and dominated by hidden fees and high transaction costs. In recent years, we’re seeing a huge explosion in FinTech startups who are looking to transform traditional personal and business banking services, and building a better customer experience than you would get from a bank or legacy broker direct.

What was the largest challenge you faced convincing currency specialists to give an honest account of their rates, or actually capturing streaming data?

Streaming data and breadth of choice is an ongoing challenge. Strategically, we want to build out the sell side of the marketplace, and there are a limited number of FCA authorized and regulated currency specialists who have the capability to contribute rates via API. Luckily, the ones who do are typically top tier and the brokers we want to aggregate rates from. Our infrastructure partnership with The Currency Cloud has certainly helped

We also definitely had our work cut out to tell currency specialists they need to effectively pull down their trousers, open their API’s and publish rates in competition on a third party site. We do see a positive correlation though between being x10 cheaper, x10 faster, x10 slicker UI / UX and a remarkably scalable business. The rate contributors, whilst keeping them 100% honest for our customers, get significant benefit, namely: operational efficiency and client acquisition.

What technologies are involved in providing a real-time view of transfer rate from various providers?

Generally rates are pulled via a REST or SOAP API at set intervals, say every x minutes or hours. It’s also possible to get a ‘stream’ for rates, but this is generally used for speculative trading. We pull in the data via REST and SOAP from our providers when the user pulls in a quote.

You combine a background in computer science and development with experience in growth, social media and UX. How have these three distinct skill areas complimented each other in your career?

Steve Jobs once said “You can’t connect the dots looking forward you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future.” I think that applies here. For example; UX, it’s all around us. Just noticing nuances in other apps (mobile and web) and taking time to consider/appreciate their creator’s work and decisions primes you for future UX considerations of your own.

Social media is trial and error – what do people react to best? It’s inherently psychological and the same thing applies to growth tactics.

Mixing all of these successfully into a cohesive story is the real trick, and your knowledge in these individual elements is quickly tested before you’re forced to learn new things.

These disciplines will be a lot more common, especially social media and growth, as it’s predicted there will be more freelancers in the future – so people will be naturally pushed to adapt, learn and compete for attention in those areas.

The rapid rise of data science has allowed startups to disrupt virtually every industry in one way or another with powerful value propositions. Do you see any other industries or areas where this is needed?

Everything will. One way I like to look at it is to imagine I’ve just woken up 30-50 years into the future. What does the world around me look like? The bedroom, kitchen, office, and so on.

I imagine windows will become screens and display data, adjust dimness according to sunlight levels. Will paper still be popular for receipts and invoices? If not, what will replace them?

If I had to name 3 industries due for a refresh – I’d say:

Home appliances. More automation, more sensors.
Personal computing. Health trackers, wearable computing like Google Glass (but better ☺)
Travel. Self driving cars, already started. Tesla’s cars are more computer than mechanics. What can we build on top of them?

There are plenty more however – that doesn’t even scratch the surface!

Other than the service you offer, how are you using data to impact your business development?

We try and learn a lot by our customer’s behavior. What they do on the platform and how they interact with it is important for us to understand. Doing so can help us cut down the amount of time required to do what they commonly do on it. We don’t assume we know everything, so we try and let the data lead us in the right direction.

You’ve previously mentioned banks as your main competitor, has anyone else entered the scene with a similar offering to yours?

International money transfer is a very hot space at the moment, with a large amount of VC funding being ploughed in to clean up an inefficient and expensive industry. However, we haven’t seen anyone who offers live aggregation plus instant execution.

What are your goals for expansion and fundraising in 2015?

We would like to be live and trading clients virtually anywhere across the world, corporate and private, by the end of the year. We’re actively working on integrations that will allow us to do so.

The business has been 100% bootstrapped to date by my Co-Founder Daniel and I. We are in regular conversation with top tier institutional investors, who are closely tracking our progress. Watch this space!

Follow @DataconomyMedia

(Image credit: Francisco González, via Flickr)

Silver Spring Networks Goes into Agreement with Detectent for a Q1 Acquisition

Eileen McNulty — Fri, 09 Jan 2015 11:09:29 +0000

Silver Spring Networks the Redwood City, U.S. based smart grid products innovator is acquiring privately held provider of utility data analytics solutions, Detectent Inc.

“As a recognized industry leader, Detectent is delivering innovative data analytics solutions to some of the leading U.S. utilities,” noted Scott Lang, Chairman, President and CEO, Silver Spring Networks.

“We are excited to enhance the Detectent analytics solutions with our SilverLink platform’s distributed intelligence and real-time data stream capabilities. We are also pleased to broaden our market reach by serving Detectent’s impressive customer lineup and extending these innovations to our global client base,” he further added.

With its Saas subscription, and software solutions Detectent assists in ‘advanced metering infrastructure (AMI) and utility grid operations, ensure revenue protection all the while delivering enhanced customer engagement programs,’ explains the press release announcing the acquisition.

Presently data from over 25 million water, gas, and electricity meters, in both AMI and legacy environments, spread across 20 leading utility customers in the United States, are being analyzed by Detectent’s products.

The acquisition of the Escondido, CA(US), based outfit was announced on Wednesday. The two have been partners since May last year with Detectent’s solutions being clubbed through Silver Spring’s SilverLink™ Sensor Network apps catalogue.

“Energy theft is a multi-billion dollar challenge for the utility industry. Coupled with operational improvements and enhanced customer engagement, there is a growing opportunity to unlock billions of dollars in value by applying data analytics solutions,“ said Michael Madrazo, CEO of Detectent, about his expectations from the acquisition.

Excluding deal-related charges, Silver Spring expects the transaction to be neutral to non-GAAP earnings in 2015, and accretive beginning in 2016. The transaction is expected to close in the first quarter of 2015, subject to customary closing conditions.

Follow @DataconomyMedia

(Image credit: Silver Spring Newtorks)

Three Key New Features from Aerospike’s Extensive Upgrade

Eileen McNulty — Mon, 15 Dec 2014 16:16:04 +0000

Back in August, Aerospike announced they were open-sourcing their signature platform. At the beginning of December, they were back again with news of a record-breaking Google Compute Engine Benchmark. Now, to round off what has been an exceptional year for the flash-optimised, in-memory NoSQL database, they’ve released a whole raft of updates to their database.

The full announcement is, in a word, extensive- you can read the whole developer’s Christmas wishlist of updates here. We recently discussed the announcements with Aerospike’s CTO Brian Bulkowski; he highlighted three key developments which he believes are instrumental in fuelling the real-time applications of tomorrow.

1. A Whole Host of New Clients

One of the key headlines from the announcement is new clients for Python, PHP, Go and Ruby, plus a whole host of upgrades for Aerospike’s Java, Node.js, .NET / C# clients. Bulkowski highlighted that the vast range of clients was a demonstration of Aerospike’s commitment to their community.

“In the polyglot language world, you can’t just have a world-class database; you have to be in the community, you have to have different connectors, you have to be giving back to your community,” he explained. “Monica [Pal, Aerospike CMO]’s catchphrase for this release is ‘It’s about what we’re giving to developers, and what developers are giving to us’. It’s about having a great Ruby client, having a great PHP client, having a great Go client- we’re having a huge amount of success with Go, which is a little underappreciated as a language. On paper, it looks a little like a laundry list- but it means to our communities, is that they know we have them covered.”

2. Hadoop Integration

The integration of InputFormat allows Hadoop tools to analyse data stored in Aerospike, without performing ETL from HDFS. Aerospike’s Indexed MapReduce also allows you to analyse specific subsets of data, eliminating the need for the large data lake analysis typically offered by HDFS.

As Bulkowski explained, “The nature of analytics is, once you’ve written your analysis, you don’t want to have to do it again- it’s troublesome, it’s error-prone- you want to use the tools you have. By backing the Hadoop tooling- even though we know our own tooling is better, faster and more capable- being compatible when it comes to analytics is really a benefit. Our key value comes from being row-oriented, rather than a streaming store like HDFS- so what we looked at was, what can we give to an analytics community as a row-oriented database? The benefit is not having to ETL- being able to run those jobs directly on your database in a safe and sane fashion.”

3. Docker Containers

As we have previously reported, taking a developer-centric approach in 2014 has become synonymous with Docker support. Bulkowski was certainly excited by the rapid development of containerisation, but was keen to stress the technology has a long way to go before it reaches maturation. “The containerisation of both the data centres and also the cloud are still in their infancy,” he remarked. “From a technical perspective, I think it’s a great idea. Containerisation definitely has the potential to reset the game in terms of high-IO applications like us. In terms of being in the community, having a Docker image that people can start playing with is great- but it’s still early days.”

These three headlines only really scratch the surface of this extensive announcement; Aerospike have also announced enhancements to their enterprise security, a raft of storage and performance improvements, and 33 key community contributions- just three months after going open source. I would urge you to give the full list a read here, and to keep your eyes peeled for further Aerospike innovation in 2015.

Follow @DataconomyMedia

(Image credit: Aerospike)

Real Time + Big Data = Move Elephant Move!

Dev Lakhani — Tue, 09 Dec 2014 13:46:36 +0000

Elephants in a Storm.

Oil tankers take ages to move. Even when there’s a storm ahead, complex machinery, pulleys, engines and switches have to be enabled to make the smallest change in direction. Organizations and Big Data are the same, they use masses of compute power in order to gain business insight in order to implement a modest change in strategy.

If you go into to work today and one of the first things you did was to look at overnight reports, chances are you work in an oil tanker organisation. Your servers churned away overnight and did all of those filters, aggregations and grouping operations that show you how many sales you did over the last day or week. Whilst you are grazing on your morning toast and looking at your KPI charts think of this… just because you have terabytes of data, why can’t you analyse, respond and respond to it in real time?

I Want it Now!

When organisations first encountered Big Data, the main driver was being able to process masses of data in a scaled out and resilient way. This by and large went well. With frameworks like Hadoop, MPP databases and distributed filesystems you could process billions of lines of log files, sales transactions and multi dimensional data. You could aggregate twitter feeds with sales data, traffic information for route optimisation and even churn through tonnes of open data provided by governments.

But the inherent architecture was the main limitation. The programming model relied on Map Reduce working on “chunks” of larger files and these technologies worked primarily by batching operations and making use of scheduled workflows. Load some large files in, clean and transform them, do some analytics and present at 9AM. Just in time for morning bagels.

The newer generation of Big Data technologies has pushed this paradigm out of the window. Why? Trades happen in real time, customers respond to trends on a minute and hourly basis, advertising is viral and reaches tipping points within seconds, traffic patterns change with the smallest crash or roadworks, betting odds spiral on the mere suggestion of a red card.

So what does the next generation of technology bring to the table and what makes it respond faster?

Full Stream Ahead

The idea is simple, rather than create technologies that work on bulks of data, the idea is to stream or load data in tiny chunks as it comes in. This means that any calculations that are done on the data happen in increments but repeatedly. Algorithms are designed to update statistics and analytics in increments which are then presented as a complete picture. One such example of this is using Spark+Cassandra. Spark micro batches and streams data into the Cassandra No-SQL warehouse which can index and allow queries on data as they come in. At the same time you can pull out data from Cassandra and use Spark to apply machine learning to updated streaming results.

Sometimes this works well with simple datasets with constantly defined data like tweets, trades, clicks and so on. The problem becomes more complex when you want to do more clever operations like recommendations, machine learning or large operations on matrices. Fear not though, this is what the clever people at Apache Spark are working on. They make use of distributed data models so that complex operations can happen in parallel on a cluster all transparent to the programmer and business user.

Furthermore, these technologies favour the use of in memory techniques for processing data. Often only loading in in data into memory once and as late as possible and only when it is needed. Hence allowing real time throughput.

So What? How Does This Help My Business?

Simple, respond faster. Especially where your business domain is dependent on masses of data coming in from multiple sources and in multiple formats. What’s more, current technologies let you implement a series of real time alerting mechanisms that let you action based on conditions. There are even mechanisms to conduct real time what if scenarios that help you plan your strategy. Demand and Supply. While it happens.

Examples

Tier 1 Banking

There cannot be a better example of a real time Big Data problem than with finance and Tier 1 banking. Banks are exposed to market risk (Value at Risk -VaR, Basel III, expected shortfall), resulting from billions of trades on equities, commodities, CFDs and such. In order for a bank to be responsive to it’s exposure, calculations on market risk must happen frequently taking into account sources from various front office trading systems and desks as well as external feeds.

Working with an enterprise MPP vendor we were able to perform in-database calculations in R, C++, Matlab and other mathematical tools to calculate risk from billion row datasets within a couple of minutes, often seconds – this compared to VaR/Basel III calculations that currently happen twice daily! The new approach streams the data into a transactional warehouse, runs maths on ingested data every few minutes and results are constantly streamed to reports using “never ending” queries.

What’s more, when breaches occur (breaches in agreed levels of risk), we now have the capability to run playback on real time data to see what caused the breach, how long the breach lasted and when it recovered. These Big Data technologies are resilient, fault tolerant and integrate with Hadoop as a backing store should there be any need for disaster recovery or historical analysis.

Real Time Digital Ad Buying

We worked with one client who was interested in real time ad buying and selling. This meant that any time there was an undervalued ad slot available on Google, Facebook, Outbrain and other ad serving platforms, analysts were able to detect, optimise and engage with their target audience faster than any of their competitors. In order to do this, multiple sources of data had to be ingested, transformed, cleaned up and standardised in order for any analyst to respond to the market.

We made use of Apache Spark and Kafka to implement a real time message system for distributing billions of clicks, impressions and activities across multiple client sites. Within a few hundred milliseconds, we were able to evaluate TV advertising response by monitoring Twitter hash tags and Facebook posts in real time.

Logistics at Scale

Logistics companies need to optimise. Optimal routes, maximising workloads and reducing fuel costs and waste. No matter how much planning you do with Google Maps, adapting a fleet of vehicles to the current demand, traffic and weather conditions requires on the spot processing of structured (relational) and unstructured data. Using multi server geolocation-aware databases, we are now able to monitor the positions of millions of vehicles and action driver changes and alerts programatically. Using frameworks like node.js and D3JS we are able to visualise millions of points (of planes, trains and auto mobiles) as they move through the world.

Gambling and Gaming.

Who doesn’t like a bit of Candy Crush? Or quick game of Bingo? Whatever your favourite game is, book makers and game platforms need to be able to monitor their players payout and keep track of odds offered and taken. Using distributed in-memory clusters we are able to read in input from multiple gaming platforms and perform real time odds calculations using an enterprise version of R that scales to hundreds of nodes. Tied in with a distributed in-memory cache the central exchange is able to query every game status within tens of milliseconds to build a composite of the gaming market.

With the use of on mobile databases we also now have the capability to sync millions of devices to a central MPP database. That way if the odds change, one update by the programmer or analyst automatically propagates to each registered mobile device.

The Real Promise is Here.

Lessons of Big Data have been learnt. Whereas before it took 30 seconds just to start a Map Reduce job on the Hadoop platform, technologists are creating responsive architectures that can ingest, process and perform complex mathematics on distributed architectures using in-memory MPP databases.

Due to the complexity of real world Big Data, traditional methods for reporting and responding are becoming obsolete. If your daily report chart is not bobbing up and down with the latest data as you watch it, chances are you are responding to an old storm. Your oil tanker is slowly slushing around hoping to reach to the shore faster that your competitors. It’s time to get real time Big Data.

Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.

(Image credit: Orihaus)

Storm Becomes an Apache Top-Level Project

Eileen McNulty — Wed, 24 Sep 2014 07:52:42 +0000

The Apache Software Foundation has declared Apache Storm a top-level project, marking a major step towards its maturity as a technology.

Designed by the team at BackType/Twitter, to analyze the tweet stream in real time, Storm became an official Apache incubation project in September 2013. Ever since, Hortonworks engineering has been working to integrate Storm with Hadoop as a scalable, fault-tolerant system, which can process data streams in real time.

Writing for Hortonworks, Taylor Goetz, Chair of the Project Management Committee (PMC) for Storm, explains what this ‘Graduation’ really signifies, “Ultimately, an open source project is only as good as the community that supports it. The community improves the quality of the project and adds features needed by end users.

“The ASF has been established to do exactly this: to set governance frameworks and build communities around open source software projects. Further, the ASF believes that the real life and future of a project lies in the vibrancy of its community of developers and users,” he wrote.

It is playing a key role in the real-time data processing architecture of various companies such as Yahoo!, Spotify, Cisco, Twitter, Xerox PARC, and WebMD, who have deployed Storm in production.

Utilized with other Hadoop components and YARN as the architectural center that manages resources across all of those components, writes Goetz, “Storm represents an integral part of any big data strategy.”

Backed by a sustainable developer community and the governance framework and processes of the ASF, Storm’s graduation to a top level project gives the users of Apache Hadoop more reason to adopt Apache Storm.