streaming data – Dataconomy

Amazon Kinesis vs. Apache Kafka For Big Data Analysis

Christopher Low — Fri, 26 May 2017 13:28:18 +0000

Data processing today is done in form of pipelines which include various steps like aggregation, sanitization, filtering and finally generating insights by applying various statistical models. Amazon Kinesis is a platform to build pipelines for streaming data at the scale of terabytes per hour. Parts of the Kinesis platform are a direct competitor to the Apache Kafka project for Big Data Analysis. The platform is divided into three separate products: Firehose, Streams, and Analytics. All three of these solve different problems, as discussed below:

How to load huge amount of data into the pipeline?

This problem is solved by Firehose. It is the entrypoint of the data into the AWS ecosystem. Kafka does not have an equivalent to Firehose. This product is more specific to Amazon’s other offerings. Firehose can load the incoming data from various different sources, buffer it into larger chunks and forward it to other AWS services like S3, Redshift, Lambda. Firehose solves the problems with backpressure. Backpressure arises when the input buffer of a service is not able to keep up with the output buffer of another service that feeds data into it. Firehose automatically scales up or down to match the its throughput with the data it has to work with. It can also batch, compress and encrypt the data before feeding it into other AWS services. Amazon recently added several other types of data transformations to the pipeline that work before the data is loaded.

How to direct live input streams into the pipeline?

Amazon Kinesis Streams is very similar to Kafka in that it is built to work with live input streams. It stores the streams that are sent to it and the streams can then be utilised by custom applications written using the Kinesis Client Library.

Kafka “topics” are roughly equivalent to Kinesis Streams. Both represent an ordered and immutable list of messages. Each message has a unique identifier and it appended to the list as it arrives. Kafka messages can be retrieved from the last known message onwards using an offset. To get the same functionality in Kinesis, the user has to build it themselves using the API and message sequence numbers.

Topics in Kafka consist of one or more partitions, which are similar to the concept of shards in Kinesis Streams. Kafka is scaled by looking out for hot partitions and adding/removing partitions as needed. Kinesis Streams are scaled by splitting/joining shards. Kafka being a hosted service has some extra overhead in terms of managing the clusters, setting up monitoring, alerting, updating the packages, tuning and failover management as compared to Kinesis. But the actual cost estimations depend on other factors like expected payload size, flow density and retention period. Kafka can be fine tuned to have less than 1 second latencies while Kinesis Streams typically have 1-3 seconds latency. Kinesis Streams are better suited when the payload size is more and the throughput is high, while latencies do not matter much.

Kafka provides durability by replicating over multiple broker nodes while Kinesis Streams do it by replicating the data over multiple zones. The user is not required to configure the replication strategy with Amazon’s offering, so they can focus more on the Big Data Analysis part.

A good use case for streaming data is analyzing clickstreams on websites and giving realtime recommendations based on the insights gained. Streams are very useful in financial trading as well. Fintech requires very high throughput data processing capabilities that can generate insights to find patterns and exceptions on the flight. This can be used to make automated trading systems that work using realtime data to make decisions. Financial models rely upon the analysis of previous data and Streams lets the user update their models as and when the data comes. This minute analysis is also required for generating realtime dashboards to monitor activity and uptime of critical services in various businesses. Streams are also useful in realtime billing and metrics systems operating at scales where account for individual unit consumption is not feasible. Anomalies can be easily spotted in this way.

But what about sending the data to Streams? Kinesis Streams can take in input data from thousands of endpoints all at once. The concept of producers is the same as in Kafka, although the implementation has differences.The producer accepts records from higher level applications, performs batching, breaks records as per partition/shard and forwards those to Kinesis Streams or Kafka. Some organization might require more fine grained control over the data producer. In that case a data aggregator like Segment can be used to pull in data from various services like mobile apps and web apps using the Mixpanel integration and pre-process it to send it ahead. Segment has an Amazon Kinesis plugin, and a Heroku Kafka plugin for this specific use case.

How to cherry-pick data from the pipeline (and obtain long-running queries)?

Kafka sits at the ingestion stage of the data processing pipeline and does not have in-built analytics abilities. Amazon analytics uses the input data from Firehose (loaded data is injected) or from Streams (data is injected as and when it arrives). SQL queries can be run against these data sources to filter out the relevant data. This data can be further processed using an AWS service like Lambda and the generated insights can further be routed to other AWS services like DynamoDB for storage and to services like Slack for notifications. Analytics supports streaming queries, so you get data output in realtime instead of waiting for the job to finish and the result pushed in bulk as in traditional SQL queries.
Pipeline based data processing is becoming the norm in big data analysis. Because of the modular design of pipeline based systems, additional processing points can be easily added to the same pipeline or in parallel processing units to form non-linear pipelines. Ingestion systems like Kinesis Streams and Kafka provide a layer of asynchronous data flow between the producers and the consumers. Kinesis Firehose helps buffer and inject huge amounts of data into the systems and finally Kinesis Analytics is used to draw insights. Choosing Kinesis Streams or Kafka is an important decision and depends on matching the desired metrics of a system.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

The Year Data Streaming Becomes Mainstream

Andrew Brust — Thu, 31 Mar 2016 07:00:42 +0000

The rise of Big Data, and the industry’s IoT craze, are driving huge demand for streaming data analytics. There’s an impediment though: streaming data is hard to work with. 2016 will heighten the demand, and also the tension around the difficulty. It may also force a solution.

In the big-data era, businesses yearn to be data-driven in their decision making processes. Rather than act on hunch, they’d prefer to observe and measure, then make decisions accordingly. That’s a laudable standard – but it raises the bar. A culture driven by data, in turn, drives a desire for real-time, streaming data.

And if culture didn’t drive that desire, technology trends would. Analyzing Web logs in real-time can help drive multi-channel marketing initiatives with much more immediate impact. The extremely-hyped IoT – the Internet of Things – is all about the observation of ephemeral sensor readings, the quintessential streaming data. But the value of that data is ephemeral as well, making real-time, streaming analytics a necessity. In the consumer packaged goods world, for example, using sensors to monitor temperature and other physical attributes of manufacturing equipment helps assure things are running smoothly. These readings can also be fed into predictive models to prevent breakdowns before they happen, and keep the assembly lines running.

It’s exciting. The use cases and the demand for streaming are in place, and 2016 is poised to be the year when streaming analytics crosses over from niche to mainstream.

Quid pro quo?

The rewards of being able to analyze data in real-time are huge, but the effort often has been as well. Working with streaming data isn’t like working with data at rest. It involves a paradigm shift, a different skill set, a different outlook, a change of mindset.

To understand why, we need only rewind a bit, to real-time data technology before the Big Data era. That category of software, known as complex event processing, or CEP, set the precedent for streaming data and difficulty going hand-in-hand.

Back then, data was smaller and accepted latencies were higher. That meant CEP was niche, allowing it to be difficult, expensive and inconsistent with other data technology.

The schism

Querying data at rest involves an architecture and approach that has been with us for more than 20 years: find a connector/driver/provider that can talk to your database, feed it a SQL query, and get your results back as a set of rows and columns. This is a pattern with which virtually every technologist is familiar. It’s a universal, standard, shared concept.

But with streaming data, you need to work with data structured as “messages.” Messaging systems work on the premise of “publishing” small bit of data to queues, to which other systems can “subscribe.” They are thus often referred to as pub/sub message buses. Message buses don’t work like databases. And queues, publishers and subscribers don’t work like tables, schemas and client drivers. The mechanics are all different. And without fighting over which model is better, the fact is that the message model is completely orthogonal to the SQL query/driver/database one that’s been around for so long. And orthogonality inhibits adoption.

An imperfect unification

It gets worse. Because, despite the distinct models for working with streaming data and data at rest – things work best when both types of data can be used together. Cross-referencing one with the other adds value to each, and the whole is proverbially greater than the sum of the parts. A model, based on blending the two and called the “Lambda Architecture,” is gathering steam as well.

But Lambda is premised on accepting the very mismatch between working with streaming and non-streaming data. Lambda has caved to the accident of CEP’s history: that queues and messages are structures that consumers of streaming data must explicitly contend with. Giving in to that demand makes things difficult, and while one might be tempted into a no-pain-no-gain outlook on this, the reality is it doesn’t need to be this way.

Will it blend?

The “physics” of streaming and conventional data are different. But that doesn’t mean they can’t be looked at through the same lens. The way we work with data at rest can, and should, be used as a metaphor – an abstraction layer – for the way we work with streaming data. We can still have database drivers. And IoT devices can present an interface that makes their data streams look like tables, based on a moving window of time.

Doing so would allow conventional BI and data discovery tools to talk to streaming data sources without being massively re-tooled. That’s because they’d be “fooled” into thinking the streams were conventional databases. Eventually, these tools wouldn’t be fooled anymore, and they’d be updated to accommodate streaming concepts, like the length of the time window, and maybe the type of IoT device and the protocol it used.

A more enlightened path

But evolutionary improvements are very different, and much better, than requiring a completely separate set of tools, and a context switch, every time analytics shifts from data at rest to streaming data. This unification of streaming and non-streaming should begin to materialize this year. Customer demand and customers’ constraints will combine into a forcing function.

When the two data genres come together, the real power of streaming and IoT will click into place. Streaming data will come to the customer, instead of the other way around. Friction will be eliminated, and casual experimentation with streaming data will begin.

Such facilitation is prerequisite to any data technology’s broad adoption. This is already happening and will continue to, and all analytics will benefit, as will those conducting the analyses.

image credit: Marcin Ignac

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia