Stream Processing – Dataconomy https://dataconomy.ru Bridging the gap between technology and business Mon, 13 Mar 2017 11:37:29 +0000 en-US hourly 1 https://dataconomy.ru/wp-content/uploads/2022/12/cropped-DC-logo-emblem_multicolor-32x32.png Stream Processing – Dataconomy https://dataconomy.ru 32 32 If you care about Big Data, you care about Stream Processing https://dataconomy.ru/2017/03/13/care-big-data-care-stream-processing/ https://dataconomy.ru/2017/03/13/care-big-data-care-stream-processing/#respond Mon, 13 Mar 2017 11:37:29 +0000 https://dataconomy.ru/?p=17495 As the scale of data grows across organizations with terabytes and petabytes coming into systems every day, running ad hoc queries across the entire dataset to generate important metrics and intelligence is no longer feasible. Once the quantum of data crosses a threshold, even simple questions such as what is the distribution of request latencies […]]]>

As the scale of data grows across organizations with terabytes and petabytes coming into systems every day, running ad hoc queries across the entire dataset to generate important metrics and intelligence is no longer feasible. Once the quantum of data crosses a threshold, even simple questions such as what is the distribution of request latencies becomes infuriatingly slow with the usual sql and database model. Imagine running such a request on the Facebook request data, the query would take days to complete.

Stream processing offers the solution for anyone looking to manage an ever increasing volume of data.

What is stream processing

Stream processing is the handling of units of data on a record-by-record basis or over sliding time windows. This methodology limits the amount of data processed and offers a very contrasting way of looking at the data and the analytics. Complexity is traded in for speed. The nature of the analytics with stream processing are usually filtering, correlations, sampling and aggregations over the data.

Examples of the diverse applications of stream processing include:

  • Alerting on sensor data on IoT devices
  • Log analysis and statistics on web traffic
  • Risk analysis with movements of money and orders in Fintech
  • Click stream analytics on Ad Networks

 

Comparison to Batch Processing:

Batch processing Stream processing
Scope Queries over all or most of the data in the dataset. Queries or processing over data within a rolling time window, or on just the most recent data record.
Size Large numbers of records Individual records or micro batches consisting of a few records.
Performance Latencies in minutes to hours. Requires latency in the order of seconds or milliseconds.
Analyses Complex analytics. Simple response functions, aggregates, and rolling metrics.


Be truly real-time

Stream processing is the only way applications can truly be real time with their data. As soon as we talk about aggregations over the universe of data, any batch process job will continue to take longer as the amount of data increases. If you have an application that relies on real time metrics, without a stream processing solution, your view of the world is always going to be delayed.

Flow data over queries

To harness the power of stream processing, engineers and data scientists need to evolve their model from one where they run queries over their data to one where the data runs over the queries. This is a powerful shift in the way to look at the data applications.

The key shift is from tables to streams and doing operations on those streams such as

  • filtering
  • joining
  • aggregating
  • windowing

Lets compare the contrast in the batch processing world vs the stream processing world in the example application of Calculating the distribution of number of user sessions per user in a day on the website.

In batch processing: a complicated query would define what events to include in a user session, what the maximum delay between events should be for a session and computing a unique count over them. Since events in tables can occur out of order, even if a new table is maintained   every record in the table would need to be traversed for an answer

In stream processing: A stream of events would be simply be filtered to produce a stream of relevant events, this stream would then be windowed over a delay window to produce a stream of unique sessions. Finally this stream would be aggregated upon for unique counts per user with the relevant number simply being appended to a key value store.

As this example demonstrates, a less computationally intense and more real-time answer can be obtained simply be switching the way the data is perceived

Easier than ever before

Managing stream processing applications is not easy with all the moving pieces of managing issues such as fault-tolerance, partitioning and scaling and durability of the data. Over the last 7 years a lot of systems have emerged; many of which are open source and handle a lot of these issues entirely. This leaves the developer to only worry about implementing the application logic.

Some of the most exciting examples of have only launched in the last couple of years: Spark Streaming (2015), Apache Beam (2016) and Kafka Streams (2016). A full comparison of the various popular options are in figure 2.

Figure2

Any person developing applications with big data should keep the models of stream processing their mind as they choose the architecture and stack for their system. While stream processing is not a panacea for all the tasks around large scale data processing, it offers a new and important way to look at data that allows scalable real-time

 

Credits:

Figure 1: Information obtained from AWS Documentation

Figure 2: Image Courtesy: Ian Hellström on Twitter

 

Like this article? Subscribe to our weekly newsletter to never miss out!

]]>
https://dataconomy.ru/2017/03/13/care-big-data-care-stream-processing/feed/ 0
Stream Processing Myths Debunked https://dataconomy.ru/2017/02/03/stream-processing-myths-debunked/ https://dataconomy.ru/2017/02/03/stream-processing-myths-debunked/#comments Fri, 03 Feb 2017 09:00:48 +0000 https://dataconomy.ru/?p=17312 This post appeared originally in the dataArtisans blog Six Common Streaming Misconceptions Needless to say, we here at data Artisans spend a lot of time thinking about stream processing. Even cooler: we spend a lot of time helping others think about stream processing and how to apply streaming to data problems in their organizations. A […]]]>

This post appeared originally in the dataArtisans blog

Six Common Streaming Misconceptions

Needless to say, we here at data Artisans spend a lot of time thinking about stream processing. Even cooler: we spend a lot of time helping others think about stream processing and how to apply streaming to data problems in their organizations.

A good first step in this process is understanding misconceptions about the modern stream processing space (and as a rapidly-changing space high in its hype cycle, there are many misconceptions worth talking about).

We’ve selected six of them to walk through in this post, and since Apache Flink® is the open-source stream processing framework that we’re most familiar with, we’ll provide examples in the context of Flink.

Myth 1: There’s no streaming without batch (the Lambda Architecture)
Myth 2: Latency and Throughput: Choose One
Myth 3: Micro-batching means better throughput
Myth 4: Exactly once? Completely impossible.
Myth 5: Streaming only applies to “real-time”
Myth 6: So what? Streaming is too hard anyway.

Myth 1: There’s no streaming without batch (the Lambda Architecture)

The “Lambda Architecture” was a useful and popular design pattern in the early days of Apache Storm and stream processing in general. The architecture consists of a “fast stream layer” that is augmented by a “batch layer”.

Lambda-Architecture-Diagram

The reason for these two separate layers was that stream processing in the Lambda Architecture could only compute approximate results (for example, if there was a failure, the results could not be trusted) and could handle only a relatively low volume of events.

While these limitations existed in early versions of Apache Storm, they are no longer relevant in modern open-source stream processing–many stream processing frameworks are now fault tolerant, producing accurate results under failures, and are still capable of high-throughput computing. And so there’s no need to maintain a multi-layer architecture to for the sake of both “fast” and “accurate” results. A modern stream processor (such as Flink) gets you both.

The good news: we don’t hear much about the Lambda Architecture any more–a sign that the stream processing space is maturing.

Myth 2: Latency and Throughput: Choose One

Early open-source stream processors were classified as either “high-throughput” or “low-latency”, and so the assumption has stuck that “a lot of data, quickly” isn’t an option in open-source streaming.

But Flink (and potentially others, too) offers both high-throughput and low latency; here’s one example of results from a benchmark to put some hard numbers on the assertion.

Let us examine this from a fundamental viewpoint, in particular at the hardware level, and let us consider a stream processing pipeline that is network-bottlenecked (which is the case, at least, for many pipelines we see that are using Flink). At the hardware level, there is no reason for such a tradeoff to exist. The network’s capacity is ultimately what dictates both the maximum attainable throughput and the lowest attainable latency.

A well-engineered software system will achieve the physical limits allowed by the network and not introduce bottlenecks itself. While there is always room to optimize the performance of Flink and bring it closer to what is attainable by hardware, at this point, Flink has been shown to handle 10s of millions of events per second in a 10-node cluster, scale to 1000s of nodes, and at the same time achieve latency in the tens of milliseconds. In our experience, this level of performance is more than sufficient for most practical deployments.

Myth 3: Micro-batching means better throughput

There’s another side to the performance discussion, and first, we should clarify two often-confused terms:

Micro-batching
An execution and programming model for data processing that is built on top of a traditional batch model. The “technique allows a process or task to treat a stream as a sequence of small batches or chunks of data.”

Buffering
An optimization technique for accessing networks, disks, caches, etc. Wikipedia provides a perfectly fine definition: “A region of a physical memory storage used to temporarily store data while it is being moved from one place to another.”

The myth, then, is that data processing frameworks that use micro-batching are capable of higher throughput than streaming frameworks that process event-at-a-time because micro-batching sends data across the network more efficiently.

What this myth overlooks, though, is that while a streaming framework will not rely on any sort of batching at the processing or programming model level, it will almost definitely buffer at the physical level.

And Flink absolutely buffers data, meaning that it sends collections of processed records over the network rather than sending one single record at a time. To not buffer data wouldn’t make sense from a performance standpoint–there’s no benefit to sending records over the network one-at-a-time. So we’ll concede that there’s no such thing as record-at-a-time at a physical level.

But buffering serves only as a performance optimization. And therefore, buffering:

  • Should be opaque to the user
  • Should not dictate system behavior in any other way
  • Should not impose artificial boundaries
  • Should not limit what you can do with the system

So while a user of Flink APIs can develop as though the program is processing each record independently, Flink uses buffering to achieve its performance under the hood.

In fact, micro-batching can introduce considerable overhead in the form of scheduling tasks–and when tuning for lower latency (with a higher volume of smaller batches), this overhead only increases! A stream processor can deliver all of the advantages of buffering with none of the task-scheduling overhead.

Myth 4: Exactly once? Completely impossible.

There are a few sub-variations of this myth, including:

  • Exactly once is impossible in nature
  • Exactly once is impossible end-to-end
  • Exactly once is never a real-world requirement anyway
  • Exactly once requires a significant performance trade-off

Let’s take a step back: we’re not thrilled to be getting into ‘exactly once’ in this post. Historically, ‘exactly once’ has referred only to ‘exactly once delivery’, but the phrase is now embedded in the stream processing lexicon and is used fairly loosely, making the two words more confusing and less descriptive than they used to be. The related concepts are still important, though, and so we won’t skip over it altogether.

To try to be as precise as possible, we’ll break out ‘exactly once state’ and ‘exactly once delivery’ as separate ideas. The confusion between state and delivery stems from prior uses of these terms. For example, while Apache Storm used the term ‘at least once’ implying delivery (Storm did not offer support for state), Apache Samza used the term ‘at least once’ implying application state guarantees.

Exactly-once state means that the state of the application after a failure will be as if no failure had occurred. For example, if we are maintaining a counter application and experience a failure, we will neither undercount nor overcount within the application. The reason for using the term ‘exactly-once’ in this context is that the application state is always as if each message was processed exactly once.

Exactly-once delivery means that after a failure, the recipient (some system outside of the application) will receive processed events as if there was no failure.

While the latter (delivery) is impossible for a stream processing framework to guarantee in every single scenario, the former (state) is both possible, and in the case of Flink, possible without a meaningful performance hit. And Flink is also capable of exactly once delivery to selected data sinks that cooperate with Flink’s checkpointing.

Flink’s checkpoints are periodic, asynchronous, and consistent snapshots of application state. This is how Flink is able to provide the aforementioned exactly-once state guarantees under failure: Flink periodically records (snapshots) the position it is reading in the input stream as well as the corresponding state of every operator. If there is a failure, Flink rolls back both the input stream and the operator state to a prior, consistent, global state and restarts the computation from there. Hence, even if records are replayed, the resulting state is as if the records were processed exactly once.

And end-to-end exactly once processing? It’s possible to build applications so that checkpoints double as a transaction coordination mechanism–in other words, source and sink operators can also take part in checkpoints. The result is exactly-once processing inside the framework and exactly-once or “effectively-once” processing end-to-end. For example, when using Flink with Kafka as source and a rolling file sink (e.g. HDFS), one can achieve end-to-end exactly once from Kafka to HDFS. Similarly, when using Kafka as the Flink source and Cassandra as the sink, one can get end-to-end exactly once when updates to Cassandra are idempotent.

effectively_once

It’s worth mentioning that checkpoints can also triple as a state versioning mechanism via Flink’s savepoints. With savepoints, it’s possible to ‘move around in time’ while maintaining state consistency. This enables easy code updates, maintenance, migration, debugging, and even A/B testing or ‘what-if?’ simulations.

savepoints

Myth 5: Streaming only applies to “real-time”

Variations of this myth include:

  • “I don’t have low-latency applications, and therefore I don’t need a stream processor.”
  • “Stream processing is only relevant for transient data before we move it to storage.”
  • “We need a batch processor to do heavy, offline computations.”

Now it’s time to think about types of datasets vs. types of execution models.

First, two types of datasets

  • Unbounded: Data that’s produced continuously without a defined end point
  • Bounded: Data that’s finite and should be considered complete

Many real-world datasets are, in fact, unbounded datasets. This is true whether the data is stored in a sequence of files or directories on HDFS or in a log-based system like Apache Kafka. Examples include but are not limited to:

  • End users interacting with mobile or web applications
  • Physical sensors providing measurements
  • Financial markets
  • Machine log data

In fact, it is quite hard to find a real-world example of a dataset in a modern business that is bounded, except perhaps the location of all of a company’s buildings and facilities (but this, too, evolves as the business grows).

Second, two types of execution models

  • Streaming: Processing that executes continuously as long as data is being produced
  • Batch: Processing that is executed and runs to completeness in a finite amount of time, releasing computing resources when finished

Let’s go one step further and identify two categories of unbounded datasets: streams with continuously-arriving data and streams with periodically-arriving data.

two_types_of_streams_alt

It’s possible, though not necessarily optimal, to process either type of dataset using either type of execution model–for instance, batch execution has long been applied to unbounded datasets, especially unbounded datasets with periodically-arriving data. In reality, most “batch” jobs are executed according to some schedule, analyzing unbounded datasets by only touching a small part of the stream at a time. This often means the unbounded nature of the stream becomes someone else’s problem (usually whoever’s working on the ingestion pipeline).

Batch processing gives the impression that it is stateless, in that the output only depends on the input. In reality, batch jobs keep state internally (e.g. a reducer always keeps state), but this state is confined within the boundaries of the batch, and there is no way to correlate events across batches.

And ‘state confined within the boundaries of the batch’ matters as soon as a user tries to implement something as simple as windowing with an ‘event time’ time characteristic–a very common approach when working with unbounded data.

It’s inevitable that a batch processor working on an unbounded dataset will have to deal with late-arriving events (because of an upstream delay, etc.), and when events arrive late, the data inside of a given batch is at risk of being incomplete–remember, we assume that we’re generating our windows based on event time because this is the time characteristic that provides the most accurate model of reality. In batch execution, late-arriving data presents a problem even with simple fixed windows (e.g. tumbling or sliding windows) and becomes more difficult to deal with when using session windows.

Because if all of the data required to complete a computation isn’t in the same batch, then it’s not possible to arrive at correct results on an unbounded dataset when using batch execution–at least, not without a lot of additional overhead to deal with late data and state between batches (delaying processing until you’re sure all data has arrived, re-processing a batch).

Flink has a built-in mechanism for dealing with late elements–late data is assumed to be a normal occurrence when working with unbounded data in the real world, and so a well-designed stream processor will offer simple tooling to handle late data.

Stateful stream processing is a more naturally-fitting model for unbounded datasets regardless of whether the dataset is ‘continuously-arriving’ or ‘periodically-arriving’. The potential to compute in real-time is just icing on the cake when working with a stream processor.

For more on the topic, these issues are well-documented in Tyler Akidau’s Streaming 101.

Myth 6: So what? Streaming is too hard anyway.

OK, we’re at the last stage. That stage where you think, “This sounds great on paper, but I’m still not gonna adopt a streaming technology because…”:

  • Streaming frameworks are hard to learn.
  • Streaming is hard to reason about: Windows? Event time? Triggers? Yikes!
  • Streaming needs to be coupled with batch, and I know batch already. So what’s the point?

We’re never going to advocate that you should adopt streaming just because we happen to think streaming is cool. Rather, we believe that the decision to use a streaming execution model should ultimately depend on the nature of your data and the nature of your code.

Let us ask again, “What type of dataset am I working with?”

  • Unbounded (e.g. user activity, logs, sensor data)
  • Bounded (e.g. ______?)

And then ask a different question: “What changes more frequently?”

  • My code changes more frequently than my data.
  • My data changes more frequently than my code.

If your data changes more frequently than your code–for example, a ‘standing’ query that remains more or less static and is updated occasionally for bug fixing, applied to an ever-changing set of data–we would argue that you have a streaming problem.

So before you write streaming off as ‘too hard’, consider that there’s a good chance you’re already solving a streaming problem in some capacity! You might be solving your streaming problem with hourly batch jobs, and someone else on your team might own the creation and management of the hourly batches (in which case you’re probably living with inaccurate results without realizing it due to batch’s issues with time handling and state that we mentioned above).

The Flink community has long worked to deliver higher-level APIs that abstract away many of the intricacies of time and state. For example, dealing with event time in Flink is as simple as defining a window of time and a function that extracts timestamps and watermarks (which only needs to be done once per stream). Dealing with state is as easy as defining variables in a Java program and registering them with Flink. And efforts such as Flink’s StreamSQL will allow you to run SQL queries on your never-ending streams.

To finish the thought exercise: what if your code changes more quickly than your data? In that case, we contend that you have an exploration problem. And using notebooks or other, similar tools for iteration might be a perfect fit for solving an exploration problem.

As soon as your code stabilizes, though, you’ll end up with a streaming problem. And we recommend that you start thinking about a long-term solution in a streaming context from the very beginning.

Stream processing’s next act

As the stream processing space matures and these myths become less and less prevalent in everyday discussion (such as how we almost never hear a reference to the Lambda architecture any more), we also see that streaming is evolving beyond its initial place in analytics applications.

As we discussed, most real-world data is produced continuously and arrives continuously. Traditionally, this continuous data flow had to be interrupted, and data had to be either gathered in a central location or cut in batches in order for applications to interact with it.

With the streaming paradigm and patterns such as CQRS becoming more popular, applications can be developed directly on continuously-flowing data, enabling benefits such as localized and correct state, better isolation of applications and teams, and better handling of time series data.

We believe that Flink, as it further evolves and is adopted by more and more enterprises, has the potential not only to streamline analytics pipelines, but also to introduce a new, far-reaching, and more powerful computational model.

Like this article? Subscribe to our weekly newsletter to never miss out!

]]>
https://dataconomy.ru/2017/02/03/stream-processing-myths-debunked/feed/ 1
Apache Samza Levels Up to a Top-Level Project; Witnesses Large Scale Adoption https://dataconomy.ru/2015/01/28/apache-samza-levels-up-to-apache-top-level-project-witnesses-large-scale-adoption/ https://dataconomy.ru/2015/01/28/apache-samza-levels-up-to-apache-top-level-project-witnesses-large-scale-adoption/#comments Wed, 28 Jan 2015 11:58:45 +0000 https://dataconomy.ru/?p=11721 Apache Samza, the distributed stream processing framework, has now graduated to become a Top-Level Project (TLP), the Apache Software Foundation has revealed. “The incubation process at Apache has been great. It has helped us cultivate a strong community, and provided us with the support and infrastructure to make Samza grow,” noted Chris Riccomini, VP at […]]]>

Apache Samza, the distributed stream processing framework, has now graduated to become a Top-Level Project (TLP), the Apache Software Foundation has revealed.

“The incubation process at Apache has been great. It has helped us cultivate a strong community, and provided us with the support and infrastructure to make Samza grow,” noted Chris Riccomini, VP at Apache Samza.

Designed to handle fault tolerance, stateful processing, message durability, and scalability, Samza uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management, thus enabling organizations to understand and react to their data in real-time.

Purpose-built as a production-grade system with operability and scalability in mind, to process tracking data, service log data, and for data ingestion pipelines for realtime services, Samza originated at LinkedIn, and was submitted to the Apache Incubator in July 2013.

Samza is witnessing strong adoption in business intelligence, financial services, healthcare, mobile applications, security, and software development, among other industries.

Read the announcement at the ASF blog here.


(Image credit: Tom Spinker, via Flickr)

]]>
https://dataconomy.ru/2015/01/28/apache-samza-levels-up-to-apache-top-level-project-witnesses-large-scale-adoption/feed/ 3
Meet Apache Samza – LinkedIn’s Stream Processing Framework https://dataconomy.ru/2015/01/09/meet-apache-samza-linkedins-stream-processing-framework/ https://dataconomy.ru/2015/01/09/meet-apache-samza-linkedins-stream-processing-framework/#comments Fri, 09 Jan 2015 10:59:56 +0000 https://dataconomy.ru/?p=11304 With the advent of Big Data and the rapidly growing scale of web-applications, monolithic relational databases were replaced by scalable, partitioned, NoSQL databases and HDFS; individual queries to relational databases were replaced by the likes of Hive and Pig. This growing scale and partitioned consumption model brought about by these systems, also put forth the […]]]>

With the advent of Big Data and the rapidly growing scale of web-applications, monolithic relational databases were replaced by scalable, partitioned, NoSQL databases and HDFS; individual queries to relational databases were replaced by the likes of Hive and Pig.

This growing scale and partitioned consumption model brought about by these systems, also put forth the need for smooth processing of “streams of events” at scale.

That’s when LinkedIn came up with Samza.

Apache Software Foundation’s incubation project since September 2013, Apache Samza is the distributed stream processing framework that incorporates Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

Navina Ramesh of LinkedIn explains, “we recognized that some of our use cases couldn’t be implemented in Hadoop due to the large turn-around time that batch processing needed.”

“Unlike Hadoop, which is optimized for throughput, Kafka is optimized for low-latency messaging.  We built a processing system on top of Kafka, allowing us to react to the messages — to join, filter, and count the messages. The new processing system, Apache Samza, solved our batch processing latency problem and has allowed us to process data in near real-time,” she explains in The New Stack article.

It essentially takes care of stream processing needs within LinkedIn, “to provide a lightweight framework for continuous data processing.” Originally open sourced, the framework enables building applications to process feeds of messages.

Read more here.


(Image credit: Jean and Fred, via Flickr)

 

]]>
https://dataconomy.ru/2015/01/09/meet-apache-samza-linkedins-stream-processing-framework/feed/ 3