Data Pipeline – Dataconomy

How to make data lakes reliable

admin — Fri, 21 Feb 2020 11:13:00 +0000

Data professionals across industries recognize they must effectively harness data for their businesses to innovate and gain competitive advantage. High quality, reliable data forms the backbone for all successful data endeavors, from reporting and analytics to machine learning.

Delta Lake is an open-source storage layer that solves many concerns around data lakes and makes data lakes reliable. It provides:

ACID transactions
Scalable metadata handling
Unified streaming and batch data processing
Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark™ APIs.

In this guide, we will walk you through the application of Delta Lake to address four common industry use cases with approaches and reusable code samples. These can be repurposed to solve your own data challenges and empower downstream users with reliable data.

Learn how you can build data pipelines for:

Streaming financial stock data analysis that delivers transactional consistency of legacy and streaming data concurrently
Genomic data analytics used for analyzing population-scale genomic data
Real-time display advertising attribution for delivering information on advertising spend effectiveness
Mobile gaming data event processing to enable fast metric calculations and responsive scaling

Download this free guide here.

Amazon Kinesis vs. Apache Kafka For Big Data Analysis

Christopher Low — Fri, 26 May 2017 13:28:18 +0000

Data processing today is done in form of pipelines which include various steps like aggregation, sanitization, filtering and finally generating insights by applying various statistical models. Amazon Kinesis is a platform to build pipelines for streaming data at the scale of terabytes per hour. Parts of the Kinesis platform are a direct competitor to the Apache Kafka project for Big Data Analysis. The platform is divided into three separate products: Firehose, Streams, and Analytics. All three of these solve different problems, as discussed below:

How to load huge amount of data into the pipeline?

This problem is solved by Firehose. It is the entrypoint of the data into the AWS ecosystem. Kafka does not have an equivalent to Firehose. This product is more specific to Amazon’s other offerings. Firehose can load the incoming data from various different sources, buffer it into larger chunks and forward it to other AWS services like S3, Redshift, Lambda. Firehose solves the problems with backpressure. Backpressure arises when the input buffer of a service is not able to keep up with the output buffer of another service that feeds data into it. Firehose automatically scales up or down to match the its throughput with the data it has to work with. It can also batch, compress and encrypt the data before feeding it into other AWS services. Amazon recently added several other types of data transformations to the pipeline that work before the data is loaded.

How to direct live input streams into the pipeline?

Amazon Kinesis Streams is very similar to Kafka in that it is built to work with live input streams. It stores the streams that are sent to it and the streams can then be utilised by custom applications written using the Kinesis Client Library.

Kafka “topics” are roughly equivalent to Kinesis Streams. Both represent an ordered and immutable list of messages. Each message has a unique identifier and it appended to the list as it arrives. Kafka messages can be retrieved from the last known message onwards using an offset. To get the same functionality in Kinesis, the user has to build it themselves using the API and message sequence numbers.

Topics in Kafka consist of one or more partitions, which are similar to the concept of shards in Kinesis Streams. Kafka is scaled by looking out for hot partitions and adding/removing partitions as needed. Kinesis Streams are scaled by splitting/joining shards. Kafka being a hosted service has some extra overhead in terms of managing the clusters, setting up monitoring, alerting, updating the packages, tuning and failover management as compared to Kinesis. But the actual cost estimations depend on other factors like expected payload size, flow density and retention period. Kafka can be fine tuned to have less than 1 second latencies while Kinesis Streams typically have 1-3 seconds latency. Kinesis Streams are better suited when the payload size is more and the throughput is high, while latencies do not matter much.

Kafka provides durability by replicating over multiple broker nodes while Kinesis Streams do it by replicating the data over multiple zones. The user is not required to configure the replication strategy with Amazon’s offering, so they can focus more on the Big Data Analysis part.

A good use case for streaming data is analyzing clickstreams on websites and giving realtime recommendations based on the insights gained. Streams are very useful in financial trading as well. Fintech requires very high throughput data processing capabilities that can generate insights to find patterns and exceptions on the flight. This can be used to make automated trading systems that work using realtime data to make decisions. Financial models rely upon the analysis of previous data and Streams lets the user update their models as and when the data comes. This minute analysis is also required for generating realtime dashboards to monitor activity and uptime of critical services in various businesses. Streams are also useful in realtime billing and metrics systems operating at scales where account for individual unit consumption is not feasible. Anomalies can be easily spotted in this way.

But what about sending the data to Streams? Kinesis Streams can take in input data from thousands of endpoints all at once. The concept of producers is the same as in Kafka, although the implementation has differences.The producer accepts records from higher level applications, performs batching, breaks records as per partition/shard and forwards those to Kinesis Streams or Kafka. Some organization might require more fine grained control over the data producer. In that case a data aggregator like Segment can be used to pull in data from various services like mobile apps and web apps using the Mixpanel integration and pre-process it to send it ahead. Segment has an Amazon Kinesis plugin, and a Heroku Kafka plugin for this specific use case.

How to cherry-pick data from the pipeline (and obtain long-running queries)?

Kafka sits at the ingestion stage of the data processing pipeline and does not have in-built analytics abilities. Amazon analytics uses the input data from Firehose (loaded data is injected) or from Streams (data is injected as and when it arrives). SQL queries can be run against these data sources to filter out the relevant data. This data can be further processed using an AWS service like Lambda and the generated insights can further be routed to other AWS services like DynamoDB for storage and to services like Slack for notifications. Analytics supports streaming queries, so you get data output in realtime instead of waiting for the job to finish and the result pushed in bulk as in traditional SQL queries.
Pipeline based data processing is becoming the norm in big data analysis. Because of the modular design of pipeline based systems, additional processing points can be easily added to the same pipeline or in parallel processing units to form non-linear pipelines. Ingestion systems like Kinesis Streams and Kafka provide a layer of asynchronous data flow between the producers and the consumers. Kinesis Firehose helps buffer and inject huge amounts of data into the systems and finally Kinesis Analytics is used to draw insights. Choosing Kinesis Streams or Kafka is an important decision and depends on matching the desired metrics of a system.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

The Data Pipeline – Analytics at the Speed of Business

Alex Bordei — Mon, 10 Apr 2017 09:00:54 +0000

Business leaders are growing weary of making further investments in business intelligence (BI) and big data analytics. Beyond the challenging technical components of data-driven projects, BI and analytics services have yet to live up to the hype.

Early adopters and proponents were quick to frame solutions as miraculous reservoirs of insight and functionality. However, big data has not met many C-level executives’ expectations. This disconnect has many executives delaying projects, filing end-to-end big data solutions under “perhaps, in the future.”

Increasing interest and investment in distributed computing, AI, machine learning and IoT are generating practical and user-friendly tools for ingesting, storing, processing, analyzing and visualizing data. Still, the necessary IT, data-science and development operations are time-consuming and often entail large resource displacements.

This is where data pipelines are uniquely fit to save the day. The data pipeline is an ideal mix of software technologies that automate the management, analysis and visualization of data from multiple sources, making it available for strategic use.

What Does a Data Pipeline Do?

The straightforward answer is “whatever you need it to do,” meaning that there are virtually endless and evolving ways of designing and using data pipelines. However, at a strategic business level, data pipelines have two fundamental use cases:

Data-enabled functionality – From automated customer targeting and financial fraud detection to robotic process automation (RPA) and even real-time medical care, data pipelines are a viable solution to power product features regardless of industry. For example, adding data-enabled features to the shopping cart of an e-commerce platform has never been easier than with today’s streaming analytics technologies. The ability to easily create flexible, reliable and scalable data pipelines that integrate and leverage cutting-edge technologies pave the way for industry innovations and keep data-driven businesses ahead of the curve.

BI and analytics – Data pipelines favor a modular approach to big data, allowing companies to bring their zest and know-how to the table. Data pipelines are designed with convenience in mind, tending to specific organizational needs. Stand-alone BI and analytics tools usually offer one-size-fits-all solutions that leave little room for personalization and optimization.

By developing and implementing data pipelines, data scientists and BI specialists benefit from multiple viable options regarding data preparation, management, processing, and data visualization. Data pipelines are an incredibly fruitful way of tackling technology experimentation and data exploration.

How Do I Build a Great Data Pipeline?

This insightful piece by Michael Li links the success of a data pipeline to three fundamental requirements. Meeting these three criteria alone does not guarantee good data pipelines, but it helps ensure that data and research results are reliable and useful to the business:

Reproducibility – As with any science, data science must be subjected to thorough testing and third-party validation. As Li puts it, “Science that cannot be reproduced by an external third party is just not science.” Furthermore, data scientists benefit from the existing tools of software engineering, which allows them to isolate all the dependencies of the analysis – the analysis code, the data sources and the algorithmic randomness – making the data pipelines reproducible.

Consistency – Having access to correctly formatted data and to the tools required to preprocess incorrectly formatted data is crucial for the success of your data pipeline. Whether it is checking in all code and data into a revision control repository or placing code under source control and locking down data sources in external pipelines, securing data sources is fundamental to consistent data and reproducible data pipelines.

A Common ETL – ETL refers to the Extract, Transform, and Load process, which is responsible for importing data from the source systems and storing it into a data repository. With the proliferation of Hadoop, ETL has been subjected to modernization and now poses less of a challenge to the deployment of a great data pipeline. Nonetheless, sharing ETL code between research and production reduces errors and decidedly ensures reproducible results in the use of the data pipeline.

Apart from these three fundamental requirements, there is also a case to be made in favor of efficient resource orchestration and collaboration capabilities. After all, why favor a modular approach to big data if not for the benefits granted by the ability to best streamline research and development operations and shrink that time to market to the bare minimum?

What else do I need to know?

Whether the goal is using BI and analytics to drive decision making or delivering data-driven functionality to products and services, data pipelines are the solution of choice. By intelligently leveraging powerful big data and cloud technologies, businesses can now gain benefits that, only a few years ago, would have completely eluded them due to the rigid, resource-intensive and time-consuming conundrum that big data used to be.

Data pipelines are not miraculous insight and functionality machines either, but instead are the best end-to-end solution to meet the real-world expectations of business leaders. After gaining access to this technology, the only remaining concern is finding and hiring those elusive, yet game-changing developers and data scientists.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia