DataTorrent – Dataconomy https://dataconomy.ru Bridging the gap between technology and business Mon, 07 Sep 2015 10:14:53 +0000 en-US hourly 1 https://dataconomy.ru/wp-content/uploads/2022/12/DC-logo-emblem_multicolor-75x75.png DataTorrent – Dataconomy https://dataconomy.ru 32 32 The Arrival of Scalable, Fault Tolerant Big Data Ingestion https://dataconomy.ru/2015/09/07/the-arrival-of-scalable-fault-tolerant-big-data-ingestion/ https://dataconomy.ru/2015/09/07/the-arrival-of-scalable-fault-tolerant-big-data-ingestion/#respond Mon, 07 Sep 2015 10:14:53 +0000 https://dataconomy.ru/?p=13917 The concept of data ingestion has existed for quite some time, but remains a challenge for many enterprises trying to get necessary data in and out of various dependent systems. The challenge becomes even more unique when looking to ingest data in and out of Hadoop. Hadoop ingestion requires processing power and unique specifications that […]]]>

The concept of data ingestion has existed for quite some time, but remains a challenge for many enterprises trying to get necessary data in and out of various dependent systems. The challenge becomes even more unique when looking to ingest data in and out of Hadoop. Hadoop ingestion requires processing power and unique specifications that cannot be met by one solution on the market now.

Given the variety of data sources that need to pump data into Hadoop, customers often end up creating one-off data ingestion jobs. These one-off jobs copy files using FTP & NFS mounts or try to use standalone tools like ‘DistCP.’ Since these jobs stitch together multiple tools, they encounter problems around manageability, failure recovery and ability to scale to handle data skews.

So how do enterprises design an ingestion platform that not only addresses the scale needed today but also scales out to address the needs of tomorrow? Our solution is DataTorrent dtIngest, the industry’s first unified stream and batch data ingestion application for Hadoop.

At DataTorrent, we work with some of the world’s largest enterprises, including leaders in IoT and ad tech. These enterprises must ingest massive amounts of data with minimal latency from a variety of sources, and dtIngest enables them to establish a common pattern for ingestion across various domains. Take a look at the following diagram, for example.

0CDEA3E4-1860-4C24-82D4-D1E1A7318EB6

Each of the blocks signifies a specific stage in the ingestion process:

  • Input – Discover and fetch the data for ingestion. The discovery of data may be from File System, messaging queues, web services, sensors, databases or even the outputs of other ingestion apps.
  • Filter – Analyze the raw data and identify the interesting subset. The filter stage is typically used for quality control or to simply sample the dataset or parse the data.
  • Enrich – Plug in the missing pieces in the data. This stage often involves talking to external data sources to plug in the missing data attributes. Sometimes this may mean that the data is being transformed from a specific form into a form more suitable for downstream processes.
  • Process – This stage is meant to do some lightweight processing to either further enrich the event or transform the event from one form into another. While similar to the enrich stage in that it requires employing external systems, the process stage usually computes using the existing attributes of the data.
  • Segregate – Often times before the data is given to downstream systems, it makes sense to bundle similar data sets together. While this stage may not always be necessary for compaction, segregation does make sense most of the time.
  • Output – With Project Apex, outputs are almost always mirrors of inputs in terms of what they can do and are as essential as inputs. While the input stage requires fetching the data, the output stage requires resting the data – either on durable storage systems or other processing systems.

There are many different ways in which these stages can occur, and the order, or even number of instances required, are dependent on the specific ingestion application.

DataTorrent dtIngest app, for example, simplifies the collection, aggregation and movement of large amounts of data to and from Hadoop for a more efficient data processing pipeline. The app was built for enterprise data stewards and intends to make their job of configuring and running Hadoop data ingestion and data distribution pipelines a point-and-click process.

Some sample use cases of dtIngest include:

  • Bulk or incremental data loading of large and small files into Hadoop
  • Distributing cleansed/normalized data from Hadoop
  • Ingesting change data from Kafka/JMS into Hadoop
  • Selectively replicating data from one Hadoop cluster to another
  • Ingest streaming event data into Hadoop
  • Replaying log data stored in HDFS as Kafka/JMS streams

Future additions to dtIngest will include new sources and integration with data governance.

(image credit: Kevin Steinhardt)

]]>
https://dataconomy.ru/2015/09/07/the-arrival-of-scalable-fault-tolerant-big-data-ingestion/feed/ 0
DataTorrent Scoop $15m to Simplify & Speed Up the Big Data Pipeline https://dataconomy.ru/2015/04/28/datatorrent-scoop-15m-to-simplify-speed-up-the-big-data-pipeline/ https://dataconomy.ru/2015/04/28/datatorrent-scoop-15m-to-simplify-speed-up-the-big-data-pipeline/#respond Tue, 28 Apr 2015 15:56:00 +0000 https://dataconomy.ru/?p=12717 Over the past few years, DataTorrent have made a real name for themselves in the big data analytics space with their propietary real-time streaming (RTS) on Hadoop technology. Today, they announce an extra $15 million funding to smooth out the wrinkles in the big data pipeline. This latest cash injection brings their total funding since […]]]>

Over the past few years, DataTorrent have made a real name for themselves in the big data analytics space with their propietary real-time streaming (RTS) on Hadoop technology. Today, they announce an extra $15 million funding to smooth out the wrinkles in the big data pipeline.

This latest cash injection brings their total funding since their inception in 2012 to $23.8 million. The round was led by Singtel Inov8, and saw participation from all of their original Series A investors. The Managing Director of Singtel Inov8, Jeff Karras, will also be joining the innovative company’s Board of Directors.

The latest round should come as no surprise to those who have been following DataTorrent. This past year has been monumental for the growing company, which saw them unveil the DataTorrent RTS platform, and be officially annointed as a “Cool Vendor” by the powers-that-be at Gartner. 2014 also saw not one but two of the DataTorrent team- Himanshu Bari and Thomas Weise– become Dataconomy contributors; an achievement we’re sure tops their list of personal highlights.

DataTorrent was established in 2012 by two former Yahoo employees- Phu Hoang and Amol Kekre- who were passionate about addressing the need for enterprises to access their data in real time. Their RTS platform- which can process 1 billion events a second- is an undoubtedly powerful tool with wide-ranging applications, from ecommerce to the Internet of Things.

“More and more organizations are looking towards big data analytics to provide valuable insights for their business operations, but lack easy to use, intuitive tools for big data to uncover these insights in a timely manner themselves,” Karras stated in a release. “The DataTorrent platform uniquely enables broad accessibility of analytics for both batch and stream processing allowing organizations glean faster insights and to unlock the value of their data.”

It’s not just Karras and the team at Singtel Inov8 who are impressed with DataTorrent. Yahoo co-founder Jerry Wang counts himself as an investor, and they’ve brokered major partnerhsips with the likes of Hortonworks and Pivotal.

Photo credit: smellslikeupdog / Source / CC BY-ND

]]>
https://dataconomy.ru/2015/04/28/datatorrent-scoop-15m-to-simplify-speed-up-the-big-data-pipeline/feed/ 0
Using Kafka and YARN for Stream Analytics on Hadoop https://dataconomy.ru/2015/03/11/using-kafka-and-yarn-for-stream-analytics-on-hadoop/ https://dataconomy.ru/2015/03/11/using-kafka-and-yarn-for-stream-analytics-on-hadoop/#comments Wed, 11 Mar 2015 11:38:06 +0000 https://dataconomy.ru/?p=12327 Understanding Big Data: Stream Analytics and YARN Real-time stream processing is growing in importance, as businesses need to be able to react faster to events as they that occur. Data that is valuable now may be worthless a few hours later. Use cases include sentiment analysis, monitoring and anomaly detection. With cheap and infinitely scalable […]]]>

Understanding Big Data: Stream Analytics and YARN

Real-time stream processing is growing in importance, as businesses need to be able to react faster to events as they that occur. Data that is valuable now may be worthless a few hours later. Use cases include sentiment analysis, monitoring and anomaly detection.

With cheap and infinitely scalable storage and compute infrastructure, more and more data flows into the Hadoop cluster. For the first time, the opportunity is ripe to fully leverage that infrastructure and bring real-time processing as close to the data in HDFS as possible, yet isolated from other workloads. This need has been a driver for Hadoop native streaming platforms and a key reason why other streaming solutions, like Storm, fall short.

This post motivates critical infrastructure pieces to build mission critical real-time streaming applications on Hadoop, specifically needed for end-to-end fault tolerance for the processing platform.

Hadoop YARN – The distributed OS

A new generation of Hadoop applications was enabled through YARN, allowing for processing paradigms other than MapReduce. Next to MapReduce, there are now many other applications and platforms running on YARN, including stream processing, interactive SQL, machine learning and graph processing.

Hadoop 2.x with YARN is becoming the distributed OS for the data center. Adaption is picking up speed, as all major Hadoop distributions moved to 2.x. YARN applications benefit from:

  • Horizontal scalability with commodity hardware
  • Distributed file system with “unlimited” storage space
  • Central resource management with queues, limits and locality preferences
  • Framework for multi-tenancy, fault tolerance and security

Most applications running on YARN rely on other external services that today are not native to YARN, like message brokers, databases and web servers. These services are separate and have their own infrastructure, resources and operational procedures. This results in fragmentation, inefficiency, higher cost and adaption barriers. For example, while a single scheduler can improve resource utilization while ensuring isolation, especially for elastic workloads, this is not possible when carving out separate clusters – another aspect of fault tolerance.

Fault Tolerant Processing by example

DataTorrent RTS is a real-time stream-processing platform. We set out in early 2012 to build the first YARN native application, besides MapReduce. All components of our platform and its entire architecture were built around YARN. Other solutions in the stream processing space are either completely outside Hadoop or are a YARN build-on that has multiple downsides.

Today, as we work with customers building applications on top of DataTorrent RTS, we see that there will be the need to extend the idea of the “distributed OS” to other peripheral systems that those applications depend on to leverage expertise and existing investments.

One of the key differentiators of DataTorrent RTS is fault tolerance. These capabilities would not be possible without support in YARN:

  • Detection of process failures that can be used to implement automatic HA
  • Ability to add and replace machines with no downtime

RTS as a YARN native application uses these basic building blocks to provide full HA with no loss of data or human intervention and minimum recovery time. This is a critical capability for real-time processing and low end-to-end latency requirements. No data loss implies that the application state is check-pointed. This is provided by RTS, without the user having to write extra code for it, a capability made possible due to tight HDFS integration.

Data needs to move into the Hadoop cluster for processing purposes. For stream processing, Kafka is an increasingly popular choice of message bus. It was built for scalability and meets low latency requirements. Let’s consider Kafka the message bus delivering data into a DataTorrent application, running in the YARN cluster, for processing:

DataTorrent Using Kafka and YARN for Stream Analytics on Hadoop

For mission critical applications, high availability is essential. In the above scenario, the RTS stream processing application running on YARN is fully fault tolerant. Any process failure in the cluster will be handled within the framework established by YARN, using the built in support for recovery in RTS. In contrast, though Kafka is fault tolerant with replicated partitions and failover for leader broker, failures of the server processes are either not handled at all or are handled through a mechanism that the user must provide.

This compromises the end-to-end fault tolerance proposition and leads to an acceptance problem. Not only can failures in the Kafka cluster lead to service interruption in the pipeline, they can also pose problems for sensitive producers of data that have little tolerance to downtime of Kafka as downstream buffer. So, how can we make use of the great capabilities Kafka offers in way that can be operationalized? Today’s primary users of Kafka have built their own teams and proprietary infrastructure to address it, but that can become an expensive hobby and is typically not what a customer wants to hear.

Kafka on YARN (KOYA)

Moving Kafka into the YARN cluster is a solution to the problem, especially in our context where the user is running a YARN cluster anyways and has made the investment to operationalize it.

Before, the user had to perform a number of steps to replace the Kafka broker. As long as a replica remains, there would be no downtime. However, it is desirable that the path to recovery can be predefined and automated to avoid that alert in the middle of the night.

With YARN, the application master detects a process failure and can initiate recovery. In case of machine failure, the process can be replaced on a new machine. Since Kafka is sensitive to disk I/O, the YARN cluster administrator can reserve backup machines.

It makes sense to integrate Kafka with YARN, as existing investments and skills can be leveraged. Kafka running under the YARN umbrella can utilize the centrally managed pool of resources. The process monitoring and recovery features of YARN can be extended to provide complete HA for Kafka servers (Kafka provides replicated partitions, but it does not offer automation for dealing with failed brokers).

DataTorrent announced a new initiative to integrate Kafka and YARN under the KOYA project. KOYA was proposed as KAFKA-1754 and was well received by the community. The goals of KOYA are listed below:

  • Automate broker recovery
  • Automate deployment of Kafka cluster
  • Central status of cluster
  • Ease of management
  • Support core Kafka as is, without modifications

Slider

We initially got ready to build a new application master for KOYA from scratch. Why not? We did it for RTS and have the expertise required. But considering the goals for KOYA and that Kafka already provides most of the HA features, we evaluated Apache Slider. Slider was built to enable long running services on YARN without making changes to the services themselves. We found Slider sufficient to bring Kafka to YARN as it provides much of the infrastructure required for KOYA.

DataTorrent Using Kafka and YARN for Stream Analytics on Hadoop Cluster

 

With KOYA, there is only one pool of resources, with all machines running under YARN. The Slider application master is responsible for keeping the Kafka server containers running, each controlled by a Slider agent, which is written in Python.

DataTorrent Using Kafka and YARN for Stream Analytics on Hadoop Code

Using KOYA, the user can specify the resources for the Kafka servers in a configuration file. Today, YARN supports memory and CPU, with disk as future consideration (YARN-2139). It is also possible to use the other parameters that YARN supports, such as node labels and locality. One feature that KOYA would benefit from and isn’t available in YARN today is anti-affinity. Anti-affinity is needed to ensure only a single broker runs on a given machine for optimal performance. The set of candidate machines can be restricted via labels. In the absence of direct anti-affinity support, requesting 51 percent of available resources is a workaround solution.

Kafka relies on the local file system for storing its logs (topic partition data). Hence, it is important that the server remains on the same machine across restarts, unless this becomes impossible, due to a machine failure for example. Slider allows users to pin a component to the machine it was first allocated to and will add improved support to relax the affinity constraint when needed without user intervention (SLIDER-799). With the latter, the user will be able to allow an alternative machine to be used when needed and Kafka will be able to restore the partition replication.

What’s next?

KOYA is under development as open source and we are looking to take it forward in collaboration with Kafka and YARN communities. We are targeting Q2 for the first release and one of our objectives is to provide a dedicated admin web service for the Kafka cluster. We see this as a future part of Kafka that should be integrated as a Slider component and plan to work with the Kafka community on it. We also identified a number of enhancements to Slider that we are looking forward to incorporating with future releases.


ThomasWeiseThomas Weise is principal architect at DataTorrent and has developed and architected distributed systems, middleware and web applications since 1997. Before joining DataTorrent at its inception, he served as principal engineer in the Hadoop Services team at Yahoo! and contributed to several of the ecosystem projects, including Pig, Hive and HCatalog. Porting of the MapReduce oriented infrastructure to Hadoop 2.x also gave motivation to explore alternative, more interactive and real-time processing capabilities on the new YARN architecture. Earlier, he worked on enterprise software for network/device management, e-commerce and search engine marketing.


Photo credit: Misha Dontsov / Foter / CC BY

]]>
https://dataconomy.ru/2015/03/11/using-kafka-and-yarn-for-stream-analytics-on-hadoop/feed/ 9
Making the Most of the Internet of Things https://dataconomy.ru/2015/01/27/making-the-most-of-the-internet-of-things/ https://dataconomy.ru/2015/01/27/making-the-most-of-the-internet-of-things/#comments Tue, 27 Jan 2015 12:01:40 +0000 https://dataconomy.ru/?p=11699 Himanshu Bari is the Director of Product Management for DataTorrent. His prior experience includes working a Senior Product Manager for Symantec and Hortonworks, a Strategy Consultant for Panasonic, and a Senior Systems Analyst for Goldman Sachs. The number of connected devices related to the Internet of Things is rapidly growing, raising the question of whether […]]]>

HimanshuBariHimanshu Bari is the Director of Product Management for DataTorrent. His prior experience includes working a Senior Product Manager for Symantec and Hortonworks, a Strategy Consultant for Panasonic, and a Senior Systems Analyst for Goldman Sachs.


The number of connected devices related to the Internet of Things is rapidly growing, raising the question of whether enterprises are prepared for IoT. While many enterprises may be overwhelmed by the influx of IoT devices, early adopters are embracing it. IoT allows for remote monitoring, diagnostics and improved contextual awareness with early adopters focused on verticals such as building and home automation, wearable devices, resource optimization and machine monitoring. Although we have seen many success stories from these specific industries, current data suggests the adoption curve will take an additional five to eight years to fully implement.

With IoT in the early stages, solutions need to account for three key qualities of the space at the moment:

Fragmentation: There is quite a bit of fragmentation of technologies involved in IoT solutions right now. At the highest level, fragmentation is driven by vertical use cases that require extremely specialized solutions. As a result, fragmentation spans across the chain, from the software to the device, to the sensor and even to the protocol level.

High variability: IoT applications naturally start with an experimentation phase, making it difficult for enterprises to understand and plan for exact infrastructure investments. This raises the need for an infrastructure that is fungible, can be scaled incrementally and is on demand.

Wide distribution: “Things” are scattered all over the place based on their deployment needs. This raises interesting challenges in terms of providing connectivity, analytics, and mobility of the device, regardless of the location.

Keeping these three features in mind, the architecture of a successful IoT solution must provide end-to-end solutions, customizable options and a ubiquitous presence. Enterprises are struggling to maintain solutions on their own due to the lack of standardization. Successful IoT application rollouts need vendor support across the entire value chain. All the way from the sensors and the SDK to program them to the back end infrastructure that gathers, stores and analyzes the data in real time.

Next, the connectivity, analytics and integration must have the ability to vary significantly based on specific verticals and use cases. Due to the nascent and evolving stage of the market solution architectures need high levels of customization in order to provide value.

Lastly, regardless of where a sensor is physically located, there must be a reliable connection to a gateway that links to the application brain. Basic services like security, reliability, fault tolerance, consistency, scalability etc., should be available universally in the underlying framework across the entire IoT deployment. Tiering also needs to be supported for seamless scale out of deployments without bottlenecks.

Companies of all sizes are aiming to provide platforms that enable IoT applications that gather, store, analyze and distribute data. They provide a SDK for IoT app development and also the backend cloud infrastructure to gather and analyze the data. Enterprises need to find a way to implement IoT solutions that adapt and adjust at the speed of specific situations and businesses.


(Image credit: Mike, via Flickr)

]]>
https://dataconomy.ru/2015/01/27/making-the-most-of-the-internet-of-things/feed/ 1
DataTorrent RTS Released, Processes 1 Billion Data Events a Second https://dataconomy.ru/2014/06/04/datatorrent-rts-released-processes-1-billion-data-events-second/ https://dataconomy.ru/2014/06/04/datatorrent-rts-released-processes-1-billion-data-events-second/#comments Wed, 04 Jun 2014 10:14:56 +0000 https://dataconomy.ru/?p=5248 Yesterday Datatorrent’s first product, DataTorrent RTS, became generally available. DataTorrent RTS is alleged to be the first platform on top of Hadoop capaple of processing more than 1 billion data events per second. Founders Phu Hoang and Amol Kekre noticed a gap in the market whilst they were both working for Yahoo! They observed a […]]]>

Yesterday Datatorrent’s first product, DataTorrent RTS, became generally available. DataTorrent RTS is alleged to be the first platform on top of Hadoop capaple of processing more than 1 billion data events per second.

Founders Phu Hoang and Amol Kekre noticed a gap in the market whilst they were both working for Yahoo! They observed a real demand for technologies which allowed you to see what was happening in real-time, rather than identify problems only after they’ve happened and the analytics have caught up. “We are seeing increasing interest in stream-processing platforms for real-time analytics, as a complement to data warehouses and Apache Hadoop,” said Jason Stamper, Research Analyst, 451 Group. “Enterprise adoption of stream-processing requires fast, in-memory processing of a large volume of events at scale — and in many cases a fault-tolerant architecture too.”

As well as allowing for proactive management rather than management in hindsight, Hoang believes DataTorrent is prepared for the impending explosion of the Internet of Things. “By 2020, the number of smartphones, tablets, and personal computers in use will reach 7.3 billion units. Additionally, the number of Internet of Things (IoT) connected devices will grow to 26 billion units, a nearly 30-fold increase from 0.9 billion in 2009”, states the DataTorrent press release. Although the amount of human-generated data exploded relatively recently and companies are scrambling to process and harvest this influx of data, the rise of machine and sensor data has barely begun, and technologies must adapt to keep up with the relentless volume of generated data to come.

David Hornik of August Capital, whose company has invested $8M in DataTorrent, acknowledges the ways in which DataTorrent will ameloriate existing Hadoop technology to cope with new performance and latency demands. Everybody acknowledges big data matters, and Hadoop is perfect container, but that doesn’t help run your business. How do you take advantage of these nodes distributed around the world?” he states. “When you can process a billion data points in a second, there are a lot of possibilities.”

Read more here.
(Photo credit: Peter Riou)



Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

[mc4wp_form]

]]>
https://dataconomy.ru/2014/06/04/datatorrent-rts-released-processes-1-billion-data-events-second/feed/ 1