Reza Sadri is the CEO of Levyx, the creators of high-performance processing technology for big-data applications. Prior to Levyx, Sadri was the CTO for software at sTec, which was acquired by Western Digital Corporation in 2013.
What is the potential of Spark? How far is the market from realizing that potential?
Spark has the potential to be as transformational in the computing landscape as the emergence of Linux back in the 1990’s thanks to strong buy-in, a growing developer community and an application ecosystem that will help it reach this potential.
I am not the only one that has this view. IBM has recently endorsed Spark and has committed to throw IBM-scale resources behind it. That endorsement, which called Spark the “most important new open source project in a decade,” will see IBM incorporate Spark into its platforms and cloud solution with the help of 3,500 IBM researchers, while the company also committed to educating 1 million data scientists and data engineers on Spark.
Outside that, the Spark development community is growing very quickly having already overtaken the Hadoop developer count, a precursor to more adoption among end users.
In addition, a huge application ecosystem has been built around Spark, which will assuredly expand its growth. In contrast, Hadoop failed to gain much support from application builders which may have hindered its ability to reach its full potential.
What pain points do organizations have to navigate in analyzing big data and, specifically, utilizing Spark?
Most enterprises are trying to do more with less. That is, organizations have to find a way to analyze mounting amounts of data using the same resources.
Typically this has led organizations to inefficiently scale out of their infrastructure (i.e., throw hardware at the problem). However, as data collection grows, enterprises are finding it increasing difficult to support this type of expansion.
Implementing Apache Spark as a big data platform is a major step in helping those companies, but it does not resolve the issue of cost. Apache Spark, while good at high-speed analysis of data on a large scale, still relies on memory to achieve its high-performance. That memory is inefficient and expensive – particularly since many companies simply add costly servers to ameliorate the problem. The end result: deploying Spark is usually prohibitive.
At Levyx, we created a unique way to distribute Spark (called LevyxSpark) that changes the way a system or a compute node’s available resources, including memory, tackle the analytics.
Analyzing big data requires both robust hardware and software solutions. How much of big data adoption is being driven by innovations in hardware versus software?
It is difficult to say which side is making advances faster or that are more meaningful, the hardware side or the software side. The reality is that engineers and scientists on both sides have made phenomenal advances in their respective fields. However, these development areas are often siloed off from one another to such an extent that applying these separate innovations in a common infrastructure leads to an underutilization of the systems’ resources.
What advice would you give to companies that want to start using Spark?
The benefits of Apache Spark are well-publicized and easy to understand – a faster, more manageable, full-featured open-source big data analytics platform.
But for companies that actually want to implement Spark or are in the early stages of using it already, the key for them will be to understand what the all-in costs to deploy it will be. Since Spark derives much of its real-time speed by relying heavily on memory, the conventional scale-out methods to deploy are usually cumbersome and complicated. Furthermore, after factoring in the associated costs of power, connectivity, networking, management/maintenance, and additional space, the costs of implementations are often underestimated and could be potentially prohibitive.
That’s why Levyx believes it has a complementary enabling technology for Apache Spark that makes deployments simpler – ultimately helping to stimulate adoption. By allowing each Spark node to process more data using less memory and more Flash, we can shrink the node count (i.e., less power, space, and total TCO) and make deployments more affordable for both small and large scale customers.
What industries are not using Spark, but should to optimize efficiency?
While most industries have just started to adopt Spark, we would like to see key areas of Life Sciences become more frequent users of not only Spark, but other cutting edge big-data technologies. The processes and methods by which huge amounts of data are being analyzed and correlated in the biotech and healthcare industries could really benefit from a broad-based platform like Apache Spark.
And further, this could have an exciting impact on all of our lives. Imagine, for example, that all available health, fitness, biological, and genetic data could somehow be more readily interpreted and used to solve some of medicine’s toughest problems. We will leave the biotech firms to answer those questions, but we believe that leveraging the benefits of Spark in a cost-effective process could lead to huge breakthroughs.
What is the future of data among enterprises?
Data science is going to become more focused on “response time.” Many current applications are going to move toward real-time and interactive use cases. As people build applications to analyze data, they should notice that the key differentiation will be in speed.
The great thing is that new improvements in hardware and system software is making it possible and economically feasible to achieve low-latency data processing even with very large amounts of data.
Therefore, I think that new entrants in data science field have the ability to be more ambitious than ever before. It has become easy to adjust infrastructure demands, opening up the door to what is both possible and affordable.
image credit: jonas maaloe jespersen
Like this article? Subscribe to our weekly newsletter to never miss out!