java – Dataconomy

Better Allies than Enemies: Why Spark Won’t Kill Hadoop

Rick Delgado — Wed, 13 May 2015 09:40:31 +0000

Fans and supporters of Hadoop have no reason to fear; Hadoop isn’t going away anytime soon. There’s been a great deal of consternation about the future of Hadoop, most of it stemming from the growing popularity of Apache Spark. Some big data experts have even gone so far as to say Apache Spark will one day soon replace Hadoop in the realm of big data analytics. So are the Spark supporters correct in this assessment? Not necessarily. Apache Spark may represent a new technology that’s getting a lot of attention. In fact, the number of Apache Spark users are growing at a considerable pace, but that doesn’t make it Hadoop’s successor. The two technologies certainly have similarities, but their difference really set them apart, showing that the right platform really depends on what task they’ll be used for. To say Spark is on its way to dethroning Hadoop is simply a premature statement. If anything, the two look to be complementary in the work they do.

Every discussion surrounding Hadoop should include talk about MapReduce, which is a parallel processing framework where jobs can be run to process and analyze large sets of data. If an enterprise needs to analyze big data offline, Hadoop is usually the preferred choice. That’s what drew so many businesses and industries to Hadoop in the first place. Hadoop had the capability to store and analyze big data inexpensively. As Matt Asay of InfoWorld Tech Watch puts it, Hadoop essentially “democratized big data.” Suddenly, businesses had access to information the likes of which they never had before, and they could put that big data to good use, creating a large number of big data use cases. Hadoop’s batch-processing technology was revolutionary and is still used often today. When it comes to data warehousing and offline data analysis of jobs that may take hours to complete, it’s tough to go wrong with Hadoop.

Apache Spark, which was developed as a project independently from Hadoop, offers its own advantages that have made many organizations sit up and take notice. Many supporters say Spark represents a technological evolution of Hadoop’s capabilities. There are several categories where Apache Spark excels. The first and most touted is speed. When processing data in a Hadoop cluster, Spark can run applications much more quickly — anywhere from ten to a hundred times faster to some cases. This capability has basically ushered in the era of real-time big data analytics, sometimes referred to as streaming data. Beyond speed, Spark is also relatively easy-to-use, particularly when it comes to developers. Writing applications in a familiar programming language, like Java or Python, makes the processing of building apps that much easier. Spark is also quite versatile, meaning it can run on a myriad of different platforms like Hadoop or the cloud. Spark can also access a wide variety of different data sources, among them being Amazon Web Services’ S3, Cassandra, and Hadoop’s own data store.

With Spark’s capabilities in mind, some may wonder why any organization should stick with Hadoop at all. After all, Spark appears to be able to run more complex and sophisticated workloads more quickly. Who wouldn’t want real-time analytics? But the truth is Hadoop and Spark may in fact work better together. If anything, Spark loses some of its effectiveness without Hadoop since it was designed to run on top of it. Hadoop can support both the traditional batch-processing model and the real-time analytics model. Think of Spark as an added feature that can go with Hadoop. When needing interactive data mining, machine learning, and stream processing, Spark is the way to go. For businesses requiring more scalable infrastructure, enabling them to add servers for growing workloads, Hadoop and MapReduce are a better bet. Utilizing both at the same time in a complementary approach gets organizations the best that both have to offer.

Talk of the death of Hadoop always seemed a little hasty, no matter how impressive Spark’s capabilities have been. There’s no denying the advantages that Spark brings to the table, but Hadoop isn’t going to just disappear. Spark was never designed to replace Hadoop anyway. When used in tandem, businesses can gain the advantages of both, effectively increasing the benefits they receive. While there will still be movement toward real-time analytics, Hadoop will still be needed and readily available for all companies.

Rick Delgado- I’ve been blessed to have a successful career and have recently taken a step back to pursue my passion of freelance writing. I love to write about new technologies and keeping ourselves secure in a changing digital landscape. I occasionally write articles for several companies, including Dell.

Photo credit: Ben K Adams / Photo / CC BY-NC-ND

Three Key New Features from Aerospike’s Extensive Upgrade

Eileen McNulty — Mon, 15 Dec 2014 16:16:04 +0000

Back in August, Aerospike announced they were open-sourcing their signature platform. At the beginning of December, they were back again with news of a record-breaking Google Compute Engine Benchmark. Now, to round off what has been an exceptional year for the flash-optimised, in-memory NoSQL database, they’ve released a whole raft of updates to their database.

The full announcement is, in a word, extensive- you can read the whole developer’s Christmas wishlist of updates here. We recently discussed the announcements with Aerospike’s CTO Brian Bulkowski; he highlighted three key developments which he believes are instrumental in fuelling the real-time applications of tomorrow.

1. A Whole Host of New Clients

One of the key headlines from the announcement is new clients for Python, PHP, Go and Ruby, plus a whole host of upgrades for Aerospike’s Java, Node.js, .NET / C# clients. Bulkowski highlighted that the vast range of clients was a demonstration of Aerospike’s commitment to their community.

“In the polyglot language world, you can’t just have a world-class database; you have to be in the community, you have to have different connectors, you have to be giving back to your community,” he explained. “Monica [Pal, Aerospike CMO]’s catchphrase for this release is ‘It’s about what we’re giving to developers, and what developers are giving to us’. It’s about having a great Ruby client, having a great PHP client, having a great Go client- we’re having a huge amount of success with Go, which is a little underappreciated as a language. On paper, it looks a little like a laundry list- but it means to our communities, is that they know we have them covered.”

2. Hadoop Integration

The integration of InputFormat allows Hadoop tools to analyse data stored in Aerospike, without performing ETL from HDFS. Aerospike’s Indexed MapReduce also allows you to analyse specific subsets of data, eliminating the need for the large data lake analysis typically offered by HDFS.

As Bulkowski explained, “The nature of analytics is, once you’ve written your analysis, you don’t want to have to do it again- it’s troublesome, it’s error-prone- you want to use the tools you have. By backing the Hadoop tooling- even though we know our own tooling is better, faster and more capable- being compatible when it comes to analytics is really a benefit. Our key value comes from being row-oriented, rather than a streaming store like HDFS- so what we looked at was, what can we give to an analytics community as a row-oriented database? The benefit is not having to ETL- being able to run those jobs directly on your database in a safe and sane fashion.”

3. Docker Containers

As we have previously reported, taking a developer-centric approach in 2014 has become synonymous with Docker support. Bulkowski was certainly excited by the rapid development of containerisation, but was keen to stress the technology has a long way to go before it reaches maturation. “The containerisation of both the data centres and also the cloud are still in their infancy,” he remarked. “From a technical perspective, I think it’s a great idea. Containerisation definitely has the potential to reset the game in terms of high-IO applications like us. In terms of being in the community, having a Docker image that people can start playing with is great- but it’s still early days.”

These three headlines only really scratch the surface of this extensive announcement; Aerospike have also announced enhancements to their enterprise security, a raft of storage and performance improvements, and 33 key community contributions- just three months after going open source. I would urge you to give the full list a read here, and to keep your eyes peeled for further Aerospike innovation in 2015.

Follow @DataconomyMedia

(Image credit: Aerospike)

The One Language A Data Scientist Must Master

Matt Reaney — Thu, 30 Oct 2014 10:46:21 +0000

When business leaders read about (and tackle) Big Data, there is a lot to take in.

The field is developing so dynamically that many of the industry buzzwords will not have existed until a few short years ago. Just a short list of some programming languages is enough to make most business leaders dizzy. R, C, Python, Java, Julia, Scala, Ruby ….. just a few of the languages that our grandchildren might be learning at high school. There will be many others; you can be sure of that.

There is one language in which every Data Scientist should be fluent: Business

As recruiters, we, of course, assess our candidates for the hard, technical skills. We look at the projects that they have completed. How they rate on Kaggle. We can do rigid technical competency checks to ascertain their professional level. That is all measureable. You either have the knowledge and the skills or you don’t.

However, the difference between a good Data Scientist and a GREAT Data Scientist is often not found in their technical ability or their amazing mathematical genius. Nope. Data Science exists to provide a service to business and business is run by people. If Data Scientists cannot comfortably communicate with their non-expert colleagues and bosses, then their effectiveness is greatly reduced. They need to be able to speak easily with people, to understand, to interpret, to translate.

They have to understand the issues of their business and give guidance in providing the data to reach the best solutions. They have to be adept at facilitating a continuous dialogue loop – from business to the Data Science / Big Data teams and then back to the business. Great data scientists will not just address business problems; they will pick the right problems that can have the most value to the organization.

They have to be able to present their findings in a clear and simple way – in the language of their business. Not all people understand the technical jargon. The candidates who can explain what they have achieved without blowing my mind with jargon are those who usually go far. Accurate numbers and graphs are one thing, but only the data scientist understands them well enough to be able to draw the crucial business conclusions. They have to interpret and translate.

Many mid-level candidates struggle with this initially. They have not had much senior management interaction and have mostly been fairly insular in terms of their work circle within a company. The solution going forward is to give them more exposure to the business, and to introduce the value of Big Data to their respective mid-management colleagues across all departments.

The organizations making the most of Big Data are now integrating their Data Science teams far closer with the rest of their business. They will grow up together as a team and learn to talk to each other more effectively.

They will learn to speak each other’s language.

Follow @DataconomyMedia

Matt Reaney is the Founder and Director at Big Cloud. Big Cloud is a talent search firm focussing on all things Big Data and helps innovative organisations across Europe, APAC and the US find the talent they need to grow.

IEEE Ranks Programming Languages, Java Comes Out on Top

Eileen McNulty — Mon, 14 Jul 2014 09:08:21 +0000

IEEE, the “world’s largest professional association for the language of technology”, have released an interactive ranking of programming languages. In the overall list, as well as many of the sub-rankings, Java emerges as the victor.

The IEEE blog details how they went about creating their definitive rankings.

“Starting from a list of over 150 programming languages gathered from GitHub, we looked at the volume of results found on Google when we searched for each one in the pattern “X programming” where “X” is the name of the language. We filtered out languages if they had a very low number of search results and then went through the list by hand to identify the most interesting languages. We labeled each language according to its use in Web, mobile, enterprise/desktop, or embedded environments.

Our final set of 49 languages includes names familiar to most computer users like Java, stalwarts like Cobol and Fortran, and more recent but niche languages like Erlang.”

From there they ranked the languages using 12 differently-weighted metrics and data from 10 different sources, including Google Trends, Github, Hacker News and Career Builder. The interactive rankings (which you can explore for yourself here) also allows you to break down the rankings by language type (Web, Mobile, Enterprise, Embedded), and create custom filters to see which languages are prevalent in your industry.

In their overall ranking, their “Trending” ranking (which languages are growing rapidly), their “Jobs” ranking (the most in-demand with employers) and the “Open” ranking (popularity on social media & open source hubs), Java emerged the victor. Hot on Java’s tails in each of the rankings were C and C++, with Python and Javascript also performing well across the metrics.

The post has already received 265 comments from developers. The main sore issue appears to be not the rankings themselves, but whether the rankings should say PERL or Perl. If you have a particularly strong view on this matter, feel free to delve into the discussion here.

(Image credit: IEEE Spectrum)