Big Data is a big market, projected to top $16.9 billion by 2015, according to IDC. The Hadoop ecosystem, alone, is worth $1 billion per year, according to Forrester, and is set to explode by most accounts. What is less clear is for whom Big Data is big, and whether the workloads they’re currently running through Hadoop might not be best complemented (or, in some cases, replaced) with real-time analytics tools like Storm.
After all, given that 32 percent of Karmasphere’s survey participants are running Hadoop clusters smaller than 2 terabytes, and 55 percent are running clusters of 10 terabytes or less, the “Big” in “Big Data” really isn’t. Not yet, anyway. That will likely change as enterprises move from toe-dipping to diving into Hadoop and Big Data in a big way, but for now the workloads aren’t huge, and real-time tools like Storm might be ideal for managing them.
These workloads are also not necessarily being run behind the firewall. While both Cloudera and Hortonworks are booming due to enterprises keeping their Hadoop jobs running primarily in their data centers, Amazon is already managing in excess of one million Hadoop clusters with its Elastic MapReduce service. This is perhaps not surprising given that the majority of Big Data users tend to be business users, not hard-core IT people, according to Karmasphere’s survey. These people are apparently very comfortable having their data processed in the cloud.
Interestingly, while there are numerous great applications for Hadoop, the majority seem to be using it for marketing-related functions:
All of which brings me back to the point I made earlier this week: some data are best analyzed in real time, not batch. For many things, you’ll actually want both: a real-time view into what’s happening with your website, HR systems, etc., as well as a deeper, Hadoop-based analysis that is done in batch, after the fact.
Real-time analytics tools like Nodeable (based on the open-source Storm project) are not a replacement for Hadoop. They’re complements.
Given that so much of the data being analyzed with Hadoop are still relatively small and marketing-focused, not to mention being analyzed in Amazon’s cloud, I’d argue that more of today’s data, not less, should be run through real-time analytics systems, and particularly hosted systems. After all, while it’s useful to know how aspects of your online retail site are working hours or days or months after the fact, you actually want the “next click” to reflect real-time analysis, as Yahoo CTO Raymie Strata argues:
With the paths that go through Hadoop [at Yahoo!], the latency is about ﬁfteen minutes.…[I]t will never be true real-time. It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reﬂected in the page.
Again, this isn’t to denigrate the importance of Hadoop. At all. It’s simply to suggest that for many applications, relying on batch-oriented Hadoop alone is an incomplete strategy. Real-time is required for many applications, particularly those where Hadoop is being used today, and that real-time capability is delivered through Storm-based Nodeable or other real-time analytics systems.
It’s not Storm or Hadoop. It’s Storm and Hadoop.