Big Data is a Big Deal, and Hadoop is arguably the driving force in Big Data. But as awesome as Hadoop is – and it is quite awesome – it’s incomplete. For many things, Hadoop’s batch workflow is just too slow. You wouldn’t calculate trending topics for Twitter using Hadoop, nor would a hedge fund look for stock trends in real-time using Hadoop. Because Hadoop doesn’t do real-time. So the trick is to marry the powerful batch processing capabilities of Hadoop with a front-end preprocessing engine that works in real-time.
Like Storm, the project Twitter inherited when it acquired BackType in 2011.
At Nodeable we use Storm to surface real-time insights from system data, whether that system is GitHub or AWS or Salesforce.com or Twitter or an infinite number of data sources. We operate under the assumption that users need real-time insights (Storm) and timely information (Hadoop). It would be impossible to crunch all of a business’ data in real-time, and frankly not necessarily all that useful. So Hadoop’s batch approach to data mining is great for a wide variety of jobs.
But not when you need to know something right now. As Metamarkets CEO Michael Driscoll noted at a recent Churchill Club event, “Hadoop is not like having a conversation with your data. Instead it’s like having a pen pal that you write from time to time.” For things like clickstream analysis, IT early-warning systems, security and fraud detection, etc., that’s not fast enough. So Storm is a great complement.
There are alternatives to Storm, of course. Hstreaming, Streambase, and Yahoo S4 each offer real-time complements for Hadoop, though S4 is arguably the most like Storm. We opted for Storm for many of the reasons Dan Lynn highlights in a presentation he gave at Gluecon. It’s open source. It works really well. And it gives our engineering team a great deal of flexibility.
But whether you use Storm or something else, you likely do need to figure out how to complement Hadoop with real-time.
Nodeable can help. Nodeable provides real-time data streaming for Hadoop, which means that we provide front-end processing — summaries, counts, anomalies, status, trends — before data hit Hadoop and turns those data into useful information. In other words, we not only give you real-time insight into your systems, but also normalize and enhance your data to make your Hadoop batch processing much more efficient.
Please sign up for beta access and tell us what you think.