Big Data is a Big Deal, and Hadoop is arguably the driving force in Big Data. But as awesome as Hadoop is – and it is quite awesome – it’s incomplete. For many things, Hadoop’s batch workflow is just too slow. You wouldn’t calculate trending topics for Twitter using Hadoop, nor would a hedge fund look for stock trends in real-time using Hadoop. Because Hadoop doesn’t do real-time. So the trick is to marry the powerful batch processing capabilities of Hadoop with a front-end preprocessing engine that works in real-time.
Like Storm, the project Twitter inherited when it acquired BackType in 2011.
At Nodeable we use Storm to surface real-time insights from system data, whether that system is GitHub or AWS or Salesforce.com or Twitter or an infinite number of data sources. We operate under the assumption that users need real-time insights (Storm) and timely information (Hadoop). It would be impossible to crunch all of a business’ data in real-time, and frankly not necessarily all that useful. So Hadoop’s batch approach to data mining is great for a wide variety of jobs.
But not when you need to know something right now. As Metamarkets CEO Michael Driscoll noted at a recent Churchill Club event, “Hadoop is not like having a conversation with your data. Instead it’s like having a pen pal that you write from time to time.” For things like clickstream analysis, IT early-warning systems, security and fraud detection, etc., that’s not fast enough. So Storm is a great complement.
There are alternatives to Storm, of course. Hstreaming, Streambase, and Yahoo S4 each offer real-time complements for Hadoop, though S4 is arguably the most like Storm. We opted for Storm for many of the reasons Dan Lynn highlights in a presentation he gave at Gluecon. It’s open source. It works really well. And it gives our engineering team a great deal of flexibility.
But whether you use Storm or something else, you likely do need to figure out how to complement Hadoop with real-time.
Nodeable can help. Nodeable provides real-time data streaming for Hadoop, which means that we provide front-end processing — summaries, counts, anomalies, status, trends — before data hit Hadoop and turns those data into useful information. In other words, we not only give you real-time insight into your systems, but also normalize and enhance your data to make your Hadoop batch processing much more efficient.
Please sign up for beta access and tell us what you think.
[...] Big Data is a Big Deal, and Hadoop is arguably the driving force in Big Data. But as awesome as Hadoop is – and it is quite awesome – it’s incomplete. For many things, Hadoop… [...]
[...] Big Data is a Big Deal, and Hadoop is arguably the driving force in Big Data. But as awesome as Hadoop is – and it is quite awesome – it’s incomplete. For many things, Hadoop… [...]
Reblogged this on Data Science 101 and commented:
Nice discussion of Hadoop and Storm and where each one is most useful.
I love this topic area re: where Big Data meets Fast Data. Storm is an amazing tool. I’m also a big fan of Apache Flume. We’re using it as part of our implementations over at Infochimps to do a lot of real-time analytics / streaming data processing. Depending on the use case, we’re finding sometime Flume is the right near real-time tool for the job, in other instances Storm feels like a better fit.
Tim: We actually tried to use Flume at first but ran into a fair number of problems. I’ve heard the latest version is much better. Care to comment on how Flume compares? I’d love your feedback.
Flume is absolutely awesome for getting data around. I would say Flume is complimentary to Storm, no comparison really. Flume and Storm play a massive role in our architecture. Generally, Flume gets data into HDFS for us. I’ve been working on a Flume spout for Storm.
I’d personally say Storm is better fit for when you need to do immediate processing of your data coming in, e.g. checking analytics, rolling window analysis, etc.. Flume is great for aggregating all your data to begin with.
I’m the Senior Backend Engineer @ FullContact, Dan Lynn is our CTO.
[...] Big Data is a Big Deal, and Hadoop is arguably the driving force in Big Data. But as awesome as Hadoop is – and it is quite awesome – it’s incomplete. For many things, Hadoop… [...]
[...] of which brings me back to the point I made earlier this week: some data are best analyzed in real time, not batch. For many things, you’ll actually want both: a real-time view into what’s happening [...]
[...] noted elsewhere, Storm competes with Hstreaming, Streambase, and Yahoo S4. Share this:TwitterFacebookLike [...]
[...] not perfect. Hadoop can’t do real-time, for example, which is why Nodeable buttresses its fantastic batch-processing capabilities with the real-time computation heroics of Storm. Storm, of course, is also open [...]