Filed under Big Data

The gold in the Big Data hills is fueled by open-source software

Who knew that there could be so much money in analyzing data?  According to IDC, the analytics market will be worth $50.7 billion by 2016.  And that’s without breaking a sweat.

The value in analytics scales with the mountains of data we amass.  It’s akin to something I wrote about years ago for CNET: 21st Century businesses thrive by driving abundance and then selling minimization of complexity inherent in that abundance.  It’s what Red Hat does with open source, what Google does with search, and what Facebook does with social.  It’s also what companies like Nodeable do with all the data your marketing/IT operations/sales/etc. systems throw off.

There are a few interesting stories buried in the growth of analytics, but perhaps the biggest is Hadoop.  While the overall analytics market grew by 14 percent in 2011, IDC has the Hadoop market growing by 60 percent each year through 2016.  Admittedly, that’s off a small base, but at that pace the Hadoop ecosystem, which Forrester already sizes at $1 billion per year, will be very, very big.

Big money for Big Data.

And for such a comparative pittance.  Hadoop, as I’ve argued before, has democratized data.  Big Data analytics used to be the province of expensive data warehousing systems, complete with proprietary software, expensive, proprietary hardware, and a smiling salesperson with their palm out, waiting for you to mortgage your house.  Not anymore.  Hadoop is open source and is run on commodity hardware.  It’s a game even cash-strapped organizations can play.

It’s not perfect.  Hadoop can’t do real-time, for example, which is why Nodeable buttresses its fantastic batch-processing capabilities with the real-time computation heroics of Storm.  Storm, of course, is also open source.

Which is what is so fascinating in this Big Data gold rush: it’s being driven by free and open-source software.  No wonder the market is growing in such dramatic fashion: it’s not being gated by vendors anymore.  That’s good news for all of us…including vendors.

Tagged , , , , , ,

Are we trying to fit square Hadoop pegs into round real-time holes?

Big Data is a big market, projected to top $16.9 billion by 2015, according to IDC.  The Hadoop ecosystem, alone, is worth $1 billion per year, according to Forrester, and is set to explode by most accounts.  What is less clear is for whom Big Data is big, and whether the workloads they’re currently running through Hadoop might not be best complemented (or, in some cases, replaced) with real-time analytics tools like Storm.

After all, given that 32 percent of Karmasphere’s survey participants are running Hadoop clusters smaller than 2 terabytes, and 55 percent are running clusters of 10 terabytes or less, the “Big” in “Big Data” really isn’t.  Not yet, anyway.  That will likely change as enterprises move from toe-dipping to diving into Hadoop and Big Data in a big way, but for now the workloads aren’t huge, and real-time tools like Storm might be ideal for managing them.

These workloads are also not necessarily being run behind the firewall.  While both Cloudera and Hortonworks are booming due to enterprises keeping their Hadoop jobs running primarily in their data centers, Amazon is already managing in excess of one million Hadoop clusters with its Elastic MapReduce service.  This is perhaps not surprising given that the majority of Big Data users tend to be business users, not hard-core IT people, according to Karmasphere’s survey.  These people are apparently very comfortable having their data processed in the cloud.

Interestingly, while there are numerous great applications for Hadoop, the majority seem to be using it for marketing-related functions:

All of which brings me back to the point I made earlier this week: some data are best analyzed in real time, not batch.  For many things, you’ll actually want both: a real-time view into what’s happening with your website, HR systems, etc., as well as a deeper, Hadoop-based analysis that is done in batch, after the fact.

Real-time analytics tools like Nodeable (based on the open-source Storm project) are not a replacement for Hadoop.  They’re complements.

Given that so much of the data being analyzed with Hadoop are still relatively small and marketing-focused, not to mention being analyzed in Amazon’s cloud, I’d argue that more of today’s data, not less, should be run through real-time analytics systems, and particularly hosted systems.  After all, while it’s useful to know how aspects of your online retail site are working hours or days or months after the fact, you actually want the “next click” to reflect real-time analysis, as Yahoo CTO Raymie Strata argues:

With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes.…[I]t will never be true real-time.  It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reflected in the page.

Again, this isn’t to denigrate the importance of Hadoop.  At all.  It’s simply to suggest that for many applications, relying on batch-oriented Hadoop alone is an incomplete strategy.  Real-time is required for many applications, particularly those where Hadoop is being used today, and that real-time capability is delivered through Storm-based Nodeable or other real-time analytics systems.

It’s not Storm or Hadoop.  It’s Storm and Hadoop.

Tagged , ,

When Hadoop isn’t fast enough: The Argument for Storm

Big Data is a Big Deal, and Hadoop is arguably the driving force in Big Data.  But as awesome as Hadoop is – and it is quite awesome – it’s incomplete.  For many things, Hadoop’s batch workflow is just too slow.  You wouldn’t calculate trending topics for Twitter using Hadoop, nor would a hedge fund look for stock trends in real-time using Hadoop.  Because Hadoop doesn’t do real-time.  So the trick is to marry the powerful batch processing capabilities of Hadoop with a front-end preprocessing engine that works in real-time.

Like Storm, the project Twitter inherited when it acquired BackType in 2011.

At Nodeable we use Storm to surface real-time insights from system data, whether that system is GitHub or AWS or Salesforce.com or Twitter or an infinite number of data sources.  We operate under the assumption that users need real-time insights (Storm) and timely information (Hadoop).  It would be impossible to crunch all of a business’ data in real-time, and frankly not necessarily all that useful.  So Hadoop’s batch approach to data mining is great for a wide variety of jobs.

But not when you need to know something right now.  As Metamarkets CEO Michael Driscoll noted at a recent Churchill Club event, “Hadoop is not like having a conversation with your data.  Instead it’s like having a pen pal that you write from time to time.”  For things like clickstream analysis, IT early-warning systems, security and fraud detection, etc., that’s not fast enough.  So Storm is a great complement.

There are alternatives to Storm, of course.  Hstreaming, Streambase, and Yahoo S4 each offer real-time complements for Hadoop, though S4 is arguably the most like Storm.  We opted for Storm for many of the reasons Dan Lynn highlights in a presentation he gave at Gluecon. It’s open source.  It works really well.  And it gives our engineering team a great deal of flexibility.

But whether you use Storm or something else, you likely do need to figure out how to complement Hadoop with real-time.

Nodeable can help.  Nodeable provides real-time data streaming for Hadoop, which means that we provide front-end processing — summaries, counts, anomalies, status, trends — before data hit Hadoop and turns those data into useful information.  In other words, we not only give you real-time insight into your systems, but also normalize and enhance your data to make your Hadoop batch processing much more efficient.

Please sign up for beta access and tell us what you think.

Tagged , ,

Hadoop and Storm are shifting the industry toward Big Data-enabled cloud applications

Dave and I were fortunate to attend a Churchill Club event on Hadoop Tuesday night.  Hadoop sits at the center of the burgeoning Big Data universe, and so one might be tempted to conclude that it’s basically a finished product.  Not so, said the esteemed panel, which included representatives from Cloudera, Facebook, Metamarkets, MapR, and Oracle.  In fact, arguably the biggest opportunity in Hadoop isn’t Hadoop at all: it’s the cloud applications built on top of Hadoop.

Dave summarized the panel discussion on CNET, and highlights Cloudera CEO Mike Olson’s call to arms for Hadoop-based applications.  It’s something Olson has said before, including here on this blog, but it was particularly poignant against the backdrop of a deep, engaging discussion about Hadoop’s pros (powerful, open source) and cons (batch-oriented, complex, somewhat inefficient).

And it’s why I think Nodeable is a sign of the times.

We’re an application that depends upon Hadoop.  But we’re also a technology that improves Hadoop by front-ending it with Storm.  Hadoop is powerful but limited to batch-oriented processing of data.  Storm actually crunches data in real-time, in the stream.  The combination of the two is potent, and something that we only discovered while building out our application to ingest systems data and extrapolate insights via in-stream data analytics.

In the near future the back-end data processing via Storm and Hadoop will be managed behind the scenes by cloud applications, as Workday co-CEO Aneel Bhusri tweeted from the Churchill Club event.  For now, companies like Nodeable are helping to bridge the divide between complex infrastructure and simplified applications.

Tagged , , ,

The problem with treating people like data: Learning from Autonomy’s mistakes

As much as we tout the importance of data in today’s fast-paced markets, Autonomy CEO Mike Lynch is a poignant reminder that people matter, too.  A lot.

HP bought Autonomy in late 2011 for $10 billion.  Autonomy was one of the UK’s brightest tech stars, but its CEO, Mike Lynch, was known to be somewhat of a difficult personality.  How difficult?  So bad that Autonomy employees gave Lynch a measly 20 percent approval rating. If the pundits think President Obama has a tough road ahead of him with a nearly 50 percent approval rating, imagine Lynch’s likelihood of getting elected.

No.  Way.

In fact, as Wired reports, the only way HP could maximize the value of its $10 billion acquisition was to fire Lynch.  This is ironic, given that Autonomy’s business is to “make sense of and process unstructured, ‘human information,’ and draw real business value from that meaning.”  The company that enables others to glean meaningful information from unstructured data was at pains to treat its employees as anything more than cogs in a machine, to be tightened and tweaked to force them to perform.

In other words, as much as we may want to boil business down to 1s and 0s, ultimately all business is about meeting human needs, not only as customers but also as employees.  Even Nodeable, which ingests machine data, processes it in real-time, and outputs useful insights is ultimately in the business of serving people, not machines.

Autonomy has built a good business based on serving customer needs. But it has started to decline as its employees struggled to enjoy apparently tyrannical working conditions.  By showing Lynch the door, HP has taken the first step toward treating both its customers and its employees with respect, which turns out to be very good business.

Tagged , ,

Survey: CIOs are confused on prioritizing IT projects, and especially on how to pay for them

CIOs are an optimistic lot these days.  According to recent survey data from InformationWeek, 61 percent of those IT executives surveyed indicate that their IT budgets will remain the same or shrink.  Yet the vast majority claim that important new projects for cloud, Big Data, security, and more will come from “new money” rather than “savings.”

How does that math work?

It’s not as if the proposed projects are useful.  As shown below, CIO’s seem to have a good handle on where money should be spent:

What they lack, of course, is a grasp on reality in terms of funding all these projects, as shown here:

And while we at Nodeable don’t have the be-all, end-all answer for how to fund these projects, we can suggest one: optimize efficiency of existing resources.  This actually fits IT priorities, generally.  After all, four of the top-five projects identified are “block-and-tackle” projects that improve existing systems rather than introducing a gee-whiz new line of business system.


The difference, of course, is that one can introduce a system like Nodeable and not only bring down costs (by tuning cloud systems based on our trending data, anomalies we flag in how resources are being used, etc.), but also drive one’s business by analyzing how resources are being used at a macro level.

I can see, for example, who is most active in handling JIRA tickets.  I can see which of my developers show up most often in the GitHub activity stream.  And while I can of course track waste in my use of AWS, I can also benchmark how my company manages its storage and compute resources against how others do.

Ultimately, what needs to be done is bring IT in better alignment with business goals.  The DevOps trend does this by reducing bureaucracy, allowing developers to get work done with a minimum of overhead.  This is the crowd Nodeable hopes to enable.

Otherwise, we end up with a mismatch of resources with goals, as InformationWeek points out:

What about hybrid clouds and cloud bursting, an activity that promises to dramatically change the face of IT spending and human resourcing as we know it? Marquee names like Zynga and DreamWorks are just two pioneers that have managed to optimize their internal infrastructure spend by balancing private and public cloud. Yet only 10% of our survey respondents identify private cloud as a top priority.

Worse, the project that came in at No.12 of 12, with a whopping 2%—launching or upgrading an enterprise social networking platform—is one that has the attention of non-IT partners….We guarantee you that if we had surveyed CMOs and their direct reports instead of CIOs and their reports, social would have been near the top.

Enterprises need to figure out how to do more with less, and that “less” means getting to more productivity with “less money,” which often will necessitate less cumbersome and costly bureaucracy.  Nodeable offers one way to accomplish this, and no doubt you can think of others.  It’s only as IT becomes more agile and joined-at-the-hip with business requirements that it’s going to be a hero in 2012.

Tagged , , , , ,

Hadoop as a ‘data refinery’ – good overview of how Hadoop works

Hortonworks’ Shaun Connolly is a smart man.  I say this not because he’s able to use big words and dense technical explanations to confuse me.  Lots of people can do that.  No, what makes Shaun smart is that he’s able to boil down complex systems into easily understandable ideas.  At dinner earlier this week he demonstrated this by equating Hadoop with a “data refinery,” performing a similar function for data as oil refineries manage for crude oil.

The analogy was useful to me.  I don’t work for Chevron but understanding the oil refining process is relatively simple.  You take crude oil and through distillation (i.e., boiling the oil to separate out hydrocarbons based on their vaporization temperatures) or chemical processing, you’re able to refine it into a form that is useful to power cars, lawn mowers, or laptops. (!)

Hadoop is very similar, minus the boiling and chemical processing.  It’s a data-processing framework that can ingest huge quantities of unstructured data like tweets or credit card receipts, and output reports that average humans can understand. One of the best explanations I’ve seen for how Hadoop works was written by a software engineer named Matt.  (No relation.)

The engineers at Nodeable probably don’t need any tutorials in the mechanics of Hadoop.  After all, we’ve built our data analytics service using Hadoop.

But I think it’s useful for our customers to understand the process by which their complex systems data gets turned into meaningful, actionable information.   We do a fair amount of pre-processing of inbound data to “massage” it into formats that Hadoop can more easily digest, which improves efficiency, among other things.

But ultimately we depend upon Hadoop to work its magic on the data, pulling in terabytes of data to yield insights, as shown at right (beta).

This is just one example of the “data refinery” magic that Hadoop can provide.  It’s also being used to optimize oil drilling, determine which spammy ads to send you on Facebook, enable governments to track your every move through video surveillance, and more.  In other words, our data refinery business is much more benign (and useful to you) than that of some others.  Best stick with us.

Tagged , , ,

Is curation the future of Big Data?

I hate the word “curation.”  Or, rather, I hate that it’s one of the most overused, overhyped words in Silicon Valley these days.  Or maybe it was yesterday, before “actionable insight” became the term du jour.  But curation may be about to make a comeback.

Why?  Because it turns out that it’s really hard for machines to pull “operational insights” out of big piles of data.  This is why big enterprises are scrambling to hire data scientists, and are increasingly discovering there’s far more demand than supply.

In short, we can, as Forrester does, trumpet operational insight or actionable insight as top priorities, but achieving them is easier said than done:

Which is why I found a lunchtime conversation with my neighbor and friend, Chuck Sharp, interesting.  Chuck is the founder and CEO of Right Intel, a marketing data analytics company.  This is Chuck’s second analytics company.  His first, Sharp Analytics, was acquired by iCrossing in 2007.  As Chuck told me, one thing that he learned from his first attempt was that a dashboard-based approach to analytics doesn’t work.  People don’t reliably log into a dashboard service and, even when they do, they often struggle to understand what they should do with the data being presented.

Enter curation.

Right Intel provides a platform that makes it easy to amass and amalgamate different data sources (e.g., Twitter feeds, charts and graphs found in blog posts, etc.), but that’s only half the story.  Right Intel then connects with partners who in turn service big brands (e.g., Marriott Hotels).  Those partners (or a designee within the end user/customer) then siphon through the pool of data to determine which highlights to pass on, and which actions to recommend based on the data.

It’s a pretty light-handed way to intervene to make sense of Big Data, and it might become much more common than we’d like.  As much as we’d like to assume all data can simply be crunched by machines to spit out insights, the reality is that human intelligence and intuition are going to remain relevant, and probably dispositive to getting great insight from data.

In the systems intelligence world, things are a bit easier, as a company like Nodeable, for example, can poll the APIs for AWS or GitHub and pull out somewhat structured data, and interpret that data without human intervention.  But machine data is the exception to the rule, and even here, someone needs to be looking at the output and determining the best course of action based on the data (though we do make suggestions).

Which, I suppose, is a long way of saying: it’s not too late to go back to school and get a degree in data science.  People are going to be important to Big Data for a long time.  Probably forever.

Tagged , ,

Amazon’s essential role in delivering scale and tackling Hadoop

The cloud has many virtues, but perhaps its biggest is scale.  Scale refers to the ability to throttle resources up or down to meet inbound demand on web applications or other infrastructure.  It’s a problem that most developers, whether at startups or Fortune 500 behemoths, can only dream of having.  But for the ill-prepared, scaling an application can be a nightmare.

Which is why Amazon Web Services have become such essential infrastructure for startups and enterprises alike.  As Ryan Park, operations and infrastructure leader at Pinterest, declared earlier this month at AWS Summit:

Imagine if we were running out own data centre, and we had to go through a process of capacity planning, and ordering hardware, and racking that hardware, and so on.  It just would not have been possible scale fast enough – especially with such a small team. Until about a month ago, I was the only operations engineer at the whole company.

Think about that for a minute.  Here’s a web service with nearly 18 million visitors in February alone, which took the company just nine months to reach.  In a pre-AWS world, Pinterest would have employed scores of operations engineers to buy and manage hardware and the software to stitch it all together.  No more.

Of course, managing the infrastructure is only part of the equation.  Perhaps even more important is managing all the data that today’s businesses increasingly collect.  The lingua franca of this Big Data movement is Hadoop, which enables companies to crunch through massive piles of data to find actionable insights into how better to run one’s business.  Hadoop’s importance in our data-hungry world is perhaps best articulated by Cloudera CEO (and Nodeable board member) Mike Olson:

In the old days if you had a data problem you would write a big check for a massive piece of hardware and with any money left over you would by some very expensive but powerful software. That box with software and data became your data temple and your analysis and conclusions were done there.

There are problems with that approach. Data are now growing so fast. It is now impossible for one box to have all your data. You must have your data across multiple servers and use software that can coordinate and operate across all those servers. Hadoop is the platform designed to do this. It is designed to solve the problems of today, not the problems of yesterday.

Critically, Hadoop, too, is increasingly a matter of the cloud and, particularly, of AWS.  By some estimates Hadoop jobs comprise the majority of all AWS processing.  With petabyte-scale data clusters increasingly common, shifting that burden of storage and processing to the cloud becomes essential.

As more infrastructure and data processing moves to AWS, it becomes more and more important to analyze your AWS instances to track real-time trends (“Is my CPU running hot?”), make comparisons (“We’re running memory 25 percent lower than most companies – should we look to optimize?”), discover anomalies, and so on.  That’s where Nodeable comes in.

It used to be that the cloud is where enterprises dumped non-critical applications.  Now the inverse is true.  The cloud is the hub for mission-critical data processing.  It’s where enterprises are running applications that need serious scale.  And it just so happens to be where Nodeable shines.  Nodeable surfaces actionable insights into an easy-to-grok, Twitter-like activity stream.  Search tools like Splunk are nice, but Nodeable prefers to reveal those insights while you sip your tea or watch your daughter’s soccer game.

Why not sign up for our beta and give it a try?

Tagged , , , , ,

Channeling Dr. Seuss to explain real-time system analytics

Most companies have a UI/UX person.  Very few companies have a UI/UX person who doubles as a cartoonist and satirist.  Well, Nodeable does: Mike Evans.

On the downside, Mike rides a hipster bike and wears Bono-like sunglasses when he rides. He also has terrible recommendations on where to find good hot chocolate in San Francisco.

On the upside, he’s an award-winning film producer.  Not that Nodeable is in the habit of making movies.  But if we SERIOUSLY pivot, he’s the guy to make our My Own Private Utah movie.

Here’s one of his recent graphics for a presentation Dave is due to give in a few weeks.  Yes, it looks like it comes from a Dr. Seuss book.  But how many Dr. Seuss books do you know that deal with the hot topic of chewing through massive piles of system data to deliver actionable insights into how to optimize your infrastructure, and helping you resolve issues before they become problems?

Yes, that was a short infomercial.  But no, there are no such Dr. Seuss books.  Thankfully.

Follow

Get every new post delivered to your Inbox.

Join 52 other followers

%d bloggers like this: