Response to David Teich’s Critique

I was pleasantly surprised by a negative review of my recent TDWI webinar with Cray on Evolution of Big Data, and prepared a comment to be posted to that site. However, for some reason that person’s blog thought my comment was spam and refuse to post it, so I am happy to share my response here.

First, at the risk of pushing people to another website, here is the negative review.

My response:

Interesting feedback, and thanks for posting the review.

I am happy to reflect on your critique, particularly in relation to my experiences in talking to those rabid Hadoop adopters who can barely spell the word scalability, let alone understand what it truly means. For example, I have a customer who has embarked on a pilot project for Hadoop, focusing on loading a subset of their data into a cluster to test-drive its capabilities. However, they (like most large organizations) have limited understanding of the inner workings of their own systems. This means that they look to migrate their circa-1985 mainframe applications to Hadoop and expect that they will get order of magnitude speedups with a fraction of the cost. In reality they get minimal speedup and the same cost.

I mentioned in the webinar about a presentation I had heard in which the presenter shared his experience in using MapReduce for monitoring access counts for millions of URLs. When it dawned on the application development team that the lion’s share of the time of the MapReduce application was shuffling the URL visit counts across their network, they determined that to get any reasonable performance they had to *sort* all of their data before they loaded it into Hadoop. OK, sort time time is now a preprocessing stage that is not accounted for on Hadoop, their MapReduce ran a lot faster, but clearly the overall execution time for the entire application was not significantly improved at all. Great case study.

The point about going heavy on memory follows accordingly: the scalability bottleneck is tied to data movement (both disk and network), so managing more data in-memory diminishes the impact. As you suggest, this is not new, and I agree: I worked on memory hierarchy optimizations 20 years ago when I was designing compiler optimizations for MPP systems. However, it is good to see that the big software vendors are now aware of this (e.g. SAP HANA). The moral of that point is that when you are configuring a system, balance your need (and fund your budget) for memory in relation to the types of applications and their corresponding performance requirements.

Next, your point in relation to differentiating between Hadoop 1.0 and YARN: Pretend you are a business person tasked with making a decision about big data and spend a few minutes reading about YARN and see if you can easily understand what the difference is between 1.0 and YARN. If you apply the same critical eye that you used with the webinar, you’ll be sure to point out that not only is the difference only clear to a person with deep technical knowledge, there is little (or no) value proposition or justification presented. Only a description of the differences in the new version. When you consider the TDWI audience (largely heritage data management/data warehousing practitioners plus a number of their business associates), you will understand that they are not typically literate around Hadoop and are happy to have these details spelled out.

I do think you are somewhat limiting in linking the term “data lake” solely to the concept of an ODS. I have at least two clients who are in the process of dumping *all* of their data onto Hadoop under the presumption that its “scalable storage” makes it a clear winner from a perspective of low-cost persistent storage. However, in both cases the use of HDFS as a “data lake” is more of a “data dump” for all data artifacts and have neither the structure nor the intent of use as an operational data store, particularly in relation to data warehousing. This tells me that there is a desire to use Hadoop/HDFS more as an archival dump more than anything else. One of those two clients said to me that they were dumping all of their data on Hadoop because they wanted to do analytics. When I asked what kind of analytics, they said “predictive analytics.” When I asked what they were hoping to accomplish using predictive analytics, they no longer had an answer. They cycled back to saying that they wanted to do text analysis and use that for predictive analytics.

On the other hand, the types of applications that are emerging on commodity-based high performance computing systems are expanding beyond the “data warehouse” and data analytics to more computation-based applications that use *data structures* (as opposed to databases). Examples include social network analysis (you want to have the graph in memory), protein structure prediction (you want to have the complex molecule data structures in memory), multidimensional nearest neighbor and other types of iterative data mining algorithms (which look to having their analyzed entity data structures in memory), cybersecurity, public protection, etc. Next, consider the ability to virtualize access to in-memory databases in ways that allows for simultaneous transaction processing and analytical processing, eliminating the need for a data warehouse (and consequently and ODS).

In general, the technology media do a good job of hyping new technology but not as good at explaining its value or telling you how to determine when the new technology is better than using that old mainframe. That is what I have in mind when I do webinars like these.

I have read through some of the blog entries on your site and the common theme seems to be criticism of one sort or another of presentations and presenters be it a webinar or a presentation at BBBT. It is pretty easy to throw darts at others. Please let me know when your next webinar is coming up and I will be sure to attend. I’ll be happy to share my thoughts with you afterward.

Use Cases for Operational Synchronization

In my last post, I introduced the need for operational synchronization, focusing on the characteristics necessary for a reasonable methodology for implementation. In this post, it is worth examining some example use cases that demonstrate the utility of operational synchronization in a more concrete way. Read more

Questions About the “Cost of Poor Data Quality”

July 25, 2011 by · 3 Comments
Filed under: Data Quality, Performance Measures 

I was reading through one of Jim Harris’s blog entries about the his reinterpretation of Pascal’s wager in terms of data quality, and the posting made reference to an email he had received from Gordon Hamilton about the estimated costs of poor information quality. I noted Richard Ordowich’s comments from the Linkedin group were incorporated regarding the claims that 15-45% of the operating expense of virtually all organizations is WASTED due to data quality issues and thought I’d spend a little bit of time investigating the origins of some of these numbers.

So I thought it would be worth exploring the sources for some of the more popular claims about the costs of poor data quality.

But just in case you are interested, I spend a lot of time in my book talking about assessing the real opportunities for increased value from data quality improvement.

Read more

Business-Oriented MDM: An Experimental Workshop Idea

May 3, 2011 by · Leave a Comment
Filed under: Master Data, Performance Measures 

I have been participating in a series of events sponsored by DataFlux on strategies for long-term success for enterprise master data management projects. We are about halfway through the series, and so far I have noticed two common threads among the questions posed by the attendees. The first thread involves justifying the value of MDM knowing that there is significant upfront effort that might not lead to the commonly-noted benefits. The second is about herding the business managers together to have them discuss (and hopefully agree) about the impacts of replicated records and inconsistent semantics.

Read more

Download Updated Version of “The Analytics Revolution”

I recently updated a white paper I did for IBM called “The Analytics Revolution – Optimizing Reporting and Analytics to Make Actionable Intelligence Pervasive.” Click here to download this revised masterpiece.

Next Page »