Filed under: Analytics, Business Impacts, Business Intelligence, information strategy, Performance Measures
I was pleasantly surprised by a negative review of my recent TDWI webinar with Cray on Evolution of Big Data, and prepared a comment to be posted to that site. However, for some reason that person’s blog thought my comment was spam and refuse to post it, so I am happy to share my response here.
First, at the risk of pushing people to another website, here is the negative review.
Interesting feedback, and thanks for posting the review.
I am happy to reflect on your critique, particularly in relation to my experiences in talking to those rabid Hadoop adopters who can barely spell the word scalability, let alone understand what it truly means. For example, I have a customer who has embarked on a pilot project for Hadoop, focusing on loading a subset of their data into a cluster to test-drive its capabilities. However, they (like most large organizations) have limited understanding of the inner workings of their own systems. This means that they look to migrate their circa-1985 mainframe applications to Hadoop and expect that they will get order of magnitude speedups with a fraction of the cost. In reality they get minimal speedup and the same cost.
I mentioned in the webinar about a presentation I had heard in which the presenter shared his experience in using MapReduce for monitoring access counts for millions of URLs. When it dawned on the application development team that the lion’s share of the time of the MapReduce application was shuffling the URL visit counts across their network, they determined that to get any reasonable performance they had to *sort* all of their data before they loaded it into Hadoop. OK, sort time time is now a preprocessing stage that is not accounted for on Hadoop, their MapReduce ran a lot faster, but clearly the overall execution time for the entire application was not significantly improved at all. Great case study.
The point about going heavy on memory follows accordingly: the scalability bottleneck is tied to data movement (both disk and network), so managing more data in-memory diminishes the impact. As you suggest, this is not new, and I agree: I worked on memory hierarchy optimizations 20 years ago when I was designing compiler optimizations for MPP systems. However, it is good to see that the big software vendors are now aware of this (e.g. SAP HANA). The moral of that point is that when you are configuring a system, balance your need (and fund your budget) for memory in relation to the types of applications and their corresponding performance requirements.
Next, your point in relation to differentiating between Hadoop 1.0 and YARN: Pretend you are a business person tasked with making a decision about big data and spend a few minutes reading about YARN and see if you can easily understand what the difference is between 1.0 and YARN. If you apply the same critical eye that you used with the webinar, you’ll be sure to point out that not only is the difference only clear to a person with deep technical knowledge, there is little (or no) value proposition or justification presented. Only a description of the differences in the new version. When you consider the TDWI audience (largely heritage data management/data warehousing practitioners plus a number of their business associates), you will understand that they are not typically literate around Hadoop and are happy to have these details spelled out.
I do think you are somewhat limiting in linking the term “data lake” solely to the concept of an ODS. I have at least two clients who are in the process of dumping *all* of their data onto Hadoop under the presumption that its “scalable storage” makes it a clear winner from a perspective of low-cost persistent storage. However, in both cases the use of HDFS as a “data lake” is more of a “data dump” for all data artifacts and have neither the structure nor the intent of use as an operational data store, particularly in relation to data warehousing. This tells me that there is a desire to use Hadoop/HDFS more as an archival dump more than anything else. One of those two clients said to me that they were dumping all of their data on Hadoop because they wanted to do analytics. When I asked what kind of analytics, they said “predictive analytics.” When I asked what they were hoping to accomplish using predictive analytics, they no longer had an answer. They cycled back to saying that they wanted to do text analysis and use that for predictive analytics.
On the other hand, the types of applications that are emerging on commodity-based high performance computing systems are expanding beyond the “data warehouse” and data analytics to more computation-based applications that use *data structures* (as opposed to databases). Examples include social network analysis (you want to have the graph in memory), protein structure prediction (you want to have the complex molecule data structures in memory), multidimensional nearest neighbor and other types of iterative data mining algorithms (which look to having their analyzed entity data structures in memory), cybersecurity, public protection, etc. Next, consider the ability to virtualize access to in-memory databases in ways that allows for simultaneous transaction processing and analytical processing, eliminating the need for a data warehouse (and consequently and ODS).
In general, the technology media do a good job of hyping new technology but not as good at explaining its value or telling you how to determine when the new technology is better than using that old mainframe. That is what I have in mind when I do webinars like these.
I have read through some of the blog entries on your site and the common theme seems to be criticism of one sort or another of presentations and presenters be it a webinar or a presentation at BBBT. It is pretty easy to throw darts at others. Please let me know when your next webinar is coming up and I will be sure to attend. I’ll be happy to share my thoughts with you afterward.
After reading Jay Stanley’s ACLU article on “Eight Problems with Big Data,” it is worth reflecting on what could be construed as a fear-mongering indictment of the use of big data analytics and the implication that big data analytics and its implementation of data mining algorithms are tantamount to all-out invasion of privacy. What is interesting, though, is the presumption that privacy advocates have been “grappling” with data mining since “not long after 9/11,” yet data mining was already quite a mature discipline by that point in time, as was the general use of customer data for marketing, sales, and other business purposes. Raising an alarm about “big data” and “data mining” today is akin to shutting the barn door decades after the horses have bolted. Read more
I have been assembling a slide deck for an upcoming TDWI web seminar on Strategic Planning and the World of Big Data, and I am finding that I might sometimes use two different terms (“data reuse” and “data repurposing,” in case you ignored the tootle of this post) interchangeably when in fact those two words could have slightly different meanings or intents. So should I be cavalier and use them as synonyms?
When I thought about it, I did see some clarity in differentiating the definitions:
- “data reuse” means taking a data asset and using more than once for the same purpose.
- “data repurposing” means taking a data asset previously used for one (or more) specific purpose(s) and using that data set four a completely different purpose. Read more
Since it has been a while since I posted to this blog (busy, busy – but busy is good!), I decided to take a break this morning and log some ideas that basically relate quality information to customer visibility. Read more
Last week I shared some thoughts about the popularity of local social network sites working with small businesses to issue coupons to drive new customer acquisition and in-store cross- and upselling. While I questioned the value, it is clearly a phenomenon that does seem to appeal to both the customer community and a number of businesses.
So my next question is about customer management: as a local business owner, how can you monitor the success of your coupon program in terms of measuring new customer acquisition and upselling?