Filed under: Analytics, Business Impacts, Business Intelligence, information strategy, Performance Measures
I was pleasantly surprised by a negative review of my recent TDWI webinar with Cray on Evolution of Big Data, and prepared a comment to be posted to that site. However, for some reason that person’s blog thought my comment was spam and refuse to post it, so I am happy to share my response here.
First, at the risk of pushing people to another website, here is the negative review.
Interesting feedback, and thanks for posting the review.
I am happy to reflect on your critique, particularly in relation to my experiences in talking to those rabid Hadoop adopters who can barely spell the word scalability, let alone understand what it truly means. For example, I have a customer who has embarked on a pilot project for Hadoop, focusing on loading a subset of their data into a cluster to test-drive its capabilities. However, they (like most large organizations) have limited understanding of the inner workings of their own systems. This means that they look to migrate their circa-1985 mainframe applications to Hadoop and expect that they will get order of magnitude speedups with a fraction of the cost. In reality they get minimal speedup and the same cost.
I mentioned in the webinar about a presentation I had heard in which the presenter shared his experience in using MapReduce for monitoring access counts for millions of URLs. When it dawned on the application development team that the lion’s share of the time of the MapReduce application was shuffling the URL visit counts across their network, they determined that to get any reasonable performance they had to *sort* all of their data before they loaded it into Hadoop. OK, sort time time is now a preprocessing stage that is not accounted for on Hadoop, their MapReduce ran a lot faster, but clearly the overall execution time for the entire application was not significantly improved at all. Great case study.
The point about going heavy on memory follows accordingly: the scalability bottleneck is tied to data movement (both disk and network), so managing more data in-memory diminishes the impact. As you suggest, this is not new, and I agree: I worked on memory hierarchy optimizations 20 years ago when I was designing compiler optimizations for MPP systems. However, it is good to see that the big software vendors are now aware of this (e.g. SAP HANA). The moral of that point is that when you are configuring a system, balance your need (and fund your budget) for memory in relation to the types of applications and their corresponding performance requirements.
Next, your point in relation to differentiating between Hadoop 1.0 and YARN: Pretend you are a business person tasked with making a decision about big data and spend a few minutes reading about YARN and see if you can easily understand what the difference is between 1.0 and YARN. If you apply the same critical eye that you used with the webinar, you’ll be sure to point out that not only is the difference only clear to a person with deep technical knowledge, there is little (or no) value proposition or justification presented. Only a description of the differences in the new version. When you consider the TDWI audience (largely heritage data management/data warehousing practitioners plus a number of their business associates), you will understand that they are not typically literate around Hadoop and are happy to have these details spelled out.
I do think you are somewhat limiting in linking the term “data lake” solely to the concept of an ODS. I have at least two clients who are in the process of dumping *all* of their data onto Hadoop under the presumption that its “scalable storage” makes it a clear winner from a perspective of low-cost persistent storage. However, in both cases the use of HDFS as a “data lake” is more of a “data dump” for all data artifacts and have neither the structure nor the intent of use as an operational data store, particularly in relation to data warehousing. This tells me that there is a desire to use Hadoop/HDFS more as an archival dump more than anything else. One of those two clients said to me that they were dumping all of their data on Hadoop because they wanted to do analytics. When I asked what kind of analytics, they said “predictive analytics.” When I asked what they were hoping to accomplish using predictive analytics, they no longer had an answer. They cycled back to saying that they wanted to do text analysis and use that for predictive analytics.
On the other hand, the types of applications that are emerging on commodity-based high performance computing systems are expanding beyond the “data warehouse” and data analytics to more computation-based applications that use *data structures* (as opposed to databases). Examples include social network analysis (you want to have the graph in memory), protein structure prediction (you want to have the complex molecule data structures in memory), multidimensional nearest neighbor and other types of iterative data mining algorithms (which look to having their analyzed entity data structures in memory), cybersecurity, public protection, etc. Next, consider the ability to virtualize access to in-memory databases in ways that allows for simultaneous transaction processing and analytical processing, eliminating the need for a data warehouse (and consequently and ODS).
In general, the technology media do a good job of hyping new technology but not as good at explaining its value or telling you how to determine when the new technology is better than using that old mainframe. That is what I have in mind when I do webinars like these.
I have read through some of the blog entries on your site and the common theme seems to be criticism of one sort or another of presentations and presenters be it a webinar or a presentation at BBBT. It is pretty easy to throw darts at others. Please let me know when your next webinar is coming up and I will be sure to attend. I’ll be happy to share my thoughts with you afterward.
My casual monitoring of data management buzz phrases suggests that, as an industry, we are beginning to transition our hysteria over “big data” to a new compulsion with what is referred to as the “Internet of Things,” (IoT). Informally, the IoT refers to the integration of communication capabilities within lots of different uniquely identified devices that effectively creates a massive network for the exchange of data. These devices can range from vending machines to sensors attached to jet engines to implanted medical devices.
Naturally, absorbing the data from the proliferation of interconnected devices that all are generating and communicating continuous streams of data is the natural next area of focus for all the big data people, especially when it comes to going beyond the acquisition of these numerous continuous streams. The next step would be to not just collect that data, but be able to make sense of the information that can be inferred from the data.
Consider automobile manufacturers, who have implanted numerous sensors within the cars they create. It is one thing to have a tire pressure sensor continuously monitor the pressure in each tire and alert the driver when the pressure is low. But here are two contrived examples. The first would integrate the sensor readings from all the tires as well as monitor weather conditions and current traffic conditions where the car is being driven to apply an algorithm to determine whether changes in tire pressure are real problems or if they can be attributed to fluctuations in outside temperature coupled with the way the car is being driven. This example streams data from sensors and other sources of data within algorithmic models to inform the driver of potential issues.
The second example goes a bit further – the manufacturer has the sensors in all of their cars that are on the road wirelessly report their readings back to the company on a regular basis. In turn, the company can monitor for issues, part failures, correlation between locations driven, external conditions, the owners’ maintenance behaviors, as well as other source of data to proactively identify potential issues and alert the owner (or maybe the car itself!) about how to mitigate any impending risks.
Both of these examples are indicative of the maturation of the big data thought process, suggesting new ideas of what to do with all of that big data you can collect. But in turn, recognize also that both of these examples (and others like them) are predicated on the ability to go beyond just collecting, storing, and processing that data. To achieve these benefits, you need to be able to align these variant (and sometimes less-than-reliable) data sets in ways so that they can be incorporated logically into the appropriate analytical models.
That requires data management and integration mechanisms that combine knowledge of structure with inferred knowledge about the actual content to drive harmonization. While we have tools that can contribute some of these capabilities, it appears that we are still close to the starting gate when it comes to universally being able to make sense of all this information.
That being said, some vendors seem to understand these challenges and have embarked on developing a roadmap that seeks to not only address the mechanical aspects of acquisition and composition, but also to fuse pattern analysis, machine learning, and predictive analytics techniques with the more mundane aspects of data profiling, scanning, parsing, cleansing, standardization, and harmonization, as well as governance aspects such as security and data protection.
An example is Informatica, whose user event I am currently attending. At this event, the management team has presented their vision for the next few years, and it speaks to a number of the concepts and challenges I have raised in this posting. Some specific aspects include evolving core capabilities for data quality and usability and ratcheting them up a notch to enable business users to make use of information without relying on the IT crutch. This vision includes data discovery and automated inventorying and classification that can adapt different methods for data preparation to encourage a greater level of business self-service, no matter where the data lives. At the same time, they are also attempting to address the issue of data protection, a challenge that only seems to be expanding. I am looking forward to monitor the actualization of this vision over the next 6-12 months.
Hi Folks, I recently published a new technical paper on the use of auxiliary processors on IBM System z class machines to support virtualization of mainframe data that allows you to bypass the need for extracting data prior to using that data for reporting and analysis. You can access the paper, which was sponsored by Rocket Software, via this link. Please email me or post comments and let me know what you think!
A few years ago I was working on configuring a test for comparing data transformation and loading into a variety of target platforms. Essentially I was hoping to assess the comparative performance of different data management schemes (open source relational databases, enterprise versions of relational databases, columnar data stores, and other NoSQL-style schemes). But to do this, I had two constraints that I needed to overcome. The first was the need for a data set that was massive enough to really push the envelope when it came to evaluating different aspects of performance. The second was a little subtler: I needed the data set to exhibit certain data error and inconsistency characteristics that simulated a real-life scenario.
Filed under: Business Intelligence, Data Analysis, Data Governance, Data Integration, information strategy, Metadata
Even if in reality the dividing lines for data management are not always well-defined, it is possible to organize different aspects of information management within a virtual stack that suggests the interfaces and dependencies across different functional layers, which we will examine from the bottom – up.