I would call myself a proponent of big data and, correspondingly, big data analytics. As a professional who has been involved in high-performance computing since the late 80s, I am glad to finally see the rapid adoption of commodity-based systems providing data distribution and parallel computing, such as what can be assembled and deployed using Hadoop.
One particularly curious innovation, at least to me, is the concept of the “data lake.” According to TechTarget, a data lake is “a large object-based storage repository that holds data in its native format until it is needed.” A data lake provides a place for collecting data sets in their original format, making those data sets available to different consumers, and allowing data users to consume that data in ways specific to their need. The benefits of the data lake concept include the ability to rapidly spin up a data repository, rapid ingestion of new data sources, and the direct data accessibility to analytical models and applications.
However, ability to allow data users to consume the data in their own ways seems to hide a potential time-bomb in terms of ensuring consistency of interpretation of data and quality of results of analysis. The issue is that the subtext mantra of the data lake is the capture of data sets in their original state and the deferral of data standardization, validation, and organization until the point of consumption. I don’t object to providing degrees of flexibility to the user community to adapt the data in their own ways. My concern is that without instituting control over the management of consumption models, there is bound to be duplicated work, variant methods of applying similar constraints and standards, and generally a degree of entropy that casts doubt about the veracity of analytical results.
In fact, what seems to be emerging leaves me with a feeling of déjà vu – data users complaining about the elongated time to profile and validate data, the apparently self-organized data silos (driven by source, not function), and multiple users left to their devices for data mapping and transformation. So I have to admit that when I was recently briefed by folks at Informatica about their latest release, I was relieved to see that at last there is a vendor who has not only recognized that these challenges are real, but also that they need to be addressed by integrating data governance, stewardship, and quality into a framework for end-to-end big data management.
I would call out two key features of this concept. The first is expanding the data integration capabilities of the tool suite to ingest a broad array of structured and unstructured data configurations, tunable to adapt to rapid changes in the sources (especially continuous and non-relational streams, whose formats are subject to change with little or no notice). Embedding logic within the information ingestion and flow processes allows for self-adjustment and dramatically reduces developer overhead as the formats evolve.
The second is the inclusion of metadata capabilities to facilitate governance of big data. The key aspects of big data governance include a shared enterprise business glossary that is ripe for collaborative discussion and analysis, profiling and discovery utilities for big data sets to inform data quality initiatives that can also be shared among data consumers, and end-to-end data lineage enabling monitoring of data flows to assess opportunities for optimization, reduce duplicative coding efforts, and to evaluate impacts as data sources change over time.
I am confident that awareness of balancing the governance needs against the potential benefits of big data and the data lake will help narrow the overwhelming manual efforts that could explode as a result of increased data sources, volumes, and variability. As the prominent vendors like Informatica continue to call out the potential issues and provide solutions, the integrity of predictive and prescriptive models will enhance the creation of corporate value.
How prepared is your organization for technology transition? Are you aware of the different facets of the process for a strategic technology renovation plan? We would like to find out, and therefore we are conducting a survey of technology professionals and their management hierarchy on transitioning to new technology (bit.ly/decisionworxsurvey). All respondents are eligible for a copy of the survey report and will be entered into a drawing for a free copy of my book “Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph.”
Please help us by taking the survey and sharing the link with others!
Filed under: Business Intelligence, information strategy, Post sponsored by Rocket Software, Replatforming
Understanding the Challenges of the Analytics Architecture
The architecture of the venerable enterprise data warehouse, while deeply-rooted in the need for performance, reflects the design decisions made at the dawn of the age of reporting and analytics. In the mid-1990s, ensuring the performance of production transaction processing systems and maintaining sub-second response time remained the highest priority for the analytics architecture. And while the desire for reporting and analysis led to the creation of alternate data organizations for the data warehouse, the potential drain on computing resources motivated early designers to segregate the data on a separate platform, with its own specialized data models and applications.
This decision, wise at the time, has created an entire ecosystem of ‘applicationware,’ hardware dependencies, and skills requirements to support the objectives for reporting and analytics. As the speed and efficiency of computing resources has improved over time, though, the performance drivers have changed as well, exposing a different set of challenges that need to be considered and addressed. These challenges can be divided into three key areas:
- Platform Challenges, which aside from the physical system segregation includes physical limitations in data warehouse storage capacity, horizontal and extra-enterprise data dependencies, the existence of alternative architectures for reporting and analysis, and the need for data synchronization within narrowing time windows, all within the constraints of a decades-old design paradigm.
- New/Emerging Opportunities, associated with evolution of technology, data awareness, and the thirst for more powerful predictive and prescriptive analytics, such as discovery analytics (including interactive visualizations, event stream analytics, or collaborative interactions), the growing data distribution and diffusion as dependence on cloud computing grows, the role of the ubiquitous mobile devices and their rampant creation and injection of data, as well as the desire to capture and analyze Big Data.
- Environmental Challenges comprised of exploding data volumes, diversity of forms in which data is generated, the various speeds at which information is streamed, and a more mature demand that organizations provide a real-time comprehensive view of actionable information.
Enterprise Data Warehouse vs. Hadoop
These challenges lead many organizational architects to consider abandoning their enterprise data warehouse while seeking greener pastures (with correspondingly green technology). And in some camps, there is a perception that the emergence of Big Data (in general) and Hadoop (in particular) is sounding the death knell for the enterprise data warehouse as we know it. With organizations aching to adopt Hadoop, it may seem that these enterprises are prepared to abandon their decades-long investment in infrastructure, software, staffing, and development.
As a replacement platform, Hadoop (as well as other high performance NoSQL tools) can be used to simplify the acquisition and storage of diverse data sources, whether structured, semi-structured (web logs, sensor feeds), or unstructured (social media, image, video, audio). In addition, data distribution and parallel processing can speed execution of algorithmic applications and analyses, and provide elastic augmentation to existing storage resources.
However, at the current level of system maturity Hadoop does not necessarily address our aforementioned challenges. While there is a promise of linear scalability, migrating reporting and analytics to a big data platform does not address data dependencies and synchronization requirements. Data sets will still need to be moved from their origination points to a separate analytics system. Re-platforming from an existing EDW to Hadoop may incur significant costs, especially in terms of reprogramming vast quantities of production-class SQL queries, end-user reporting tool configurations, and coded solutions for analytics.
So despite the apparent (and justified) benefits of the growing capabilities of the different big data platforms, a more reasoned and responsible approach would blend consideration of new technologies like Hadoop with new options for extending the value of the existing information architecture investment. Consider that:
- The production-hardened enterprise data warehouse in its various configuration still presents opportunities for significant value, especially in the context of the existence of tested queries and applications for accessing, organizing, and analyzing data.
- The emergence of production-class data federation and data virtualization tools extends data accessibility across the enterprise without sacrificing the effort in development of existing reports and analyses. At the same time, optimizations, in-memory computing, and caching reduce the data latency that originally motivated system segregation. Not only does this reduce the need for additional staging areas and costly ETL, it also enables reporting and analysis to be more tightly-coupled to data sitting in its original source location diminishing the synchronization challenge.
- Increasing mainframe utilization through data virtualization can amortize the per-user costs and prolong the lifetime of the EDW, as well as enhance the continued advantage of existing investments.
This raises the question: do you want to continue moving data from original sources and staging platforms to a segregated system, or do you want to examine ways of keeping the data sets where they are and redevelop around new services interfaces layered using data virtualization?
What is Data Virtualization?
The key to balancing the existing EDW’s value while incrementally new analytics componentry is data virtualization. Data virtualization tools enable independently designed and deployed data structures to be leveraged together as a single source, in real time, and with limited (or often no) data movement. According to noted data virtualization expert Rick van der Lans, “data virtualization is the technology that offers data consumers a unified, abstracted, and encapsulated view for querying and manipulating data stored in a heterogeneous set of data stores.”
Data virtualization tools specifically adapted to mainframe environments (such as the z class IBM mainframes) use a special mainframe processing engine (one example being the IBM System z Integrated Information Processor, or zIIP) to handle data transformation and facilitating access to the data store on the mainframe.
Not only does this eliminate a significant amount of mainframe processing, but it also provides a low latency method to satisfy the data requests for downstream business intelligence and visualization tools. At the same time, the data virtualization methodology uses federation techniques to access data on external platforms (in internal relational database management systems, web/mobile data, data in the cloud, and with varying degrees of imposed structure) to create composite views of the information that is not in the data warehouse.
Data virtualization provides an abstracted view of organized data potentially drawn from heterogeneous sources, and using the right tools, can be deployed on mainframe’s integrated processors as long as it:
- Provides support for SQL queries
- Does not impact OLTP response time
- Does not incur additional costs for storage, processing
Virtualizing a data warehouse deployed on a mainframe using a specialty processing engine, allows you to leave the mainframe data in place, avoiding the cost and complexity of data movement. The integrated processor uses the existing storage capacity of the mainframe, which reduces network bandwidth demand while providing real-time integration with transaction data. When the data virtualization tool can federate to big data storage environments like Hadoop/HDFS or NoSQL platforms, it enables programmers to use modern APIs such as MongoDB without demanding that the data be offloaded from the mainframe.
5 Data Virtualization Use Cases for the Enterprise Data Warehouse
In this section we will discuss data virtualization use cases for enhancing the existing data warehouse environment, including (but not necessarily limited to):
- Storage augmentation and system federation – By enabling a uniform method of accessing logical views of data sourced from different platforms in place (including Hadoop), data virtualization can help create composite views of data that are not persisted within the confines of the data warehouse.
- Streamlining extraction, transformation, and loading – Data virtualization provides two complimentary benefits for loading data into the data warehouse. First, the way that federation enables access to other data sources reduces the need for bringing the data into the data warehouse before using the data to satisfy new business requests. Second, data virtualization reduces the hardware, software, and programming costs of data integration and loading by limiting network bandwidth contention, shrinking the costs of duplicated storage, and speeding execution time through the use of caches that use in-memory capabilities to eliminate data latency.
- Rapid prototyping for new development – Enabling access to heterogeneous data sources using data virtualization accelerates assessment of data warehouse requirements yet streamlines integration without having to load data first. This facilitates rapid prototyping of reports and analyses and assess their respective business suitability prior to doing the work needed to extract, transform, and load the data first
- Increase breadth of data accessibility – Under the right circumstances, data virtualization tools can enable access to both structured and unstructured data sources, as well as data in non-relational formats such as the various NoSQL data management schemas. This allows one to create composite representations of information that are not typically available in a relational data warehouse, as well as query a broad set of data sources in real-time.
- Substitution using virtual data marts – When the enterprise data warehouse is unavailable (either for routine maintenance or because of unscheduled down time), accessing composite sources using data virtualization can function as a substitute for reporting and analytics until the EDW is back up and running.
There is a common theme that flows across all these use cases – leveraging data virtualization as a strategic tool for enabling, extending, or continuing accessibility to an enterprise data warehouse. Over time, these use cases demonstrate how data virtualization enables the eventual incorporation of a wide variety of data assets for analytics. In some cases, data virtualization can help make the case for streaming data into the mainframe-based EDW when it is more efficient than migrating to a new platform.
Summary: Augment Enterprise Data Warehouse with Data Virtualization
There is no doubt that the attraction of emerging data management paradigms such as NoSQL and Hadoop will prove to be a strong motivating factor in corporate reengineering and re-platforming data warehouses from their heritage environments. However, at the same time, it would be irresponsible to abandon the resources and time invested in developing production systems that are more than adequate to address a healthy proportion of today’s business reporting and analytics needs. And as the vision for the future analytics environment takes shape, you will see that there is enough room for both emerging technologies and trusted heritage environments. The trick will be to balance the continued expanded use of the traditional systems with the design, development, and deployment of newer systems.
As we have seen, data virtualization provides a way to bridge these technologies. Data virtualization can help revitalize the data warehouse (particularly those involved with mainframe data) through a variety of hybrid approaches for data accessibility, thereby extending the useful life of existing platform investments. Data virtualization can be a key component of the strategy for continuing to extract value out of the years of underwritten costs. Adopting a strategy that retains the use of existing mainframe capabilities will preserve the investment in the development of SQL and code for reporting and analysis.
When it comes to organizations dependent on mainframe data, before disavowing the trusted data warehouse, consider some of these questions:
- What is your current volume of persisted data? Does that overburden the existing environment?
- What are your expectations for data volume growth?
- Has there been a significant investment in developing SQL queries for reports, ad hoc analyses, and other types of analytical applications?
- How much specialized code would have to be developed to replicate that functionality on a new environment such as Hadoop?
- Do you have measurable statistics for comparing the total costs of operation for both the existing and any proposed replacement environments?
- Are there ways of incrementally introducing new technologies like Hadoop for storage augmentation and pilot algorithmic analytics that dovetail with current mainframe reporting?
Each of these questions refers to dependences on existing production systems some of which deployed on a mainframe, with an expectation for incorporation of emerging tools for enhancement and growth. That suggests that an effective strategy for saving your data warehouse investments in the near-term and medium-term will incorporate a hybrid architecture combining the mainframe and data virtualization to provide the transition environment for the future of reporting and analytics.
About the Author
David Loshin, president of Knowledge Integrity, Inc., (www.knowledge-integrity.com), is a recognized thought leader and expert consultant in the areas of data quality, master data management, and business intelligence. David is a prolific author regarding best practices for data management, business intelligence, and analytics, and has written numerous books and papers on these topics. Most recently, he is the author of “Big Data Analytics” (Morgan Kaufmann 2013). His book, “Business Intelligence: The Savvy Manager’s Guide” (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing, and how all of the pieces work together.” He is the author of “Master Data Management,” which has been endorsed by data management industry leaders, and the recently-released “The Practitioner’s Guide to Data Quality Improvement,” focusing on practical processes for improving information utility. Visit http://dataqualitybook.com for more insights on data management.
David can be reached at email@example.com.
About Rocket Software
Rocket Software is a leading global developer of software products that help corporations, government agencies and other organizations reach their technology and business goals. 1,100 Rocketeers on five continents are focused on building and delivering solutions for more than 10,000 customers and partners – and five million end users.
Rocket Data Virtualization enables mainframe relational and non-relational data to seamlessly integrate with Big Data, Analytics, and Web/Mobile initiatives; eliminating the need to move or replicate data, and with significantly reduced costs, complexity and risk.
- The industry’s only mainframe-resident data virtualization solution for real-time, universal access to data, regardless of location or format.
- Support for Data Providers – IBM Big Insights, Hadoop, MongoDB, DB2, Oracle, SQL SQL Server, VSAM, IMS, Adabas, and others
- Support for Data Consumers – Cloud, Mobile, Analytics, Search, ETL, as well as ODBC, JDBC, REST, SOAP, JSON, HTTP, HTML, XML
- Reduced mainframe TCO -engineered to divert up to 99% of its integration related processing to the System z Integrated Information Processor (zIIP).
- Universal DB2 Support – applications using DB2 can now seamlessly integrate with any non-DB2 data source with the same ease of functionality
- Asymmetrical Request/Reply – any mainframe application (Batch, Started Task,
IMS, IDMS, Natural) can interface with DV to request data for itself or another applications
Our customers tell us that IBM System z—the mainframe—is still the best platform in the world for running their critical business applications. And those applications generate and access large data volumes—big data. Increasingly, those applications and data must connect with other applications within the enterprise and even outside the enterprise. Rocket has deep domain expertise and world-class technology to keep the data where it belongs and move the analytics closer to the data.
 “van der Lans, Rick F., “Data Virtualization for Business Intelligence Systems,” 2012 Morgan Kaufmann
Filed under: Analytics, Business Impacts, Business Intelligence, information strategy, Performance Measures
I was pleasantly surprised by a negative review of my recent TDWI webinar with Cray on Evolution of Big Data, and prepared a comment to be posted to that site. However, for some reason that person’s blog thought my comment was spam and refuse to post it, so I am happy to share my response here.
First, at the risk of pushing people to another website, here is the negative review.
Interesting feedback, and thanks for posting the review.
I am happy to reflect on your critique, particularly in relation to my experiences in talking to those rabid Hadoop adopters who can barely spell the word scalability, let alone understand what it truly means. For example, I have a customer who has embarked on a pilot project for Hadoop, focusing on loading a subset of their data into a cluster to test-drive its capabilities. However, they (like most large organizations) have limited understanding of the inner workings of their own systems. This means that they look to migrate their circa-1985 mainframe applications to Hadoop and expect that they will get order of magnitude speedups with a fraction of the cost. In reality they get minimal speedup and the same cost.
I mentioned in the webinar about a presentation I had heard in which the presenter shared his experience in using MapReduce for monitoring access counts for millions of URLs. When it dawned on the application development team that the lion’s share of the time of the MapReduce application was shuffling the URL visit counts across their network, they determined that to get any reasonable performance they had to *sort* all of their data before they loaded it into Hadoop. OK, sort time time is now a preprocessing stage that is not accounted for on Hadoop, their MapReduce ran a lot faster, but clearly the overall execution time for the entire application was not significantly improved at all. Great case study.
The point about going heavy on memory follows accordingly: the scalability bottleneck is tied to data movement (both disk and network), so managing more data in-memory diminishes the impact. As you suggest, this is not new, and I agree: I worked on memory hierarchy optimizations 20 years ago when I was designing compiler optimizations for MPP systems. However, it is good to see that the big software vendors are now aware of this (e.g. SAP HANA). The moral of that point is that when you are configuring a system, balance your need (and fund your budget) for memory in relation to the types of applications and their corresponding performance requirements.
Next, your point in relation to differentiating between Hadoop 1.0 and YARN: Pretend you are a business person tasked with making a decision about big data and spend a few minutes reading about YARN and see if you can easily understand what the difference is between 1.0 and YARN. If you apply the same critical eye that you used with the webinar, you’ll be sure to point out that not only is the difference only clear to a person with deep technical knowledge, there is little (or no) value proposition or justification presented. Only a description of the differences in the new version. When you consider the TDWI audience (largely heritage data management/data warehousing practitioners plus a number of their business associates), you will understand that they are not typically literate around Hadoop and are happy to have these details spelled out.
I do think you are somewhat limiting in linking the term “data lake” solely to the concept of an ODS. I have at least two clients who are in the process of dumping *all* of their data onto Hadoop under the presumption that its “scalable storage” makes it a clear winner from a perspective of low-cost persistent storage. However, in both cases the use of HDFS as a “data lake” is more of a “data dump” for all data artifacts and have neither the structure nor the intent of use as an operational data store, particularly in relation to data warehousing. This tells me that there is a desire to use Hadoop/HDFS more as an archival dump more than anything else. One of those two clients said to me that they were dumping all of their data on Hadoop because they wanted to do analytics. When I asked what kind of analytics, they said “predictive analytics.” When I asked what they were hoping to accomplish using predictive analytics, they no longer had an answer. They cycled back to saying that they wanted to do text analysis and use that for predictive analytics.
On the other hand, the types of applications that are emerging on commodity-based high performance computing systems are expanding beyond the “data warehouse” and data analytics to more computation-based applications that use *data structures* (as opposed to databases). Examples include social network analysis (you want to have the graph in memory), protein structure prediction (you want to have the complex molecule data structures in memory), multidimensional nearest neighbor and other types of iterative data mining algorithms (which look to having their analyzed entity data structures in memory), cybersecurity, public protection, etc. Next, consider the ability to virtualize access to in-memory databases in ways that allows for simultaneous transaction processing and analytical processing, eliminating the need for a data warehouse (and consequently and ODS).
In general, the technology media do a good job of hyping new technology but not as good at explaining its value or telling you how to determine when the new technology is better than using that old mainframe. That is what I have in mind when I do webinars like these.
I have read through some of the blog entries on your site and the common theme seems to be criticism of one sort or another of presentations and presenters be it a webinar or a presentation at BBBT. It is pretty easy to throw darts at others. Please let me know when your next webinar is coming up and I will be sure to attend. I’ll be happy to share my thoughts with you afterward.
My casual monitoring of data management buzz phrases suggests that, as an industry, we are beginning to transition our hysteria over “big data” to a new compulsion with what is referred to as the “Internet of Things,” (IoT). Informally, the IoT refers to the integration of communication capabilities within lots of different uniquely identified devices that effectively creates a massive network for the exchange of data. These devices can range from vending machines to sensors attached to jet engines to implanted medical devices.
Naturally, absorbing the data from the proliferation of interconnected devices that all are generating and communicating continuous streams of data is the natural next area of focus for all the big data people, especially when it comes to going beyond the acquisition of these numerous continuous streams. The next step would be to not just collect that data, but be able to make sense of the information that can be inferred from the data.
Consider automobile manufacturers, who have implanted numerous sensors within the cars they create. It is one thing to have a tire pressure sensor continuously monitor the pressure in each tire and alert the driver when the pressure is low. But here are two contrived examples. The first would integrate the sensor readings from all the tires as well as monitor weather conditions and current traffic conditions where the car is being driven to apply an algorithm to determine whether changes in tire pressure are real problems or if they can be attributed to fluctuations in outside temperature coupled with the way the car is being driven. This example streams data from sensors and other sources of data within algorithmic models to inform the driver of potential issues.
The second example goes a bit further – the manufacturer has the sensors in all of their cars that are on the road wirelessly report their readings back to the company on a regular basis. In turn, the company can monitor for issues, part failures, correlation between locations driven, external conditions, the owners’ maintenance behaviors, as well as other source of data to proactively identify potential issues and alert the owner (or maybe the car itself!) about how to mitigate any impending risks.
Both of these examples are indicative of the maturation of the big data thought process, suggesting new ideas of what to do with all of that big data you can collect. But in turn, recognize also that both of these examples (and others like them) are predicated on the ability to go beyond just collecting, storing, and processing that data. To achieve these benefits, you need to be able to align these variant (and sometimes less-than-reliable) data sets in ways so that they can be incorporated logically into the appropriate analytical models.
That requires data management and integration mechanisms that combine knowledge of structure with inferred knowledge about the actual content to drive harmonization. While we have tools that can contribute some of these capabilities, it appears that we are still close to the starting gate when it comes to universally being able to make sense of all this information.
That being said, some vendors seem to understand these challenges and have embarked on developing a roadmap that seeks to not only address the mechanical aspects of acquisition and composition, but also to fuse pattern analysis, machine learning, and predictive analytics techniques with the more mundane aspects of data profiling, scanning, parsing, cleansing, standardization, and harmonization, as well as governance aspects such as security and data protection.
An example is Informatica, whose user event I am currently attending. At this event, the management team has presented their vision for the next few years, and it speaks to a number of the concepts and challenges I have raised in this posting. Some specific aspects include evolving core capabilities for data quality and usability and ratcheting them up a notch to enable business users to make use of information without relying on the IT crutch. This vision includes data discovery and automated inventorying and classification that can adapt different methods for data preparation to encourage a greater level of business self-service, no matter where the data lives. At the same time, they are also attempting to address the issue of data protection, a challenge that only seems to be expanding. I am looking forward to monitor the actualization of this vision over the next 6-12 months.