One particularly curious innovation, at least to me, is the concept of the “data lake.” According to TechTarget, a data lake is “a large object-based storage repository that holds data in its native format until it is needed.” A data lake provides a place for collecting data sets in their original format, making those data sets available to different consumers, and allowing data users to consume that data in ways specific to their need. The benefits of the data lake concept include the ability to rapidly spin up a data repository, rapid ingestion of new data sources, and the direct data accessibility to analytical models and applications.
However, ability to allow data users to consume the data in their own ways seems to hide a potential time-bomb in terms of ensuring consistency of interpretation of data and quality of results of analysis. The issue is that the subtext mantra of the data lake is the capture of data sets in their original state and the deferral of data standardization, validation, and organization until the point of consumption. I don’t object to providing degrees of flexibility to the user community to adapt the data in their own ways. My concern is that without instituting control over the management of consumption models, there is bound to be duplicated work, variant methods of applying similar constraints and standards, and generally a degree of entropy that casts doubt about the veracity of analytical results.
In fact, what seems to be emerging leaves me with a feeling of déjà vu – data users complaining about the elongated time to profile and validate data, the apparently self-organized data silos (driven by source, not function), and multiple users left to their devices for data mapping and transformation. So I have to admit that when I was recently briefed by folks at Informatica about their latest release, I was relieved to see that at last there is a vendor who has not only recognized that these challenges are real, but also that they need to be addressed by integrating data governance, stewardship, and quality into a framework for end-to-end big data management.
I would call out two key features of this concept. The first is expanding the data integration capabilities of the tool suite to ingest a broad array of structured and unstructured data configurations, tunable to adapt to rapid changes in the sources (especially continuous and non-relational streams, whose formats are subject to change with little or no notice). Embedding logic within the information ingestion and flow processes allows for self-adjustment and dramatically reduces developer overhead as the formats evolve.
The second is the inclusion of metadata capabilities to facilitate governance of big data. The key aspects of big data governance include a shared enterprise business glossary that is ripe for collaborative discussion and analysis, profiling and discovery utilities for big data sets to inform data quality initiatives that can also be shared among data consumers, and end-to-end data lineage enabling monitoring of data flows to assess opportunities for optimization, reduce duplicative coding efforts, and to evaluate impacts as data sources change over time.
I am confident that awareness of balancing the governance needs against the potential benefits of big data and the data lake will help narrow the overwhelming manual efforts that could explode as a result of increased data sources, volumes, and variability. As the prominent vendors like Informatica continue to call out the potential issues and provide solutions, the integrity of predictive and prescriptive models will enhance the creation of corporate value.]]>
Please help us by taking the survey and sharing the link with others!]]>
The architecture of the venerable enterprise data warehouse, while deeply-rooted in the need for performance, reflects the design decisions made at the dawn of the age of reporting and analytics. In the mid-1990s, ensuring the performance of production transaction processing systems and maintaining sub-second response time remained the highest priority for the analytics architecture. And while the desire for reporting and analysis led to the creation of alternate data organizations for the data warehouse, the potential drain on computing resources motivated early designers to segregate the data on a separate platform, with its own specialized data models and applications.
This decision, wise at the time, has created an entire ecosystem of ‘applicationware,’ hardware dependencies, and skills requirements to support the objectives for reporting and analytics. As the speed and efficiency of computing resources has improved over time, though, the performance drivers have changed as well, exposing a different set of challenges that need to be considered and addressed. These challenges can be divided into three key areas:
These challenges lead many organizational architects to consider abandoning their enterprise data warehouse while seeking greener pastures (with correspondingly green technology). And in some camps, there is a perception that the emergence of Big Data (in general) and Hadoop (in particular) is sounding the death knell for the enterprise data warehouse as we know it. With organizations aching to adopt Hadoop, it may seem that these enterprises are prepared to abandon their decades-long investment in infrastructure, software, staffing, and development.
As a replacement platform, Hadoop (as well as other high performance NoSQL tools) can be used to simplify the acquisition and storage of diverse data sources, whether structured, semi-structured (web logs, sensor feeds), or unstructured (social media, image, video, audio). In addition, data distribution and parallel processing can speed execution of algorithmic applications and analyses, and provide elastic augmentation to existing storage resources.
However, at the current level of system maturity Hadoop does not necessarily address our aforementioned challenges. While there is a promise of linear scalability, migrating reporting and analytics to a big data platform does not address data dependencies and synchronization requirements. Data sets will still need to be moved from their origination points to a separate analytics system. Re-platforming from an existing EDW to Hadoop may incur significant costs, especially in terms of reprogramming vast quantities of production-class SQL queries, end-user reporting tool configurations, and coded solutions for analytics.
So despite the apparent (and justified) benefits of the growing capabilities of the different big data platforms, a more reasoned and responsible approach would blend consideration of new technologies like Hadoop with new options for extending the value of the existing information architecture investment. Consider that:
This raises the question: do you want to continue moving data from original sources and staging platforms to a segregated system, or do you want to examine ways of keeping the data sets where they are and redevelop around new services interfaces layered using data virtualization?
The key to balancing the existing EDW’s value while incrementally new analytics componentry is data virtualization. Data virtualization tools enable independently designed and deployed data structures to be leveraged together as a single source, in real time, and with limited (or often no) data movement. According to noted data virtualization expert Rick van der Lans, “data virtualization is the technology that offers data consumers a unified, abstracted, and encapsulated view for querying and manipulating data stored in a heterogeneous set of data stores.”
Data virtualization tools specifically adapted to mainframe environments (such as the z class IBM mainframes) use a special mainframe processing engine (one example being the IBM System z Integrated Information Processor, or zIIP) to handle data transformation and facilitating access to the data store on the mainframe.
Not only does this eliminate a significant amount of mainframe processing, but it also provides a low latency method to satisfy the data requests for downstream business intelligence and visualization tools. At the same time, the data virtualization methodology uses federation techniques to access data on external platforms (in internal relational database management systems, web/mobile data, data in the cloud, and with varying degrees of imposed structure) to create composite views of the information that is not in the data warehouse.
Data virtualization provides an abstracted view of organized data potentially drawn from heterogeneous sources, and using the right tools, can be deployed on mainframe’s integrated processors as long as it:
Virtualizing a data warehouse deployed on a mainframe using a specialty processing engine, allows you to leave the mainframe data in place, avoiding the cost and complexity of data movement. The integrated processor uses the existing storage capacity of the mainframe, which reduces network bandwidth demand while providing real-time integration with transaction data. When the data virtualization tool can federate to big data storage environments like Hadoop/HDFS or NoSQL platforms, it enables programmers to use modern APIs such as MongoDB without demanding that the data be offloaded from the mainframe.
In this section we will discuss data virtualization use cases for enhancing the existing data warehouse environment, including (but not necessarily limited to):
There is a common theme that flows across all these use cases – leveraging data virtualization as a strategic tool for enabling, extending, or continuing accessibility to an enterprise data warehouse. Over time, these use cases demonstrate how data virtualization enables the eventual incorporation of a wide variety of data assets for analytics. In some cases, data virtualization can help make the case for streaming data into the mainframe-based EDW when it is more efficient than migrating to a new platform.
There is no doubt that the attraction of emerging data management paradigms such as NoSQL and Hadoop will prove to be a strong motivating factor in corporate reengineering and re-platforming data warehouses from their heritage environments. However, at the same time, it would be irresponsible to abandon the resources and time invested in developing production systems that are more than adequate to address a healthy proportion of today’s business reporting and analytics needs. And as the vision for the future analytics environment takes shape, you will see that there is enough room for both emerging technologies and trusted heritage environments. The trick will be to balance the continued expanded use of the traditional systems with the design, development, and deployment of newer systems.
As we have seen, data virtualization provides a way to bridge these technologies. Data virtualization can help revitalize the data warehouse (particularly those involved with mainframe data) through a variety of hybrid approaches for data accessibility, thereby extending the useful life of existing platform investments. Data virtualization can be a key component of the strategy for continuing to extract value out of the years of underwritten costs. Adopting a strategy that retains the use of existing mainframe capabilities will preserve the investment in the development of SQL and code for reporting and analysis.
When it comes to organizations dependent on mainframe data, before disavowing the trusted data warehouse, consider some of these questions:
Each of these questions refers to dependences on existing production systems some of which deployed on a mainframe, with an expectation for incorporation of emerging tools for enhancement and growth. That suggests that an effective strategy for saving your data warehouse investments in the near-term and medium-term will incorporate a hybrid architecture combining the mainframe and data virtualization to provide the transition environment for the future of reporting and analytics.
David Loshin, president of Knowledge Integrity, Inc., (www.knowledge-integrity.com), is a recognized thought leader and expert consultant in the areas of data quality, master data management, and business intelligence. David is a prolific author regarding best practices for data management, business intelligence, and analytics, and has written numerous books and papers on these topics. Most recently, he is the author of “Big Data Analytics” (Morgan Kaufmann 2013). His book, “Business Intelligence: The Savvy Manager’s Guide” (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing, and how all of the pieces work together.” He is the author of “Master Data Management,” which has been endorsed by data management industry leaders, and the recently-released “The Practitioner’s Guide to Data Quality Improvement,” focusing on practical processes for improving information utility. Visit http://dataqualitybook.com for more insights on data management.
David can be reached at email@example.com.
Rocket Software is a leading global developer of software products that help corporations, government agencies and other organizations reach their technology and business goals. 1,100 Rocketeers on five continents are focused on building and delivering solutions for more than 10,000 customers and partners – and five million end users.
Rocket Data Virtualization enables mainframe relational and non-relational data to seamlessly integrate with Big Data, Analytics, and Web/Mobile initiatives; eliminating the need to move or replicate data, and with significantly reduced costs, complexity and risk.
IMS, IDMS, Natural) can interface with DV to request data for itself or another applications
Our customers tell us that IBM System z—the mainframe—is still the best platform in the world for running their critical business applications. And those applications generate and access large data volumes—big data. Increasingly, those applications and data must connect with other applications within the enterprise and even outside the enterprise. Rocket has deep domain expertise and world-class technology to keep the data where it belongs and move the analytics closer to the data.
 “van der Lans, Rick F., “Data Virtualization for Business Intelligence Systems,” 2012 Morgan Kaufmann]]>
First, at the risk of pushing people to another website, here is the negative review.
Interesting feedback, and thanks for posting the review.
I am happy to reflect on your critique, particularly in relation to my experiences in talking to those rabid Hadoop adopters who can barely spell the word scalability, let alone understand what it truly means. For example, I have a customer who has embarked on a pilot project for Hadoop, focusing on loading a subset of their data into a cluster to test-drive its capabilities. However, they (like most large organizations) have limited understanding of the inner workings of their own systems. This means that they look to migrate their circa-1985 mainframe applications to Hadoop and expect that they will get order of magnitude speedups with a fraction of the cost. In reality they get minimal speedup and the same cost.
I mentioned in the webinar about a presentation I had heard in which the presenter shared his experience in using MapReduce for monitoring access counts for millions of URLs. When it dawned on the application development team that the lion’s share of the time of the MapReduce application was shuffling the URL visit counts across their network, they determined that to get any reasonable performance they had to *sort* all of their data before they loaded it into Hadoop. OK, sort time time is now a preprocessing stage that is not accounted for on Hadoop, their MapReduce ran a lot faster, but clearly the overall execution time for the entire application was not significantly improved at all. Great case study.
The point about going heavy on memory follows accordingly: the scalability bottleneck is tied to data movement (both disk and network), so managing more data in-memory diminishes the impact. As you suggest, this is not new, and I agree: I worked on memory hierarchy optimizations 20 years ago when I was designing compiler optimizations for MPP systems. However, it is good to see that the big software vendors are now aware of this (e.g. SAP HANA). The moral of that point is that when you are configuring a system, balance your need (and fund your budget) for memory in relation to the types of applications and their corresponding performance requirements.
Next, your point in relation to differentiating between Hadoop 1.0 and YARN: Pretend you are a business person tasked with making a decision about big data and spend a few minutes reading about YARN and see if you can easily understand what the difference is between 1.0 and YARN. If you apply the same critical eye that you used with the webinar, you’ll be sure to point out that not only is the difference only clear to a person with deep technical knowledge, there is little (or no) value proposition or justification presented. Only a description of the differences in the new version. When you consider the TDWI audience (largely heritage data management/data warehousing practitioners plus a number of their business associates), you will understand that they are not typically literate around Hadoop and are happy to have these details spelled out.
I do think you are somewhat limiting in linking the term “data lake” solely to the concept of an ODS. I have at least two clients who are in the process of dumping *all* of their data onto Hadoop under the presumption that its “scalable storage” makes it a clear winner from a perspective of low-cost persistent storage. However, in both cases the use of HDFS as a “data lake” is more of a “data dump” for all data artifacts and have neither the structure nor the intent of use as an operational data store, particularly in relation to data warehousing. This tells me that there is a desire to use Hadoop/HDFS more as an archival dump more than anything else. One of those two clients said to me that they were dumping all of their data on Hadoop because they wanted to do analytics. When I asked what kind of analytics, they said “predictive analytics.” When I asked what they were hoping to accomplish using predictive analytics, they no longer had an answer. They cycled back to saying that they wanted to do text analysis and use that for predictive analytics.
On the other hand, the types of applications that are emerging on commodity-based high performance computing systems are expanding beyond the “data warehouse” and data analytics to more computation-based applications that use *data structures* (as opposed to databases). Examples include social network analysis (you want to have the graph in memory), protein structure prediction (you want to have the complex molecule data structures in memory), multidimensional nearest neighbor and other types of iterative data mining algorithms (which look to having their analyzed entity data structures in memory), cybersecurity, public protection, etc. Next, consider the ability to virtualize access to in-memory databases in ways that allows for simultaneous transaction processing and analytical processing, eliminating the need for a data warehouse (and consequently and ODS).
In general, the technology media do a good job of hyping new technology but not as good at explaining its value or telling you how to determine when the new technology is better than using that old mainframe. That is what I have in mind when I do webinars like these.
I have read through some of the blog entries on your site and the common theme seems to be criticism of one sort or another of presentations and presenters be it a webinar or a presentation at BBBT. It is pretty easy to throw darts at others. Please let me know when your next webinar is coming up and I will be sure to attend. I’ll be happy to share my thoughts with you afterward.]]>
Naturally, absorbing the data from the proliferation of interconnected devices that all are generating and communicating continuous streams of data is the natural next area of focus for all the big data people, especially when it comes to going beyond the acquisition of these numerous continuous streams. The next step would be to not just collect that data, but be able to make sense of the information that can be inferred from the data.
Consider automobile manufacturers, who have implanted numerous sensors within the cars they create. It is one thing to have a tire pressure sensor continuously monitor the pressure in each tire and alert the driver when the pressure is low. But here are two contrived examples. The first would integrate the sensor readings from all the tires as well as monitor weather conditions and current traffic conditions where the car is being driven to apply an algorithm to determine whether changes in tire pressure are real problems or if they can be attributed to fluctuations in outside temperature coupled with the way the car is being driven. This example streams data from sensors and other sources of data within algorithmic models to inform the driver of potential issues.
The second example goes a bit further – the manufacturer has the sensors in all of their cars that are on the road wirelessly report their readings back to the company on a regular basis. In turn, the company can monitor for issues, part failures, correlation between locations driven, external conditions, the owners’ maintenance behaviors, as well as other source of data to proactively identify potential issues and alert the owner (or maybe the car itself!) about how to mitigate any impending risks.
Both of these examples are indicative of the maturation of the big data thought process, suggesting new ideas of what to do with all of that big data you can collect. But in turn, recognize also that both of these examples (and others like them) are predicated on the ability to go beyond just collecting, storing, and processing that data. To achieve these benefits, you need to be able to align these variant (and sometimes less-than-reliable) data sets in ways so that they can be incorporated logically into the appropriate analytical models.
That requires data management and integration mechanisms that combine knowledge of structure with inferred knowledge about the actual content to drive harmonization. While we have tools that can contribute some of these capabilities, it appears that we are still close to the starting gate when it comes to universally being able to make sense of all this information.
That being said, some vendors seem to understand these challenges and have embarked on developing a roadmap that seeks to not only address the mechanical aspects of acquisition and composition, but also to fuse pattern analysis, machine learning, and predictive analytics techniques with the more mundane aspects of data profiling, scanning, parsing, cleansing, standardization, and harmonization, as well as governance aspects such as security and data protection.
An example is Informatica, whose user event I am currently attending. At this event, the management team has presented their vision for the next few years, and it speaks to a number of the concepts and challenges I have raised in this posting. Some specific aspects include evolving core capabilities for data quality and usability and ratcheting them up a notch to enable business users to make use of information without relying on the IT crutch. This vision includes data discovery and automated inventorying and classification that can adapt different methods for data preparation to encourage a greater level of business self-service, no matter where the data lives. At the same time, they are also attempting to address the issue of data protection, a challenge that only seems to be expanding. I am looking forward to monitor the actualization of this vision over the next 6-12 months.]]>
When I mentioned this to a colleague, he told me that he had programmed a small utility to generate millions of transaction records. And of course, I have enough programming experience to conjure up a similar engine that could spit out lots of randomized (and imaginary) transactions. While this addressed my first constraint, it did not touch upon the second one. The randomness is good for generating a lot of data, but becomes a drawback when you are looking for latent issues and dependencies that you’d like to test. The upshot is that simple approaches can help in generating a data set, but may not help in generating the data set that you really want to test.
And this gap becomes much more serious when you look at enterprise development projects in which the production data is not available for either the development process or for testing. And even in situations where production data is made available, when testing new product features, there may not be any production data to test. In other words, there is a need to go beyond a simplistic approach and use a more sophisticated approach for generating test data.
Therefore, what would you look for in an automated test data generation tool? If I sat down and thought about it for long, I could probably come up with a really long laundry list. However, even a few minutes of noodling yields what I might call a hefty list of demands, namely that an automated test generation tool must:
Fortunately, I had the opportunity to be briefed by Informatica about test data management in their upcoming 9.6 release. Apparently, they have been thinking the same thoughts about test data generation, since this release blends the types of capabilities I wished for in the above list with aspects of data protection employing encryption, masking, and data scrambling. In addition, I was told that the test data generator links with the Power Center metadata repository as well as the data profiling capabilities of their Data Quality tool. This means that the profiler can be used to accumulate knowledge about metadata within a data set to be modeled, as well as statistical information about data value distributions to guide test data generation. Lastly, data quality business rules can be used to guide the generation of specific instances that are to be subjected to testing and validation.
My perception is that with a tool like Informatica’s Test Data Management, it should be possible for enterprises to augment their existing test data that follows their business rules, its ‘quirkiness’ and associated error conditions, or, generate test data from ground up that simulates their production data.]]>
The lowest level in this diagram, file management, forms the basis for all information management activities, and is often handled intrinsically within the operating system. However, with the growing interest in big data and its dependence on distributed file structures, file management reemerges as a critical component of the information management stack. The ability to differentiate between structured and unstructured data then becomes valuable in the context of determining the optimal methods for data storage and file management. Metadata management is used for managing data standards, reference data, and standard data models.
Decisions about data organization for both structured and unstructured data will influence the decision made for the data management, such as what might be deemed “legacy” data management frameworks (such as IMS or VSAM files), traditional relational database management systems (RDBMS) vs. newer NoSQL methods, and the decision as to whether these management schemes are to be implemented on top of big data platforms.
Above those levels we can begin to consider aspects of content. The first layer is master data management, used to provide shared access to a unified view of core data domains such as customer and product. Business processes rely on database management systems; transaction systems use transaction-oriented models for RDBMS systems, while reporting and analytics systems will use alternate data schemas in data warehouses that are optimized for rapid access. Data integration methods are used to facilitate the movement and integration of information into the target systems, while data quality and enrichment methods and tools will ensure observance of the quality expectations for the user community. Lastly, enabling business applications to access the various data resources requires data access and control methods.]]>
We can consider a variety of areas that are to be governed by policies, including:
Data governance is a practice that frames information use along these dimensions and guides the key stakeholders in program management and development of the policies and standards and corresponding processes.]]>
A strategy is a plan and set of policies intended to help achieve specific objectives. An information strategy elucidates the way that principles for information use across the organization will help the organization achieve its intended goals. That information strategy spans all aspects of the information lifecycle, including acquisition, management, accessibility, sharing, and disposition in a way that satisfies all information consumer requirements.
In my upcoming posts I will explore some aspects of each of these]]>