I would call myself a proponent of big data and, correspondingly, big data analytics. As a professional who has been involved in high-performance computing since the late 80s, I am glad to finally see the rapid adoption of commodity-based systems providing data distribution and parallel computing, such as what can be assembled and deployed using Hadoop.
One particularly curious innovation, at least to me, is the concept of the “data lake.” According to TechTarget, a data lake is “a large object-based storage repository that holds data in its native format until it is needed.” A data lake provides a place for collecting data sets in their original format, making those data sets available to different consumers, and allowing data users to consume that data in ways specific to their need. The benefits of the data lake concept include the ability to rapidly spin up a data repository, rapid ingestion of new data sources, and the direct data accessibility to analytical models and applications.
However, ability to allow data users to consume the data in their own ways seems to hide a potential time-bomb in terms of ensuring consistency of interpretation of data and quality of results of analysis. The issue is that the subtext mantra of the data lake is the capture of data sets in their original state and the deferral of data standardization, validation, and organization until the point of consumption. I don’t object to providing degrees of flexibility to the user community to adapt the data in their own ways. My concern is that without instituting control over the management of consumption models, there is bound to be duplicated work, variant methods of applying similar constraints and standards, and generally a degree of entropy that casts doubt about the veracity of analytical results.
In fact, what seems to be emerging leaves me with a feeling of déjà vu – data users complaining about the elongated time to profile and validate data, the apparently self-organized data silos (driven by source, not function), and multiple users left to their devices for data mapping and transformation. So I have to admit that when I was recently briefed by folks at Informatica about their latest release, I was relieved to see that at last there is a vendor who has not only recognized that these challenges are real, but also that they need to be addressed by integrating data governance, stewardship, and quality into a framework for end-to-end big data management.
I would call out two key features of this concept. The first is expanding the data integration capabilities of the tool suite to ingest a broad array of structured and unstructured data configurations, tunable to adapt to rapid changes in the sources (especially continuous and non-relational streams, whose formats are subject to change with little or no notice). Embedding logic within the information ingestion and flow processes allows for self-adjustment and dramatically reduces developer overhead as the formats evolve.
The second is the inclusion of metadata capabilities to facilitate governance of big data. The key aspects of big data governance include a shared enterprise business glossary that is ripe for collaborative discussion and analysis, profiling and discovery utilities for big data sets to inform data quality initiatives that can also be shared among data consumers, and end-to-end data lineage enabling monitoring of data flows to assess opportunities for optimization, reduce duplicative coding efforts, and to evaluate impacts as data sources change over time.
I am confident that awareness of balancing the governance needs against the potential benefits of big data and the data lake will help narrow the overwhelming manual efforts that could explode as a result of increased data sources, volumes, and variability. As the prominent vendors like Informatica continue to call out the potential issues and provide solutions, the integrity of predictive and prescriptive models will enhance the creation of corporate value.
How prepared is your organization for technology transition? Are you aware of the different facets of the process for a strategic technology renovation plan? We would like to find out, and therefore we are conducting a survey of technology professionals and their management hierarchy on transitioning to new technology (bit.ly/decisionworxsurvey). All respondents are eligible for a copy of the survey report and will be entered into a drawing for a free copy of my book “Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph.”
Please help us by taking the survey and sharing the link with others!
My casual monitoring of data management buzz phrases suggests that, as an industry, we are beginning to transition our hysteria over “big data” to a new compulsion with what is referred to as the “Internet of Things,” (IoT). Informally, the IoT refers to the integration of communication capabilities within lots of different uniquely identified devices that effectively creates a massive network for the exchange of data. These devices can range from vending machines to sensors attached to jet engines to implanted medical devices.
Naturally, absorbing the data from the proliferation of interconnected devices that all are generating and communicating continuous streams of data is the natural next area of focus for all the big data people, especially when it comes to going beyond the acquisition of these numerous continuous streams. The next step would be to not just collect that data, but be able to make sense of the information that can be inferred from the data.
Consider automobile manufacturers, who have implanted numerous sensors within the cars they create. It is one thing to have a tire pressure sensor continuously monitor the pressure in each tire and alert the driver when the pressure is low. But here are two contrived examples. The first would integrate the sensor readings from all the tires as well as monitor weather conditions and current traffic conditions where the car is being driven to apply an algorithm to determine whether changes in tire pressure are real problems or if they can be attributed to fluctuations in outside temperature coupled with the way the car is being driven. This example streams data from sensors and other sources of data within algorithmic models to inform the driver of potential issues.
The second example goes a bit further – the manufacturer has the sensors in all of their cars that are on the road wirelessly report their readings back to the company on a regular basis. In turn, the company can monitor for issues, part failures, correlation between locations driven, external conditions, the owners’ maintenance behaviors, as well as other source of data to proactively identify potential issues and alert the owner (or maybe the car itself!) about how to mitigate any impending risks.
Both of these examples are indicative of the maturation of the big data thought process, suggesting new ideas of what to do with all of that big data you can collect. But in turn, recognize also that both of these examples (and others like them) are predicated on the ability to go beyond just collecting, storing, and processing that data. To achieve these benefits, you need to be able to align these variant (and sometimes less-than-reliable) data sets in ways so that they can be incorporated logically into the appropriate analytical models.
That requires data management and integration mechanisms that combine knowledge of structure with inferred knowledge about the actual content to drive harmonization. While we have tools that can contribute some of these capabilities, it appears that we are still close to the starting gate when it comes to universally being able to make sense of all this information.
That being said, some vendors seem to understand these challenges and have embarked on developing a roadmap that seeks to not only address the mechanical aspects of acquisition and composition, but also to fuse pattern analysis, machine learning, and predictive analytics techniques with the more mundane aspects of data profiling, scanning, parsing, cleansing, standardization, and harmonization, as well as governance aspects such as security and data protection.
An example is Informatica, whose user event I am currently attending. At this event, the management team has presented their vision for the next few years, and it speaks to a number of the concepts and challenges I have raised in this posting. Some specific aspects include evolving core capabilities for data quality and usability and ratcheting them up a notch to enable business users to make use of information without relying on the IT crutch. This vision includes data discovery and automated inventorying and classification that can adapt different methods for data preparation to encourage a greater level of business self-service, no matter where the data lives. At the same time, they are also attempting to address the issue of data protection, a challenge that only seems to be expanding. I am looking forward to monitor the actualization of this vision over the next 6-12 months.
Hi Folks, I recently published a new technical paper on the use of auxiliary processors on IBM System z class machines to support virtualization of mainframe data that allows you to bypass the need for extracting data prior to using that data for reporting and analysis. You can access the paper, which was sponsored by Rocket Software, via this link. Please email me or post comments and let me know what you think!
A few years ago I was working on configuring a test for comparing data transformation and loading into a variety of target platforms. Essentially I was hoping to assess the comparative performance of different data management schemes (open source relational databases, enterprise versions of relational databases, columnar data stores, and other NoSQL-style schemes). But to do this, I had two constraints that I needed to overcome. The first was the need for a data set that was massive enough to really push the envelope when it came to evaluating different aspects of performance. The second was a little subtler: I needed the data set to exhibit certain data error and inconsistency characteristics that simulated a real-life scenario.