Tools Vendors and Managing the Challenges of the Data Lake
I would call myself a proponent of big data and, correspondingly, big data analytics. As a professional who has been involved in high-performance computing since the late 80s, I am glad to finally see the rapid adoption of commodity-based systems providing data distribution and parallel computing, such as what can be assembled and deployed using Hadoop.
One particularly curious innovation, at least to me, is the concept of the “data lake.” According to TechTarget, a data lake is “a large object-based storage repository that holds data in its native format until it is needed.” A data lake provides a place for collecting data sets in their original format, making those data sets available to different consumers, and allowing data users to consume that data in ways specific to their need. The benefits of the data lake concept include the ability to rapidly spin up a data repository, rapid ingestion of new data sources, and the direct data accessibility to analytical models and applications.
However, ability to allow data users to consume the data in their own ways seems to hide a potential time-bomb in terms of ensuring consistency of interpretation of data and quality of results of analysis. The issue is that the subtext mantra of the data lake is the capture of data sets in their original state and the deferral of data standardization, validation, and organization until the point of consumption. I don’t object to providing degrees of flexibility to the user community to adapt the data in their own ways. My concern is that without instituting control over the management of consumption models, there is bound to be duplicated work, variant methods of applying similar constraints and standards, and generally a degree of entropy that casts doubt about the veracity of analytical results.
In fact, what seems to be emerging leaves me with a feeling of déjà vu – data users complaining about the elongated time to profile and validate data, the apparently self-organized data silos (driven by source, not function), and multiple users left to their devices for data mapping and transformation. So I have to admit that when I was recently briefed by folks at Informatica about their latest release, I was relieved to see that at last there is a vendor who has not only recognized that these challenges are real, but also that they need to be addressed by integrating data governance, stewardship, and quality into a framework for end-to-end big data management.
I would call out two key features of this concept. The first is expanding the data integration capabilities of the tool suite to ingest a broad array of structured and unstructured data configurations, tunable to adapt to rapid changes in the sources (especially continuous and non-relational streams, whose formats are subject to change with little or no notice). Embedding logic within the information ingestion and flow processes allows for self-adjustment and dramatically reduces developer overhead as the formats evolve.
The second is the inclusion of metadata capabilities to facilitate governance of big data. The key aspects of big data governance include a shared enterprise business glossary that is ripe for collaborative discussion and analysis, profiling and discovery utilities for big data sets to inform data quality initiatives that can also be shared among data consumers, and end-to-end data lineage enabling monitoring of data flows to assess opportunities for optimization, reduce duplicative coding efforts, and to evaluate impacts as data sources change over time.
I am confident that awareness of balancing the governance needs against the potential benefits of big data and the data lake will help narrow the overwhelming manual efforts that could explode as a result of increased data sources, volumes, and variability. As the prominent vendors like Informatica continue to call out the potential issues and provide solutions, the integrity of predictive and prescriptive models will enhance the creation of corporate value.