My casual monitoring of data management buzz phrases suggests that, as an industry, we are beginning to transition our hysteria over “big data” to a new compulsion with what is referred to as the “Internet of Things,” (IoT). Informally, the IoT refers to the integration of communication capabilities within lots of different uniquely identified devices that effectively creates a massive network for the exchange of data. These devices can range from vending machines to sensors attached to jet engines to implanted medical devices.
Naturally, absorbing the data from the proliferation of interconnected devices that all are generating and communicating continuous streams of data is the natural next area of focus for all the big data people, especially when it comes to going beyond the acquisition of these numerous continuous streams. The next step would be to not just collect that data, but be able to make sense of the information that can be inferred from the data.
Consider automobile manufacturers, who have implanted numerous sensors within the cars they create. It is one thing to have a tire pressure sensor continuously monitor the pressure in each tire and alert the driver when the pressure is low. But here are two contrived examples. The first would integrate the sensor readings from all the tires as well as monitor weather conditions and current traffic conditions where the car is being driven to apply an algorithm to determine whether changes in tire pressure are real problems or if they can be attributed to fluctuations in outside temperature coupled with the way the car is being driven. This example streams data from sensors and other sources of data within algorithmic models to inform the driver of potential issues.
The second example goes a bit further – the manufacturer has the sensors in all of their cars that are on the road wirelessly report their readings back to the company on a regular basis. In turn, the company can monitor for issues, part failures, correlation between locations driven, external conditions, the owners’ maintenance behaviors, as well as other source of data to proactively identify potential issues and alert the owner (or maybe the car itself!) about how to mitigate any impending risks.
Both of these examples are indicative of the maturation of the big data thought process, suggesting new ideas of what to do with all of that big data you can collect. But in turn, recognize also that both of these examples (and others like them) are predicated on the ability to go beyond just collecting, storing, and processing that data. To achieve these benefits, you need to be able to align these variant (and sometimes less-than-reliable) data sets in ways so that they can be incorporated logically into the appropriate analytical models.
That requires data management and integration mechanisms that combine knowledge of structure with inferred knowledge about the actual content to drive harmonization. While we have tools that can contribute some of these capabilities, it appears that we are still close to the starting gate when it comes to universally being able to make sense of all this information.
That being said, some vendors seem to understand these challenges and have embarked on developing a roadmap that seeks to not only address the mechanical aspects of acquisition and composition, but also to fuse pattern analysis, machine learning, and predictive analytics techniques with the more mundane aspects of data profiling, scanning, parsing, cleansing, standardization, and harmonization, as well as governance aspects such as security and data protection.
An example is Informatica, whose user event I am currently attending. At this event, the management team has presented their vision for the next few years, and it speaks to a number of the concepts and challenges I have raised in this posting. Some specific aspects include evolving core capabilities for data quality and usability and ratcheting them up a notch to enable business users to make use of information without relying on the IT crutch. This vision includes data discovery and automated inventorying and classification that can adapt different methods for data preparation to encourage a greater level of business self-service, no matter where the data lives. At the same time, they are also attempting to address the issue of data protection, a challenge that only seems to be expanding. I am looking forward to monitor the actualization of this vision over the next 6-12 months.
Hi Folks, I recently published a new technical paper on the use of auxiliary processors on IBM System z class machines to support virtualization of mainframe data that allows you to bypass the need for extracting data prior to using that data for reporting and analysis. You can access the paper, which was sponsored by Rocket Software, via this link. Please email me or post comments and let me know what you think!
A few years ago I was working on configuring a test for comparing data transformation and loading into a variety of target platforms. Essentially I was hoping to assess the comparative performance of different data management schemes (open source relational databases, enterprise versions of relational databases, columnar data stores, and other NoSQL-style schemes). But to do this, I had two constraints that I needed to overcome. The first was the need for a data set that was massive enough to really push the envelope when it came to evaluating different aspects of performance. The second was a little subtler: I needed the data set to exhibit certain data error and inconsistency characteristics that simulated a real-life scenario.
Come on, seriously?
Poor data quality and Hurricane Irene
Just to prove to a content aggregator that I am serious about social media, I am putting this code:
inside a blog post.