This morning I was looking at a spreadsheet documenting data validation scores for a number of data sets at a particular client. The report provided basic measures of quality based on validity and completeness rules applied to a variety of largely location-oriented data elements. What I found interesting was that the coding formula for the error codes incorporated a degree of ambiguity.
I have just finished a paper sponsored by Informatica titled “Understanding the Financial Value of Data Quality Improvement,” which looks at bridging the communications gap between technologists and business people regarding the value of data quality improvements. Here is a summary:
As opposed to the technical aspects of data validation and cleansing, often the biggest challenge in beginning a data quality program is effectively communicating the business value of data quality improvement. But using a well-defined process for considering the different types of costs and risks of low-quality data not only provides a framework for putting data quality expectations into a business context, it also enables the definition of clear metrics linking data quality to business performance. For example, it is easy to speculate that data errors impede up-selling and cross-selling, but to really justify the need for a data quality improvement effort, a more comprehensive quantification of the number of sales impacted or of the total dollar amount for the missed opportunity can be much more effective at showing the value gap.
This article looks at different classifications of financial impacts and corresponding performance measures that enables a process for evaluating the relationship between acceptable performance and quality information. This article is targeted to analysts looking to connect high quality information and optimal business performance to make a quantifiable case for data quality improvement.
After having led or participated in a number of data quality assessments, I continue to think about good ways to present results of the analysis that convey both the severity of speciifc issues while simultaneously allowing the reader to compare the different issues. I will admit that I am not a “visualization” person, nor do I advocate creating dashboards and scorecards as the end product of a data quality activity. Rather, the scorecard is the means to an end, which is the prioritzation of the issues so that most effective use of resources can get the maximum benefit.
That being said, I do think that radar charts are one good visualization paradigm. A radar chart allows you to map multiple variable in a 2-dimensional view that conveys comparative information. Here is an example:
This example portrays the measures of severity for four different value driver areas for a single data quality issue. By looking at this graph, you can quickly see that incomplete dates have a high financial impact, but relatively low risk and productivity impacts. I am still experimenting with these types of images, and tinkering with excel to figure out how to get multiple axes represented in a single graph so that I can overlay the impact dimension with a “remediation suitability” dimension that presents the time to value, cost to resolve, and staff effort. Together that would provide a summary of the severity of the issue and the feasibility of its resolution. If you have some suggestions, let me know, and when I figure it out I will post a follow up.