Find and Fix? A Question About Data Quality Metrics

Data profiling can be an excellent approach to identifying latent issues and errors hidden in your data. We have seen a number of clients using data profiling as the first step in defining data quality metrics and using those metrics for reporting via scorecards and dashboards.

And if I can identify a problem and I can define a rule for determining that the problem exists, should I not be able to fix the problem? Here is a question, though: once I fix the root cause of the problem, do I need to still keep checking if the problem has occured?

More on this in an upcoming post; contact me if you have thoughts…

Business Rules and Data Quality

November 5, 2010 by · Leave a Comment
Filed under: Business Rules, Data Quality, Metadata, Metrics 

There are many different dimensions of data quality that can be “configured” to measure and monitor compliance with data consumer expectations. We could classify a subset of the data quality dimensions that can be mapped to assertions at different levels of data precision, such as:

  • Data value, in which a rule is used to validate a specific value. An example is a format specification for any ZIP code (no matter which data element is storing it) that says the value must be a character string that has 5 digits, a hyphen, then 4 digits.
  • Data element, in which a value is validated in the context of the assignment of a value domain to a data element. An example is an assertion that the value of the SEX field must be either M or F.
  • Record, in which the assertion refers to more than one data element within a record. An example would specify that the START_DATE must be earlier in time than the END_DATE.
  • Column, which is some qualitative measure of the collection of values in one column. An example would assert that no value appears more than 5% of the time across the entire column.
  • Table, which measures compliance over a collection of records. An example is a rule says that the table’s percentage of valid records must be greater than 85%.
  • Cross-table, which looks at the relationships across tables. An example could specify that there is a one-to-one relationship between customer record and primary address record.
  • Aggregate, which provides rules about aggregate functions. An example would apply a validation rule to averages and sums calculated in business intelligence reports.

I have been thinking about ways to map these rules to metadata concepts to understand how a services model could be implemented that could be invoked at different locations within the information production flow. For example, one could validate data values as they are created, but you’d have to wait until you have many records to validate a table rule. This suggests that the value rules can be mapped to value domains, while table rules are mapped to entities. As this mapping gets fleshed out, I will begin to assemble a service model for data validation that ultimately links through the metadata to the original definitions associated with business policies. Given that model, we can spec out an operational governance framework to manage the quality as it pertains to the business policies.

« Previous Page