Sorry for the long hiatus – travel has gotten the best of me. But I have been thinking a lot about an issue and thought it would be worth sharing: predictability of errors.
Filed under: Business Rules, Data Profiling, Data Quality, Metrics
Yesterday our company was approached to provide a proposal for a data quality assessment project as part of a more comprehensive data quality assurance effort. When we get these types of requests, I am always amused by the fact that key pieces of information necessary for determining the amount of work. We typically have some basic questions in order to scope the level of effort, including:
• What data sets are to be used as the basis for analysis?
• How many tables?
• How many data elements?
• How many records in each table?
• Are reference data sets available for the common value domains?
• How many business processes source data into the target data set?
• How many processes use the data in the target data set?
• What documentation is available for the data sets and the business processes?
• What tools are in place to analyze the data?
• Will the client provide access to the sources for analysis?
• How is the organization prepared to take actions based on the resultant findings?
In general, I like to think that my company is pretty good at doing these types of assessments – of course, I wrote the book (or at least, a book) on the topic ;-).
The other day I had a conversation with a prospective client who mentioned to me that the company is looking at changing their key processing system and was told by one of the potential vendors that they had to clean up their data before it could be migrated into the new system. This person, intrigued by this comment, did a bunch of research about data cleansing and asked me whether this made sense. After a few questions, I learned that the vendor claimed that unless the data were “clean,” the new system would not work right. Of course, my curiosity was raised at this comment, since in my opinion, before you “clean” (or rather in this case, transform/normalize) the data for a target system, don’t you need to know which system you are planning to migrate to? And if they had not yet selected a vendor system, how would they know what they needed to “clean”?
This got me thinking about the link between data migration and data quality. Actually, in a number of client situations, the company is considering a large investment in a new system – a new contract administration system, a new pricing system, a new sales system – requiring a significant $$$ investment. And consequently, in each of these cases, the question of the quality of the legacy data is raised as a technical hurdle that must be jumped as opposed to a key component of making the new system meet the business needs of the organization. So this has triggered a few more questions about system replacement, data migration, and data cleansing:
• What is the intent of the new system?
• What features of the old system were inadequate? How were they related to the quality of the data?
• What are the features of the new system that are expected to alleviate those shortcomings? What are the dependencies on the existing data?
• What other business processes will derive value from the data created or modified within the new system?
• What is the target model? Is metadata available at the data element level?
• Who is assessing the target system data requirements?
• What process is in place for source to target mapping?
• What process is in place for programming the transformations?
• What do you do with data instances that do not transform properly? Is there a remediation process?
• What cleansing needs to be done? Is that different from transformation?
• What processes are in place for validating source data against target model expectations?
• What is the data migration plan?
• Will both systems need to run at the same time until the new system is validated?
Any thoughts of adding to the list? Please feel free to post additional questions by adding a comment…
In a number of places I have used the term “data quality control” in the context of a service that inspects the compliance of a data instance (at different levels of granularity, ranging from a data value to an entire table or file, for example) to a defined data quality rule. The objective of the control is to ensure that the expectations of the process consuming are validated prior to the exchange of data.
Some times when I have used this term, people respond by saying “oh yes, data edits.” But I think there is a difference between a data edit and a control. In most of the cases I have seen, an edit is intended to compare a value to some expectation and then change the value if it doesn’t agree with the intended target’s expectation. This is actually quite different than the control, which generates a notification or other type of event indicating a missed expectation. Another major difference is that the edit introduces inconsistency between the supplier of the data and the consumer of the data, and while the transformation may benefit the consumer in the short run, it may lead to issues at a later point when there is a suddent need for reconciliation between the target and the source.
A different consideration is that data edits are a surreptitious attempt to harmonize data to meet expectations, and does so without the knowledge of the data supplier. The control at least notifies a data steward that there is a potential discrepancy and allows the data steward to invoke the appropriate policy as well as notify both the supplier and the consumer that some transformation needs to be made (or not).
One other question often crops up about controls: how do they affect performance? This is a much different question, and in fact a recent client mentioned to me that they used to have controls in place but it slowed the processing down and so they removed them. This means it boils down to a decision of what is more important: ensuring high quality data or ensuring high throughput. The corresponding questions to ask basically center on the cumulative business impacts. Observing processing time service level agreements may be contractually imposed (with corresponding penalties for missing the SLA), which may even suggest that allowing some errors through and patching the process downstream might be less impactful than incurring SLA non-observance penalty costs.
In a recent discussion with a client, I was told about a situation in which there is a flip-flopping of automated data corrections. One day a record is identified as having an error (as part of an identity resolution process), the matching records are compared and a survival rule is applied that essentially deletes the old record and creates a new record. The next day, the new record is determined to be in error, again as part of a matching process, and a different survival rule is applied that, for all intents and purposes, reverts the record back to its original form.
This has become commonplace in the organization. So much so that are already aware of these repeat offenders and can track how many corrections are done for the first time and how many have been done before.
One might call the automation into question – how can it continue to go back and forth like that every day? I think there is a deeper issue involved having to do with the way the data is collected. For some reasong a correction rule is triggered by some set of value combinations, but the rule-based correction has not been properly vetted. The result is that the corrected version still does not comply with some set of expectations.
Recognition of repetitive correction indicates opportunities for increasing the levels of maturity for data quality management. Relying on automation is good, but less so if checks and balances are not in place to validate the applied rules.