Questions About the “Cost of Poor Data Quality”
I was reading through one of Jim Harris’s blog entries about the his reinterpretation of Pascal’s wager in terms of data quality, and the posting made reference to an email he had received from Gordon Hamilton about the estimated costs of poor information quality. I noted Richard Ordowich’s comments from the Linkedin group were incorporated regarding the claims that 15-45% of the operating expense of virtually all organizations is WASTED due to data quality issues and thought I’d spend a little bit of time investigating the origins of some of these numbers.
So I thought it would be worth exploring the sources for some of the more popular claims about the costs of poor data quality.
First off, I came across a posting referring to Larry English’s recent book that had the quote “Poor information quality costs organizations 20-35% of operating revenue wasted in recovery from process failure and information scrap and rework.”
Actually, additional context is provided in this quote, but the emphasis is mine: “In the Costs of Poor Quality Information analyses that we have conducted, combined with the anecdotal evidence we have collected over the past twenty years, the evidence is clear. The Costs of Poor Quality Information as a percent of operating revenue or budget (for government and not-for-profit) is roughly equivalent to the costs of poor quality in the manufacturing and service sectors.”
IN 2003, TDWI produced a report that is often quoted, estimating that “that data quality problems cost U.S. businesses more than $600 billion a year.” Actually, when you read the report and look at the footnotes, you see them qualify that statement: “TDWI estimate based on cost-savings cited by survey respondents and others who have cleaned up name and address data, combined with Dunn (sic) & Bradstreet counts of U.S. businesses by number of employees.” So in fact, that estimate is really an estimate based on survey respondents, who may be “self-selected” to some extent; it might be interesting to go back and review the survey responses.
Further back in time (2002), we see Tom Redman claim that “Poor data quality costs the typical company up to twenty percent of revenue.”
Earlier, in 1999, Larry English’s previous book suggests that the costs are actually lower: “Based on numerous cost analyses, the typical organization may see from 15 to 25 percent of its revenue go to pay the costs of information scrap and rework.”
However, in an article published a year before that book was released, we see his claim that “If early data assessments are an indicator, the business costs of non-quality data, including non-recoverable costs, rework of products and services, workarounds, lost and missed revenue, may be as high as 10-20 percent of revenue or total budget of an organization”
And Tom Redman’s 1998 article in Communications of the ACM comments that his “article would be enhanced with an estimate of the total cost of poor data quality, but studies to produce such estimates have proven difficult to perform.” However, he then notes that he is “aware of three proprietary studies that yielded estimates in the 8–12% of revenue range.”
So where are we? Here are some additional qualifying notes: First of all, I am a firm believer that the value gap attributable to poor data quality is real and can be estimated; see my series of articles; here is a link to one specific paper. I think that if effort is invested in understanding where value is impacted as a result of data issues, you can estimate the value of improvements.
However (and second) I do want to point out that I am not biasing my research here – I am quoting directly from published sources. Third, there are a lot of papers and articles with less-refined methods suggesting scenarios in which one could estimate a cost, (including my own) that do not show hard numbers. I am definitely open to notes regarding actual costs, savings, and added value. Fourth, there are some more academic attempts to collect a bunch of theories and provide a unified approach to estimating costs, if you are willing to invest the time to read through them.
From the sources I found, all of whom are reputed to be experts in the data quality space, we can conclude the following:
1) There are few (if any) published papers on actual case studies providing tangible details about the cost of poor data quality.
2) What academic notes and books that do exist and attempt to suggest the costs of poor data quality base their numbers on estimates, “proprietary studies,” accumulations from survey responses, or extrapolation from other estimates of the “cost of quality.”
3) Even in the absence of tangible evidence of actual costs, according to the experts willing to state cost estimates, the costs seem to be rising, from a low of 8% of revenue in 1998 to 35% of operating revenue in 2009 (does that mean that the costs of poor data quality increased 400% over a ten year period?).
As I mentioned before, I am very open to have suggestions about actual case studies or reports that provide researchable numbers (that means the numbers are published and can be reviewed) about evaluating the costs of poor data quality. Having access to these types of articles, reports, etc. will enable people like me to refine our approaches to evaluating the value of data quality improvement and helping to truly come up with a model to define clear return on your data quality investment.