Data Governance and Quality: Data Reuse vs. Data Repurposing
I have been assembling a slide deck for an upcoming TDWI web seminar on Strategic Planning and the World of Big Data, and I am finding that I might sometimes use two different terms (“data reuse” and “data repurposing,” in case you ignored the tootle of this post) interchangeably when in fact those two words could have slightly different meanings or intents. So should I be cavalier and use them as synonyms?
When I thought about it, I did see some clarity in differentiating the definitions:
- “data reuse” means taking a data asset and using more than once for the same purpose.
- “data repurposing” means taking a data asset previously used for one (or more) specific purpose(s) and using that data set four a completely different purpose.
For example, if we have an application that uses the customer database to generate address labels for a marketing campaign for a mailing this morning and then later in the day we use the same customer database to generate address labels for a second marketing campaign for the afternoon mail pickup, I would call that “reuse.” On the other hand, taking that same customer data set and combining it with sales transactions from the last month to classify customers by transaction and sales volume as part of an overall profiling algorithm would be an example of taking the same data but using that data for a different purpose.
The question build down to the governance aspects of assessing data quality requirements. For multiple instances of reuse, are all the quality expectations going to be identical? Alternatively, when a data set is repurposed, whose responsibility is it to document data quality rules and acceptability thresholds as well as integrate validation of the data into upstream processes?
And even more of an issue: what does one do if the repurposing is very far from origination? If we grab a data set from a public web site that has been through a number of transformations, the information in the data set may be subject to very different interpretations than when the data instances in the sources were originally created. That makes the problem woven more difficult – are we allowed to modify (AKA “correct”) data values that don’t meet our needs? Or are we constrained to use the data set as is because corrections alter the data, potentially affecting its repurposability (I think that is a new word I just invented).
In either case, providing a definition for both terms distinguishes the usage scenarios, and at the very least allows me to use both terms in the same blog entry or presentation slide.