Hierarchy Data Completeness and Semantic Convergence

Yesterday, Henrik Liliendahl Sørensen posted an interesting entry about data profiling, data values, and corresponding quality and completeness of the hierarchies associated with the data domain values used within a data set for any particular data element’s populated values. I’d like to jam along with that concept with respect to a conversation I had the other day that was essentially about capturing and tracking spend data, although the context was capturing and reporting the aggregate physician payments made by a pharmaceutical (or other covered manufacturer) to specific practitioners.

The person I was talking to proposed that using business rules within a data profiling tool would be a good method to analyze and thereby aggregate the collected payments by type and then use the tool to generate reports. (Note: payments over $10 need to be reported). While I did not disagree, I did suggest that it was much more complex (this is all based on the research I did for my DataFlux paper on that topic). The problem, I told him, was that each individual tasked with documenting the various types of physician payments (typically sales and marketing people) had probably not committed to use a standardized hierarchy for documentation. In other words, what one representative refers to as “breakfast” might be called “bagels” by another and “brunch” by yet a third.

At the same time, there are different representations for the practitioner as well. For example, one representative might refer to the practitioner by her name, another by the name of the practice, and third might use an abbreviated version. I have seen data sets in which two practitioners are not connected except for the fact that they share a telephone number – would those two be considered a practice?

The underlying need for resolution on conceptual/semantic terminology means that even if you did apply standard data quality techniques (even approximate matching, for example), you are still exposed to noncompliance when summed values of similar (or the same) “exchanges of value” are not properly rolled up, either by item or by practitioner.

I agree with Henrik that there is a need for hierarchy completeness, but you also to introduce a semantic analysis flavor into that stew to make sure what you end up with satisfies the specific needs of the end consumers. A few days worth of bagels might add up.


Tell me what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!