Enhancing Data Profiling with “Sub-Profile”

November 22, 2010
Here is a scenario that I often hit while doing a data quality assessment using profiling: I have reviewed a column’s set of values, identified a specific outlier valu, ¬†and have drilled through to select out all of the records that share that specific value. Now what?

Typically, the idea is to review those records to see if there are any invalidities or weirdnesses that might have contributed to the potential anomaly. However, in big data sets, the resulting drill-through record set may *still* be too big for an “eyeball-review.”

So this led to today’s idea: sub-profiling, which is¬†subjecting those recods in the drill-through record set to a subsequent round of profiling so that any root causes existing in those records might be brought to the forefront. Value frequencies that might get lost in the pack in a profile of he entire table may all of a sudden become obvious in the sub-profile.

I once asked a vendor if their tool did this. He cautiously said “yes,” but he meant this: “Yes, but only if you run tth eprofile, drill-down, save the result set in another file, load the file back inot the profiler as a table an dthen profile that table.” True, that is doable, bu not as efficient as my idea.

