Hierarchy Data Completeness and Semantic Convergence

Yesterday, Henrik Liliendahl Sørensen posted an interesting entry about data profiling, data values, and corresponding quality and completeness of the hierarchies associated with the data domain values used within a data set for any particular data element’s populated values. I’d like to jam along with that concept with respect to a conversation I had the other day that was essentially about capturing and tracking spend data, although the context was capturing and reporting the aggregate physician payments made by a pharmaceutical (or other covered manufacturer) to specific practitioners.
Read more

Data Quality Profiling and Assessment – Some Questions for the Client

March 18, 2011 by · Leave a Comment
Filed under: Business Rules, Data Profiling, Data Quality, Metrics 

Yesterday our company was approached to provide a proposal for a data quality assessment project as part of a more comprehensive data quality assurance effort. When we get these types of requests, I am always amused by the fact that key pieces of information necessary for determining the amount of work. We typically have some basic questions in order to scope the level of effort, including:

• What data sets are to be used as the basis for analysis?
• How many tables?
• How many data elements?
• How many records in each table?
• Are reference data sets available for the common value domains?
• How many business processes source data into the target data set?
• How many processes use the data in the target data set?
• What documentation is available for the data sets and the business processes?
• What tools are in place to analyze the data?
• Will the client provide access to the sources for analysis?
• How is the organization prepared to take actions based on the resultant findings?

In general, I like to think that my company is pretty good at doing these types of assessments – of course, I wrote the book (or at least, a book) on the topic ;-).

Upcoming Webinar (2011-02-09) – Visualizing the Power of Your Data

February 3, 2011 by · 1 Comment
Filed under: Data Governance, Events 

I will be the invited speaker at an exciting upcoming web seminar on Feb 9, 2011 at 1:00PM EST on the topic of “Visualizing the Power of Your Data”,” which is sponsored by CA Technologies. Should be an interesting discussion, with Donna Burbank from CA speaking as well. Click here to learn more and register.

Enhancing Data Profiling with “Sub-Profile”

November 22, 2010 by · Leave a Comment
Filed under: Data Governance, Data Profiling, Data Quality 

Here is a scenario that I often hit while doing a data quality assessment using profiling: I have reviewed a column’s set of values, identified a specific outlier valu,  and have drilled through to select out all of the records that share that specific value. Now what?

Typically, the idea is to review those records to see if there are any invalidities or weirdnesses that might have contributed to the potential anomaly. However, in big data sets, the resulting drill-through record set may *still* be too big for an “eyeball-review.”

So this led to today’s idea: sub-profiling, which is subjecting those recods in the drill-through record set to a subsequent round of profiling so that any root causes existing in those records might be brought to the forefront. Value frequencies that might get lost in the pack in a profile of he entire table may all of a sudden become obvious in the sub-profile.

I once asked a vendor if their tool did this. He cautiously said “yes,” but he meant this: “Yes, but only if you run tth eprofile, drill-down, save the result set in another file, load the file back inot the profiler as a table an dthen profile that table.” True, that is doable, bu not as efficient as my idea.

Have additional thoughts? Contact me and I will post your great ideas also!

Find and Fix? A Question About Data Quality Metrics

Data profiling can be an excellent approach to identifying latent issues and errors hidden in your data. We have seen a number of clients using data profiling as the first step in defining data quality metrics and using those metrics for reporting via scorecards and dashboards.

And if I can identify a problem and I can define a rule for determining that the problem exists, should I not be able to fix the problem? Here is a question, though: once I fix the root cause of the problem, do I need to still keep checking if the problem has occured?

More on this in an upcoming post; contact me if you have thoughts…