Data Quality Profiling and Assessment – Some Questions for the Client

March 18, 2011 by · Leave a Comment
Filed under: Business Rules, Data Profiling, Data Quality, Metrics 

Yesterday our company was approached to provide a proposal for a data quality assessment project as part of a more comprehensive data quality assurance effort. When we get these types of requests, I am always amused by the fact that key pieces of information necessary for determining the amount of work. We typically have some basic questions in order to scope the level of effort, including:

• What data sets are to be used as the basis for analysis?
• How many tables?
• How many data elements?
• How many records in each table?
• Are reference data sets available for the common value domains?
• How many business processes source data into the target data set?
• How many processes use the data in the target data set?
• What documentation is available for the data sets and the business processes?
• What tools are in place to analyze the data?
• Will the client provide access to the sources for analysis?
• How is the organization prepared to take actions based on the resultant findings?

In general, I like to think that my company is pretty good at doing these types of assessments – of course, I wrote the book (or at least, a book) on the topic ;-).

Data Validation, Data Validity Codes, and Mutual Exclusion

March 11, 2011 by · Leave a Comment
Filed under: Data Analysis, Data Governance, Data Quality 

This morning I was looking at a spreadsheet documenting data validation scores for a number of data sets at a particular client. The report provided basic measures of quality based on validity and completeness rules applied to a variety of largely location-oriented data elements. What I found interesting was that the coding formula for the error codes incorporated a degree of ambiguity.
Read more

Variant Approaches to Identity Resolution and Record Matching

January 18, 2011 by · 1 Comment
Filed under: Data Analysis, Identity Resolution 

I had a set of discussions recently from representatives of different business functions and found an interesting phenomenon: although folks from almost every area of the business indicated a need for some degree of identity resolution and matching, there were different requirements, expectations, processes, and even tools/techniques in place. In some cases it seems that the matching algorithms each uses refers to different data elements, uses different scoring weights, different thresholds, and different processes for manual review of questionable matches. Altogether the result is inconsistency in matching precision.

And it is reasonable for different business functions to have different levels of precision for matching. You don’t need as strict a set of scoring thresholds for matching individuals for the purpose of marketing as you might for assuring customer privacy. But when different tools and methods are used, there is bound to be duplicative work in implementing and managing the different matching processes and rules.

To address this, it might be worth considering whether the existing approaches serve the organization in the most appropriate way. This involves performing at least these steps:

1) Document the current state of matching/identity resolution
2) Profile the data sets to determine the best data attributes for matching
3) Document each business process’s matching requirements
4) Evaluate the existing solutions and determine that the current situation is acceptable or that there is an opportunity to select one specific approach that can be used as a standard across the organization

Data Quality and Transitions in the Customer Life Cycle

November 30, 2010 by · Comments Off on Data Quality and Transitions in the Customer Life Cycle
Filed under: Business Impacts, Data Analysis, Data Quality, Performance Measures 

I have been doing some further research into the interdependence of business value drivers, related data sets, and considering financial impacts. One area of focus is customer retention, and I have been looking at a number of aspects related to some related performance measures. The one I would like to look at today involves maintainin the relationship with the customer at various points across the customer life cycle.

Thre are two aspects to the concept of customer life cycle: events associated with the life time of a product once it has been purchased by the customer, and specific events associated with the customer’s own life time. An example of the first involves a product’s manufacturer’s warranty. The warranty is associated with some qualifying criteria, such as a time period or measurable wear (such as a “3 year/36,000 mile” automobile warranty). A product life cycle event could be associated with a customer touch point. For example, three months prior to the end of a product’s warranty period might be a good opportunity to contact the customer and propose an extension to the warranty. An example of a customer life cycle event is the purchase of a home, often registered as public data with a state registry. Customer life cycle events can also trigger touch point opportunities, such as contacting a new home buyer with a proposal for a new water filtering system.

There is value in knowledge of customer life cycle events and transitions, especially in maintaining long-term relationships. That same new mother who, in 2001, registered for diaper coupons is probably going to be dealing with a toddler in 2003, a kindergartener in 2006, and a teenager learning to drive in 2017. An effective long term marketing strategy may take these life cycle events into account as part of customer analytics modeling.

That being said, the question of the impact of data quality on customer acquisition or retention has to do with the degree to which data errors increas or decrease probability of long-term customer retention and/or continued conversion at critical life cycle events. And this implies a strong command of many different data sets, their potential integration, and corresponding analysis.

Let’s continue the example: the sale and installation of a water filtering system in a recently purchased home is but the first transaction in what should be an ongoing sequence of subsequent maintenance transactions. The filters will need to be replaced on a periodic basis (every 12 months or so?), and the entire system may need to be flushed and cleaned every few years. Therefore, it is to the water filter company’s benefit to maintain high quality information about customers and their transaction dates. But since the filter itself is associated with the property, that information needs to be managed separately as well.

So here, if the customer moves to a new location, that could trigger two life cycle events: one to contact the existing customer at the new location and begin the sales cycle from the start, and one to contact the new customer at the existing site to establish a new maintenance relationship. The quality of the data is critical, since attempting to continue to provide maintenance to the existing customer at the new site would not really make sense until a new filter is installed.

But it is important to note that it is not just the quality of the data that is important: it is the business process scenarios in which the data is used. Without having specific tasks associated with the life cycle event trigger, the entire effort is wasted.

Business Rules and Data Quality

November 5, 2010 by · Leave a Comment
Filed under: Business Rules, Data Quality, Metadata, Metrics 

There are many different dimensions of data quality that can be “configured” to measure and monitor compliance with data consumer expectations. We could classify a subset of the data quality dimensions that can be mapped to assertions at different levels of data precision, such as:

  • Data value, in which a rule is used to validate a specific value. An example is a format specification for any ZIP code (no matter which data element is storing it) that says the value must be a character string that has 5 digits, a hyphen, then 4 digits.
  • Data element, in which a value is validated in the context of the assignment of a value domain to a data element. An example is an assertion that the value of the SEX field must be either M or F.
  • Record, in which the assertion refers to more than one data element within a record. An example would specify that the START_DATE must be earlier in time than the END_DATE.
  • Column, which is some qualitative measure of the collection of values in one column. An example would assert that no value appears more than 5% of the time across the entire column.
  • Table, which measures compliance over a collection of records. An example is a rule says that the table’s percentage of valid records must be greater than 85%.
  • Cross-table, which looks at the relationships across tables. An example could specify that there is a one-to-one relationship between customer record and primary address record.
  • Aggregate, which provides rules about aggregate functions. An example would apply a validation rule to averages and sums calculated in business intelligence reports.

I have been thinking about ways to map these rules to metadata concepts to understand how a services model could be implemented that could be invoked at different locations within the information production flow. For example, one could validate data values as they are created, but you’d have to wait until you have many records to validate a table rule. This suggests that the value rules can be mapped to value domains, while table rules are mapped to entities. As this mapping gets fleshed out, I will begin to assemble a service model for data validation that ultimately links through the metadata to the original definitions associated with business policies. Given that model, we can spec out an operational governance framework to manage the quality as it pertains to the business policies.