Data Quantization: Transformations for Analysis
We are defining a high level process flow for data preparation for data mining and analysis. One of the objectives is to use undirected data mining to identify the variables that are most suitable for classification. The idea is that if there are specific characteristics that distinguish what we have identified as “good customers,” then we can look for similar characteristics in other customers and see if they can be transitioned into the “good customer” category. Therefore, we are looking for discriminating variables and their corresponding value sets.
One of the challenges involves data attributes with a broad, yet continuous, range of values, such as magnitudes or ages. The issue is that the large number of values diffuses the attribute’s meaning across each record, even if some broader value categorization is dependent. As an example, consider age: instead of looking at how each individual age is used for differentiation or classification, we often bundle groups of ages into smaller buckets for descriptive purposes, such as “18-34” or “over 65.”
Performing the analysis by assigning discrete values into buckets reduces the variability of values and allows the variable to contribute to the analysis. To allow this requires a process called “quantization” which is a frequently-used transformation for preparing data for analysis:
1) Determine data attributes whose using a set of discrete values within a continuous range – these are the attributes whose values can be bucketed;
2) Determine the most suitable distribution – in some cases an even distribution would suggest even ranges, while a normal value distribution might suggest variable sized ranges scaled by the standard deviation;
3) Deploy the transformation – provide some process or service that “quantizes” each value into the right bucket.