Is Data Profiling a Commodity?
Here are some quick thoughts about the basic functionality of data profiling that make me wonder about the degree to which it has become a commodity capability. If so, then I have a few observations at the end to make folks think about what they are using profiling for.
Here is the list:
- There are a handful of open source vendors providing core profiling functionality without charging a licensing fee.
- There are a handful of non open source vendors providing core functionality without charging a licensing fee.
- For small data sets you can do a lot of what is typically done using a desktop productivity tool that almost everyone has.
- For larger data sets you can do a lot of what is typically done using some template SQL queries.
- For even larger data sets, if you are a programmer and took a freshman-level algorithms class you can do a lot of what is typically done using hash tables.
- For even larger larger data sets, if you are a programmer and have some understanding of parallel processing you can do a lot of what is typically done using MapReduce and/or Hadoop
What do I mean by “typically done”? Many aspects of profiling are based on value frequency analysis, or basically counting the number of times a value appears in a column. This enables completeness evaluation (the count of nulls), outlier analysis (sort by the frequency to look for either high-frequency values or low-frequency values), other outlier analysis (sort by the values themselves to look for weird ones), distribution analysis (any show up more frequently than expected?), etc.
The harder stuff for profiling is usually around redundancy analysis (I can do that with bit vectors, by the way, for set functions like intersection) and dependency analysis (much more complicated and computationally intense). Yet who is actually using these features? If you are, then freebies might not do it for you. Likewise for use of the product as a service component, or if you use it for managing your data quality rules.
What are you using data profiling to do? If it is solely base level identification of gross errors, that is one thing, but I’d be interested in hearing about your more creative approaches.