Repost from 1999: Much Ado About Nothing – An Orderly Representation of Null Values

July 6, 2011 by
Filed under: Data Profiling, Data Quality, Recommendations 

This is a repost of an article I wrote back in 1999, but I thought it might be interesting to recycle it, since it still seems relevant. I did edit it a little – I wrote it after the birth of my second child, who is now 11, so it did not make sense to refer to him as a baby ;-). Here it is:

A null value is a missing value. Yet a value that is “not there” may provide more information than one might think, because there may be different reasons that the value might be missing. A null value might actually represent an unavailable value, that the attribute is not applicable for this entity, that there is no value in the attribute’s domain that correctly classifies this entity, etc. Or the value may actually be missing!

Even though many databases may provide a default representation for the null value, there may be times when a specific internal representation of a null value is used. When this occurs, it is wise to choose a value that will not get you into trouble down the road. An example of a poor choice for a null value is 20 years worth of using “9/9/99” as a null date.

Instead of relying on the default null, let’s look at the different kinds of null values that exist, and ways of representing those nulls. In turn, we must also create a framework for validating those nulls – basically making sure that “what you don’t see is what you get.”

There are Different Kinds of Nulls

A null value is essentially the absence of a value, although there are different kinds of null values. Our goal for null value specifications is to isolate the difference between a legitimate null value and a missing value. Since a data record may at times allow certain fields to contain null values, we provide these deeper characterizations:

  1. No value – there is no value for this field – a true null.
  2. Unavailable – there is a value for this field, but for some reason it has been omitted. Using the unavailable characterization implies that at some point the value will be available and the field should be completed.
  3. Not applicable – this indicates that in this instance, there is no applicable value.
  4. Not classified – there is a value for this field, but it does not conform to a predefined set of domain values for that field.
  5. Unknown – the fact that there is a value is established, but that value is not known.

If we were only to use the default null value, it would be difficult to distinguish when a value is truly missing and when it is expected to not be there. We will look at some ways to look at this problem so that we can make assertions about what we expect to see (or not), and when action needs to be taken. But first, let’s look at some examples.

Examples

When our second child was born, one of the more mundane tasks at hand involved the notification and subsequent enrollment of our boy in our health insurance plan. I was told that I needed to provide my name, my identification number, the baby’s name and the baby’s social security number. While we had been able to give our son a name, we had only applied for the social security number, but it had not yet been assigned. I was told to send in what information I did have, and to provide the social security number when I received it. This is an example of an unavailable data value.

Another example involves securities trading. An equity product is essentially stock in a company. An option is a derivative security that is basically a promise to buy or sell a specific stock by a given date (the “exercise date”) at a pre-assigned price (the “strike price”). A securities orders database must be able to support trades of either kind of product. “Strike price” and “Exercise date” are two attributes that are relevant to the option product, but are not applicable to the equity product, and are expected to be null.

Setting the Tables…

A way to address the null value question involves both defining different representations for the different null values, and expressing rules that validate the presence (or absence) of null values. Consequently, we will have one method for defining and characterizing null values, and two kinds of null value rules. The first asserts that null values are allowed to be present in a field, and the second asserts that null values are not allowed in a field.

Considering that we allow more than one kind of null value, we also need to allow different actual representations, since there is usually only one system-defined null value. Therefore, any null value specification must include both the kind of null along with an optional assigned representation. We can define a table that contains different kinds of representations along with their related meanings:

representation meaning
“-“ “no value”
“U” “unknown”
“X” “unavailable”
“N/A” “not applicable”
“N/C” “not classified”

Using a more flexible table, we’ll associate each set of null value representations with a named table and attribute. This way, we can have different representations for the different kinds of null values in a way that allows flexibility in defining the null value rules. A null value rule will specify the kinds of null values that may be present, and a representation (if any) used for those null values.

Nulls_representation:

Table_field representation meaning
Customers.SSN “-“ “no value”
Customers.SSN “U” “unknown”
Customers.SSN “X” “unavailable”
Customers.SSN “N/A” “not applicable”
Customers.SSN “N/C” “not classified”
Customers.birthdate “99/00/00“ “unknown”
Customers.birthdate “98/00/00“ “unavailable”

In the table above, we have null information for 2 fields: SSN (social security number), and birthdate. For the SSN field, we have 5 different kinds of nulls, but for the birthdate field the only null values should be unknown and unavailable, we will not allow nulls other than the ones defined in the table.

A null value rule may allow traditional null values (such as system nulls, empty fields, or blanks), generic null values as defined in null value specifications, or a detailed list of specific null value representations. Note that if we only allow certain kinds of null values, this will most likely mean that we want to restrict the appearance of the traditional nulls!

A non-null value rule, as its name implies, is a rule that specifies which kinds of null values are not allowed. If the rule indicates that no nulls are allowed, then the rule is violated if the system null or blanks appear in the field. The non-null value rule can also specify that certain predefined null representations are not allowed, in which case any appearance of those defined null values constitutes a violation.

Validation

With the null value rules, the resulting validation depends on the types of null values allowed. If any nulls are allowed, then there is really nothing to do – whether or not a value is in the field, the field is conformant. But if only certain kinds of nulls are allowed, then the validation for that rule includes checking to make sure that if any other null values appear (such as the system null, or blanks), the record is marked as violating that rule.

How is this done? Well, since we have specific representations for each kind of null, we can identify those records that have unexpected null values by selecting any records that have a real null or a blank string in the field under scrutiny:

For example, any record where the birthdate field has a true null value is invalid, since we have defined the two null representations that we care about:

SELECT * FROM Customers WHERE birthdate IS NULL

Another example would be when pulling out all the records where the social security number was previously unavailable:

SELECT * FROM Customers WHERE SSN = “X”

The real power of this kind of representation is that we do not really need to know the actual representation ahead of time. Instead, we can grab the representation from the Nulls_representation table itself:

SELECT * FROM Customers WHERE SSN =

(SELECT representation FROM Nulls_representation WHERE

meaning = “unknown” AND

table_field = “Customers.SSN”)

Conclusion

No remedy is always going to be foolproof, but this method yields two major benefits. The first is that it gives the DBA and the users some flexibility in using null values, and the second, more important benefit, is that you start to accumulate knowledge about how null values are represented in all the tables in the company – an important piece of enterprise knowledge.

Comments

Tell me what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!