This article describes a method for identifying whether a data set is composed of categorical values. Automatically identifying whether a data set contains categorical values enables applications to make use of such data without requiring users to supply this information.

Categorical data is composed of a limited number of possible values. For example, a table column representing gender would contain the text values *male* and *female*. This column contains categorical data because it holds only two values regardless of the number of rows. In contrast, a table column containing the text of a set of tweets does not contain categorical data because although some of the values might be repeatedâ€”as in the case of retweetsâ€”most of the values will be unique. Since repeated value can occur in non-categorical data, a test is required to differentiate between a data set composed of categorical values and a data set composed of non-categorical values that may contain repeats.

The following algorithm provides a useful test for identifying categorical data:

- Calculate the number of unique values in the data set.
- Calculate the difference between the number of unique values in the data set and the total number of values in the data set.
- Calculate the difference as a percentage of the total number of values in the data set.
- If the percentage difference is 90% or more, then the data set is composed of categorical values.

When the number of rows is less than around 50, then a lower threshold of 70% works well in practice.