home * about us * contact us * past features * columns * resource links * site map


9/11 Remembered
Clustering: Making The Most of Information Overload
Posted by Brad Paye, Contributing Writer

As the digital revolution stampedes onward, researchers and applied analysts increasingly confront a problem that seemed unimaginable only a few decades ago: too much information. Sifting through a mountain of data can be a daunting task, especially if the data has been collected without a well-defined application or purpose in mind, which is not unusual. In many instances, there may only have been a notion that the data would be interesting or useful, in which case it is often left to the statistician or analyst to determine what the data means and what it can be used for. To do this, he or a she needs tools, and one such tool now receiving a great deal of attention for its ability to give shape to large quantities of seemingly disparate information is clustering.

What is Clustering?
Clustering is a statistical technique for identifying groups, or clusters, within a data set. Traditionally it has found use in academic disciplines such as genetics and taxonomy, but it is currently finding business and marketing applications as well. A host of different clustering algorithms currently exist, but each is designed to identify clusters that exhibit the following two characteristics:

  1. data points within the clusters are as similar to one another as possible; and
  2. all of the identified clusters are distinct from one another as possible.

Three Approaches
Among the many approaches to clustering, three of the most common ones are hierarchical, partitioning, and overlapping. Hierarchical methods follow either a top-down or a bottom-up approach. In a typical top-down approach, the algorithm begins with each data point represented by a different cluster. Similar data points are then grouped together to form larger groups. This process proceeds until the final step when the entire data set is represented as a single cluster. (Bottom-up methods reverse this process, starting with one cluster and then dividing it into smaller and smaller clusters.) One advantage of the hierarchical approach is that the results can be presented in a tree-like structure called a dendrogram. In applied work, the researcher will often "cut the tree" at a particular stage that appears useful.

In contrast to hierarchical methods, partitioning methods often start by making an assumption about the number of clusters in the data and their center points. (These “centers” may represent actual data points or hypothetical ones.) Data points are then assigned to the nearest cluster center based on such things as minimum Euclidean distance or some other measure of “dissimilarity” (see e.g., Kaufman and Rousseeuw, 1990, chapter 1). Depending on the resulting distribution, the analyst may then adjust the assumptions and repeat the process.

Both hierarchical and partitioning methods produce a strict partition of the data in which each data point is assigned to a single cluster. Some algorithms, however, such as the fuzzy clustering algorithm of Kaufman and Rousseeuw (1990, chapter 4) produce overlapping clusters. The fuzzy clustering algorithm, which is a generalization of conventional partitioning methods, assigns each data point a membership coefficient for each cluster based on its likelihood of belonging to that cluster. The coefficients range between zero and one, and taken together must add up to one. Thus, if there are three clusters, a given data point might have membership coefficients of 0.5, 0.5, and 0.0 for clusters 1, 2 and 3, respectively. In this case, the data point would have a 50% probability of belonging to cluster 1, a 50% probability of belonging to cluster 2, and a 0% probability of belonging to cluster 3.

Data Preparation
Although it might seem that the choice of algorithm would be the most critical decision facing the analyst, this is not always the case. Regardless of the technique, great care must be taken in preparing the data, which in fact can prove to be the most challenging aspect of the entire process. For example, with certain datasets and applications, the choice of variables is obvious. For others, however, a good portion of the analysis consists of searching for the set of variables that yields the most insightful results. One possible solution is a “kitchen sink” strategy that includes any and all relevant variables. Unfortunately, the inclusion of so many variables often obscures what might otherwise be a clear cluster structure.

Another potential problem facing the analyst is that the variables themselves are often of different data types. The embedding space may include interval data, ordinal data, and categorical data. Related to this, there is also the question of scaling or the units of measurement of the different variables, and of course, there is the problem of missing data, which always represents a challenge when comparing objects.

The Pitfalls of Interpretation
In any analysis, the choice of such things as variables, units of measurement, and measures of similarity will fundamentally determine the resulting clustering structure. As a result, changes to these features such as removing variables or changing the scale of measurement will also change the nature of the identified clusters. In fact, in many cases the changes will be significant. Still, there is no precise and objective way to rank or choose among the different permutations, and particularly dangerous is the fact that the analyst can often find what he or she is looking for by fiddling with features of the analysis. If the analyst has strong prior notions as to the underlying cluster structure, repeated experimentation with parameters of the analysis will likely produce something resembling that outcome. It is important to realize, however, that such an exercise yields little new knowledge! In practice, the most useful insights often result from careful consideration of what initially appear to be incongruous or surprising results.

Frontiers in Clustering
Innovation in clustering is driven largely by applications, and two promising areas of research are those of the Adaptive Resonance Theory (ART) of Carpenter and Grossberg, and clustering over time. Oftentimes, instead of a static dataset, the analyst is confronted with an expanding body of data. In ART, each new data point is assigned either to an existing cluster, in which case the cluster centers are then updated, or to an entirely new cluster if it does not “resonate” with any of the existing clusters.

In other applications, the values of measured variables may change over time, in which cases a standard static cluster analysis might be performed at repeated time intervals. A natural question arises, however, as to how to examine and compare the resulting clusters, and while there is relatively little work in the literature addressing the question, Dunn and Landwehr (1980) suggest a number of interesting graphical methods for comparing clusters across two successive time periods.

Cluster Wisely
Employed properly, clustering can give shape to what might otherwise be a formless mass of data. It is an excellent tool, and as the mountains of information continue to rise around us, there is little doubt that it is going to be turning up in more and more places—it has already caught the fancy of the e-world. Still, to be truly informative it must be used with care, and it is wise to remember that like all statistical methods, it is likely to yield most when applied without preconceived notions as to the results.