The Resurgence of Data Mining
Posted by Brad Paye, Contributing Writer
It is hard to resist a great comeback story, and in the world of commercially applied statistics, the recent resurgence of data mining techniques makes for a compelling comeback tale. Prior to the last decade, statisticians and econometricians often used the term data mining as a pejorative for so called "fishing expeditions," that is, research guided by a particular data sample rather than theoretical considerations. Fueled by the explosive growth of internet traffic and the massive quantities of data collected by e-commerce companies, data mining has pulled off an astonishing feat of image reversal over the past decade. Data mining’s potential to revolutionize online marketing has the e-commerce industry abuzz with excitement. Venerated computational software providers such as SAS and Insightful (maker of S-Plus) are scrambling to incorporate the latest data mining techniques into their products, while literally dozens of smaller firms market specialized data mining solutions to e-commerce customers.
Multiple Techniques
Before continuing, it should be mentioned that data mining is in fact an umbrella term for a variety of analytical techniques. The key to understanding data mining’s capabilities is to examine some of the typical questions they are used to answer. For example: How do you predict future values of variables of interest? For an e-commerce company, variables of interest might include the number of visitors to their Website each week or month, the length of time each visitor spends, and whether or not the visitor reaches the site’s "shopping cart." (Note that these examples include both numeric and nominal data.) To answer such a question, data mining techniques seek to identify relationships between the variable of interest and the variables in a data sample. In this example, the technique might simply be that of linear regression, in which the values for the variable of interest (e.g., the number of visitors to the Website each month) are compared to the values for other variables in the dataset (e.g., the month of the year and the sum spent on advertising over the previous few months). The regression would yield a fitted model, which is an equation that expresses the predicted number of monthly visitors to the Website as a linear combination of the other variables in the dataset, with weights determined by the regression analysis of the sample of data. Many e-businesses currently use equations of this sort to predict future site traffic and to assist in decisions such as when to purchase additional servers. Their efficacy rests on a critical assumption, however. They assume that the relationships discovered in the data sample are valid for new samples of data as well. That is to say, the relationships identified in the sample dataset must generalize.
A second type of question or decision for which data mining is used involves classification, such as the lending decisions faced by creditors. Banks and other lending institutions regularly collect data from loan applicants, based on which they decide whether or not to extend credit. Essentially, the goal is to classify applicants as either good or bad credit risks. If the lending institution has stored data from past applications along with records of loan payments, these data can be used to determine a method for classifying new applicants as either good or bad credit risks. Here again, data mining would be used to identify relationships between the payment record and other variables in the data sample, relationships that would then form the basis of a "recipe" for classifying future applicants. This recipe can take several forms, one of the most common of which is a decision tree. Here, tests are conducted at each node of the tree, the navigation continuing until a "leaf" of the tree is reached at which the observation is assigned to a particular class. Once more, for the classification scheme to be effective, the relationships in the sample dataset must generalize to future samples. If not, the scheme may incorrectly classify new applicants as good credit risks when in fact they are poor ones, and vice versa, something that would very quickly prove extremely costly for the creditor!
A third type of question data mining is used to answer is simply: What are the interesting relationships among the variables we observe? The canonical example is the famous "shopping basket" question that asks which items tend to show up together in the purchases made by customers. The major distinction between this question and the preceding two is that previously there was a "variable of interest" that we were attempting to predict. In this case, there is no particular variable of interest. Rather, we are simply looking for relationships among all of the variables in a sample of data, and typically, we "discover" many such relationships. Some of these may be of limited interest. For example, the revelation that hot dogs and hot dog buns often show up together in shopping carts is really not much of a revelation at all. Others, however, may be more surprising and thus of considerable interest. Still, to be of any practical use, the relationships – often referred to associations in the data mining literature—in the sample dataset must generalize to new data samples.
Overfitting
The point to take away from all this is that regardless of the particular question or the technique applied, the key to the successful application of data mining is the extent to which relationships identified in the data generalize. A significant concern, however, is that all data mining techniques tend to overfit data. Consider the loan applicant classification example above. In constructing a classification scheme, the primary interest is the accuracy of the scheme at classifying new loan applicants as good or bad credit risks. The classification scheme is developed using a sample of data called the training set, data for which the outcome is known. The data mining algorithm looks for relationships among payment history and other variables in the training set, and returns a classification scheme. To calculate the error rate we take each loan application in the training set and use the scheme to classify it as either a good or bad credit risk. We then check whether the actual outcome matches the classification assigned by our scheme. The ratio of errant classifications to the total number of applicants is the error rate for the training set. The critical notion to remember, however, is that this error rate is not a good indicator of the error rate that we are likely to observe when we attempt to classify new loan applicants. In fact, the error rate for new applicants is likely to be higher, possibly significantly higher. The reason for this is that the structure of the decision tree we are using in this example is highly dependent on the training set itself. In effect, the branches of the tree are created in order to perform well on the training set. Some of the branches will reflect idiosyncrasies of the training set, so that when new data are encountered the performance is likely to suffer. Furthermore, the danger of overfitting is not limited to decision trees, but to all data mining techniques.
This fact is central to understanding the story of the resurgence of data mining. In its darker days, data mining’s reputation was severely hampered by the inherent problem of overfitting. In fact, the terms "data mining" and "overfitting" were used nearly synonymously. Recently, however, new statistical techniques have been developed that allow for a better understanding of the extent to which relationships identified in the training set will generalize to new data.
Potential Solutions
When In the presence of massive amounts of data (such as the data that many e-commerce companies are likely to capture) one common recipe for accurately assessing how classifiers or rules will generalize is to simply partition the total data into two large sets, one for training and one for testing. The training set is used by the data mining algorithm to find relationships in the data and produces, for example, a decision tree. The error rate is calculated the same way as described above, except the observations in the test set are used instead of those in the training set. Since the tree was not created using the test set, performance on the test set is a reasonably accurate measure of performance on new data. In some cases, the data mining exercise is even expanded to three stages. In the first stage, several different algorithms are run on the training set. Each of these algorithms are then tested on a second dataset known the validation set, at which time the best performing model is selected. Finally, the selected model is tested against a third set of data, the test set.
To obtain a good predictive model, it is always best to have a large amount of data available for training. Similarly, a large amount of data is required to obtain accurate error estimates using the test set. When the amount of data is limited, however, a tradeoff must occur between the size of the training set and test set. Still, even when the quantity of data is not overwhelmingly large, there are techniques exist that allow for reasonably accurate performance evaluation. One technique, known as repeated holdout sampling, begins by randomly dividing the total dataset into a test and training set and proceeding as above. An error rate is determined based on performance over the test ("holdout") set. This process is then repeated a number of times, and the resulting error are averaged. The related technique of cross-validation randomly partitions the dataset into N subsets. Each of the subsets is used in turn as a test set, while the remaining sets are aggregated and used for training. This yields an error rate for each of the N subsets, and these N rates are then averaged to produce an overall error rate estimate. While the number of partitions is arbitrary, ten is a popular choice. Alternative techniques are also widely used, the most prominent being bootstrapping, which relies on the statistical procedure of sampling with replacement.
Ultimately, the significance of all this is that the sizzling hot aura now surrounding data mining is not merely a by-product of our recent data hoarding. The development of statistical techniques such as cross-validation and bootstrapping has been instrumental in transforming the term "data mining" from a disparaging remark into the golden child of applied analytics. Data mining’s allure to the business world is tied inextricably with the ongoing statistical research that allows businesses to confidently apply the knowledge derived from their data.
|