Reality Check for Data Mining
Posted by Dr. Halbert White
Introduction
Last month, our main article, "A Little Knowledge can be a Dangerous Thing" (see below), featured a discussion with Dr. Halbert White on neural networks and their potential pitfalls, such as over-fitting the data. This month we continue our discussion with Dr. White, focusing on his Data Mining Reality Check (DMRC), which, as its name implies, is a means for ensuring the validity of your data mining results.
Reality Check
"Given the large number of nets you’re likely to examine in the course of a data mining exercise, you can easily get something that performs well on the training set but is really a result of over-fitting," explains White. "You need to know if this apparent good performance is real or not, and as a matter or routine, most people, when employing neural networks, have an evaluation set which they use to test each of their networks to tell if the results are real or not. Still, the question remains as to how good the results should be on this evaluation set. A systematic way to approach this is to test a null hypothesis that the best network is no better than some benchmark—say, something really simple like a network with no hidden units, just direct connections from input to target. What DMRC does is to give you a probability distribution for the best net relative to the benchmark over all of the different nets that you happened to have tried on the evaluation data, where the probability distribution is the one that arises under the null hypothesis that your best is no better than your benchmark. This lets you compare what you observe in your data to what you would have expected to see generated by chance under your data mining exercise."
"In other words, the Data Mining Reality Check is technology for giving you the probability distribution against which to compare the best performance from your data mining exercise—that is, the probability distribution of your best network relative to the benchmark, viewed as a random variable generated by a random process in which really there is nothing better than the benchmark. The concept here is analogous to testing whether a regression coefficient is different from zero using a t statistic. The t statistic is the number that comes out of the regression analysis and the t distribution is the distribution that you compare it against. For DMRC, your best performance is the number that comes out and DMRC distribution is what you compare that against. The thing about this distribution is that it depends on all the different networks you tried, all their correlations across cases, and all the cross-correlations across the different networks you’ve looked at in arriving at your best. As a result, this isn’t a nice distribution you can look up in a textbook, like your normal distribution or a t distribution. The distribution is different in every case, but one way to find it out is by means of statistical methods, such as the bootstrap."
The Bootstrap
"Using the bootstrap," White continues, "it is possible to arrive at an estimate of what the sampling distribution of the statistic that is the best observed performance of a data mining exercise happens to be. By comparing that statistic to the DMRC distribution generated by the bootstrap you get a p value for testing the hypothesis that the best observed performance in a data mining exercise is no better than the benchmark. The p value tells you the probability that results as good or better than those of your best network could have been generated by chance. So if that p value is low, say 0.02, it’s very unlikely that chance has generated what you are seeing. On the other hand, if the p value that comes out of DMRC is 0.5, even though the performance might look stunningly better than the benchmark, it is still well within the random variation you should expect, given the data mining exercise you just conducted."
"Let me give you a stock market example. There are a number of calendar trading rules, such as ‘trade on Monday’ and ‘trade the day after a holiday.’ If you data mined a set of stock returns data without a reality check, a conventional or classical analysis would conclude that a number of these rules yielded statistically significant predictive power. When you apply DMRC, however, all these apparently significant effects disappear. In fact, in a study in which I was involved, we had a restricted universe of about 300 different calendar rules and a more inclusive universe of almost 9,000 different calendar trading rules—things like buy and hold the stock for a particular day of the week or a particular month, or sell short on Tuesdays. Using one hundred years of daily data on the Dow Jones and about 25 years of daily data on the S&P 500, we found the best calendar rule to be the so-called Monday Effect. In this case, the rule said to stay out of the market on Monday. It was the best, and by a substantial margin—four hundred basis points per year or more!"
"It was a phenomenal difference, and using conventional techniques that ignored the data mining, if you just tested the "theory" that says stay out of the market on Monday, it would be extremely statistically significant. However, once you take into account the fact that there is no scientific basis for a "Monday theory" and that you did a data mining exercise, the p value goes to 0.3, 0.4, in some cases 0.9, well within the range of what you would expect the chance best performance to be when looking at performance of all of those different calendar rules. That was true whether you did it with 300 trading rules or 9,000 trading rules. The Monday Effect is just not real for the markets we studied. Neither is the January Effect nor the October Effect. (For a more in-depth discussion on this subject, please click on the following link to an online paper titled: " Dangers of Data-Driven Inference: The Case of Calendar Effects on Stock Returns ," by Sullivan, Timmermann and White, recently accepted for publication in the Journal of Econometrics.)
"The danger that neural networks present is that you can create new myths on the fly and never know it. What the Data Mining Reality Check does is to let you know whether you’ve found something real or not. If it turns out you haven’t found something real, you can keep going, knowing that you will always be able to account for the impact of your data mining. This gives you the freedom to data mine with the knowledge that you can assay what you find. It may be hard to find things, but you can have some faith in what you do find."
----------
Mr. Paul Lasky has pointed out by email that this article contains no discussion of the power properties of the DMRC, an inadvertant consequence of its interview format, but nevertheless a serious omission. Just as with any test, the power properties of the DMRC are a critical aspect of its usefulness. The power of the DMRC is the probability of detecting systems that do in fact beat the benchmark. In Dr. White's recent Econometrica paper, "A Reality Check for Data Snooping," (Econometrica, September 2000), he proves mathematically that when there is a system that truly beats the benchmark among the set of systems tested, the DMRC power approaches one as the number of cases in the evaluation set grows larger (that is, the best system is eventually detected with certainty), and the greater the improvement over the benchmark, the higher the power (that is, the easier it is to detect the improvement). In addition, Dr. White undertakes further study of the power properties of DMRC in the paper "Finite Sample Properties of the Bootstrap Reality Check for Data Snooping: A Monte Carlo Assessment," written jointly with Dr. Ryan Sullivan (link to be added). Dr. White also notes that in his study with Sullivan and Timmermann of technical trading rules on the Dow and the S&P500 ( Data-Snooping, Technical Trading Rule Performance, and the Bootstrap ) they consistently found technical trading rules beating their benchmark to a highly significant degree statistically until 1986. After 1986, the advantage disappears. The reason for this is a matter of conjecture, but a likely possibility is the combination of ever cheaper computing and ever more powerful methods for detecting tradable market structures, such as artificial neural networks, beginning in the mid 1980s. On the other hand, calendar trading rules do not appear to have any sub-periods during which they beat the buy-and-hold benchmark.
----------
About the Author
Dr. White is a professor of economics at University of California, San Diego and is one of the world’s foremost experts on artificial neural networks and on econometrics. He is also a Senior Partner of Bates White & Ballentine, LLC. The firm provides economics, business analysis, and litigation consulting services. Dr. White is a member of the advisory board of Stone Analytics, sponsor of Second Moment. He has received U.S. Patent 5,893,069 for computer implementations of the Data Mining Reality Check.
|