home * about us * contact us * past features * columns * resource links * site map


9/11 Remembered
WANTED: A better algorithm - The Current State of Market Basket Analysis
Posted by Scott Cunningham and Tej Anand

Algorithms for finding rules or affinities between items in a database are both well known and well documented within the data mining community. One reason for this is that they have obvious commercial application. Affinity algorithms play a major role in current “market basket” analyses, i.e. the use of affinity rules in the analysis of consumer purchases. For their part, market basket analyses are of particular importance to the consumer package goods industry, which generates more than 300 billion dollars in sales a year in the United States alone. Despite the size of the industry, however, analysts have yet to develop data mining solutions that specifically address the needs of this vast market.

A Complex Environment
The consumer package goods industry exists within a complex economic and informational environment. Mass merchandizing is in decline. U.S. consumers are increasingly recognized as belonging to 50 (or more) distinct segments, each with its own demographic profile, buying power, product preferences, and media access. The items being sold are also more diverse than ever before. A single category of food can easily contain hundreds of competing products. Within this highly differentiated environment, strong brand names continue to offer a strong competitive advantage. Temporary price reductions by themselves are not sufficient for establishing consumer loyalty either to a store or product. Consumers are knowledgeable and mobile enough to seek out the lowest possible price for a product. Ultimately, the advantage goes to those retailers most able to negotiate favorable terms with their suppliers.

To make matters even more complicated, the consumer package goods business is a mature industry here in the United States. Profit is no longer merely a matter of opening more stores and selling to increasing numbers of consumers. The market is becoming more and more saturated, and the available consumer disposable income is already largely consumed. In fact, in many instances, maintenance of one’s existing customer base is more important than gaining new customers. Growth is based on selling both more and a greater variety of products to existing consumers. In this environment, the most profitable retailers are those that are able to maintain or reduce their operating costs. Economics of scope, not scale, determine profitability.

Given this environment, more and more retailers are using such things as consumer loyalty programs and on-line transaction processing systems to gather detailed information about their customers. Data warehousing is seen as one of the foremost technological means for increasing operational efficiency. Consumer response systems and category management, an organizational strategy for enhancing retailer-wholesaler coordination, are expected to save the industry $30 billion a year.

Case Studies: The Reality
A major international food manufacturer with significant brand equity and a wide variety of manufactured products was interested in optimizing its product advertising budget. Like many package goods retailers, the manufacturer had an extensive and rapidly growing advertising budget. Essential to the endeavor was cooperation between the manufacturer and their independent retail outlets in the creation and design of product promotions. The manufacturer sought to create a suite of software tools that would employ the latest data mining technology in the design of the promotions, and to make these tools available in real time to their category managers and to the managers of their retail outlets. The business case suggested that there would be several sources of return for the creation of these tools:

  1. improved coordination with retailers
  2. more effective cross-sales across product categories
  3. reduced promotional competition from other manufacturers and
  4. enhanced promotional returns

My colleagues and I designed a state-of-the-art neural network for forecasting and optimizing planned promotions. The network met or exceeded industry standards for promotional forecasts. It was within 15% of actual sales, 85% of the time. Despite the statistical quality of the results, however, the application was never put into production by the manufacturer because the software design necessary to implement the results was too complex. Part of the complexity stemmed from the hierarchical data types necessitated by the varied products and markets. Another component of the complexity had to do with reconciling the different product world-views of manufacturer and retailer.

At the same time, a major regional food retailer wanted to analyze consumer transactions within its produce and salad dressing departments. The retailer anticipated that such analysis would enable it to improve store layout, improve promotional design, and gain insight into the market role of the various highly differentiated products within the category. The retailer clearly anticipated a causal analysis that would reveal the products that would lead to additional add-on sales of other products.

In this instance, we produced a market basket analysis that revealed the distinctive purchasing profiles associated with each major brand of interest. The analysis showed that the best selling brands were not those that resulted in the greatest number of attendant sales. The analysis supported the existing category management plans by the retailer, and also independently confirmed the results of a demographic panel survey. Still, despite these successes, the market basket analysis, by itself, did not produce any new actionable results for the retailer.

The Problem
As stated at the beginning, affinity algorithms are both well understood and well documented. For example, when applied to market basket analysis these algorithms produce rules such as, “Those baskets producing product X are 75% likely to contain product Y.” Recently, work has also been done to optimize the speed and efficiency with which these rules are found. Still, to be truly useful, additional applied research is needed to the support decision-making needs of the consumer goods industry (and other relevant business groups).

Affinity algorithms produce individual, isolated rules. They do not reveal associations between groups of products. Furthermore, while the analysis can be repeated across all the products in a category, or even in a store, the number of rules grows exponentially. Not only is this computationally complex, but the resulting welter of rules is hard to interpret. At the same time, the output of affinity algorithms seems to suggest causal relationships between products, while the algorithms themselves embody no causal assumptions. As a result, the nature of product affinities needs to be reconsidered. Either a new and causal form of affinity analysis needs to be produced, or a thorough understanding of non-causal applications and use of affinity rules needs to be obtained.

Affinity algorithms also lack robustness. The algorithms produce a point estimate of affinity, yet retailers need to understand how (and if) these rules apply across larger groups of transaction. A similar issue is the minimum sample size needed to produce robust results. Additionally, market basket analyses carry implicit information about consumer preferences. Even when consumer identification is missing from transaction data, the data can still be grouped or segmented using data mining techniques that reveal distinct groups of consumer preferences. Affinity algorithms imply that samples are taken from homogenous groups of customers, yet business knowledge suggests that consumers are highly varied in taste and expenditure.

Another problem with affinity algorithms is that for some business questions, the market basket analyses may require the rigor of a properly designed statistical experiment. Reasoning from standard to promotional pricing, as well as reasoning from standard display conditions to promotional display conditions is unwarranted. Still, much of the potential of market basket analysis stems from the capacity of retailers to manipulate product pricing, display, or other attributes to meet consumer need. An appropriate approach to address these multiple concerns may be a binomial test of proportions. The appropriate question for such a test is “Is the proportion of customers buying product A and B any higher than that of customers buying from their respective categories?” If not, the association lacks interest to the retailer.

The fact is, despite the great strides the industry has taken in gathering large amounts of relevant information, that information is still not yielding the hoped for results. When it will is to a large extent dependent on the capacity of algorithms to model complex, hierarchical arrangements of goods and products.

About the Authors
Scott Cunningham is a Senior Software Engineer for @TheMoment (http://www.themoment.com), a leading provider of dynamic trading solutions. Prior to that, he was a data mining developer for the Teradata data warehouse solution with NCR

Tej Anand is a business strategist, technical innovator, and IT practitioner. He has served as Chief Information Officer for both NetCreations, a permission based e-mail marketing company, and Golden Books, the largest publisher of children's books in North America.