Web Traffic Analysis: Understanding and Affecting Visitor Behavior
Posted by Andrew Caffrey
Introduction
With the rise of the Internet and the explosive growth in the number of Web sites, Web traffic analysis is fast becoming the Holy Grail of the information superhighway. Businesses, especially those involved in e-commerce, see it as the key to success in everything from sales and marketing to Customer Relationship Management.
But what exactly is Web traffic analysis and how is it carried out? For the most part, it begins by taking raw data on Web site usage and employing it to describe, model, and understand visitor behavior. Regardless of the type of site, commercial or otherwise, Web designers generally want visitors to reach a particular page such as the registration page or the “cash register” page. Information about where visitors enter a site, where they exit it, the percentage who reach key pages, and the paths they take to get there are all critical elements in determining the effectiveness of a site and what can be done to improve it.
As an example, consider the Web site of a hypothetical, e-commerce company, InternetBusiness.com. InternetBusiness.com operates a site consisting of a 101 individual pages, including index pages, product information pages, order pages, customer registration pages, and corporate information pages. Picture it as tree structure or node-link diagram with the nodes representing Web pages and the links representing hyperlinks between the pages. Additionally, visitors can enter the site via ad pages, search engines, or other referring pages, and leave it via banner ads placed on InternetBusiness pages, or simply by typing a new Web address. InternetBusiness is interested in having its visitors reach three locations, its product brochure download page, its order page, and its customer registration page. Web traffic analysis allows InternetBusiness to study the effectiveness of individual pages in driving visitors to these pages, the typical paths of the visitors who reach them, as well as common "dead-end" paths and the pages that drive visitors off the site. As a result, the company is able to continually refine its site, maximizing its probability of successfully "converting" each visitor.
Pitfalls and Problems
One of the primary problems encountered in Web traffic analysis involves the data. The raw data taken from Web server logs, which includes summary information for each individual file request, is often both incomplete and noisy. Individual visits have to be "built" from the data, a problem made difficult by the existence of proxy servers, firewalls, and caching. Although it is possible to use embedded scripts or cookies to track individual visitors, such methods can be easily thwarted by browser settings. Another problem arises from the sheer volume of data. Large Web sites have hundreds of thousands, if not millions of file requests each day. This presents difficulties both in terms of database storage and run-time constraints during analysis, which underscores a third problem--the scalability of analytical techniques to massive archives.
Typical Approaches to Analysis
Much of the literature on Web traffic analysis comes from computer scientists employing data mining methods. Typical techniques include cluster analysis, graph theoretic methods, and assorted algorithms for mining sequential patterns from large datasets, as well as combinations thereof. Such techniques seek to identify similarities and regularities in visitor behavior, grouping similar visitors to facilitate Web site development and content customization. (See Dr. Bamshad Mobasher’s homepage at http://maya.cs.depaul.edu/~mobasher/ for a selection of relevant papers.) There is a pervasive focus on developing speedy and efficient algorithms for discovering the what to the detriment of explaining the why, and one of the weaknesses of most current techniques, at least in the eyes of statisticians, is the lack of a rigorous theoretical underpinning.
Modeling Approaches
Recently, attempts have been made to apply more rigorous statistical methodologies in developing models of visitor behavior, but it is no simple undertaking. Given that visitors can retrace there steps, there are an infinite number of ways for a visitor to traverse a site. Still, such approaches allow the standard tools of statistical analysis to be used both to analyze changes in behavior and to compare behavior across visitors or groups of visitors, as well as to facilitate prediction.
One approach that moves in this direction treats visitor histories as time series of variable length and uses model-based clustering techniques to analyze Web traffic data. Dr. Padhraic Smyth of the University of California, Irvine, has written several papers in this area that discuss using mixtures of Markov models (see http://www.ics.uci.edu/~datalab/papers.html . Such developments are clearly a step toward a more rigorous analysis of Web traffic, and holds promise for the development of a theory of Web visitor behavior.
Presentation of Web Traffic Analysis Results
In addition to data analysis, there is a secondary issue involving the efficient representation of results in a manner easily understood by the analytical layperson. Typically, Web traffic analysis tools rely on a series of simple charts and two- or three-dimensional graphs to communicate the state of Web traffic, which may not be easily assimilated by the user. As a result, some researchers are developing more efficient and intuitive methods of presenting the result of Web traffic analyses.
Conclusion
The growth of the Internet has brought to light the difficulties involved in tracking and analyzing the behavior of Web site visitors. These difficulties range from weaknesses in the raw data to the problems involved in developing rigorous, efficient, and scaleable techniques to analyze that behavior. Additionally, there is the difficulty in presenting high-dimensional results to the Web site operator. Plainly, current approaches leave the door wide open to the development of a new paradigm for analyzing and understanding the behavior of Web site visitors.
|