home * about us * contact us * past features * columns * resource links * site map


9/11 Remembered
Text Mining: The Next Gold Rush
Posted by William Abrams, Contributing Editor

Mining the Future
Like gold, information is both an object of desire and a medium of exchange. Also like gold, it is rarely found just lying about. It must be mined, and as it stands, 80 percent of the world's electronic information exists not as numeric data, but text. Text drives the Internet, just as it drives newspapers, magazines, and books, and it doesn’t end there. Text is also the language of customer service records, market surveys, and advertisements. It is ubiquitous, and two critical questions now facing businesses, scientific endeavors, and organizations of every kind, are how to retrieve the information contained within that text, and how best to use it once you have it. These are the provinces of text mining, one of the newest fronts in the information revolution.

The Problem
Unlike Data Mining, which focuses on the well-structured collections that exist in relational databases and data warehouses, Text Mining focuses on material that is far less structured. This is the reason so much irrelevant information tends to come back when you type in a word and hit the Enter key on the typical search engines now in use. In addition, it has become abundantly clear that simply adding more words to your query is no guarantee of greater accuracy. In fact, the very opposite is true, which is the reason businesses and scientists are increasingly applying statistical analysis and other tools to the problem—the goal being to intelligently exploit the ever-increasing flow of textual data.

“Traditionally, information has been retrieved by literally matching terms in a user’s query with terms extracted from the documents within a collection,” explains Dr. Eric Jiang, Assistant Professor of Computer Science at the University of San Diego and a senior staff scientist with Stone Analytics. “Unfortunately, these lexical-based retrieval techniques can be incomplete and inaccurate due to the variability in word usage. On one side, people typically use different words to describe the same concept [synonymy]. On the other, many words can have multiple meanings [polysemy].”

To circumvent these problems, Jiang himself employs a method called Latent Semantic Indexing. In LSI, a document is expressed as a vector whose coefficients are calculated from the occurrence and frequency of different terms inside the document. LSI goes much deeper than the simple co-occurrence of words. It assumes that there is an underlying semantic structure or word usage in a document collection, structure that can be tapped by replacing individual literal terms with statistically derived conceptual indices.

Summaries and Technical Support
Essentially, semantic analysis uses statistical tools and grammatical rules to look for the meaning underlying a text, information that can then be used to create a summary of a given document. In the same way that more traditional abstracts are used to sift through articles and papers, this is especially useful when a search is likely to bring back dozens, if not hundreds of potentially relevant documents.

Other areas where text mining is finding use are the categorization and routing of email correspondence and even technical support. Not all customers are as organized or technically sophisticated as companies would like. When calling for help, some customers are not even able to provide the serial number for their particular product, and even when they are, they may have difficulty describing the exact nature of their problem. Furthermore, not all customer issues are indexed. Text mining is the perfect tool for sifting through specification documents, user guides, and even transcripts of other customer calls to quickly and easily retrieve the required information.

Although text mining remains a relatively new field, its importance is by no means unrecognized. A number of firms already offer a variety consulting services as well as a wide range of tools, including everything from one-size-fits-all products, to more specialized solutions geared to analyzing complex collections of documents like scientific publications, newswires, and other press reports.

Ongoing Development
One of the keys to text mining is the classification of words, a subject that is hardly new. “It is now nearly 50 years since I first projected a system of verbal classification similar to that on which the present work is founded.” So wrote Dr. Peter Roget in the preface to the first edition of his famous thesaurus, which collected words in related categories such as Hearing, Deafness, Sound, Silence, Faintness of Sound, and Loudness. The year was 1852.

It is this idea that is now being expanded by WordNet. WordNet, a product of an ongoing research project at Princeton University, is an electronic lexical reference system for the English Language. Its design is inspired by current psycholinguistic theories of human lexical memory. A popular tool among text and natural language researchers, its relative comprehensiveness and free-of-charge distribution—along with its numerous tools and associated data sets—give it the potential of becoming a standard. It has already inspired compatible developments in other languages, paving the road to multi-lingual and cross-lingual applications.

In WordNet, English nouns, verbs, adjectives, and adverbs are organized into synonym sets called synsets. Each synset consists of a list of synonymous word forms, representing one underlying lexical concept. Semantic pointers describe relationships between the synsets. For instance, there is a synset for the synonym set {industry, manufacture, manufacturing}. WordNet contains approximately 122,000 terms grouped into approximately 99,000 synsets and is constantly being updated and expanded. The main relationships between synsets are the hyponym/hypernym relationship and the meronym/holonym relationship. An “oak” is a kind of “tree” is an example of the first, and a “pitcher” is part of a “baseball team” is an example of the second.

Along with new statistical models and more powerful processors, reference tools like WordNet are sure to play a major role as text mining continues to advance in the coming years.

Limitations
In spite of its vast potential, it must be noted that Text Mining is still in its infancy. For example, the empirical results from a number of recent studies on document retrieval place the improvement rate over searches done strictly by the co-occurrence words at between 5 and 10%. Still, in many instances, the increased accuracy represented by even such small percentages as these are not merely significant, but substantial, and researchers point out that they have been able to consistently improve their results.