home * about us * contact us * past features * columns * resource links * site map


9/11 Remembered
Bayesian Networks, Causal Inference and Knowledge Discovery
Posted by Dr. Judea Pearl

Introduction
One of the most exciting prospects in recent years has been the possibility of using graphical models to discover causal structures in raw statistical data [Pearl and Verma 1991, Spirtes et al, 1993, Pearl, 2000] previously considered impossible without controlled experiments. Consider, for example, the following intransitive pattern of dependencies among three events: A and B are dependent, B and C are dependent, yet A and C are independent. If you ask a person to supply an example of three such events, the example would invariably portray A and C as two independent causes and B as their common effect, namely A -> B <- C. For instance, A and C could be the outcomes of two fair coins, while B represents a bell that rings whenever either coin comes up heads. Fitting this dependence pattern with a scenario in which B is the cause and A and C are the effects is mathematically feasible but very unnatural, because it entails fine tuning the probabilities involved. In other words, the desired dependence pattern will be destroyed as soon as the probabilities undergo a slight change.

Such thought experiments tell us that certain patterns of dependency that are totally void of temporal information are conceptually characteristic of certain causal directionalities and not others. When put together systematically, such patterns can be used to infer causal structures from raw data and to guarantee that any alternative structure compatible with the data must be less stable than the one(s) inferred, i.e. slight fluctuations in parameters will eventually render the structure incompatible with the data.

Using this mild assumption of stability, methods have been developed for identifying genuine and spurious causes, with or without temporal information (Pearl, 2000; Chapter 2).

Bayesian Networks
The nodes in a Bayesian network represent variables of interest (e.g., the temperature of a device, the gender of a patient, the price of a product, the occurrence of an event) and the links represent informational or causal dependencies among the variables. The dependencies are quantified by conditional probabilities for each node given its parents in the network. The network supports the computation of the probabilities of any subset of variables given evidence about any other subset.

Figure1 illustrates a simple yet typical Bayesian network. It describes the causal relationships among the season of the year (X1), whether it’s raining (X2), whether the sprinkler is on (X3), whether the pavement is wet (X4), and whether the pavement is slippery (X5). Here, the absence of a direct link between X1 and X5, for example, captures our understanding that there is no direct influence of season on slipperiness-the influence is mediated by the wetness of the pavement. (If freezing is a possibility, then a direct link could be added.)

Perhaps the most important aspect of Bayesian networks is that they are direct representations of the world, not of reasoning processes. The arrows in the diagram represent real causal connections and not the flow of information during reasoning, as in rule-based systems and neural networks. Inferences can be derived from Bayesian networks by propagating information in any direction. For example, if the sprinkler is on, then the pavement is probably wet (prediction); if someone slips on the pavement, that also provides evidence that it is wet (abduction). On the other hand, if we see that the pavement is wet, that makes it more likely that the sprinkler is on or that it is raining (abduction); but if we then observe that the sprinkler is on, that reduces the likelihood that it is raining (explaining away). It is this last form of reasoning, explaining away, that is especially difficult to model in rule-based systems and neural networks in any natural way.

Causal Networks
All probabilistic models, no matter how refined and accurate, Bayesian included, describe a distribution over possible observed events, but say nothing about what will happen if a certain intervention occurs. For example, what if I turn on the sprinkler? What effect does that have on the season, or on the connection between wetness and slipperiness? A causal network is a Bayesian network with the added property that the parents of each node are its direct causes. In such a network, the result of an intervention is obvious: the sprinkler node is set to “on” and the causal link between the season and the sprinkler is removed. All other causal links and conditional probabilities remain intact. This added property endows the causal network with the capability of representing and responding to external or spontaneous changes. For example, to represent a disabled sprinkler in the story of Figure 1, we simply delete from the network all links incident to the node Sprinkler. To represent the policy of turning the sprinkler off if it rains, we simply add a link between Rain and Sprinkle. Such changes would require much greater remodeling efforts if the network were not constructed along the causal direction. This remodeling flexibility may well be cited as the ingredient that marks the division between deliberative and reactive systems such a neural networks, and that enables the former to manage novel situations instantaneously, without requiring training or adaptation.

Causal Structures and Knowledge Mining
Many statistical routines are currently being developed under the enterprises of “knowledge mining” or “knowledge discovery,” but none deserves this fancy title, because knowledge connotes stable relationships, invariant to local interventions, and transportable across contexts—statistical routines are blind to considerations of stability. The general attitude is that statistical associations alone would be sufficient in prediction tasks that involve no manipulation.

This attitude is short sighted. First, black-box predictions are not as useful as those that are accompanied with causal understanding of the underlying processes. For example, when a statistical package predicts that customers who purchased product A are likely to purchase a product B in the future, the question always arises whether the association discovered is long-lived, and whether it is transportable across contexts. If one product is functionally supplementary to another, the association between the two demands is stable. If, on the other hand, demands for products A and B are correlated merely because the two were advertised simultaneously in the same medium, the association is short lived, and will disappear as soon as advertising strategies change.

Second, models are rarely used exclusively for passive predictions. Using an e-commerce example again, vendors constantly try new techniques of presentation, and new methods of capturing users’ attention. These changes are the commercial analogue of scientific experimentation, and only causal models can capture the results of these experiments so as to predict response to future changes.

Finally, even purely predictive tasks can benefit from the modularity inherent in causal models. When some conditions in the environment undergo change, it is usually only a few causal mechanisms that are affected by the change; the rest remain unaltered. It is simpler and more effective, then, to reassess (judgmentally) or re-estimate (statistically) the model parameters knowing that the corresponding change in the model is also local, involving just a few parameters, than to re-estimate the entire model from scratch. In non-causal systems, such as neural nets or those based on regression equations, a local change in mechanism space would spread its effect over all model parameters, and that normally requires a major effort of re-estimation or re-training.

Where Does the Structure Come From?
In many applications, users of statistical methods possess valuable theoretical and professional knowledge (e.g., symptoms do not cause diseases) that permits one to combine causal and statistical information effectively—the human expert provides the qualitative causal structure and the data provides the basis for assessing the strengths of the causal connections. This symbiosis was in fact the motivating paradigm behind econometric modeling, before it went into hiding. (Most econometric texts in the past decade have refrained from defining what an economic model is, and those that attempt a definition, erroneously view models as compact representations of density functions [see Pearl, 2000, pp. 135-138]). However, there have been two major (mental) barriers for implementing this symbiosis: (1) Investigators (especially statisticians) are reluctant to state causal information explicitly, because such information cannot be tested directly in non-experimental data; and (2) Causal information, even when tested, cannot be expressed in the standard vocabulary of probability calculus. The second barrier, to my view, far outweighs the first, and the development of new mathematical tools for causation, both algebraic and graphical, now promises to reinstate causal modeling to its proper place in data interpretation and knowledge mining.

References
J. Pearl. Causality Cambridge University Press, New York, NY, 2000.

J. Pearl, and S. Russell, Bayesian Networks, In M. Arbib (Ed.), Handbook of Brain Theory and Neural Networks, MIT Press, second edition, forthcoming, 2001.

J. Pearl and T. Verma. A theory of inferred causation. In J.A. Allen, R. Fikes, and E. Sandewall, editors, Principles of Knowledge Representation and Reasoning: Proceedings of the Second International Conference, pages 441-452. Morgan Kaufmann, San Mateo, CA, 1991.

P. Spirtes, C. Glymour, and R. Schienes. Causation, Prediction, and Search. Springer-Verlag, New York, 1993.