Shmueli's argument is that the terms predictive and explanatory in a statistical modeling context have become conflated, and that statistical literature lacks a a thorough discussion of the differences. In the paper, she contrasts both and talks about their practical implications. I encourage you to read the papers.
The question I'd like to pose to the practitioner community is: How do you know which one to use? It would be useful if you could talk about the specific application. Predictive modelling is all about "what is likely to happen? I think the main difference is what is intended to be done with the analysis. I would suggest explanation is much more important for intervention than prediction.
If you want to do something to alter an outcome, then you had best be looking to explain why it is the way it is. Explanatory modelling, if done well, will tell you how to intervene which input should be adjusted.
However, if you simply want to understand what the future will be like, without any intention or ability to intervene, then predictive modelling is more likely to be appropriate. Predictive modelling using "cancer data" would be appropriate or at least useful if you were funding the cancer wards of different hospitals.
You don't really need to explain why people get cancer, rather you only need an accurate estimate of how much services will be required. Explanatory modelling probably wouldn't help much here. For example, knowing that smoking leads to higher risk of cancer doesn't on its own tell you whether to give more funding to ward A or ward B. Explanatory modelling of "cancer data" would be appropriate if you wanted to decrease the national cancer rate - predictive modelling would be fairly obsolete here.
The ability to accurately predict cancer rates is hardly likely to help you decide how to reduce it. However, knowing that smoking leads to higher risk of cancer is valuable information - because if you decrease smoking rates e. Looking at the problem this way, I would think that explanatory modelling would mainly focus on variables which are in control of the user, either directly or indirectly.
There may be a need to collect other variables, but if you can't change any of the variables in the analysis, then I doubt that explanatory modelling will be useful, except maybe to give you the desire to gain control or influence over those variables which are important.
Predictive modelling, crudely, just looks for associations between variables, whether controlled by the user or not. Is it true that exercising regularly say 30 minutes per day leads to lower blood pressure? To answer this question we may collect data from patients about their exercise regimen and their blood pressure values over time. The goal is to see if we can explain variations in blood pressure by variations in exercise regimen. Blood pressure is impacted by not only exercise by wide variety of other factors as well such as amount of sodium a person eats etc.
These other factors would be considered noise in the above example as the focus is on teasing out the relationship between exercise regimen and blood pressure. When doing a predictive exercise, we are extrapolating into the unknown using the known relationships between the data we have at hand.
If I exercise 1 hour per day to what extent is my blood pressure likely to drop? To answer this question, we may use a previously uncovered relationship between blood pressure and exercise regimen to perform the prediction.
In the above context, the focus is not on explanation, although an explanatory model can help with the prediction process. There are also non-explanatory approaches e.
One practical issue that arises here is variable selection in modelling. A variable can be an important explanatory variable e. I see this mistake almost every day in published papers. Another difference is in the distinction between principal components analysis and factor analysis. PCA is often used in prediction, but is not so useful for explanation.
FA involves the additional step of rotation which is done to improve interpretation and hence explanation. There is a nice post today on Galit Shmueli's blog about this. For example, home loans may be strongly related to GDP but that isn't much use for predicting future home loans unless we also have good predictions of GDP. Here is a deck of slides that I use in my data mining course to teach linear regression from both angles. Even with linear regression alone and with this tiny example various issues emerge that lead to different models for explanatory vs.
A classic example that I have seen is in the context of predicting human performance. Thus, if you put self-efficacy into a multiple regression along with other variables such as intelligence and degree of prior experience, you often find that self-efficacy is a strong predictor. This has lead some researchers to suggest that self-efficacy causes task performance.
And that effective interventions are those which focus on increasing a person's sense of self-efficacy. However, the alternative theoretical model sees self-efficacy largely as a consequence of task performance. In this framework interventions should focus on increasing actual competence and not perceived competence. Thus, including a variable like self-efficacy might increase prediction, but assuming you adopt the self-efficacy-as-consequence model, it should not be included as a predictor if the aim of the model is to elucidate causal processes influencing performance.
This of course raises the issue of how to develop and validate a causal theoretical model. This clearly relies on multiple studies, ideally with some experimental manipulation, and a coherent argument about dynamic processes. I've seen similar issues when researchers are interested in the effects of distal and proximal causes. Proximal causes tend to predict better than distal causes.
However, theoretical interest may be in understanding the ways in which distal and proximal causes operate. Finally, a huge issue in social science research is the variable selection issue. In any given study, there is an infinite number of variables that could have been measured but weren't. Thus, interpretation of models need to consider the implications of this when making theoretical interpretations.
Two Cultures by L. Breiman is, perhaps, the best paper on this point. His main conclusions see also the replies from other prominent statisticians in the end of the document are as follows:. I haven't read her work beyond the abstract of the linked paper, but my sense is that the distinction between "explanation" and "prediction" should be thrown away and replaced with the distinction between the aims of the practitioner, which are either "causal" or "predictive". In general, I think "explanation" is such a vague word that it means nearly nothing.
For example, is Hooke's Law explanatory or predictive? On the other end of the spectrum, are predictively accurate recommendation systems good causal models of explicit item ratings? I think we all share the intuition that the goal of science is explanation, while the goal of technology is prediction; and this intuition somehow gets lost in consideration of the tools we use, like supervised learning algorithms, that can be employed for both causal inference and predictive modeling, but are really purely mathematical devices that are not intrinsically linked to "prediction" or "explanation".
Having said all of that, maybe the only word that I would apply to a model is interpretable. Regressions are usually interpretable; neural nets with many layers are often not so. I think people sometimes naively assume that a model that is interpretable is providing causal information, while uninterpretable models only provide predictive information.
This attitude seems simply confused to me. I am still a bit unclear as to what the question is. Having said that, to my mind the fundamental difference between predictive and explanatory models is the difference in their focus. By definition explanatory models have as their primary focus the goal of explaining something in the real world.
In most instances, we seek to offer simple and clean explanations. By simple I mean that we prefer parsimony explain the phenomena with as few parameters as possible and by clean I mean that we would like to make statements of the following form: Given these goals of simple and clear explanations, explanatory models seek to penalize complex models by using appropriate criteria such as AIC and prefer to obtain orthogonal independent variables either via controlled experiments or via suitable data transformations.
The goal of predictive models is to predict something. Thus, they tend to focus less on parsimony or simplicity but more on their ability to predict the dependent variable. However, the above is somewhat of an artificial distinction as explanatory models can be used for prediction and sometimes predictive models can explain something.
With respect, this question could be better focused. Have people ever used one term when the other was more appropriate?
Sometimes it's clear enough from context, or you don't want to be pedantic. Sometimes people are just sloppy or lazy in their terminology. This is true of many people, and I'm certainly no better. What's of potential value here discussing explanation vs. In short, the distinction centers on the role of causality.
If you want to understand some dynamic in the world, and explain why something happens the way it does, you need to identify the causal relationships amongst the relevant variables. To predict, you can ignore causality. For example, you can predict an effect from knowledge about its cause; you can predict the existence of the cause from knowledge that the effect occurred; and you can predict the approximate level of one effect by knowledge of another effect that is driven by the same cause.
Why would someone want to be able to do this? To increase their knowledge of what might happen in the future, so that they can plan accordingly. For example, a parole board may want to be able to predict the probability that a convict will recidivate if paroled.
However, this is not sufficient for explanation. Of course, estimating the true causal relationship between two variables can be extremely difficult. In addition, models that do capture what are thought to be the real causal relationships are often worse for making predictions. So why do it, then? First, most of this is done in science, where understanding is pursued for its own sake.
Second, if we can reliably pick out true causes, and can develop the ability to affect them, we can exert some influence over the effects. With regard to the statistical modeling strategy, there isn't a large difference. Primarily the difference lies in how to conduct the study. If your goal is to be able to predict, find out what information will be available to users of the model when they will need to make the prediction.
Information they won't have access to is of no value. If they will most likely want to be able to predict at a certain level or within a narrow range of the predictors, try to center the sampled range of the predictor on that level and oversample there.
For instance, if a parole board will mostly want to know about criminals with 2 major convictions, you might gather info about criminals with 1, 2, and 3 convictions. On the other hand, assessing the causal status of a variable basically requires an experiment. That is, experimental units need to be assigned at random to prespecified levels of the explanatory variables.
If there is concern about whether or not the nature of the causal effect is contingent on some other variable, that variable must be included in the experiment. If it is not possible to conduct a true experiment, then you face a much more difficult situation, one that is too complex to go into here. Brad Efron, one of the commentators on The Two Cultures paper, made the following observation as discussed in my earlier question:.
Prediction by itself is only occasionally sufficient. The post office is happy with any method that predicts correct addresses from hand-written scrawls. Peter Gregory undertook his study for prediction purposes, but also to better understand the medical basis of hepatitis.
Most statistical surveys have the identification of causal factors as their ultimate goal. Medicine place a heavy weight on model fitting as explanatory process the distribution, etc.
Other fields are less concerned with this, and will be happy with a "black box" model that has a very high predictive success. This can work its way into the model building process as well.
Most of the answers have helped clarify what modeling for explanation and modeling for prediction are and why they differ. What is not clear, thus far, is how they differ. So, I thought I would offer an example that might be useful. Suppose we are intereted in modeling College GPA as a function of academic preparation.
As measures of academic preparation, we have:. If the goal is prediction, I might use all of these variables simultaneously in a linear model and my primary concern would be predictive accuracy. Whichever of the variables prove most useful for predicting College GPA would be included in the final model. If the goal is explanation, I might be more concerned about data reduction and think carefully about the correlations among the independent variables. My primary concern would be interpreting the coefficients.
In a typical multivariate problem with correlated predictors, it would not be uncommon to observe regression coefficients that are "unexpected". Given the interrelationships among the independent variables, it would not be surprising to see partial coefficients for some of these variables that are not in the same direction as their zero-order relationships and which may seem counter intuitive and tough to explain.
This is not a problem for prediction, but it does pose problems for an explanatory model where such a relationship is difficult to interpret. This model might provide the best out of sample predictions but it does little to help us understand the relationship between academic preparation and College GPA.
Instead, an explanatory strategy might seek some form of variable reduction, such as principal components, factor analysis, or SEM to:. Strategies such as these might reduce the predictive power of the model, but they may yield a better understanding of how Academic Preparation is related to College GPA. Predictive modeling is what happens in most analyses. For example, a researcher sets up a regression model with a bunch of predictors.
The regression coefficients then represent predictive comparisons between groups. The predictive aspect comes from the probability model: The purpose of this model is to predict new outcomes for units emerging from this superpopulation. Often, this is a vain objective because things are always changing, especially in the social world.
Or because your model is about rare units such as countries and you cannot draw a new sample. The usefulness of the model in this case is left to the appreciation of the analyst. When you try to generalize the results to other groups or future units, this is still prediction but of a different kind.
We may call it forecasting for example. The key point is that the predictive power of estimated models is, by default, of descriptive nature. You compare an outcome across groups and hypothesize a probability model for these comparisons, but you cannot conclude that these comparisons constitute causal effects. There is an enormous amount of information available on the internet, libraries. The literature search may include magazines, newspaper, trade literature, and academic literature.
These people can be professionals or person outside the organization. Anyone with information related to the problem is a strong candidate for the depth interview. Another method used is the gathering of the people who have a common objective and has information about the specific problem at hand. Focus group can have members. While selecting the members, it should be kept in mind that the individuals have information about the problem.
Researchers can understand and tackle the problem more efficiently by dealing with the carefully selected cases or cases of the phenomenon.
Analysis of the case of the organization which has gone through the same case will help in dealing with the problem more efficiently. Bean is perceived for its excellent request satisfaction. Hence, different organizations have tried to enhance their own particular request satisfaction by bench-marking L. The Explanatory Research allows the researcher to provide the deep insight into a specific subject, which gives birth to more subjects and provides more opportunities for the researchers to study new things and questions new things.
Exploratory researchers are normally led when an issue is not obviously characterized. It permits the agent to familiarize with the issue or idea to be examined, and conceivably create theories meaning of theory to be tried. Research can be immensely valuable for social research.
They are vital when an agent is breaking new ground and they ordinarily convey new data about a point for research. The explanatory research is such type of research which is a pillar of the other type of researchers.
Explore the research methods terrain, read definitions of key terminology, and discover content relevant to your research methods journey.
Causal research, also known as explanatory research is conducted in order to identify the extent and nature of cause-and-effect relationships. Causal research can be conducted in order to assess impacts of specific changes on existing norms, various processes etc.
Explanatory research is defined as an attempt to connect ideas to understand cause and effect, meaning researchers want to explain what is going on. Explanatory research looks at . Explanatory, analytical and experimental studies. Explanatory, analytical and experimental studies Explain Why a phenomenon is going on; Can be used for hypothesis testing.
Qualitative research is designed to explore the human elements of a given topic, while specific qualitative methods examine how individuals see and experienc. Explanatory Research is the research whose primary purpose is to explain why events occur to build, elaborate, extend or test theory.