Yesterday I was talking to my friend Anna who had a great question about multi-variate regression:
"When you have a regression of Y on X and Z, what happens to the variations in X and Z that influence Y but are perfectly correlated?"
This is a serious question that doesn't seem to be taken seriously enough by many researchers. I think in econometrics 101, they teach us the Frisch-Waugh Theorem because they want us to think about this, but few of us do.
Anna was concerned about this because many people run a regression of Y on X, observe a correlation, then include a "control variable" Z and observe that the correlation between Y and X vanishes. They then conclude that "X does not affect Y." This conclusion need not be true, Anna was right.
If you use multi-variable regression in research, I would be sure you understand why Anna was right. If you don't, I would sit and think about Frisch-Waugh until you do.
If you don't believe me and you know how to use Stata, run this code. It might help.
/*WHY YOU SHOULD THINK VERY HARD ABOUT THE FRISCH-WAUGH THEORM IF YOU USE MULTIPLE REGRESSION FOR CAUSAL INFERENCE*/
/*SOLOMON HSIANG, 2010*/
set obs 1000
/*Here, X is the true exogenous variable*/
gen X = runiform()
/*Let Z and Y be influenced by X*/
gen Z = X
gen Y = 2*X
/*There are three independant sources of observational error*/
gen e_x = 0.1*runiform()
gen e_z = 0.1*runiform()
gen e_y = 0.1*runiform()
/*Then the following observations are observed*/
gen x = X + e_x
gen z = Z + e_z
gen y = Y + e_y
/*To be completely clear about what is observable, let's throw
away the fundamental variables and only keep the observed variables*/
drop X Y Z e_x e_y e_z
/*Suppose you thought that a unit change in x would increase y, so you estimate:*/
reg y x
/*This would give you a fairly good estimate of the coefficient 2, which is correct.*/
/*Now suppose you are anxious because someone tells you that you haven't controlled
for every variable in the world. In your panic, you concede and include the variable
z in your regression:*/
reg y x z
/*This is bizzare. We know that Z was not involved at all in the creation of Y. But
including z in our regression suggests it is not only highly significantly correlated,
but it also dramatically changes the coefficient on x to half of its true value.*/