## 9.24.2010

### Misuse of "control variables" in multi-variable regression

Yesterday I was talking to my friend Anna who had a great question about multi-variate regression:

"When you have a regression of Y on X and Z, what happens to the variations in X and Z that influence Y but are perfectly correlated?"

This is a serious question that doesn't seem to be taken seriously enough by many researchers.  I think in econometrics 101, they teach us the Frisch-Waugh Theorem because they want us to think about this, but few of us do.

Anna was concerned about this because many people run a regression of Y on X, observe a correlation, then include a "control variable" Z and observe that the correlation between Y and X vanishes.  They then conclude that "X does not affect Y."  This conclusion need not be true, Anna was right.

If you use multi-variable regression in research, I would be sure you understand why Anna was right. If you don't, I would sit and think about Frisch-Waugh until you do.

If you don't believe me and you know how to use Stata, run this code. It might help.

/*WHY YOU SHOULD THINK VERY HARD ABOUT THE FRISCH-WAUGH THEORM IF YOU USE MULTIPLE REGRESSION FOR CAUSAL INFERENCE*/
/*SOLOMON HSIANG, 2010*/

clear
set obs 1000

/*Here, X is the true exogenous variable*/
gen X = runiform()

/*Let Z and Y be influenced by X*/
gen Z = X
gen Y = 2*X

/*There are three independant sources of observational error*/
gen e_x = 0.1*runiform()
gen e_z = 0.1*runiform()
gen e_y = 0.1*runiform()

/*Then the following observations are observed*/
gen x = X + e_x
gen z = Z + e_z
gen y = Y + e_y

/*To be completely clear about what is observable, let's throw
away the fundamental variables and only keep the observed variables*/
drop X Y Z e_x e_y e_z

/*Suppose you thought that a unit change in x would increase y, so you estimate:*/
reg y x

/*This would give you a fairly good estimate of the coefficient 2, which is correct.*/
/**/
/*Now suppose you are anxious because someone tells you that you haven't controlled
for every variable in the world. In your panic, you concede and include the variable
reg y x z

/*This is bizzare.  We know that Z was not involved at all in the creation of Y. But
including z in our regression suggests it is not only highly significantly correlated,
but it also dramatically changes the coefficient on x to half of its true value.*/