9.24.2010

Misuse of "control variables" in multi-variable regression

Yesterday I was talking to my friend Anna who had a great question about multi-variate regression:

"When you have a regression of Y on X and Z, what happens to the variations in X and Z that influence Y but are perfectly correlated?"

This is a serious question that doesn't seem to be taken seriously enough by many researchers.  I think in econometrics 101, they teach us the Frisch-Waugh Theorem because they want us to think about this, but few of us do.

Anna was concerned about this because many people run a regression of Y on X, observe a correlation, then include a "control variable" Z and observe that the correlation between Y and X vanishes.  They then conclude that "X does not affect Y."  This conclusion need not be true, Anna was right.

If you use multi-variable regression in research, I would be sure you understand why Anna was right. If you don't, I would sit and think about Frisch-Waugh until you do.

If you don't believe me and you know how to use Stata, run this code. It might help.



/*WHY YOU SHOULD THINK VERY HARD ABOUT THE FRISCH-WAUGH THEORM IF YOU USE MULTIPLE REGRESSION FOR CAUSAL INFERENCE*/
/*SOLOMON HSIANG, 2010*/


clear
set obs 1000


/*Here, X is the true exogenous variable*/
gen X = runiform()


/*Let Z and Y be influenced by X*/
gen Z = X
gen Y = 2*X


/*There are three independant sources of observational error*/
gen e_x = 0.1*runiform()
gen e_z = 0.1*runiform()
gen e_y = 0.1*runiform()


/*Then the following observations are observed*/
gen x = X + e_x
gen z = Z + e_z
gen y = Y + e_y


/*To be completely clear about what is observable, let's throw
away the fundamental variables and only keep the observed variables*/
drop X Y Z e_x e_y e_z


/*Suppose you thought that a unit change in x would increase y, so you estimate:*/
reg y x


/*This would give you a fairly good estimate of the coefficient 2, which is correct.*/
/**/
/*Now suppose you are anxious because someone tells you that you haven't controlled 
for every variable in the world. In your panic, you concede and include the variable 
z in your regression:*/
reg y x z


/*This is bizzare.  We know that Z was not involved at all in the creation of Y. But
including z in our regression suggests it is not only highly significantly correlated,
but it also dramatically changes the coefficient on x to half of its true value.*/

Madagascar

Jesse and I have been working hard and haven't posted anything in a while. I don't have much time to offer comments, but this is a really interesting piece in last month's NGM.  It follows groups of loggers and others in Madagascar and I think it does a good job capturing some of the challenges of development and the political economy of resource extraction.

For anyone who's interested in commitment problems of political economy, this is a good quote:
In September 2009, after months during which up to 460,000 dollars' worth of rosewood was being illegally harvested every day, the cash-strapped new government reversed a 2000 ban on the export of rosewood and released a decree legalizing the sale of stockpiled logs. Pressured by an alarmed international community, the government reinstated the ban in April. 
And I also liked this breakdown of who gets to keep what from the forest:

For weeks they camp out in small groups beside the trees they've singled out for cutting, subsisting on rice and coffee, until the boss shows up. He inspects the rosewood, gives the order. They chop away with axes. Within hours a tree that first took root perhaps 500 years ago has fallen to the ground. The cutters hack away at its white exterior until all that remains is its telltale violet heart. The rosewood is cut into logs about seven feet long. Another team of two men tie ropes around each log and proceed to drag it out of the forest to the river's edge, a feat that will take them two days and earn them $10 to $20 a log, depending on the distance. While staggering through the forest myself, from time to time I come upon the jarring apparition of two stoic figures tugging a 400-pound log up some impossible gradient or down a waterfall or across quicksand-like bogs—a hard labor of biblical scale, except that these men are doing this for money. As is the man the pair would meet up with at the river, waiting to tie the log to a handcrafted radeau, or raft, to help it float down the rapids ($25 a log). As is the pirogueman awaiting the radeau where the rapids subside ($12 a log). As is the park ranger whom the timber bosses have bribed to stay away ($200 for two weeks). As are police at checkpoints along the road to Antalaha ($20 an officer). The damage to the forest is far more than the loss of the precious hardwoods: For each dense rosewood log, four or five lighter trees are cut down to create the raft that will transport it down the river.
At a bend in the river, the pirogues pull up to shore. A man with a mustache squats in a tent, smoking a hand-rolled cigarette. His name is Dieudonne. He works with the middleman, the boss on the ground, entrusted by the timber baron to select the trees for cutting and oversee the logs from the riverbank to the transport trucks. There have been 18 trucks this morning. Thirty or so rosewood logs lie scattered around Dieudonne's tent. His cut is $12 a log.