Showing posts with label Stata. Show all posts
Showing posts with label Stata. Show all posts

3.14.2013

Now on Stata-bloggers...

Francis Smart has set up Stata-bloggers, a new blog aggregator for Stata users (modeled after R-bloggers). FE will be contributing there, but there's lots of other goodies worth checking out from more prolific bloggers.

Everyone say "Thank you, Francis."

2.18.2013

Plotting restricted cubic splines in Stata [with controls]

Michael Roberts has been trying to convince me to us restricted cubic splines to plot highly nonlinear functions, in part because they are extremely flexible and they have nice properties near their edges.  Unlike polynomials, information at one end of the support only weakly influences fitted values at the other end of the support. Unlike the binned non-parametric methods I posted a few weeks ago, RC-splines are differentiable (smooth). Unlike other smooth non-parametric methods, RC-splines are fast to compute and easily account for control variables (like fixed effects) because they are summarized by just a few variables in an OLS regression. They can also be used with spatially robust standard errors or clustering, so they are great for nonlinear modeling of spatially correlated processes. 

In short: they have lots of advantages. The only disadvantage is that it takes a bit of effort to plot them since there's no standard Stata command to do it.

Here's my function plot_rcspline.ado, which generates the spline variables for the independent variable, fits a spline while accounting for control variables, and plots the partial effect of the specified independent variables (adjusting for the control vars) with confidence intervals (computed via delta method). It's as easy as

plot_rcspline y x

and you get something like


where the "knots" are plotted as the vertical lines (optional).

Help file below the fold.  Enjoy!

Related non-linear plotting functions previously posted on FE:
  1. Boxplot regression
  2. Visually-weighted regression
  3. Watercolor regression
  4. Non-parametric three-dimensional regression
  5. Binned non-parametric regression
  6. Polynomial regression of any degree

1.24.2013

Quickly plotting nonparametric response functions with binned independent variables [in Stata]

Yesterday's post described how we can bin the independent variable in a regression to get a nice non-parametric response function even when we have large data sets, complex standard errors, and many control variables.  Today's post is a function to plot these kinds of results.

After calling bin_parameter.ado to discretize an independent variable (see yesterday's post), run a regression of the outcome variable on the the sequence of generated dummy variables (this command can be as complicated as you like, so feel free to throw your worst semi-hemi-spatially-correlated-auto-regressive-multi-dimensional-cluster-block-bootstrap standard errors at it). Then run plot_response.ado (today's function) to plot the results of that regression (with your fancy standard errors included). It's that easy.

Here's an example. Generate some data where Y is a quadratic function of X and a linear function of Z:
set obs 1000
gen x = 10*runiform()-5
gen z = rnormal()
gen e = rnormal()
gen y = x^2+z+5*e
Then bin the parameter using yesterday's function and run a regression of your choosing, using the the dummy variables output by bin_parameter:
bin_parameter x, s(1) t(4) b(-4) drop(-1.6) pref(_dummy)
reg y _dummy* z
After the regression, call plot_response.ado to plot the results of that regression (only the component related to the binned variables). The arguments describing the bins are the same format as those used by bin_parameter to make this easier:
plot_response, s(1) t(4) b(-4) drop(-1.6) pref(_dummy)
The result is a plot that clearly shows us the correct functional form:


Note: plot_response.ado requires parmest.ado (download from SSC  by typing "net install st0043.pkg" at the command line). It also calls a function parm_bin_center.ado that is included in the download.

Citation note: If you use this suite of functions in publication, please cite: Hsiang, Lobell, Roberts & Schlenker (2012): "Climate and the location of crops."

Help file below the fold.

1.23.2013

Binning a continuous independent variable for flexible nonparametric models [in Stata]


Sometimes we want to a flexible statistical model to allow for non-linearities (or to test if an observed relationship is actually linear). It's easy to run a model containing a high-degree polynomial (or something similar), but these can become complicated to interpret if the model contains many controls, such as location-specific fixed effects. Fully non-parametric models can be nice, but they require partialling out the data and standard errors can be come awkward if the sample is large or something sophisticated (like accounting for spatial correlation) is required.

An alternative that is easy to interpret, and handles large samples and complex standard errors well, is to convert the independent variable into discrete bins and to regress the outcome variable on dummy variables that represent each bin.

For example, in a paper with Jesse we take the typhoon exposure of Filipino households (a continuous variable) and make dummy variables for each 5 m/s bin of exposure. So there is a 10_to_15_ms dummy variable that is zero for all households except for those households whose exposure was between 10 and 15 m/s, and there is a different dummy for exposure between 15 and 20 m/s, etc.  When we regress our outcomes on all these dummy variables (and controls) at the same time, we recover their respective coefficients -- which together describe the nonlinear response of the outcome. In this case, the response turned out to be basically linear:


The effect of typhoon exposure on Filipino households finances.
From Anttila-Hughes & Hsiang (2011)


This approach coarsens the data somewhat, so there is some efficiency loss and we should be wary of Type 2 error if we compare bins to one another. But as an approach to determine the functional form of a model, this is a great approach so long as you have enough data.

I found myself rewriting Stata code to bin variables like this in many different contexts, so I wrote bin_parameter.ado to do it for me quickly.  Running these models can now be done in two lines of code (one of which is the regression command). bin_parameter allows you to specify a bin width, a top bin, a bottom bin and a dropped bin (for your comparison group). It spits out a bunch of dummy variables that represent all the bins which cover the range of the specified variable. It also has options for naming the dummy variables so you can use the wildcard notation in regression commands. Here's a short example of how it can be used:

set obs 1000
gen x = 10*runiform()-5
gen y = x^2
bin_parameter x, s(1) t(4) b(-4) drop(-1.6) pref(x_dummy)
reg y x_dummy*

Help file below the fold.

1.22.2013

Prettier graphs with less headache: use schemes in Stata

I'm picky about graphs looking nice. So for a long time I did silly things that drove me nuts, like typing "ylabel(, angle(horizontal))" for every single Stata graph I made (since some of the default settings in Stata are not pretty). I always knew that you could set a default scheme in stata to manage colors, but I didn't realize that it could do more or be customized. See here to learn more.

After a bit of playing around, I wrote my own scheme. The text file looks pitiful, I know. But it saves me lots of headache and makes each plot look nicer with zero marginal effort.  You can make your own, or if you download mine, put the file scheme-sol.scheme in ado/personal/ (where you put other ado files) and then type
set scheme sol
at the command line. Or set this as the default on your machine permanently with
set scheme sol, perm
It will make your plots that look kind of like this:


1.21.2013

Plot polynomial of any degree in Stata (with controls)

FE has been a little sluggish to recover from break. To kick start us back in gear, I'm making good on one resolution by making this FE Week-of-Code. I'll try to post something useful that I've written from the past year each day.

It always bugged me that I could easily plot a linear or quadratic fit in Stata, but if I used a third-order polynomial I could no longer plot the results easily.  Stata has built in functions like lowess, fpfitci and lpolyci that will plot very flexible functions, but those tend to be too flexible for many purposes. Sometimes I just want a function that is flexibly non-linear, but still smooth (so not lowess) and something I can easily write down analytically (so not fpfit or lpoly) and perhaps not symmetric (so not qfit).  We use high-degree polynomial's all the time, but we just don't plot them very often (I think this is because there is no built-in command do it for us).

Here's my function plot_margins.ado that does it. It takes a polynomial that you use in a regression, and plots the response function. Added bonuses: It plots the confidence interval you specify and can handle control variables (which are not plotted)

Reed Walker actually came up with this idea in an email exchange we had last year. So he deserves credit for this. I just wrote it up into an .ado file that you can call easily.

Basically, the idea is that you run any regression using Stata's factor variable notation, where you tell Stata that a variable X is continous and should be interacted with itself, eg
reg y c.x##c.x
is the notation to regress Y on X and X^2. (Check out Stata's documentation on factor variables if this isn't familiar to you.)

Reed's idea was to then use Stata 11's new margins command to evaluate the response of Y to X and X^2 at several points along the support of X, and then to use parmest to plot the result. (To download parmest.dotype "net install st0043.pkg" at the command line in Stata. plot_margins will call parmest, so you need to have it installed to run this function.)

The idea works really well, so long as you have Stata 11 or later (margins was introduced in Stata 11). 

Here's an example. First generate some false data. Y is the outcome of interest. X is the independent variable of interest. W is a relevant covariate. Z is an irrelevant covariate.
clear 
set obs 1000 
gen year = ceil(_n/100) 
gen x=5*uniform()-2 
gen z=2*uniform() 
gen w=2*uniform() 
gen e = 40*rnormal()  
gen y= w + 3*z + 4*x - x^2 + x^3 + e 
Then run a regression of Y on a polynomial of X (here, it's third degree) along with controls. The standard errors can be computed any fancy way you like. Here, I've done a block-bootstrap by the variable year.
reg y w z c.x##c.x##c.x , vce(bootstrap , reps(500) cluster(year))
Then, right after running the regression, (or when the estimates from the regression are the active in memory) call plot_margins to plot the marginal contribution of X at different values.
plot_margins x
Easy enough? I've added a few features for programming ease. Use the plot_command() option to add labels, etc to the graph
plot_margins x, plotcommand("xtit(X) ytit(Y) tit(Third order polynomial) subtit(plotted with PLOT_MARGINS) note(SE are block-bootstrapped by year and model controls for X and Z)") 
 The result:


 or specify the option "line" to have the CI plotted as lines instead of a shaded area:


There is also an option to save the function generated by parmest. Help file is below the fold. Have fun and don't forget to email Reed and tell him "thank you!"

Update: Reed points us to marginsplot in Stata 12, which basically does the same thing. Funny that the function names are all so unique...

7.30.2012

Visually-Weighted Regression

[This is the overdue earth-shattering sequel to this earlier post.]

I recently posted this working paper online. It's very short, so you should probably just read it (I was actually originally going to write it as a blog post), but I'll run through the basic idea here.  

Since I'm proposing a method, I've written functions in Matlab (vwregress.m) and Stata (vwlowess.ado) to accompany the paper. You can download them here, but I expect that other folks can do a much better job implementing this idea.

Solomon M. Hsiang
Abstract: Uncertainty in regression can be efficiently and effectively communicated using the visual properties of regression lines.  Altering the "visual weight" of lines to depict the quality of information represented clearly communicates statistical confidence even when readers are unfamiliar or reckless with the formal and abstract definitions of statical uncertainty. Here, we present an example by decreasing the color-saturation of nonparametric regression lines when the variance of estimates increases. The result is a simple, visually intuitive and graphically compact display of statistical uncertainty. This approach is generalizable to almost all forms of regression.
Here's the issue. Statistical uncertainty seems to be important for two different reasons. (1) If you have to make a decision based on data, you want to have a strong understanding of the possible outcomes that might result from your decision, which itself rests on how we interpret the data.  This is the "standard" logic, I think, and it requires a precise, quantitative estimate of uncertainty.  (2) Because there is noise in data, and because sampling is uneven across independent variables, a lot of data analysis techniques generate artifacts that we should mostly just ignore.  We are often unnecessarily focused/intrigued by the wackier results that shows up in analyses, but thinking carefully about statistical uncertainty reminds us to not focus too much on these features. Except when it doesn't.

"Visually-weighted regression" is a method for presenting regression results that tries to address issue (2), taking a normal person's psychological response to graphical displays into account. I had grown a little tired of talks and referee reports where people speculate about the cause of some strange non-linearity at the edge of a regression sample, where there was no reason to believe the non-linear structure was real.  I think this and related behaviors emerge because (i) there seems to be an intellectual predisposition to thinking that "nonlinearity" is inherently more interesting that "linearity" and (ii) the traditional method for presenting uncertainty subconsciously focuses viewers attention on features of the data that are less reliable. I can't solve issue (i) with data visualization, but we can try to fix (ii). 

The goal of visually-weighted regression is to take advantage of viewer's psychological response to images in order to focus their attention on the results that are the most informative.  "Visual weight" is a concept from art and graphical design that is used to to direct a viewer's focus within an image.  Large, dark,  high-contrast, and complex structures tend to "grab" a viewer's attention.  Our brains are constantly looking for visual information and, somewhere along the evolutionary line, detailed/high-contrast structures in our field of view were probably more informative and more useful for survival, so we are programmed to give them more of our attention.  Unfortunately, the traditional approaches to displaying statistical uncertainty give more visual weight to the uncertain portions of the analysis, which is exactly backwards of what we want. Ideally, a viewer will focus more of their attention on the portions of analysis that have some statical confidence and they will mostly ignore the portions of analysis that are so uncertain that they contain little or no information.

[continued below the fold]

4.30.2012

Why visualizing instrumental variables made me less excited about it

Instrumental variables (IV and 2SLS) is a statistical approach commonly used in the empirical social sciences ("instrumental variables" has > 84,000 hits on google scholar) but it often seems to be over-interpreted. I realized this when, as a grad student, I tried to draw a graphical description of what IV was doing. Ever since, I've never bothered to estimate IV and only spend my time on reduced form models.  These are my two cents.

IV & 2SLS

IV was originally developed to combat attenuation bias in ordinary least-squares, but somebody (I don't know who) realized it could be used to estimate local average treatment effects in experimental settings when not all subjects in a treatment group actually received the treatment of interest.  From there, someone figured out that IV could be used in retrospective studies to isolate treatment effects when assignment to treatment had some endogenous component. This last interpretation is where I think a lot of folks run into trouble. (If you don't know what any of this means, you probably won't care about the rest of this post...)

All of these conceptual innovations were big contributions.  But what happened afterwards led to a lot of sloppy thinking.  IV is often taught as a method for "removing" endogeneity from an independent variable (variable X) in a regression.  While strictly a correct statement (if some relatively strong assumptions are met), it often feels as if it's abused.  My perception from papers/talks/discussion is that most students interpret IV in a similar way to how they interpret a band-pass filter: if you find a plausible instrument (variable Z), you can "filter out" the bad (endogenous) part of your independent variable and just retain the good (exogenous) part.  While a useful heuristic for teaching, its overuse leads to bad intuition.

IV with one instrument

The first stage regression in IV "filters out" all variation in X that does not project (for now, linearly) onto Z.  This linear projection is just Z*b, but it is often written as "X-hat". While not mathematically incorrect, and appealing for heuristic purposes, I think that using X-hat as a notational tool is why a lot of students get confused. It makes students think that X-hat is just like X, except without all the bad endogenous parts.  But really, X-hat is more like Z than it is like X. In fact, X-hat IS Z! It's just linearly rescaled by b, which is just a change in units.  The reason why X-hat is exogenous if Z is exogenous is because

X-hat = Z * b

where b is a constant. It seems silly and patronizing to reiterate an equation that is taught in metrics 101 a gazillion times, but people seem to forget that this little rescaling equation is doing all the heavy lifting in an IV model.  (I bet that if we never used "X-hat" notation and we renamed "the first stage" something less exciting, like "linearly converting units," then grad students would think IV is much less magical...)

Once X-hat is estimated, it is used as the new regressor in the "second stage" equation

Y = X-hat * a + error

and then a big fuss is made about how a is the unbiased effect of X on Y.  But if we drop the X-hat notation and replace it with the rescaled-Z notation, we get something less exciting:

Y = Z * b * a + error

Where a is now the unbiased effect of Z on Y, but rescaled by a factor 1/b.  For those familiar with IV, this looks a lot like the reduced form equation

Y = Z * c + error

which was always valid because Z is exogenous. Clearly,

a = c / b.

If you're still reading, you should be yawning, since this all seems very boring: these are all just variations on the same reduced form equation. And that's my main point, which I don't think is taught enough: Instrumental variables is only as good as your reduced form, because it is a linear rescaling of your reduced form. Of course, there are adjustments to standard errors, but if you're doing IV to remove bias, then you were focused on your point estimate to begin with. (An aside: it drives me nuts when people are "uninterested" in the reduced form effect of natural variations [eg. weather] on some outcome, but then are fascinated if you use the same variation as an instrument for some otherwise endogenous intermediate variable. The latter regression is the same as the first, only the coefficient has been rescaled by 1/b!)

Enough ranting. How do we visualize this? Its as easy as it sounds: you rescale the horizontal axis in your reduced form regression.

To illustrate this, I generated 100 false data points with the equations

Z ~ uniform on [0,1]
X =  Z * 2 + e1
Y = X + e2


where e1 and e2 are N(0,1). And then plot the reduced form equation (Z vs. Y) as the blue line in the upper-left panel:


I then estimate the first stage regression in the lower left panel and show the predicted values X-hat as open red circles. In the lower right panel, I just plot X-hat against X-hat to show that I am reflecting these values from the vertical axis (ordinate) to the horizontal axis (abcissa). Then, keeping X-hat on the horizontal axis, I plot Y against X-hat in the upper right panel. This is the second stage estimate from IV, which gives us our unbiased estimate of the coefficient a (the slope of the red line).

[The code to generate these plots in Stata is at the bottom of this post.]

What is different between the scatter on the upper left (blue line, representing the reduced form regression) and the scatter on the upper right (red line, representing the second stage of IV)? Not too much. The vertical axis is the same and the relative locations of the data points are the same. The only change is that the horizontal variable has been rescaled by a factor of two (recall the data generating process above). The IV regression is just a rescaled version of the reduced form.  What happened to X? It was left behind in the lower left panel and it never went anywhere else. It only feels like it is somehow present in the upper right panel because we renamed Z*b the glitzier X-hat.

Multiple instruments (2SLS)

Sometimes people have more than one exogenous variable that influences X (eg. temperature and rainfall). Both of these variables can be used as instruments in the first stage. What happens to the intuition above when we do this? Not too much, except things get harder to draw. But the fact that it's harder to draw doesn't mean that the estimate is necessarily more impressive or magical.

Suppose we have now have instruments Z1 and Z2 such that

Multivariate first stage: X = Z1 - Z2
Second stage: Y = 2 * X

then the first stage regression (X-hat) looks like this:


where the two horizontal axes are Z1 and Z2 and the vertical axis is X.

If we substitute this first stage regression into the second stage, we get

Y = 2 * (Z1 - Z2)

We can easily plot this version of Y as a function of Z1 and Z2 (the reduced form). Here, it's the purple plane:


The resulting second stage regression would take the observed value for Y (purple) and project it onto the predicted values for X (green). The parameter of interest (2 here, but the variable "a" earlier) is just the ratio of the height of the purple surface to the height of the green surface. 

Again, everything can be stated in terms of the instruments Z1 and Z2 (the two horizontal axes), which means the reduced form is again just as good as the second stage.  X only enters by determining how steep the green surface is.

Nonlinear first stage with multiple instruments

Finally, some people use a non-linear first stage.  Again, this is not terribly different. Suppose we have

Nonlinear multivariate first stage: X = Z1^2 - Z2
Second stage: Y = 2 * X

Then the first stage looks like:


and the reduced form regression of Y on Z1 and Z2  is gives us

Y = 2 * (Z1^2 - Z2)

which we overlay in purple again:


Since the second stage didn't change, the height of the purple surface is still always twice the distance from zero relative to the green surface (a scatter plot of pink vs. green values would be a straight line with slope = 2 = a).  This graph looks fancier, but the instruments Z1 and Z2 are still driving everything.  The endogenous variable X only enters passively by determining the height of the green surface. Once that's done, the  Z1 and Z2 do the rest.

Take away: Reduced form models can be very interesting. However, instrumental variables models are rarely much more interesting.  In fact, all the additional interestingness of the IV model arises from the exclusion restriction that is assumed, but this assumption is usually false (recall: how many papers use weather as an instrument?), so the IV model is probably exactly as interesting as the reduced form, except that it has larger standard errors and the wrong units.

[If you disagree, send me a clear visualization of your 2SLS results and all the steps you used to get there.  I'd love to be wrong here.]

7.06.2011

Some shapefile utilities for Matlab (and data transfer to Stata)

Here are some extremely simple but extremely useful functions when working with shapefiles in Matlab.

Functions to drop or keep polygons in your workspace based on their attributes (drop_by_attribute.m, keep_by_attribute.m).

A simple command to plot all your polygons (plot_districts.m).

A Matlab function to merge attributes (combine_attributes.m) and an accompanying Stata function to "un-merge" those attributes (break_var.ado).  This is particularly useful if you want a single attribute to uniquely identify a polygon (eg. combining State and County into a single attribute State_County) which you use when you do some calculations in Matlab (eg. compute avg. temperature by State_County) and export to Stata (using my scripts here) where you then want to retrieve the original distinct attributes (eg. State and County) for statistical analysis.


4.01.2011

Stata's new geocode function

Our friend and colleague Reed Walker points out that there's a new Stata program called geocode that allows an internet-connected copy of Stata to query Google Maps for latitude and longitude. From Adam Ozimek at the excellent econ and general social science blog Modeled Behavior :
In the upcoming Stata Journal I have a paper with a coauthor that lets Stata query Google Maps in order to find latitude and longitude for addresses or other locations, also known as geocoding. What makes this useful is that you can have weird formatting, spelling errors, or missing information in your address or location variable and the program can still geocode it as well as Google Maps can place it on a map.
There's another program that calculates the "Google maps distance" (i.e., actual travel time distance as opposed to as-the-crow-flies geometric distance) between any two addresses. More details can be found here. This is very exciting.

3.13.2011

Stata blog post on understanding matrices (with bonus Stata cheat sheet)


William Gould on Stata's blog (previously mentioned here) has two great posts (here and here) on the intuition behind matrices and regression coefficients. The section on near-singular matrices is characteristically nice:
Singular matrices are an extreme case of nearly singular matrices, which are the bane of my existence here at StataCorp. Here is what it means for a matrix to be nearly singular: [see figure]
Nearly singular matrices result in spaces that are heavily but not fully compressed. In nearly singular matrices, the mapping from x to y is still one-to-one, but x‘s that are far away from each other can end up having nearly equal y values. Nearly singular matrices cause finite-precision computers difficulty. Calculating y = Ax is easy enough, but to calculate the reverse transform x = A-1y means taking small differences and blowing them back up, which can be a numeric disaster in the making.
Both posts are great and I recommend them for anyone struggling with the intuition behind what exactly you're doing when you type in reg y x.

As an added bonus, earlier this week I stumbled across Kenneth Simon's excellent pdf cheat sheet of Stata commands for intermediate / advanced econometrics, here. I was trying to figure out a way to do something cute with distributed lag models and post-estimation tests, but the sheet covers everything from the simple but important (e.g., the difference between gen old = age >= 18 and gen old = age >= 18 if age<. ) to the arcane but potentially important (e.g., nonlinear hypothesis testing). If you're in applied work and use Stata I highly recommend flipping through it. I've already found several useful techniques I wasn't even aware existed.

11.17.2010

Official Stata Blog Now Up!

Our colleague Reed Walker points out that there's now an official Stata blog, "Not Elsewhere Classified":
Here we will try to keep you up-to-date about all things related to Stata Statistical Software. That includes not only product announcements from StataCorp and others, but timely tips (and sometimes comments) on other news related to the use of Stata.

Many entries will be signed by members of the StataCorp staff.

If you have any tips or comments for us, email blogteam@stata.com
Fun topics they've covered thus far include which are the most powerful commercially available computers and the advantages of running Stata-MP.

11.05.2010

Regression coefficient stability over time (in Stata)

If you're estimating a regression model with time series or panel data, you often would like to know if the coefficient you're interested in is changing over time or if its stable for subsamples of the time series or panel.

Here's my simple but handy script that let's you see if your coefficient is stable or changing with a single command.  You specify your multiple regression model (for OLS, but you can change it easily to run a different estimator), which coefficient to examine.  It does a sequence of regressions for a moving window of specified length and stores the changing coefficient, SE and CI (t-test).

For example, the figure at right shows the coefficient from the model

GDP_it = B * cereal_yields_it + e_it

for a panel of all countries estimated for a moving window that covers 3 years at a time.

The description of the code is just commented out in the script and pasted below the fold.

[7/14/2011: A small but important error in the code was fixed. Thanks to Kyle for catching it.]

9.24.2010

Misuse of "control variables" in multi-variable regression

Yesterday I was talking to my friend Anna who had a great question about multi-variate regression:

"When you have a regression of Y on X and Z, what happens to the variations in X and Z that influence Y but are perfectly correlated?"

This is a serious question that doesn't seem to be taken seriously enough by many researchers.  I think in econometrics 101, they teach us the Frisch-Waugh Theorem because they want us to think about this, but few of us do.

Anna was concerned about this because many people run a regression of Y on X, observe a correlation, then include a "control variable" Z and observe that the correlation between Y and X vanishes.  They then conclude that "X does not affect Y."  This conclusion need not be true, Anna was right.

If you use multi-variable regression in research, I would be sure you understand why Anna was right. If you don't, I would sit and think about Frisch-Waugh until you do.

If you don't believe me and you know how to use Stata, run this code. It might help.



/*WHY YOU SHOULD THINK VERY HARD ABOUT THE FRISCH-WAUGH THEORM IF YOU USE MULTIPLE REGRESSION FOR CAUSAL INFERENCE*/
/*SOLOMON HSIANG, 2010*/


clear
set obs 1000


/*Here, X is the true exogenous variable*/
gen X = runiform()


/*Let Z and Y be influenced by X*/
gen Z = X
gen Y = 2*X


/*There are three independant sources of observational error*/
gen e_x = 0.1*runiform()
gen e_z = 0.1*runiform()
gen e_y = 0.1*runiform()


/*Then the following observations are observed*/
gen x = X + e_x
gen z = Z + e_z
gen y = Y + e_y


/*To be completely clear about what is observable, let's throw
away the fundamental variables and only keep the observed variables*/
drop X Y Z e_x e_y e_z


/*Suppose you thought that a unit change in x would increase y, so you estimate:*/
reg y x


/*This would give you a fairly good estimate of the coefficient 2, which is correct.*/
/**/
/*Now suppose you are anxious because someone tells you that you haven't controlled 
for every variable in the world. In your panic, you concede and include the variable 
z in your regression:*/
reg y x z


/*This is bizzare.  We know that Z was not involved at all in the creation of Y. But
including z in our regression suggests it is not only highly significantly correlated,
but it also dramatically changes the coefficient on x to half of its true value.*/

6.16.2010

Standard error adjustment (OLS) for spatial correlation and serial correlation in panel data in (Stata and Matlab)

I've been doing statistical work on climate impacts (see a typhoon climatology of the Philippines to the right) and have been having trouble finding code that will properly account for spatial correlation and serial correlation when estimating linear regression models (OLS) with panel (longitudinal) data.  So I wrote my own scripts for Matlab and Stata.  They implement GMM estimates similar to Newey-West (see Conley, 2008). You can download them here.

They are "beta versions" at this point, they may contain errors.  If you find one, please let me know.

In the future, I may try to document what I've learned about the different estimators out there. But for the time being, some of the advantages of what I've written are:
  • The weighting kernels are isotropic in space
  • The spatial weighting kernel can be uniform (as Conley, 2008 recommends) or decay linearly (a la Bartlett).
  • Serial correlation at a specific location is accounted for non-parametrically
  • Locations are specified with Lat-Lon, but kernel cutoffs are specified in kilometers; the code accounts for the curvature of the earth (to first order) so relatively large spatial domains can be accounted for in a single sample
  • An option allows you to examine the impact of adjustments for spatial correlation and spatial + serial correlation on your standard errors
  • The Stata version follows the format of all Stata estimates, so it should be compatible with post-estimation commands (eg. you can output your results using "outreg2").
[If you use this code, please cite: Hsiang (PNAS 2010)]


STATA VERSION 2 UPDATE 2013: Thanks to my field-testing team (Gordon McCord and Kyle Meng), several bugs in the code and additional options have been added.  Most useful changes: the code now correctly accepts the wildcard "*" when specifying variables and the option "dropvar" can be used to drop variables that Stata regards as 'too collinear' (contributed by KM).

STATA VERSION 3 UPDATE 2018: Thanks to some careful bug-chasing by Mathias Thoenig and Jordan Adamson, I've fixed a single line of code that led to a miscalculation in the weights for autocorrelation across within-unit panel observations. This error did not affect the Matlab implementation. This error did not affect calculations for the influence of spatial autocorrelation within contemporary observations or adjustments for heteroskedasticity. The previous version (v2) sometimes over-inflated or under-estimated the standard error estimate adjustment due to auto-correlation, but the magnitude of this effect depends on both the serial-correlation structure in the data and the maximum lag length parameter (lagcutoff) determined by the user. For very long lag lengths (e.g. infinity) the standard error estimates are too small, but for shorter lag lengths the bias may be of either sign.

My (currently ad-hoc) help file for the Stata script is below the fold. The Matlab code has an associated help-file embedded.









5.18.2010

Data transfer from Matlab to Stata and reverse

Zip file with Matlab code to export or import datasets to Stata. For use in conjunction with "insheet/outsheet". Importing data from Stata is restricted to numeric variables (but not the reverse).