Standard error adjustment (OLS) for spatial correlation and serial correlation in panel data in (Stata and Matlab)

I've been doing statistical work on climate impacts (see a typhoon climatology of the Philippines to the right) and have been having trouble finding code that will properly account for spatial correlation and serial correlation when estimating linear regression models (OLS) with panel (longitudinal) data.  So I wrote my own scripts for Matlab and Stata.  They implement GMM estimates similar to Newey-West (see Conley, 2008). You can download them here.

They are "beta versions" at this point, they may contain errors.  If you find one, please let me know.

In the future, I may try to document what I've learned about the different estimators out there. But for the time being, some of the advantages of what I've written are:
  • The weighting kernels are isotropic in space
  • The spatial weighting kernel can be uniform (as Conley, 2008 recommends) or decay linearly (a la Bartlett).
  • Serial correlation at a specific location is accounted for non-parametrically
  • Locations are specified with Lat-Lon, but kernel cutoffs are specified in kilometers; the code accounts for the curvature of the earth (to first order) so relatively large spatial domains can be accounted for in a single sample
  • An option allows you to examine the impact of adjustments for spatial correlation and spatial + serial correlation on your standard errors
  • The Stata version follows the format of all Stata estimates, so it should be compatible with post-estimation commands (eg. you can output your results using "outreg2").
[If you use this code, please cite: Hsiang (PNAS 2010)]

STATA VERSION 2 UPDATE 2013: Thanks to my field-testing team (Gordon McCord and Kyle Meng), several bugs in the code and additional options have been added.  Most useful changes: the code now correctly accepts the wildcard "*" when specifying variables and the option "dropvar" can be used to drop variables that Stata regards as 'too collinear' (contributed by KM).

STATA VERSION 3 UPDATE 2018: Thanks to some careful bug-chasing by Mathias Thoenig and Jordan Adamson, I've fixed a single line of code that led to a miscalculation in the weights for autocorrelation across within-unit panel observations. This error did not affect the Matlab implementation. This error did not affect calculations for the influence of spatial autocorrelation within contemporary observations or adjustments for heteroskedasticity. The previous version (v2) sometimes over-inflated or under-estimated the standard error estimate adjustment due to auto-correlation, but the magnitude of this effect depends on both the serial-correlation structure in the data and the maximum lag length parameter (lagcutoff) determined by the user. For very long lag lengths (e.g. infinity) the standard error estimates are too small, but for shorter lag lengths the bias may be of either sign.

My (currently ad-hoc) help file for the Stata script is below the fold. The Matlab code has an associated help-file embedded.

I've just cut and pasted the documentation in the Stata file below.  Eventually I will figure out how to write a proper Stata help file.


 ols_spatial_HAC Yvar Xvarlist, lat(latvar) lon(lonvar) Timevar(tvar) Panelvar(pvar) [DISTcutoff(#) LAGcutoff(#) bartlett DISPlay star dropvar]

 Function calculates non-parametric (GMM) spatial and autocorrelation 
 structure using a panel data set.  Spatial correlation is estimated for all
 observations within a given period.  Autocorrelation is estimated for a
 given individual over multiple periods up to some lag length. Var-Covar
 matrix is robust to heteroskedasticity.

 A variable equal to 1 is required to estimate a constant term.

 Example commands:

 ols_spatial_HAC dep indep1 indep2 const, lat(C1) lon(C2) t(year) p(id) dist(300) lag(3) bartlett disp

 ols_spatial_HAC dep indep*, lat(C1) lon(C2) timevar(year) panelvar(id) dist(100) lag(2) star dropvar


 Requred arguments: 

 Yvar: dependent variable  
 Xvarlist: independnet variables (INCLUDE constant as column)
 latvar: variable containing latitude in DEGREES of each obs
 lonvar: same, but longitude
 tvar: varible containing time variable
 pvar: variable containing panel variable (must be numeric, see "encode")


 Optional arguments:

 distcutoff(#): {abbrev dist(#)} describes the distance cutoff in KILOMETERS for the spatial kernal (the distance at which spatial correlation is assumed to vanish). Default is 1 KM.

 lagcutoff(#): {abbrev lag(#)} describes the maximum number of temporal periods for the linear Bartlett window that weights serial correlation across time periods (the distance at which serial correlation is assumed to vanish). Default is 0 PERIODS (no serial correlation). {Note, Greene recommends at least T^0.25}  



 bartlett: use a linear bartlett window for spatial correlations, instead of a uniform kernal

 display: {abbrev disp} display a table with estimated coeff and SE & t-stat using OLS, adjusting for spatial correlation and adjusting for both spatial and serial correlation. Can be used with star option. Ex:

     Variable |   OLS      spatial    spatHAC   
       indep1 |    0.568      0.568      0.568  
              |    0.198      0.206      0.240  
              |    2.876      2.761      2.369  
        const |    6.415      6.415      6.415  
              |    0.790      1.176      1.340  
              |    8.119      5.454      4.786  
                                  legend: b/se/t

 star: same as display, but uses stars to denote significance and does not show SE & t-stat. Can be used with display option. Ex:

     Variable |    OLS        spatial      spatHAC    
       indep1 |   0.568***     0.568***     0.568**   
        const |   6.415***     6.415***     6.415***  
                   legend: * p<.1; ** p<.05; *** p<.01
 dropvar: Drops variables that Stata would drop due to collinearity. This requires that an additiona regression is run, so it slows the code down. For large datasets, if this function is called many times, it may be faster to ensure that colinear variables are dropped in advance rather than using the option dropvar. If Stata returns "estimates post: matrix has missing values", than including the option dropvar may solve the problem. (This option written by Kyle Meng).



 The default kernal used to weight spatial correlations is a uniform kernal that
 discontinously falls from 1 to zero at length locCutoff in all directions (it is isotropic). This is the kernal recommented by Conley (2008). If the option "bartlett" is selected, a conical kernal that decays linearly with distance in all directions is used instead.

 Serial correlation bewteen observations of the same individual over multiple periods seperated by lag L are weighted by 

       w(L) = 1 - L/(lagCutoff+1)


 Location arguments should specify lat-lon units in DEGREES, however
 distcutoff should be specified in KILOMETERS. 

 distcutoff must exceed zero. CAREFUL: do not supply
 coordinate locations in modulo(360) if observations straddle the
 zero-meridian or in modulo(180) if they straddle the date-line. 

 Distances are computed by approximating the planet's surface as a plane
 around each observation.  This allows for large changes in LAT to be
 present in the dataset (it corrects for changes in the length of
 LON-degrees associated with changes in LAT). However, it does not account
 for the local curvature of the surface around a point, so distances will
 be slightly lower than true geodesics. This should not be a concern so
 long as locCutoff is < O(~2000km), probably.

 Each time-series for an individual observation in the panel is treated
 with Heteroskedastic and Autocorrelation Standard Errors. If lagcutoff =
 0, than this estimate is equivelent to White standard errors (with spatial correlations 
 accounted for). If lagcutoff = infinity, than this treatment is
 equivelent to the "cluster" command in Stata at the panel variable level.

 This script stores estimation results in standard Stata formats, so most "ereturn" commands should work properly.  It is also compatible with "outreg2," although I have not tested other programs.

 The R^2 statistics output by this function will differ from analogous R^2 stats
 computed using "reg" since this function omits the constant. 


      TG Conley "GMM Estimation with Cross Sectional Dependence" 
      Journal of Econometrics, Vol. 92 Issue 1(September 1999) 1-45

      Conley "Spatial Econometrics" New Palgrave Dictionary of Economics,
      2nd Edition, 2008


      Greene, Econometric Analysis, p. 546


 Modified from scripts written by Ruben Lebowski and Wolfram Schlenker and Jean-Pierre Dube and Solomon Hsiang
 Debugging help provided by Mathias Thoenig.


1 comment:

  1. Dear Sol,

    I have started using this code, but adopted it to pass demeaned variables to the procedure to make running it a bit more simple for users with higher dimensional fixed effects.

    See http://www.trfetzer.com/conley-spatial-hac-errors-with-fixed-effects/