Showing posts with label code. Show all posts
Showing posts with label code. Show all posts

2.12.2014

Spatial Data and Analysis

I developed a new course last fall title "Spatial Data and Analysis". Because several people have asked for the material, I've finally posted the syllabus and assignments online here

Description of the course:
The recent explosion of spatially explicit data and analytical tools, such as "Geographic Information Systems" (GIS) and spatial econometrics, have aided researchers and decision- makers faced with a variety of challenges. This course introduces students to spatial data and its analysis, as well as the modeling of spatially dependent social processes and policy problems. Students will be introduced to the types, sources, and display of spatial data. Through hands-on analysis, students will learn to extract quantitative information from spatial data for applied research and public policy. Students will be introduced to spatial statistics, spatially dependent simulation, and spatial optimization. Students will learn to think creatively about spatial problems through examples drawn from economics, politics, epidemiology, criminology, agriculture, social networks, and the environment. The goal of the course is to equip advanced masters students and doctoral students with tools that will help them be effective analysts and communicators of spatial information in their future research or policy-related work. Because hands-on analysis plays a central role in the class, students will benefit from prior experience with basic computer programming -- although prior experience is not required. Prerequisites: introductory statistics or equivalent.

3.14.2013

Now on Stata-bloggers...

Francis Smart has set up Stata-bloggers, a new blog aggregator for Stata users (modeled after R-bloggers). FE will be contributing there, but there's lots of other goodies worth checking out from more prolific bloggers.

Everyone say "Thank you, Francis."

2.18.2013

Plotting restricted cubic splines in Stata [with controls]

Michael Roberts has been trying to convince me to us restricted cubic splines to plot highly nonlinear functions, in part because they are extremely flexible and they have nice properties near their edges.  Unlike polynomials, information at one end of the support only weakly influences fitted values at the other end of the support. Unlike the binned non-parametric methods I posted a few weeks ago, RC-splines are differentiable (smooth). Unlike other smooth non-parametric methods, RC-splines are fast to compute and easily account for control variables (like fixed effects) because they are summarized by just a few variables in an OLS regression. They can also be used with spatially robust standard errors or clustering, so they are great for nonlinear modeling of spatially correlated processes. 

In short: they have lots of advantages. The only disadvantage is that it takes a bit of effort to plot them since there's no standard Stata command to do it.

Here's my function plot_rcspline.ado, which generates the spline variables for the independent variable, fits a spline while accounting for control variables, and plots the partial effect of the specified independent variables (adjusting for the control vars) with confidence intervals (computed via delta method). It's as easy as

plot_rcspline y x

and you get something like


where the "knots" are plotted as the vertical lines (optional).

Help file below the fold.  Enjoy!

Related non-linear plotting functions previously posted on FE:
  1. Boxplot regression
  2. Visually-weighted regression
  3. Watercolor regression
  4. Non-parametric three-dimensional regression
  5. Binned non-parametric regression
  6. Polynomial regression of any degree

1.24.2013

Quickly plotting nonparametric response functions with binned independent variables [in Stata]

Yesterday's post described how we can bin the independent variable in a regression to get a nice non-parametric response function even when we have large data sets, complex standard errors, and many control variables.  Today's post is a function to plot these kinds of results.

After calling bin_parameter.ado to discretize an independent variable (see yesterday's post), run a regression of the outcome variable on the the sequence of generated dummy variables (this command can be as complicated as you like, so feel free to throw your worst semi-hemi-spatially-correlated-auto-regressive-multi-dimensional-cluster-block-bootstrap standard errors at it). Then run plot_response.ado (today's function) to plot the results of that regression (with your fancy standard errors included). It's that easy.

Here's an example. Generate some data where Y is a quadratic function of X and a linear function of Z:
set obs 1000
gen x = 10*runiform()-5
gen z = rnormal()
gen e = rnormal()
gen y = x^2+z+5*e
Then bin the parameter using yesterday's function and run a regression of your choosing, using the the dummy variables output by bin_parameter:
bin_parameter x, s(1) t(4) b(-4) drop(-1.6) pref(_dummy)
reg y _dummy* z
After the regression, call plot_response.ado to plot the results of that regression (only the component related to the binned variables). The arguments describing the bins are the same format as those used by bin_parameter to make this easier:
plot_response, s(1) t(4) b(-4) drop(-1.6) pref(_dummy)
The result is a plot that clearly shows us the correct functional form:


Note: plot_response.ado requires parmest.ado (download from SSC  by typing "net install st0043.pkg" at the command line). It also calls a function parm_bin_center.ado that is included in the download.

Citation note: If you use this suite of functions in publication, please cite: Hsiang, Lobell, Roberts & Schlenker (2012): "Climate and the location of crops."

Help file below the fold.

1.23.2013

Binning a continuous independent variable for flexible nonparametric models [in Stata]


Sometimes we want to a flexible statistical model to allow for non-linearities (or to test if an observed relationship is actually linear). It's easy to run a model containing a high-degree polynomial (or something similar), but these can become complicated to interpret if the model contains many controls, such as location-specific fixed effects. Fully non-parametric models can be nice, but they require partialling out the data and standard errors can be come awkward if the sample is large or something sophisticated (like accounting for spatial correlation) is required.

An alternative that is easy to interpret, and handles large samples and complex standard errors well, is to convert the independent variable into discrete bins and to regress the outcome variable on dummy variables that represent each bin.

For example, in a paper with Jesse we take the typhoon exposure of Filipino households (a continuous variable) and make dummy variables for each 5 m/s bin of exposure. So there is a 10_to_15_ms dummy variable that is zero for all households except for those households whose exposure was between 10 and 15 m/s, and there is a different dummy for exposure between 15 and 20 m/s, etc.  When we regress our outcomes on all these dummy variables (and controls) at the same time, we recover their respective coefficients -- which together describe the nonlinear response of the outcome. In this case, the response turned out to be basically linear:


The effect of typhoon exposure on Filipino households finances.
From Anttila-Hughes & Hsiang (2011)


This approach coarsens the data somewhat, so there is some efficiency loss and we should be wary of Type 2 error if we compare bins to one another. But as an approach to determine the functional form of a model, this is a great approach so long as you have enough data.

I found myself rewriting Stata code to bin variables like this in many different contexts, so I wrote bin_parameter.ado to do it for me quickly.  Running these models can now be done in two lines of code (one of which is the regression command). bin_parameter allows you to specify a bin width, a top bin, a bottom bin and a dropped bin (for your comparison group). It spits out a bunch of dummy variables that represent all the bins which cover the range of the specified variable. It also has options for naming the dummy variables so you can use the wildcard notation in regression commands. Here's a short example of how it can be used:

set obs 1000
gen x = 10*runiform()-5
gen y = x^2
bin_parameter x, s(1) t(4) b(-4) drop(-1.6) pref(x_dummy)
reg y x_dummy*

Help file below the fold.

1.22.2013

Prettier graphs with less headache: use schemes in Stata

I'm picky about graphs looking nice. So for a long time I did silly things that drove me nuts, like typing "ylabel(, angle(horizontal))" for every single Stata graph I made (since some of the default settings in Stata are not pretty). I always knew that you could set a default scheme in stata to manage colors, but I didn't realize that it could do more or be customized. See here to learn more.

After a bit of playing around, I wrote my own scheme. The text file looks pitiful, I know. But it saves me lots of headache and makes each plot look nicer with zero marginal effort.  You can make your own, or if you download mine, put the file scheme-sol.scheme in ado/personal/ (where you put other ado files) and then type
set scheme sol
at the command line. Or set this as the default on your machine permanently with
set scheme sol, perm
It will make your plots that look kind of like this:


1.21.2013

Plot polynomial of any degree in Stata (with controls)

FE has been a little sluggish to recover from break. To kick start us back in gear, I'm making good on one resolution by making this FE Week-of-Code. I'll try to post something useful that I've written from the past year each day.

It always bugged me that I could easily plot a linear or quadratic fit in Stata, but if I used a third-order polynomial I could no longer plot the results easily.  Stata has built in functions like lowess, fpfitci and lpolyci that will plot very flexible functions, but those tend to be too flexible for many purposes. Sometimes I just want a function that is flexibly non-linear, but still smooth (so not lowess) and something I can easily write down analytically (so not fpfit or lpoly) and perhaps not symmetric (so not qfit).  We use high-degree polynomial's all the time, but we just don't plot them very often (I think this is because there is no built-in command do it for us).

Here's my function plot_margins.ado that does it. It takes a polynomial that you use in a regression, and plots the response function. Added bonuses: It plots the confidence interval you specify and can handle control variables (which are not plotted)

Reed Walker actually came up with this idea in an email exchange we had last year. So he deserves credit for this. I just wrote it up into an .ado file that you can call easily.

Basically, the idea is that you run any regression using Stata's factor variable notation, where you tell Stata that a variable X is continous and should be interacted with itself, eg
reg y c.x##c.x
is the notation to regress Y on X and X^2. (Check out Stata's documentation on factor variables if this isn't familiar to you.)

Reed's idea was to then use Stata 11's new margins command to evaluate the response of Y to X and X^2 at several points along the support of X, and then to use parmest to plot the result. (To download parmest.dotype "net install st0043.pkg" at the command line in Stata. plot_margins will call parmest, so you need to have it installed to run this function.)

The idea works really well, so long as you have Stata 11 or later (margins was introduced in Stata 11). 

Here's an example. First generate some false data. Y is the outcome of interest. X is the independent variable of interest. W is a relevant covariate. Z is an irrelevant covariate.
clear 
set obs 1000 
gen year = ceil(_n/100) 
gen x=5*uniform()-2 
gen z=2*uniform() 
gen w=2*uniform() 
gen e = 40*rnormal()  
gen y= w + 3*z + 4*x - x^2 + x^3 + e 
Then run a regression of Y on a polynomial of X (here, it's third degree) along with controls. The standard errors can be computed any fancy way you like. Here, I've done a block-bootstrap by the variable year.
reg y w z c.x##c.x##c.x , vce(bootstrap , reps(500) cluster(year))
Then, right after running the regression, (or when the estimates from the regression are the active in memory) call plot_margins to plot the marginal contribution of X at different values.
plot_margins x
Easy enough? I've added a few features for programming ease. Use the plot_command() option to add labels, etc to the graph
plot_margins x, plotcommand("xtit(X) ytit(Y) tit(Third order polynomial) subtit(plotted with PLOT_MARGINS) note(SE are block-bootstrapped by year and model controls for X and Z)") 
 The result:


 or specify the option "line" to have the CI plotted as lines instead of a shaded area:


There is also an option to save the function generated by parmest. Help file is below the fold. Have fun and don't forget to email Reed and tell him "thank you!"

Update: Reed points us to marginsplot in Stata 12, which basically does the same thing. Funny that the function names are all so unique...

8.31.2012

Watercolor regression

I'm in a rush, so I will explain this better later. But Andrew Gelman posted my idea for a type of visually-weighted regression that I jokingly called "watercolor regression" without a picture, so its a little tough to see what I was talking about. Here is the same email but with the pictures to go along with it. The code to do you own watercolor regression is here as the 'SMOOTH' option to vwregress. (Update: I've added a separate cleaner function watercolor_reg.m for folks who want to tinker with the code but don't want to wade through all the other options built into vwregress. Update 2: I've added watercolor regression as a second example in the revised paper here.)

This was the email with figures included:

Two small follow-ups based on the discussion (the second/bigger one is to address your comment about the 95% CI edges).

1. I realized that if we plot the confidence intervals as a solid color that fades (eg. using the "fixed ink" scheme from before) we can make sure the regression line also has heightened visual weight where confidence is high by plotting the line white. This makes the contrast (and thus visual weight) between the regression line and the CI highest when the CI is narrow and dark. As the CI fade near the edges, so does the contrast with the regression line. This is a small adjustment, but I like it because it is so simple and it makes the graph much nicer. 


My posted code has been updated to do this automatically.

2. You and your readers didn't like that the edges of the filled CI were so sharp and arbitrary. But I didn't like that the contrast between the spaghetti lines and the background had so much visual weight.  So to meet in the middle, I smoothed the spaghetti plot to get a nonparametric estimate of the probability that the conditional mean is at a given value:


To do this, after generating the spaghetti through bootstrapping, I estimate a kernel density of the spaghetti in the Y dimension for each value of X.  I set the visual-weighting scheme so it still "preserves ink" along a vertical line-integral, so the distribution dims where it widens since the ink is being "stretched out". To me, it kind of looks like a watercolor painting -- maybe we should call it a "watercolor regression" or something like that.

The watercolor regression turned out to be more of a coding challenge than I expected, because the bandwidth for the kernel smoothing has to adjust to the width of the CI. And since several people seem to like R better than Matlab, I attached 2 figs to show them how I did this. Once you have the bootstrapped spaghetti plot:


I defined a new coordinate system that spanned the range of bootstrapped estimates for each value in X 


The kernel smoothing is then executed along the vertical columns of this new coordinate system.

I've updated the code posted online to include this new option. This Matlab code will generate a similar plot using my vwregress function:

x = randn(100,1);
e = randn(100,1);
y = 2*x+x.^2+4*e;

bins = 200;
color = [.5 0 0];
resamples = 500;
bw = 0.8;

vwregress(x, y, bins, bw, resamples, color, 'SMOOTH');


NOTE TO R USERS: The day after my email to andrew, Felix Schönbrodt posted a nice similar variant with code in R here.

Update: For overlaid regressions, I still prefer the simpler visually-weighted line (last two figs here) since this is what overlaid watercolor regressions look like:


It might look better if the scheme made the blue overlay fade from blue-to-clear rather than blue-to-white, but then it would be mixing (in the color sense) with the red so the overlaid region would then start looking like very dark purple. If someone wants to code that up, I'd like to see it. But I'm predicting it won't look so nice.

8.20.2012

Visually-weighted confidence intervals

Following up on my earlier post describing visually weighted regression (paper here) and some suggestions from Andrew Gelman and others, I adjusted my Matlab function (vwregress.m, posted here) and just thought I'd visually document some of the options I've tried.

All of this information is available if you type help vwregress once the program is installed, but I think looking at the pictures helps.

The basic visually weighted regression is just a conditional mean where the visual weight of the line reflects uncertainty.  Personally, I like the simple non-parametric plot overlaid with the OLS regression since its clean and helps us see whether a linear approximation is a reasonable fit or not:
vwregress(x, y, 300, .5,'OLS',[0 0 1])
 


Confidence intervals (CI) can be added, and visually-weighted according to the same scheme as the conditional mean:
vwregress(x, y, 300, .5, 200,[0 0 1]);

Since the CI band is bootstrapped, Gelman suggested that we overlay the spaghetti plot of resampled estimates, I added the option 'SPAG' to do this. If the spaghetti are plotted using a solid color (option 'SOLID'), this ends up looking quasi-visually-weighted:
vwregress(x, y, 300, .5,'SPAG','SOLID',200,[0 0 1]);

But since it gets kind of nasty looking near the edges, where the estimates go little haywire since the observations get thin, we can visually weight the spaghetti too to keep it from getting distracting (just omit the 'SOLID' option).
vwregress(x, y, 300, .5,'SPAG',200,[0 0 1]);

Gelman also suggested that we try approximating the spaghetti by smoothly filling in the CI band, using the original visual-weighting scheme. To do this, I added the 'FILL' option. I like the result quite a bit (even more than the spaghetti, but others may disagree). [Warning: this plotting combination may be very slow, especially with a lot of resamples.]
vwregress(x, y, 300, .5,'FILL',200,[0 0 1]);

If 'SOLID' is combined with 'FILL', only the conditional mean is plotted with solid coloring. (This differs from 'SPAG' and the simple CI bounds).
vwregress(x, y, 300, .5,'FILL','SOLID',200,[0 0 1]);

Finally, I included the 'CI' option which changes the visual-weighting scheme from using weights 1/sqrt(N) to using 1/(CI_max - CI_min), where CI_max is the upper limit (point-wise) of the CI and CI_min is the lower limit. 

I like this because if we combine this with 'FILL', then the confidence band "conserves ink" (which we equate with confidence) in the y-dimension. Imagine that we squirt out ink uniformly to draw the conditional mean and then smear the ink vertically so that it stretches from the lower confidence bound to the upper confidence bound.  In places where the CI band is narrow, this will cause very little spreading of the ink so the CI band will be dark. But in places where the CI band is wide, the ink is smeared a lot so it gets lighter. For any vertical sliver of the CI band (think dx) the amount of ink displayed (integrated along a vertical line) will be constant.
vwregress(x, y, 300, .5,'CI','FILL',200,[0 0 1]);

For Stata users, I have written vwlowess.ado (here), but unfortunately it does not yet have any of these options.

Lucas Leeman has implemented some of these ideas in R (see here), so maybe he'll make that code available.

All of the above plots were made with the random data:
x = randn(200,1); 
e = randn(200,1).*(1+abs(x)).^1.5; 
y = 2*x+x.^2+4*e;

7.30.2012

Visually-Weighted Regression

[This is the overdue earth-shattering sequel to this earlier post.]

I recently posted this working paper online. It's very short, so you should probably just read it (I was actually originally going to write it as a blog post), but I'll run through the basic idea here.  

Since I'm proposing a method, I've written functions in Matlab (vwregress.m) and Stata (vwlowess.ado) to accompany the paper. You can download them here, but I expect that other folks can do a much better job implementing this idea.

Solomon M. Hsiang
Abstract: Uncertainty in regression can be efficiently and effectively communicated using the visual properties of regression lines.  Altering the "visual weight" of lines to depict the quality of information represented clearly communicates statistical confidence even when readers are unfamiliar or reckless with the formal and abstract definitions of statical uncertainty. Here, we present an example by decreasing the color-saturation of nonparametric regression lines when the variance of estimates increases. The result is a simple, visually intuitive and graphically compact display of statistical uncertainty. This approach is generalizable to almost all forms of regression.
Here's the issue. Statistical uncertainty seems to be important for two different reasons. (1) If you have to make a decision based on data, you want to have a strong understanding of the possible outcomes that might result from your decision, which itself rests on how we interpret the data.  This is the "standard" logic, I think, and it requires a precise, quantitative estimate of uncertainty.  (2) Because there is noise in data, and because sampling is uneven across independent variables, a lot of data analysis techniques generate artifacts that we should mostly just ignore.  We are often unnecessarily focused/intrigued by the wackier results that shows up in analyses, but thinking carefully about statistical uncertainty reminds us to not focus too much on these features. Except when it doesn't.

"Visually-weighted regression" is a method for presenting regression results that tries to address issue (2), taking a normal person's psychological response to graphical displays into account. I had grown a little tired of talks and referee reports where people speculate about the cause of some strange non-linearity at the edge of a regression sample, where there was no reason to believe the non-linear structure was real.  I think this and related behaviors emerge because (i) there seems to be an intellectual predisposition to thinking that "nonlinearity" is inherently more interesting that "linearity" and (ii) the traditional method for presenting uncertainty subconsciously focuses viewers attention on features of the data that are less reliable. I can't solve issue (i) with data visualization, but we can try to fix (ii). 

The goal of visually-weighted regression is to take advantage of viewer's psychological response to images in order to focus their attention on the results that are the most informative.  "Visual weight" is a concept from art and graphical design that is used to to direct a viewer's focus within an image.  Large, dark,  high-contrast, and complex structures tend to "grab" a viewer's attention.  Our brains are constantly looking for visual information and, somewhere along the evolutionary line, detailed/high-contrast structures in our field of view were probably more informative and more useful for survival, so we are programmed to give them more of our attention.  Unfortunately, the traditional approaches to displaying statistical uncertainty give more visual weight to the uncertain portions of the analysis, which is exactly backwards of what we want. Ideally, a viewer will focus more of their attention on the portions of analysis that have some statical confidence and they will mostly ignore the portions of analysis that are so uncertain that they contain little or no information.

[continued below the fold]

6.18.2012

Only the finest in Swiss data visualiation

The data visualization specialists over at Chart Porn point us to Datavisualization.ch's Selected Tools page. From their blog post announcing it:
When I meet with people and talk about our work, I get asked a lot what technology we use to create interactive and dynamic data visualizations. [...] That’s why we have put together a selection of tools that we use the most and that we enjoy working with. We called it selection.datavisualization.ch. It includes libraries for plotting data on maps, frameworks for creating charts, graphs and diagrams and tools to simplify the handling of data. Even if you’re not into programming, you’ll find applications that can be used without writing one single line of code. We will keep this list as a living repository and add / remove things as technology develops. We hope this will help you find the best tool for your next job.
Guess it's time to learn a little javascript...

5.18.2012

Plotting a two-dimensional line using color to depict a third dimension (in Matlab)

I had seen other people do this in publications, but was annoyed that I couldn't find a quick function to do it in a general case. So I'm sharing my function to do this.

If you have three vectors describing data, eg. lat, lon and windspeed of Hurricane Ivan (yes, this data is from my env. sci. labs), you could plot it in 3 dimensions, which is awkward for readers:


or using my nifty script, you can plot wind speed as a color in two dimensions:


Not earth-shattering, but useful. Next week, I will post an earth-shattering application.

[BTW, if someone sends me code to do the same thing in Stata, I will be very grateful.]

Also, you can make art:


Help file below the fold.

4.30.2012

Why visualizing instrumental variables made me less excited about it

Instrumental variables (IV and 2SLS) is a statistical approach commonly used in the empirical social sciences ("instrumental variables" has > 84,000 hits on google scholar) but it often seems to be over-interpreted. I realized this when, as a grad student, I tried to draw a graphical description of what IV was doing. Ever since, I've never bothered to estimate IV and only spend my time on reduced form models.  These are my two cents.

IV & 2SLS

IV was originally developed to combat attenuation bias in ordinary least-squares, but somebody (I don't know who) realized it could be used to estimate local average treatment effects in experimental settings when not all subjects in a treatment group actually received the treatment of interest.  From there, someone figured out that IV could be used in retrospective studies to isolate treatment effects when assignment to treatment had some endogenous component. This last interpretation is where I think a lot of folks run into trouble. (If you don't know what any of this means, you probably won't care about the rest of this post...)

All of these conceptual innovations were big contributions.  But what happened afterwards led to a lot of sloppy thinking.  IV is often taught as a method for "removing" endogeneity from an independent variable (variable X) in a regression.  While strictly a correct statement (if some relatively strong assumptions are met), it often feels as if it's abused.  My perception from papers/talks/discussion is that most students interpret IV in a similar way to how they interpret a band-pass filter: if you find a plausible instrument (variable Z), you can "filter out" the bad (endogenous) part of your independent variable and just retain the good (exogenous) part.  While a useful heuristic for teaching, its overuse leads to bad intuition.

IV with one instrument

The first stage regression in IV "filters out" all variation in X that does not project (for now, linearly) onto Z.  This linear projection is just Z*b, but it is often written as "X-hat". While not mathematically incorrect, and appealing for heuristic purposes, I think that using X-hat as a notational tool is why a lot of students get confused. It makes students think that X-hat is just like X, except without all the bad endogenous parts.  But really, X-hat is more like Z than it is like X. In fact, X-hat IS Z! It's just linearly rescaled by b, which is just a change in units.  The reason why X-hat is exogenous if Z is exogenous is because

X-hat = Z * b

where b is a constant. It seems silly and patronizing to reiterate an equation that is taught in metrics 101 a gazillion times, but people seem to forget that this little rescaling equation is doing all the heavy lifting in an IV model.  (I bet that if we never used "X-hat" notation and we renamed "the first stage" something less exciting, like "linearly converting units," then grad students would think IV is much less magical...)

Once X-hat is estimated, it is used as the new regressor in the "second stage" equation

Y = X-hat * a + error

and then a big fuss is made about how a is the unbiased effect of X on Y.  But if we drop the X-hat notation and replace it with the rescaled-Z notation, we get something less exciting:

Y = Z * b * a + error

Where a is now the unbiased effect of Z on Y, but rescaled by a factor 1/b.  For those familiar with IV, this looks a lot like the reduced form equation

Y = Z * c + error

which was always valid because Z is exogenous. Clearly,

a = c / b.

If you're still reading, you should be yawning, since this all seems very boring: these are all just variations on the same reduced form equation. And that's my main point, which I don't think is taught enough: Instrumental variables is only as good as your reduced form, because it is a linear rescaling of your reduced form. Of course, there are adjustments to standard errors, but if you're doing IV to remove bias, then you were focused on your point estimate to begin with. (An aside: it drives me nuts when people are "uninterested" in the reduced form effect of natural variations [eg. weather] on some outcome, but then are fascinated if you use the same variation as an instrument for some otherwise endogenous intermediate variable. The latter regression is the same as the first, only the coefficient has been rescaled by 1/b!)

Enough ranting. How do we visualize this? Its as easy as it sounds: you rescale the horizontal axis in your reduced form regression.

To illustrate this, I generated 100 false data points with the equations

Z ~ uniform on [0,1]
X =  Z * 2 + e1
Y = X + e2


where e1 and e2 are N(0,1). And then plot the reduced form equation (Z vs. Y) as the blue line in the upper-left panel:


I then estimate the first stage regression in the lower left panel and show the predicted values X-hat as open red circles. In the lower right panel, I just plot X-hat against X-hat to show that I am reflecting these values from the vertical axis (ordinate) to the horizontal axis (abcissa). Then, keeping X-hat on the horizontal axis, I plot Y against X-hat in the upper right panel. This is the second stage estimate from IV, which gives us our unbiased estimate of the coefficient a (the slope of the red line).

[The code to generate these plots in Stata is at the bottom of this post.]

What is different between the scatter on the upper left (blue line, representing the reduced form regression) and the scatter on the upper right (red line, representing the second stage of IV)? Not too much. The vertical axis is the same and the relative locations of the data points are the same. The only change is that the horizontal variable has been rescaled by a factor of two (recall the data generating process above). The IV regression is just a rescaled version of the reduced form.  What happened to X? It was left behind in the lower left panel and it never went anywhere else. It only feels like it is somehow present in the upper right panel because we renamed Z*b the glitzier X-hat.

Multiple instruments (2SLS)

Sometimes people have more than one exogenous variable that influences X (eg. temperature and rainfall). Both of these variables can be used as instruments in the first stage. What happens to the intuition above when we do this? Not too much, except things get harder to draw. But the fact that it's harder to draw doesn't mean that the estimate is necessarily more impressive or magical.

Suppose we have now have instruments Z1 and Z2 such that

Multivariate first stage: X = Z1 - Z2
Second stage: Y = 2 * X

then the first stage regression (X-hat) looks like this:


where the two horizontal axes are Z1 and Z2 and the vertical axis is X.

If we substitute this first stage regression into the second stage, we get

Y = 2 * (Z1 - Z2)

We can easily plot this version of Y as a function of Z1 and Z2 (the reduced form). Here, it's the purple plane:


The resulting second stage regression would take the observed value for Y (purple) and project it onto the predicted values for X (green). The parameter of interest (2 here, but the variable "a" earlier) is just the ratio of the height of the purple surface to the height of the green surface. 

Again, everything can be stated in terms of the instruments Z1 and Z2 (the two horizontal axes), which means the reduced form is again just as good as the second stage.  X only enters by determining how steep the green surface is.

Nonlinear first stage with multiple instruments

Finally, some people use a non-linear first stage.  Again, this is not terribly different. Suppose we have

Nonlinear multivariate first stage: X = Z1^2 - Z2
Second stage: Y = 2 * X

Then the first stage looks like:


and the reduced form regression of Y on Z1 and Z2  is gives us

Y = 2 * (Z1^2 - Z2)

which we overlay in purple again:


Since the second stage didn't change, the height of the purple surface is still always twice the distance from zero relative to the green surface (a scatter plot of pink vs. green values would be a straight line with slope = 2 = a).  This graph looks fancier, but the instruments Z1 and Z2 are still driving everything.  The endogenous variable X only enters passively by determining the height of the green surface. Once that's done, the  Z1 and Z2 do the rest.

Take away: Reduced form models can be very interesting. However, instrumental variables models are rarely much more interesting.  In fact, all the additional interestingness of the IV model arises from the exclusion restriction that is assumed, but this assumption is usually false (recall: how many papers use weather as an instrument?), so the IV model is probably exactly as interesting as the reduced form, except that it has larger standard errors and the wrong units.

[If you disagree, send me a clear visualization of your 2SLS results and all the steps you used to get there.  I'd love to be wrong here.]

4.16.2012

Cheap parallel computing for students

I don't like to advertise for companies, but Matlab recently released its 2012 version for students and I know that a lot of students don't realize how much computational bang they can get for their buck.  The parallel computing package for Matlab only costs students $29 and allows them to run parallel code on an unlimited number of cores. (Compare this to Stata, which charges hundreds of dollars per additional core.)  And even more importantly, parallelizing your code is amazingly easy (in the best cases, it may only involve changing your "for" commands to "parfor").  As a PhD student, I think this $29 investment saved me several months on my dissertation (I'm not kidding).

I'm guessing there's a similar free package for R which I don't know about (please comment).

3.27.2012

Research papers + GitHub = SciGit

Eric H. (@scigit) points us to SciGit, a management platform for collaborative research papers modeled after GitHub. It's still in prelaunch but should be available soon, and you can sign up your email for the official announcement if you like.
  • Collaborate with others without the mess No more -version19 John edits anymore. Keep track of changes done by each person and see the differences between papers. Merge edits easily.
  • Version Control Made Simple Simply Drag and Drop and it’ll do the rest, no more dealing with the terminal. Supports SVN, CVS, Mercurial, and Git.
  • Supports Your Favorite Editor  Use the same text editor you have now. Have the convenience to work offline.
  • Secure Your paper would be secure and private, only accessible to whom you gave permission to.


2.10.2012

Carefully interpreting climate-conflict results (JPR 2012) - Part 1


The Journal of Peace Research recently published an entire special issue dedicated to climate and conflict which contains papers from a 2010 conference.  Nils Petter Gleditsch is the editor of the issue, and in his introduction he suggests that he's not worried about future climate change (WRT conflict) based on findings presented in the issue.  We had a brief exchange about his interpretation on the Monkey Cage where he suggests that climate may influence conflict (recognizing our 2011 findings), but he is skeptical that anthropogenic climate change will matter.  I suppose this is fair enough, since nobody can say anything about the future with certainty. But his conclusions are based on empirical studies which are also necessarily historical, so it seems a little contradictory to think that they are somehow more informative about the future than other empirical historical studies.

This point aside, I've taken Nils' advice and started reading the papers in the issue.  My first reaction was a second wave of surprise at his conclusions, since most of the empirical papers in the issue seem to actually find a link between climatological parameters and conflict, although I haven't carefully kept score yet.  When I finish reading the issue, I will post my conclusions on this.

The first paper I got to that seemed to have a reasonable identification strategy was this one:

Climate change, rainfall, and social conflict in Africa 
Cullen S Hendrix and Idean Salehyan 
Abstract: Much of the debate over the security implications of climate change revolves around whether changing weather patterns will lead to future conflict. This article addresses whether deviations from normal rainfall patterns affect the propensity for individuals and groups to engage in disruptive activities such as demonstrations, riots, strikes, communal conflict, and anti-government violence. In contrast to much of the environmental security literature, it uses a much broader definition of conflict that includes, but is not limited to, organized rebellion. Using a new database of over 6,000 instances of social conflict over 20 years – the Social Conflict in Africa Database (SCAD) – it examines the effect of deviations from normal rainfall patterns on various types of conflict. The results indicate that rainfall variability has a significant effect on both large-scale and smaller-scale instances of political conflict. Rainfall correlates with civil war and insurgency, although wetter years are more likely to suffer from violent events. Extreme deviations in rainfall – particularly dry and wet years – are associated positively with all types of political conflict, though the relationship is strongest with respect to violent events, which are more responsive to abundant than scarce rainfall. By looking at a broader spectrum of social conflict, rather than limiting the analysis to civil war, we demonstrate a robust relationship between environmental shocks and unrest.

I wasn't sure if these results (HS) contradicted the results in our 2011 paper which showed a link between ENSO and conflict (HMC), since El Nino tends to cause the tropics to dry on average but also causes heavier-than-normal rainfall in some parts of Africa.  So I downloaded the replication data and merged it with the NINO3 time series from our paper.  The first thing I noticed (which isn't substantively important but is concerning from an editorial standpoint) is that the replication code and the published table didn't quite match.  The main conflict table from the paper is this:




But just running the code produced this:



The highlighted coeffs don't match, but they are just "controls" so we may not care much about them being mistabulated.  The underlined variables aren't reported in the first model (although I think they must have been included in model the authors ran because when I dropped them, the rest of the coeffs went haywire).  I mention this not to be a nag, but to explain why my estimates below don't match those in the paper exactly.

Having merged the HS data with our ENSO data (exactly from HMC), I replicated their main result (Column 1), then restricted it to the years which overlap with our paper (Column 2), then include our measure of ENSO {NINO3 averaged May-December} (Column 3), then drop 1997 since its a strong El Nino year that many people have complained about WRT our paper (Column 4):


Two things happen when I try to reconcile HS with HMC.

First, dropping 2005-2008 causes their strong positive correlation between conflict-onset and normalized-standardized-rainfall (GPCP_precip_mm_deviation_sd in the table) to drop by half in magnitude.  Because the standard errors don't change much, this also makes the coeff much less significant.  However, the coeffs are not actually significantly different from one another, so this change is hard to interpret conclusively.  Although it is worth noting that almost none of the other coeffs (except current normalized-standardized-rainfall) move very much.

Second, when I introduce our measure or ENSO (nino3_late in the table), it comes out with a strong positive correlation, consistent with earlier results of HMC.  The coeff on normalized-standardized-rainfall doesn't change much, suggesting these correlations are reasonable orthogonal.  When I drop 1997, the coeff on ENSO doubles in size, but so does the uncertainty.

Overall, I think this tells us that HS doesn't overturn the findings of HMC, suggesting that the effect they may be measuring is not related to ENSO. Although it also suggests that the positive correlation HS observe (between local rainfall and conflict onset) might be a little sensitive to the sample/specification (since notability, they also don't control for temperature which is definitely correlated with rainfall).

I was also interested in the SCAD data HS use, since my coauthors and I have played with the idea of using it to look at smaller scale conflicts.  Unfortunately, the data doesn't go back in time very far, so it's pretty tough to look at the effects of ENSO since ENSO varies annually.  Nonetheless, I did the quick and dirty thing of collapsing all SCAD events in Africa to yearly observations (following the ACR approach of HMC), linearly de-trending the series, and plotting the scatter against ENSO:


Since the panel is so short, 1997 is a monstrous outlier. But if we ignore it (as the critics who don't like the results of HMC keep arguing for) we see a pretty large positive correlation (although, unsurprisingly it's not quite significant with N = 13).  I think this is pretty interesting, since the SCAD database contains less-organized, less-deadly or less-political conflict than the PRIO data analyzed in HMC and it still exhibits a positive correlation with ENSO, similar to organized, deadly, political violence presented in HMC.

[Code to run the models in the table are below the fold.]

1.09.2012

"Boxplot regression" - my new favorite non-parametric method

I like non-parametric regressions since it lets you see more of what's going on in data (see this post too).    I also like when people plot more than just a mean regression, since seeing the dispersion in a conditional distribution is often helpful.  So why don't people combine both to make a suped-up graph with the best of both worlds?  Probably because there's no standard command in software packages (that I know of, at least).  So I decided to fix that.  Introducing boxplot regression! I'm not the first to do this, but I'm giving this code away for free, so I'm taking the liberty of making up a name since I haven't seen one before (heaven knows that I may be way off here...).

For one of our projects, Michael Roberts (his blog is solid) suggested we mimic this plot from one of his earlier papers.  This seemed like such a good idea, I wrote a Matlab function up so I can make a similar plot for every paper I write from here on out.

The function takes a continuous independent variable and bins it according to an arbitrary bin-width.  It then makes a boxplot for values of the dependent variable over each bin in the independent variable.  The result is basically a non-parametric quantile regression showing the median, 25th and 75th percentiles.  I then plot the mean for each bin overlaid as a line, so one can see how more traditional non-parametric mean regressions will compare.  Simple.

I'm not the first, but I'm not sure why this isn't done more often. Boxplots are usually used to compare distributions over qualitatively different groups, like races, sexes or treatment groups.  But it's not a huge conceptual leap to discretize an independent variable so we can apply the approach to standard regression. It's just annoying to code up.

My boxplot regression function is here (along with a utility to bin any variable without making the plot).  Now making this plot takes a single command.

Example: We take a globally gridded dataset from the SAGE group (Google Earth file of it here) and do a boxplot regression of area planted with crops on the agriculture suitability index of that grid cell:


We get a bi-variate graph packed full of information, right?  I hope Tufte would approve.  If you specify >25 bins, I've set the function to switch to a slightly more compact style for the boxplots.


Enjoy!

[If you like this, see this earlier post too. Help-file cut and pasted below the fold.]