For one of our projects, Michael Roberts (his blog is solid) suggested we mimic this plot from one of his earlier papers. This seemed like such a good idea, I wrote a Matlab function up so I can make a similar plot for every paper I write from here on out.
The function takes a continuous independent variable and bins it according to an arbitrary bin-width. It then makes a boxplot for values of the dependent variable over each bin in the independent variable. The result is basically a non-parametric quantile regression showing the median, 25th and 75th percentiles. I then plot the mean for each bin overlaid as a line, so one can see how more traditional non-parametric mean regressions will compare. Simple.
I'm not the first, but I'm not sure why this isn't done more often. Boxplots are usually used to compare distributions over qualitatively different groups, like races, sexes or treatment groups. But it's not a huge conceptual leap to discretize an independent variable so we can apply the approach to standard regression. It's just annoying to code up.
My boxplot regression function is here (along with a utility to bin any variable without making the plot). Now making this plot takes a single command.
Example: We take a globally gridded dataset from the SAGE group (Google Earth file of it here) and do a boxplot regression of area planted with crops on the agriculture suitability index of that grid cell:
We get a bi-variate graph packed full of information, right? I hope Tufte would approve. If you specify >25 bins, I've set the function to switch to a slightly more compact style for the boxplots.
Enjoy!
[If you like this, see this earlier post too. Help-file cut and pasted below the fold.]
% ----------------------------
% S. HSIANG, 1/12
% SHSIANG@PRINCETON.EDU
% ----------------------------
%
% OUTPUT = boxplot_regression(X, Y, width, bottom_bin_upper_bound, Nbins)
%
% BOXPLOT_REGRESSION bins data from X and Y according to the bins in X
% specified by WIDTH, BOTTOM_BIN_UPPER_BOUND and NBINS. It then plots a
% boxplot over these bins in X for the values in Y. It also overlays the
% conditional mean of Y for each bin.
%
% X - independant variable
% Y - dependant variable
% width - the width of each interior bin in X
% bottom_bin_upper_bound - the upper bound for the lowest bin
% Nbins - the number of total bins (including top and bottom bins)
%
% NOTES:
%
% The top and bottom bins are for all values beyond the limits of the
% Nbins-2 interior bins. They are infinite in width. However, all
% interior bins are WIDTH units in width.
%
% Each bin includes its upper bound but excludes its lower bound.
%
% X and Y can be matrices or vectors, however they must be the same size
% and have elements that correspond 1:1.
%
% The OUTPUT stored is a structure containing values of
% Y - reshaped to a vector
% X_binned - corresponding new values of X that are unique to each bin
% (indexed by each bin's upper bound, except the top bin)
% Y_means - the mean values of Y for each bin
% labels - a cell array describing the values of X in each bin
There are some simple ways to knock down the number of tick labels on that compact boxplot. I'll leave it as an exercise for the reader. Also, is it a bit intellectually disingenuous to show binned data when the line is generated from continuous data?
ReplyDeleteThe mean line is simply plotting the mean value for Y in each bin, so it is also discretized and is no longer continuous.
ReplyDeleteHi, thanks figuring out ways to do boxplot regression! I wonder; are you familiar with ways to do this in SPSS/PASW, R or Sam? I would very much like to try this out, but I don't use MatLab :-)
ReplyDelete