1.09.2012

"Boxplot regression" - my new favorite non-parametric method

I like non-parametric regressions since it lets you see more of what's going on in data (see this post too).    I also like when people plot more than just a mean regression, since seeing the dispersion in a conditional distribution is often helpful.  So why don't people combine both to make a suped-up graph with the best of both worlds?  Probably because there's no standard command in software packages (that I know of, at least).  So I decided to fix that.  Introducing boxplot regression! I'm not the first to do this, but I'm giving this code away for free, so I'm taking the liberty of making up a name since I haven't seen one before (heaven knows that I may be way off here...).

For one of our projects, Michael Roberts (his blog is solid) suggested we mimic this plot from one of his earlier papers.  This seemed like such a good idea, I wrote a Matlab function up so I can make a similar plot for every paper I write from here on out.

The function takes a continuous independent variable and bins it according to an arbitrary bin-width.  It then makes a boxplot for values of the dependent variable over each bin in the independent variable.  The result is basically a non-parametric quantile regression showing the median, 25th and 75th percentiles.  I then plot the mean for each bin overlaid as a line, so one can see how more traditional non-parametric mean regressions will compare.  Simple.

I'm not the first, but I'm not sure why this isn't done more often. Boxplots are usually used to compare distributions over qualitatively different groups, like races, sexes or treatment groups.  But it's not a huge conceptual leap to discretize an independent variable so we can apply the approach to standard regression. It's just annoying to code up.

My boxplot regression function is here (along with a utility to bin any variable without making the plot).  Now making this plot takes a single command.

Example: We take a globally gridded dataset from the SAGE group (Google Earth file of it here) and do a boxplot regression of area planted with crops on the agriculture suitability index of that grid cell:


We get a bi-variate graph packed full of information, right?  I hope Tufte would approve.  If you specify >25 bins, I've set the function to switch to a slightly more compact style for the boxplots.


Enjoy!

[If you like this, see this earlier post too. Help-file cut and pasted below the fold.]

% ----------------------------
% S. HSIANG, 1/12
% SHSIANG@PRINCETON.EDU
% ----------------------------
%
% OUTPUT = boxplot_regression(X, Y, width, bottom_bin_upper_bound, Nbins)
%
% BOXPLOT_REGRESSION bins data from X and Y according to the bins in X 
% specified by WIDTH, BOTTOM_BIN_UPPER_BOUND and NBINS. It then plots a 
% boxplot over these bins in X for the values in Y.  It also overlays the
% conditional mean of Y for each bin.
%
% X - independant variable
% Y - dependant variable
% width - the width of each interior bin in X
% bottom_bin_upper_bound - the upper bound for the lowest bin
% Nbins - the number of total bins (including top and bottom bins)
%
% NOTES:
%
% The top and bottom bins are for all values beyond the limits of the
% Nbins-2 interior bins.  They are infinite in width.  However, all
% interior bins are WIDTH units in width.
%
% Each bin includes its upper bound but excludes its lower bound. 
%
% X and Y can be matrices or vectors, however they must be the same size
% and have elements that correspond 1:1.
%
% The OUTPUT stored is a structure containing values of 
% Y         - reshaped to a vector
% X_binned  - corresponding new values of X that are unique to each bin 
%               (indexed by each bin's upper bound, except the top bin)
% Y_means   - the mean values of Y for each bin
% labels    - a cell array describing the values of X in each bin


3 comments:

  1. There are some simple ways to knock down the number of tick labels on that compact boxplot. I'll leave it as an exercise for the reader. Also, is it a bit intellectually disingenuous to show binned data when the line is generated from continuous data?

    ReplyDelete
  2. The mean line is simply plotting the mean value for Y in each bin, so it is also discretized and is no longer continuous.

    ReplyDelete
  3. Hi, thanks figuring out ways to do boxplot regression! I wonder; are you familiar with ways to do this in SPSS/PASW, R or Sam? I would very much like to try this out, but I don't use MatLab :-)

    ReplyDelete