2.18.2013

Plotting restricted cubic splines in Stata [with controls]

Michael Roberts has been trying to convince me to us restricted cubic splines to plot highly nonlinear functions, in part because they are extremely flexible and they have nice properties near their edges.  Unlike polynomials, information at one end of the support only weakly influences fitted values at the other end of the support. Unlike the binned non-parametric methods I posted a few weeks ago, RC-splines are differentiable (smooth). Unlike other smooth non-parametric methods, RC-splines are fast to compute and easily account for control variables (like fixed effects) because they are summarized by just a few variables in an OLS regression. They can also be used with spatially robust standard errors or clustering, so they are great for nonlinear modeling of spatially correlated processes. 

In short: they have lots of advantages. The only disadvantage is that it takes a bit of effort to plot them since there's no standard Stata command to do it.

Here's my function plot_rcspline.ado, which generates the spline variables for the independent variable, fits a spline while accounting for control variables, and plots the partial effect of the specified independent variables (adjusting for the control vars) with confidence intervals (computed via delta method). It's as easy as

plot_rcspline y x

and you get something like


where the "knots" are plotted as the vertical lines (optional).

Help file below the fold.  Enjoy!

Related non-linear plotting functions previously posted on FE:
  1. Boxplot regression
  2. Visually-weighted regression
  3. Watercolor regression
  4. Non-parametric three-dimensional regression
  5. Binned non-parametric regression
  6. Polynomial regression of any degree
S. HSIANG
SHSIANG@PRINCETON.EDU
2/2013
-------------------------------------------------------

SYNTAX

PLOT_RCSPLINE Yvar Xvar [Zvars] [if] , [NKNOTS(integer 5) KNOTPREF(string) BINS(integer 100) SAVE_data_file(string) plotcommand(string) regcommand(string) cilevel(integer 95) PLOTKNOTS NOCI]

-------------------------------------------------------

PLOT_RCSPLINE is designed to show a nonlinear relationship between Xvar and Yvar, adjusted for other variables Zvar1 ... Zvarn that are specified as controls. Confidence intervals are computed by the delta method.

PLOT_RCSPLINE uses a restricted cubic spline approach to estimating the nonlinear relationship between Yvar and Xvar. The spline is estimated with NKNOTS (5 is the default) and the spline variables in Xvar are generated by PLOT_RCSPLINE. One usage of PLOT_RCSPLINE is to generate the spline variables that will be used elsewhere in an analysis, while allowing the user to visually check that the partial response function (in Xvar) is reasonably specified using the spline.

PLOT_RCSPLINE does not automatically save the figure. Use GRAPH EXPORT or GRAPH SAVE afterwards to save the figure.

-------------------------------------------------------

Required arguments:

Yvar - the dependent variable to be plotted
Xvar - the independent variable to be plotted

Optional arguments: 

Zvars - control variables in the model

PLOTKNOTS - plots the knots as vertical lines

NKNOTS - the number of knots used in the spline, default = 5 (the number of variables representing the spline in the regression is one less than NKNOTS)

KNOTPREF() - a prefix used to name the variables generated for the spline

NOCI - suppress the plotting of the confidence intervals

CILEVEL() - Integer specifying the confidence level to be plotted. The default is cilevel(95).

REGCOMMAND() - String that is added to the regression command as an option, eg. "cluster(year)"

BINS() - Number of "bins" that are used to plot the final response funtion. Fewer bins will make the response function look "jagged", more bins will make the response look smoother. Default is 100. 

SAVE_data_file() - String that specifies a path/file to save the dataset of estimates used to plot the response function. These are the values of the response function (with SE) at each bin of the independent variable. If nothing is specified, the data used to plot the results will not be saved.

PLOTCOMMAND() - String arguments that are appending to the final plotting command. Eg. to label the plot with the title "X vs Y" and xtitle "X" type: plot_options("title(X vs Y) xtit(X)").

-------------------------------------------------------

EXAMPLES: 

set obs 1000
gen year = ceil(_n/100)
gen x=4*uniform()
gen z=2*uniform()
gen w=2*uniform()
gen e = 4*rnormal()
gen y= w + 3*z + 4*x - x^2 + .1*x^3 + e

plot_rcspline y x

plot_rcspline y x z w, nknots(7)  noci plotknots bins(20) knotpref(_x_knots_)

plot_rcspline y x z w, nknots(7) plotknots cilevel(90) plotcommand("tit(E[y|x,z,w])") regcommand("cluster(year)")

-------------------------------------------------------

*/

4 comments:

  1. Solomon: I agree with your advocacy of restricted cubic splines as a good scatter plot smoother.

    You are correct that there is no official command to do this in Stata, if that's what you mean by "standard". As far as user-written commands are concerned, -rcspline- has been downloadable from SSC since 2007. It doesn't have quite the same objectives as your program: yours goes further in supporting what you call control variables). -rcspline- always shows the original data for the variables plotted.

    In terms of your program, Stata users would typically expect a help file that would explain the syntax in a standard way, not documentation embedded in the code. Your implementation makes use of permanent variable names in a way that poses a small but not zero risk of messing up existing datasets and in a future revision could usefully support -in- as well as -if-.

    In a quick test of your program I went

    sysuse auto, clear
    plot_rcspline mpg weight

    and I see predictions shown for -mpg- that are all negative. What did I do wrong?

    ReplyDelete
    Replies
    1. Nick: thanks for pointing out rcspline (and writing it!). It looks really nice.

      One day, I will learn how to write proper help files for Stata... but until then, I apologize that getting the details on my functions is a little unwieldy.

      In your test, the predictions for mpg are all negative because plot_rcspline does not account for the constant term. The reason is that when there are multiple additive constants, I was only interested in the marginal effect of the variable of interest, all other non-weight variables are not included in the prediction. If you center the variables first (so the constant is zero), you see that the scatter is identical to the spline:

      sysuse auto, clear
      center mpg weight
      sc c_mpg c_weight
      plot_rcspline c_mpg c_weight

      In the next version, I'll make including the constant an option. Thanks for pointing this out.

      Delete
    2. Thanks for this. When there are no control variables, it's possible just to use -predict- directly to get predictions in terms of the response. I'd guess other users than myself might expect that to be the default with no control variables, but as you imply that clashes with your broader purpose in getting marginal effects. That in turn raises the question of getting Stata's -marginsplot- to do your work for you.

      Delete
  2. Alright, a few other folks emailed about plots that dropped the constant term in a bivariate model, so I've updated the code to have an option "keepcons". This option will keep the constant term when computing the spline, which is appropriate only in models where there are no control variables. So if you are plotting the relationship between X and Y, and you want it look similar to what you would get if you typed "lowess Y X" then you should type

    plot_rcspline Y X, keepcons

    and you should recover a curve that looks more familiar. If you include controls Z but type

    plot_rcspline Y X Z, keepcons

    then you will keep the constant but drop the average effect of Z when you plot the spline. This can be interpreted as the prediction of the model when all covariates are set to zero (note that this doesn't make too much sense in many models, so I would stick to using the "keepcons" option only in bivariate models).

    ReplyDelete